20.07.2018 Views

ewC2018_Sammelmappe_Seitenzahl

Proceedings embedded world Conference 2018

Proceedings embedded world Conference 2018

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Conference Chair:<br />

Prof. Dr. Matthias Sturm<br />

HTWK Leipzig<br />

Project Manager<br />

Renate Ester<br />

P + 49 (0)89 255 56-1349<br />

E-Mail: REster@weka-fachmedien.de<br />

Coordinator Conference Attendees<br />

Juliane Heger<br />

P + 49 (0)89 255 56-1155<br />

E-Mail: JHeger@weka-fachmedien.de<br />

WEKA FACHMEDIEN GmbH<br />

Richard-Reitzner-Allee 2<br />

85540 Haar, Germany<br />

www.weka-fachmedien.de<br />

ISBN 978-3-645-50173-6<br />

www.embedded-world.eu<br />

1


Copyright<br />

©2018 WEKA FACHMEDIEN GmbH, Richard-Reitzner-Allee 2, 85540 Haar, Germany,<br />

phone: + 49. (0) 89.255 56 – 1000<br />

All rights reserved. No part of this publication may be reproduced or transmitted in any form or by<br />

any means, included photocopying, scanning, duplicating or transmitting electronically without the<br />

written permission of the copyright holder, application for which should be addressed to the<br />

publisher. Such written permission must also be obtained before any part of this publication is<br />

stored in a retrieval system of any nature.<br />

The publisher, his employees and agents exercise the customary degree of care in accepting and<br />

checking advertisement texts and conference papers, but are not liable for misleading or deceiving<br />

conduct by the client.<br />

The company addresses contained in these proceedings are subject to the data protection law.<br />

The use of this information for advertising purposes is prohibited<br />

2


Emulation and Rich, Non-intrusive Analytics Address<br />

Verification Complexity<br />

Rupert Baines<br />

UltraSoC<br />

UK<br />

Rupert.Baines@ultrasoc.com<br />

Russell Klein<br />

Mentor, a Siemens Company<br />

Wilsonville, OR, U.S.A.<br />

Russell.Klein@Siemens.com<br />

Breaking down the pre-silicon/post silicon divide by<br />

combining hardware-based on-chip analytics with leading edge<br />

emulation technology enables a consistent debug approach<br />

through all stages of the design cycle. Tracing capabilities added<br />

to SoC devices provide debug visibility into systems running at<br />

full speed and under typical operating conditions. However, even<br />

this type of advanced tracing cannot deliver complete visibility<br />

into the hardware. Diagnosing certain problems requires<br />

visibility into any register to net in the design. Emulation<br />

systems deliver this complete debug visibility, but traditionally<br />

have presented this data in a means inconsistent with post silicon<br />

debug tools, that is disruptive to the debug process. Combining<br />

these technologies delivers the optimal combination of<br />

performance and visibility.<br />

I. INTRODUCTION<br />

Software plays an increasing role in the functionality of<br />

moderns SoCs. As such, verification of SoCs must include<br />

software execution. Many of the most challenging bugs are<br />

only seen when the SoC is run as a complete system of both<br />

hardware and software, operating at full speed and under<br />

typical operating conditions. These bugs are often sensitive to<br />

the smallest change in operating conditions. Adding a logging<br />

statement of a single character can dramatically change the<br />

behavior of the bug, or even completely mask it. Diagnosing<br />

this type of bug requires a non-intrusive tracing technique,<br />

capable of working while the SoC is running at full speed and<br />

processing typical datasets.<br />

Sometimes trace data from the processors, bus fabrics, and<br />

surrounding logic can provide the insight needed to understand<br />

the bug. Other times additional debug data will be needed.<br />

This additional data is usually obtained by reverting to a presilicon<br />

execution platform, where hardware debug visibility is<br />

greater than is possible in post-silicon environments. Typically<br />

this will occur when the basis of the problem is in hardware,<br />

thus deepening introspection into the hardware is required.<br />

Traditionally, pre-silicon and post-silicon debug approaches<br />

presented data in very different ways, as the nature of the data<br />

was quite different. Because the time scales of operation can<br />

be orders of magnitude apart it makes common data collection<br />

and presentation the exception rather than the rule. This results<br />

in a discontinuity in the debug process. As developers move<br />

from a characterization of the problem in post-silicon, they<br />

need to then re-capture and re-characterize the problem in the<br />

pre-silicon environment. When the bug is sensitive to minor<br />

timing or operation condition changes this can prove both time<br />

consuming and frustrating – just at the time when debug<br />

efficiency is needed.<br />

II. THE CHALLENGE<br />

One of the biggest challenges in SoC design today is<br />

systemic complexity. Block-level verification means we can be<br />

confident in the test coverage of individual blocks; but when<br />

these are integrated into a whole system the complexity<br />

increases and problems slip through. This is especially the case<br />

for heterogeneous, multi-core systems or those with many<br />

different IP blocks.<br />

The problems are worsened because the hardware blocks<br />

themselves may come from many different sources. Some will<br />

be designed in-house; others will be licensed-in from external<br />

vendors. It is a hard job to bring together these various CPUs,<br />

GPUs and accelerators – particularly in the absence of a<br />

unifying tool-chain that can deal with IP from many vendors.<br />

The number, complexity and interaction of IP blocks is by<br />

no means the end of the story. The software that runs on a large<br />

chip will be every bit as complex as the hardware: verifying the<br />

functionality of the software itself, and its interaction with the<br />

underlying hardware, brings yet another level of complexity.<br />

However, the problems for the SoC team do not end there.<br />

Still more complexity is revealed when we understand the end<br />

goal – which is to produce an SoC that functions correctly, and<br />

with the expected performance, in real-life situations. It is not<br />

uncommon to encounter issues that reveal themselves in the<br />

field only on a timescale of days or weeks of continuous<br />

running. Such issues cannot practically be found in simulation<br />

and verification, because of the time involved.<br />

The endgame is that many of today's chips are so complex<br />

that it is impossible for the design team which created them to<br />

fully understand their operation in-life or in-field.<br />

III. CURRENT APPROACHES<br />

The modern SOC might have 1 billion transistors and more<br />

than 100 individual IP blocks. Such a system obviously<br />

presents a challenge to simulate, validate or verify. Typically<br />

the design flow moves from simulation to emulation and FPGA<br />

3


prototyping, then tape out and post-silicon integration, bringup,<br />

system-level test and finally deployment.<br />

This has created two distinct parts of the flow, demarcated<br />

at tape-out: the pre- and post-silicon worlds have traditionally<br />

been completely separate, with little connection between the<br />

two.<br />

A. Pre-Silicon<br />

The pre-silicon world of simulation, emulation and<br />

prototyping has typically been considered a single domain<br />

served by EDA vendors. That domain is virtual, which has<br />

huge advantages in terms of flexibility and scope.<br />

Today’s simulators and similar tools are highly competent –<br />

we can have great confidence that a given block will work asdesigned.<br />

But there are difficulties: it is challenging to test<br />

software at anything approaching real-time speeds; it is very<br />

difficult to model systems that depend on real-world inputs; the<br />

systemic complexity in the hardware may require extended run<br />

times to achieve anything like acceptable coverage.<br />

In particular, it is not sufficient to model only hardware<br />

states: it is also necessary to include the execution of software.<br />

Software activity is typically dependent on real-world contexts<br />

and external inputs, making the challenge even greater. Tools<br />

like Mentor’s Veloce emulation platform help address this<br />

problem by enabling pre-silicon testing and debug at hardware<br />

speeds, using real-world data, while both hardware and<br />

software designs are still fluid.<br />

B. Post Silicon<br />

Bring up, integration, verification and validation are done<br />

by systems developers: and typically with very few specific<br />

tools to help them address their problems. Traditional debug<br />

tools are processor-centric – the ARM CoreSight system may<br />

be of little use in spotting an interaction issue with a CEVA<br />

DSP.<br />

The situation is typified by the fact that perhaps the only<br />

“standard” tool that can be of help here (aside from free or<br />

open source debug software like GDB and commercial tools<br />

like Lauterbach’s TRACE32) is JTAG, a 30-year-old<br />

technology which is very limited in its scope.<br />

In recent years there have been some steps forward in<br />

solving this problem. UltraSoC, for example, provides<br />

analytics IP that allows the construction of a universal on-chip<br />

debug infrastructure, independent of the main system. The<br />

CPU vendors themselves have responded to such developments<br />

with improvements to their embedded debug capabilities,<br />

though these remain by no means “universal”.<br />

UltraSoC’s IP integrates a system of monitors and analytics<br />

modules into the device itself, at hardware level. This gives the<br />

post-silicon team a system-wide view of their SoC – as a<br />

complete entity and under real-world operating conditions.<br />

Importantly, this includes both hardware and software,<br />

enabling full insight into processor performance and into how<br />

processes interact with each other and with hardware elements<br />

inside the system.<br />

IV. EMBEDDED ANALYTICS<br />

The embedded analytics concept puts hard-wired, nonintrusive,<br />

vendor neutral, ‘smart’ monitoring and analysis<br />

capabilities into the chip itself. It includes local intelligence –<br />

again hard-wired – for filtering and statistics, reducing the<br />

amount of information that needs to be brought off-chip.<br />

Such a capability makes it very much easier to bring-up and<br />

debug the chip – even if there are subtle issues or interactions<br />

involved.<br />

UltraSoC’s embedded analytics capability includes<br />

protocol-aware bus monitors for major interconnects and buses,<br />

including AMBA 5 CHI, AXI, OCP and AHB. It supports all<br />

of the common processors (including ARM, MIPS, CEVA,<br />

XTensa, and RISC-V), co-processors and custom logic (with<br />

sophisticated logic-analyzer functionality) and delivers the<br />

development team an integrated view of their chip. Within the<br />

RISC-V ecosystem, UltraSoC is the only company with<br />

commercial products for run-control, debug and processor<br />

trace.<br />

The offering also includes analytics software that supports<br />

engineers in developing and optimizing their products.<br />

V. EMULATION TECHNOLOGY<br />

The Mentor Veloce emulation platform enables high-speed<br />

emulation of complex SoCs to quickly identify design issues<br />

under critical traffic conditions and enable users to improve<br />

device performance and reduce time to market. It has both the<br />

performance and the capacity to execute an SoC design in the<br />

pre-silicon phase and run it under typical operating conditions,<br />

including a full software payload.<br />

Veloce delivers full SoC hardware visibility. That is, the<br />

developer can trace all register states in the complete design,<br />

and all wires (or nets) that connect those registers. These<br />

traces can be collected for any time periods through the<br />

execution of the SoC. Veloce uses hardware and software that<br />

enables this data collection to be performed even when the<br />

design is running in the emulator at the emulator’s maximum<br />

operating speed.<br />

In a typical debug session with Veloce, the developer will<br />

not collect traces for all registers for all time, but will trace a<br />

limited set of signals and registers for a selected period of time;<br />

strategically selected to provide insight into the problem being<br />

diagnosed. Upon examination of the traces collected, it may<br />

lead the developer to want to see additional traces; perhaps<br />

from other parts of the design, and possibly earlier or later in<br />

the run of the system. Through a sophisticated system of<br />

hardware and software that can save and restore the design’s<br />

state in the emulator, and effectively reconstruct the design’s<br />

state at any time in a past emulation, Veloce is able to deliver<br />

traces for any part of the design, and for any time in the<br />

emulation run.<br />

This level of debug visibility into the hardware state<br />

ensures that hardware problems can be quickly and confidently<br />

diagnosed and resolved.<br />

4


As stated earlier, the activity of the design is dependent on<br />

the inputs to the system and the context of the operation.<br />

Veloce has a complete library of virtual peripherals and traffic<br />

generators, as well as a large number of connections to physical<br />

devices. Both virtual and physical peripherals can be time<br />

synchronized with the time domain of the design running in the<br />

emulator, allowing realistic timing and performance<br />

characterization.<br />

VI. COMBINING ANALYTICS AND EMULATION CREATES A<br />

POWERFUL PLATFORM<br />

Bringing together on-chip analytics and hardware-based<br />

emulation tools like UltraSoC and Veloce allows designers<br />

using RISC-V not only to improve the effectiveness of their<br />

emulation efforts, but also to take an important step towards<br />

bringing together the currently disparate pre- and post-silicon<br />

worlds.<br />

UltraSoC IP incorporated within the device can gather<br />

information about the real-world environment the chip will<br />

encounter – information that can be gathered both in the lab,<br />

and even after deployment in the field.<br />

This real-world traffic can be used effectively to extend the<br />

use of the Veloce platform to prototyping.<br />

Uniting the pre- and post-silicon worlds in this way gives<br />

the design team access to visualizations and statistics based on<br />

real-world behavior, and allows them to compare modeled /<br />

predicted behavior with actual / captured behavior to identify<br />

discrepancies. Those discrepancies might be bugs (for example<br />

deadlocks). They may also be more subtle phenomena – for<br />

example contention or underutilization – that can impact longterm<br />

performance or affect power dissipation, without creating<br />

a catastrophic bug.<br />

Many of these issues may not have been observable in the<br />

slow virtual world of traditional simulation, but with this<br />

combination can easily be observed and addressed.<br />

This approach also enables an easy transition from<br />

emulation to post-silicon implementation, giving a seamless<br />

flow from virtual to physical, and to in-life / in-field<br />

deployment and optimization.<br />

5


Efficiency of the RISC-V ISA-Level Custom<br />

Extension for AES Standard Acceleration:<br />

a Case Study<br />

Pavel Smirnov, Grigory Okhotnikov, Dmitry Pavlov<br />

Syntacore<br />

Saint Petersburg, Russia<br />

{sps|go|dp}@syntacore.com<br />

In this work, we present an example of the custom RISC-V ISA<br />

extension using the familiar AES cryptographic standard as a target<br />

workload.<br />

The recently-introduced RISC-V ISA standard is flexible and<br />

extensible by design. We utilize standard extensibility features<br />

offered by the ISA and stay within its basic capacity. Based on the<br />

algorithm analysis, custom AES instruction extension is designed<br />

and implemented in HW. The proposed extension includes 6 new<br />

instructions, operates over the standard RV32GC FPU register file<br />

and supports all the variations of the AES algorithm defined by the<br />

AES standard both for encryption and decryption.<br />

The proposed custom instruction set extension was implemented<br />

in hardware. The demonstrated results are based on the real-time<br />

FPGA implementation and end-to-end benchmarking using<br />

accelerated SW library with extensions support. The prototype setup<br />

includes RISC-V RV32GC based processor core with GCC toolchain,<br />

modified to support the designed custom extension. The resulting<br />

implementation demonstrates more than 50x ciphers speed-ups vs<br />

base, SW-only implementation for the RISC-V RV32GC system at<br />

the expense of a modest additional HW footprint (~30 kgates).<br />

The resulting extension is then compared with AES extensions<br />

from other contemporary commercial CPUs, currently available on<br />

the market, and proves to be competitive both in code density and<br />

performance.<br />

Keywords—RISC-V; ISA; AES; custom instructions<br />

I. INTRODUCTION<br />

Recent slowdowns in the semiconductor technology scaling<br />

[8] and “traditional” CPU performance growth [2] established<br />

platform heterogeneity and HW specialization as a fundamental<br />

trend in the computer architecture and design. Contemporary<br />

SoCs are increasingly heterogeneous, but dominating use case is<br />

addition of the workload-specific accelerators and driver model<br />

for SW deployment, which limits resources reuse and efficiency<br />

of the resulting solution.<br />

The recently introduced open RISC-V ISA instruction set [9]<br />

includes support for user-defined extensions. Although<br />

technologies, based on the ISA extensibility are well known and<br />

have been both extensively explored in the academia and<br />

successfully applied in the industry in several cases, RISC-V for<br />

the first time enables such technologies in the main system<br />

sockets for a wide range of applications.<br />

In this work, we explore efficiency of the ISA-level RISC-V<br />

extensibility for acceleration of the familiar AES algorithms<br />

suite and compare it with contemporary alternatives. We intent<br />

to study RISC-V ISA suitability and limitations based on the<br />

practical case, provide qualitative and quantitative comparison<br />

with functional equivalents in other co-temporary ISAs and<br />

measure performance and efficiency of a developed custom<br />

extension using real-time FPGA prototype and end-to-end SW<br />

stack.<br />

II.<br />

APPLICATION ANALYSIS<br />

A. AES Overview<br />

AES [1] is arguably one of the most widely adopted<br />

encryption standards. For this work, we consider all the<br />

applicable block lengths defined by the AES specification.<br />

Without going into full details of the algorithm itself, we<br />

would like to start from a high-level algorithm overview, which<br />

will be useful for the following ISA mapping. We also note a<br />

few aspects, important for the following analysis.<br />

AES operates over the 128-bit blocks that are processed by<br />

a sequence of encryption rounds using round keys. The number<br />

of rounds and, accordingly, keys depends on the length of the<br />

master key, 128, 192 or 256 bits key are defined by the standard.<br />

Depending on the length, 10, 12 or 14 round keys are used, each<br />

round key being 128 bits.<br />

The input message block (called the “state”) is sequentially<br />

turned into a cipher block of the same length as a result of the<br />

transformation rounds. Each AES round transformation depends<br />

on the results of the previous one, which restricts parallel rounds<br />

execution.<br />

1) AES Encryption<br />

Overall, AES encryption operation sequence consists from<br />

the following steps (Fig. 1).<br />

6


Fig 1. AES encryption algorithm<br />

AddRoundKey is 128-bit bitwise ‘exclusive or’ operation<br />

on round key and state.<br />

SubBytes is a nonlinear bijective operation of a byte<br />

substitution.<br />

ShiftRows is shuffling of bytes by a fixed permutation.<br />

MixColumns is a linear transform. This is fixed irreducible<br />

polynomial Galois field multiplication of 4×4 coefficient matrix<br />

to 4×4 state bytes matrix.<br />

2) AES Decryption<br />

Block decryption is a sequence of inverse operations.<br />

Standard [1] defines two algorithms: Inverse Cipher and<br />

Equivalent Inverse Cipher.<br />

The main differences between these algorithms are the<br />

sequence, in which operations are applied and InvMixColumns<br />

transform, which must be done over the round key for the<br />

Equivalent Inverse Cipher (Fig. 2).<br />

Fig 3. AES decryption algorithms<br />

III.<br />

RISC-V ISA EXTENSIBILITY OVERVIEW<br />

RISC-V ISA [9] supports variable length instructions where<br />

length unit is 16 bits. The length is encoded in the first bits (LSB)<br />

of instruction (Fig. 4).<br />

Fig 4. Possible lengths of instruction<br />

Base 32-bit RV32I ISA instructions have length of 32 bits.<br />

Specification defines several instruction formats (Fig. 5): R-type<br />

for register only operations, I-type for instructions with<br />

immediate operands (including loads), S-type for store<br />

instructions, B-type for branches, U-type for the upper bits<br />

immediate operands and J-type for jump/call instructions. Such<br />

a unification significantly simplifies decoding.<br />

Fig 2. Order of operations in Inverse Cipher and Equivalent Inverse Cipher<br />

AES decryption operation sequence consists from the<br />

following steps (Fig. 3)<br />

Fig 5. Types of instructions<br />

7


All RISC-V standard extensions use these instruction<br />

formats and can define own types/subtypes. For example,<br />

floating point operations of single (S-extension) and double<br />

precision (D-extensions) define multiply-add operations with 3<br />

source and 1 destination register operands as so-called R4-type<br />

subtype of R-type format (Fig. 6).<br />

Fig. 6. R- and R4-type of instruction<br />

Base opcode map for 32-bit instructions is represented in the<br />

Fig. 7.<br />

Fig. 7. Base opcode map<br />

Already in the basic ISA, there is some opcode space<br />

reserved for the custom instructions. For example, RV32 and<br />

RV64 architectures can use so-called custom-0 and custom-1<br />

instructions.<br />

A. Staying Compliant with Standard RISC-V Extensible<br />

Features<br />

For the initial evaluation exercise, we intentionally stay<br />

within capabilities provided by the standard basic instruction set<br />

and its extensibility features. We intend to explore all the<br />

supported basic XLEN values, but this initial work focuses<br />

specifically on the RV32GC basic set.<br />

In addition, for this work, we do not use advanced features,<br />

like additional i/o ports, non-standard registers, and load-store<br />

units. Instead, proposed implementation utilizes part of the<br />

standard opcode space, reserved for custom extensions in the socalled<br />

“custom0/custom1” opcodes.<br />

IV.<br />

CUSTOM AES EXTENSTION DESIGN<br />

A. Custom AES Instructions Functional Description<br />

The algorithm described in [1] consists of transformations,<br />

namely SubBytes, ShiftRows, MixColumns, AddRoundKey,<br />

and their inverses. Each operation transforms a single 128-bit<br />

block also named “state”. AddRoundKey transform requires an<br />

additional 128-bit argument which is called “round key”. Each<br />

operand is split into high-bits part and low-bits part. The detailed<br />

description of these transformations can be found in Chapter 5<br />

of [1].<br />

Based on the analysis of the algorithm, the following<br />

instruction basis can be proposed:<br />

• XOR128;<br />

• AESENC;<br />

• AESENCLAST;<br />

• AESDEC;<br />

• AESDECLAST;<br />

• KEYGENASSIST.<br />

The rest of this section contains functional description of the<br />

basic operations in pseudo-code.<br />

1) XOR128<br />

AddRoundKey transformation from [1] is basically 128-bit<br />

XOR operation over operands.<br />

Input: state_high[64], state_low[64], key_high[64],<br />

key_low[64]<br />

State_high[63:0] := state_high[63:0] ^ key_high[63:0]<br />

State_low[63:0] := state_low[63:0] ^ key_low[63:0]<br />

Output: state_high[64], state_low[64]<br />

2) AESENC<br />

A composition of ShiftRows, SubBytes, MixColumns, and<br />

AddRoundKey transformations which results in a single round<br />

of encryption.<br />

Input: state_high[64], state_low[64], key_high[64],<br />

key_low[64]<br />

Tmp[63:0] := state_high[63:0]<br />

Tmp[127:64] := state_low[63:0]<br />

Tmp[127:0] := ShiftRows(tmp[127:0])<br />

Tmp[127:0] := SubBytes(tmp[127:0])<br />

Tmp[127:0] := MixColumns(tmp[127:0])<br />

State_high[63:0], state_low[63:0] := XOR128(tmp[63:0],<br />

tmp[127:64], key_high[63:0], key_low[63:0])<br />

Output: state_high[64], state_low[64]<br />

3) AESENCLAST<br />

A special kind of AESENC instruction that is performed<br />

during the last stage of encryption. This instruction combines<br />

ShiftRows, SubBytes, and AddRoundKey transformations.<br />

Input: state_high[64], state_low[64], key_high[64],<br />

key_low[64]<br />

Tmp[63:0] := state_high[63:0]<br />

Tmp[127:64] := state_low[63:0]<br />

Tmp[127:0] := ShiftRows(tmp[127:0])<br />

Tmp[127:0] := SubBytes(tmp[127:0])<br />

State_high[63:0], state_low[63:0] := XOR128(tmp[63:0],<br />

tmp[127:64], key_high[63:0], key_low[63:0])<br />

Output: state_high[64], state_low[64]<br />

4) AESDEC<br />

Performs transformation of a single round decryption, which<br />

is a composition of InvShiftRows, InvSubBytes,<br />

AddRoundKey, InvMixColumns transformations. It should be<br />

noted that this instruction follows Inverse Cipher procedure<br />

described in Chapter 5.3 of [1], not the Equivalent Inverse<br />

Cipher.<br />

Input: state_high[64], state_low[64], key_high[64],<br />

key_low[64]<br />

Tmp[63:0] := state_high[63:0]<br />

Tmp[127:64] := state_low[63:0]<br />

Tmp[127:0] := InvShiftRows(tmp[127:0])<br />

Tmp[127:0] := InvSubBytes(tmp[127:0])<br />

Tmp[63:0], tmp[127:64] := XOR128(tmp[63:0], tmp[127:64],<br />

key_high[63:0], key_low[63:0])<br />

Tmp[127:0] := InvMixColumns(tmp[127:0])<br />

State_high[63:0] := tmp[63:0]<br />

State_low[63:0] := tmp[127:64]<br />

Output: state_high[64], state_low[64]<br />

8


5) AESDECLAST<br />

This instruction, similar to AESENCLAST, performs the last<br />

stage of decryption. It combines InvShiftRows, InvSubBytes,<br />

and AddRoundKey transformations.<br />

Input: state_high, state_low, key_high, key_low<br />

Tmp[63:0] := state_high[63:0]<br />

Tmp[127:64] := state_low[63:0]<br />

Tmp[127:0] := InvShiftRows(tmp[127:0])<br />

Tmp[127:0] := InvSubBytes(tmp[127:0])<br />

State_high[63:0], State_low[63:0] := XOR128(tmp[63:0],<br />

tmp[127:64], key_high[63:0], key_low[63:0])<br />

Output: state_high[64], state_low[64]<br />

6) KEYGENASSIST<br />

This is a helper instruction, used in the process of the round<br />

keys expansion.<br />

Two auxiliary functions are introduced to simplify the<br />

notation. SubWord transformation performs substitution of<br />

bytes sequentially, similar to SubBytes. RotWord results in word<br />

rearrangement: RotWord(tmp[31:0]) := tmp[31:24] | tmp[7:0]<br />

| tmp[15:8] | tmp[23:16].<br />

Input: state_high, state_low, rcon[8]<br />

Tmp1[31:0] := state_high[31:0]<br />

Tmp2[31:0] := state_high[63:32]<br />

Tmp3[31:0] := state_low[31:0]<br />

Tmp4[31:0] := state_low[63:32]<br />

State_high[31:0] := RotWord(SubWord(tmp1[31:0])) ^<br />

rcon[7:0]<br />

State_high[63:32] := SubWord(tmp1[31:0])<br />

State_low[95:64] := RotWord(SubWord(tmp3[31:0])) ^<br />

rcon[7:0]<br />

State_low[127:96] := SubWord(tmp3[31:0])<br />

Output: state_high[64], state_low[64]<br />

V. CUSTOM AES ISA EXTENSION<br />

A. Operand Format and FP Register File Reuse<br />

For all the operations in the proposed AES extension, each<br />

of the instructions takes two 128-bit input operands, performs<br />

required sequence of transformations, and returns one 128-bit<br />

output.<br />

In this initial implementation, the proposed AES extension<br />

uses pairs of 64-bit floating point registers as aliases for 128-bit<br />

operands. The half that contains the most significant bits of an<br />

argument is called “high part” (state_high) and the other half is<br />

called “low part” (state_low).<br />

This approach is practical (vs dedicated 128-bit storage) and<br />

allows to optimize additional area, required to support the<br />

extension.<br />

B. AES Operational Basis Interface Requirements<br />

The proposed in the previous section functional basis allows<br />

specifying encryption, decryption and key expansion procedures<br />

in terms of rounds. This simplifies the programming process<br />

because the number of instructions stacked together corresponds<br />

exactly to the number of transformation rounds. Thus, only 6<br />

custom instructions will be required for AES<br />

encryption/decryption algorithms.<br />

The data flow of AES algorithm allows accumulative<br />

updating of 128-bit state block. Correspondingly, when<br />

operating over the 64-bit entry-register file, every instruction<br />

needs four input registers and two of those registers should be<br />

updated as result of the transformation.<br />

The proposed basis fits into the standard R4-type instruction<br />

format, but does not strictly comply with the standard R4-type<br />

instruction template semantics. An ordinary R4-type instruction<br />

takes three source registers and returns its result to one<br />

destination register, while described basis requires two<br />

destination registers. This suggests specific modification of the<br />

R4-type instructions as described in the following section.<br />

C. AES Custom Extension Instruction Format<br />

To implement instruction-level interface of the described<br />

functionality, we introduce additional format as a subset of the<br />

custom0/custom1 R4 (Fig. 6) format encoding. In this newly<br />

introduced format, we define instructions with funct3 values of<br />

‘001’ and ‘101’ within the R4-type. Funct2 field encodes<br />

different AES transformations.<br />

Each instruction of AES custom extension set takes 4 FPU<br />

registers as operands. Every such instruction has one important<br />

change in the operation semantics: ‘rd’ and ‘rs3’ arguments are<br />

both source and destination registers while other opcodes in this<br />

format have 3 source operand and a single destination operand.<br />

To prevent data corruption we do not allow instructions in which<br />

a single register is used twice for both ‘rd’ and ‘rs3’ destinations.<br />

Full listing of the proposed custom AES extension<br />

instructions is in the Table I.<br />

Instruction<br />

TABLE I.<br />

AES EXTENSION INSTRUCTIONS<br />

31 27 26 25 24 20 19 15 14 12 11 7 6 0<br />

rs3 funct2 rs2 rs1 funct3 rd opcode<br />

AESENC any 00 any any 001 any 0001011<br />

AESENC<br />

LAST<br />

any 01 any any 001 any 0001011<br />

AESDEC any 10 any any 001 any 0001011<br />

AESDEC<br />

LAST<br />

any 11 any any 001 any 0001011<br />

XOR128 any 00 any any 101 any 0001011<br />

AESKEYGEN<br />

ASSIST<br />

any 11 any any 101 any 0001011<br />

Overall, if implementation allows for a full operand registers<br />

flexibility (which is preferable — first hand, from the<br />

instrumentation point of view, but not required), the proposed<br />

extension occupies around 6M opcodes (mainly, due to the<br />

operand address fields), leaving more than 90% (and ~56 million<br />

combinations) of the opcode space available within RV32<br />

opcode space (Table II below).<br />

TABLE II.<br />

CUSTOM INSTRUCTIONS OPCODE SPACE<br />

custom0 custom1 Total %<br />

Full opcode space 2 35 ~32M 2 35 ~32M 64M 100<br />

Number of possible opcodes<br />

used for AES insturctions a 6×2 20 -2 15 ~6M 0 6M 9.375<br />

Opcode space still available 3×2 23 +2 21 +2 15 >26M 32M 56M 90.625<br />

a.<br />

With full 4-operand register flexibility<br />

9


VI.<br />

AES FUNCTIONAL UNIT IMPLEMENTATION<br />

This section provides high-level overview of the AES<br />

functional unit which implements the proposed ISA extension.<br />

Full RTL code of the module is published in [6].<br />

A. AES Module Block Diagram<br />

The high-level AES functional unit diagram is represented<br />

in the Fig. 8.<br />

As this evaluation exercise is ISA-centric, the AES unit<br />

implementation is pretty straight-forward. The unit has four 64-<br />

bits input operands and two 64-bits results. The unit is<br />

implemented as a single stage pipeline, where input operands<br />

and the command are latched in the input register for the<br />

following processing. The computational part of the current unit<br />

is implemented as a combinatorial logic. Correspondingly, the<br />

latency of the initial AES unit implementation is just a single<br />

clock in the current prototype. Although, data path is rather<br />

simple and can be pipelined, if necessary. The “Substitution”<br />

and “Inverse Substitution” units are simple LUTs of 256 keys.<br />

The “Keygen” unit executes bit interleaver as a logic data path.<br />

The “Shift rows” and “Mix column” units execute rows and<br />

columns interleaving, correspondingly.<br />

Fig. 9. AES block interface<br />

B. Implementation Complexity<br />

The described implementation has been synthesized at<br />

TSMC 28nm library. Synthesis results are included in Table III<br />

below.<br />

Module<br />

TABLE III.<br />

IMPLEMENTATION COMPLEXITY<br />

Basic core complexity,<br />

kGates<br />

Extended core complexity,<br />

kGates<br />

Core total, logic only 275 307 11.5<br />

Diff,<br />

%<br />

AES module — 20.8 7.5<br />

Other core<br />

modifications<br />

— 10.9 4<br />

VII. EXPERIMENTAL RESULTS AND COMPARISON<br />

Fig. 8. AES functional unit diagram<br />

AES block interfacing with RISC-V pipeline is shown in the<br />

Fig. 9.<br />

A. Software Packages Used<br />

AES algorithm described in [1] was implemented in C++.<br />

This implementation is based on the Botan library [4]. The<br />

conventional implementations of AES benefit from well-known<br />

optimization tricks. For example, combining ShiftRows and<br />

MixColumns transformations into a single operation etc. Botan<br />

library supports cryptographic extensions for various<br />

architectures, including Intel AES-NI and ARMv8<br />

cryptographic extension.<br />

The correctness of the resulting application was tested using<br />

reference vectors from [1]. A small cross-platform<br />

benchmarking library was implemented in C++11 for<br />

correctness verification and performance testing. The<br />

implementation was compiled with MSCV 2015 compiler for<br />

Windows 10 (64-bit), g++ 5.4.1 for Linux (64-bit), g++ 5.2.0 for<br />

RISC-V Linux, and g++ 6 for Raspberry Pi 3.<br />

B. Reference Implementations<br />

Intel architecture AES-NI extension supports AES subset<br />

with 128-bit register/memory operands [5]:<br />

• AESENC. This instruction performs a single round<br />

of encryption. The instruction combines the four<br />

transformations of the AES algorithm, namely:<br />

ShiftRows, SubBytes, MixColumns &<br />

AddRoundKey (as described in [1]) into a single<br />

instruction.<br />

10


• AESENCLAST. Instruction for the last round of<br />

encryption. Combines the ShiftRows, SubBytes,<br />

and AddRoundKey steps into one instruction.<br />

• AESDEC. Instruction for a single round of<br />

decryption. This combines the four steps of AES<br />

transformations InvShiftRows, InvSubBytes,<br />

InvMixColumns, AddRoundKey into a single<br />

instruction.<br />

• AESDECLAST. Performs last round of decryption.<br />

It combines InvShiftRows, InvSubBytes,<br />

AddRoundKey into a single instruction.<br />

• AESKEYGENASSIST assists in an algorithm of<br />

generating of the round keys used for encryption.<br />

• AESIMC is used for converting the encryption<br />

round keys to a form suitable for decryption using<br />

the Equivalent Inverse Cipher.<br />

ARMv8 also has cryptographic extension with 128-bit<br />

coprocessor SIMD registers operands [3]:<br />

• AESD – AES single round decryption.<br />

• AESE – AES single round encryption.<br />

• AESIMC – AES inverse MixColumns<br />

transformation.<br />

• AESMC – AES MixColumns transformation.<br />

C. AES Operation-to-Instruction Mapping<br />

In this section, we compare mappings of the AES cipher<br />

operations in the designed extension with functionally similar<br />

extensions in the select co-temporary ISAs.<br />

AES encryption operation mapping is shown in the Fig. 10.<br />

One can observe that Intel AES-NI and the proposed custom<br />

AES extension for RISC-V have similar instructions<br />

functionality. Number of executed instructions is equal to the<br />

number of round keys in both cases. Corresponding ARMv8<br />

extension is different and requires twice as many instructions for<br />

the round.<br />

For decryption, both AES-NI and ARMv8 use an equivalent<br />

algorithm of inverse cipher, which requires a separate set of<br />

round keys that is different from the encryption keys (Fig. 11).<br />

Again, one can see that, same as in the cipher algorithm, the<br />

number of executed instructions in the RISC-V extension is<br />

equal to the number of round keys. It is equal to the number of<br />

instructions in the AES-NI instruction set, whereas ARMv8<br />

requires twice as many instructions to be executed.<br />

Advantages of hardware implementation of AES<br />

transformations are described in detail in chapter “Software Side<br />

Channels and the AES Instructions” of [5].<br />

However, the proposed custom RISC-V extension for AES,<br />

in contrast to AES-NI, decomposes the main AES Inverse<br />

Cipher algorithm which uses the same set of round keys as<br />

Cipher algorithm does.<br />

D. Instruction Scheduling Comparison<br />

In this section, we compare disassembled code<br />

corresponding to the main encryption and decryption loops. A<br />

well-scheduled sequence of instructions may significantly<br />

improve computational performance of an algorithm.<br />

Fig. 10. AES encryption instructions for different ISAs<br />

Fig. 11. AES decryption instructions for different ISAs<br />

11


The following samples contain block of C++ code preceding<br />

its disassembly. Round keys are assumed to be stored in the<br />

m_EK array. Each code fragment performs 11 round keys<br />

applications for each 128-bit block.<br />

Listing 1 contains code fragment that uses Intel AES-NI<br />

instruction set for a single block encryption. We can see<br />

instructions loading round keys from the memory. Number of<br />

registers is not enough to preload keys into registers outside the<br />

main loop. The total number of instructions in the loop is 28, and<br />

the total number of used ‘xmm’ registers is 3 out of 8 for the<br />

AES-128 example below.<br />

Listing 1. Block encryption using Intel AES-NI instruction set<br />

void<br />

encrypt(block_type const& plain_text_block,<br />

block_type& cipher_block) const<br />

{<br />

__m128i B =<br />

_mm_loadu_si128(std::addressof(plain_text_block));<br />

B = _mm_xor_si128(B, _mm_loadu_si128(&m_EK[0]));<br />

for (int i = 1; i < number_of_round_keys - 1; ++i)<br />

{<br />

B = _mm_aesenc_si128(B,<br />

_mm_loadu_si128(&m_EK[i]));<br />

}<br />

B = _mm_aesenclast_si128(B,<br />

_mm_loadu_si128(&m_EK[number_of_round_keys - 1]));<br />

_mm_storeu_si128(std::addressof(cipher_block), B);<br />

}<br />

...<br />

.L217:<br />

movdqu (%rdx), %xmm0<br />

addq $16, %rdx<br />

addq $16, %r8<br />

movdqu (%rax), %xmm2<br />

pxor %xmm2, %xmm0<br />

movdqu 16(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 32(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 48(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 64(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 80(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 96(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 112(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 128(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 144(%rax), %xmm1<br />

aesenc %xmm1, %xmm0<br />

movdqu 160(%rax), %xmm1<br />

aesenclast %xmm1, %xmm0<br />

movups %xmm0, -16(%r8)<br />

cmpq %rsi, %rdx<br />

jne .L217<br />

Code fragment in the Listing 2 uses ARMv8 cryptographic<br />

instructions. Again, we can see instructions loading round keys<br />

from the memory. Number of registers is again not sufficient to<br />

preload keys into the registers outside of the main loop. The total<br />

number of instructions in the loop is 40, and the total number of<br />

used ‘q’ registers is 7 (14 ‘d’ registers out of 16) for the AES-<br />

128.<br />

Listing 2. Block encryption using ARM8 with cryptographic instructions<br />

void<br />

encrypt(block_type const& plain_text_block,<br />

block_type& cipher_block) const<br />

{<br />

auto const p_inp = reinterpret_cast(&plain_text_block);<br />

auto const p_out =<br />

reinterpret_cast(&cipher_block);<br />

auto data = vld1q_u8(p_inp);<br />

{<br />

auto const& keys = m_EK;<br />

for (int i = 0; i < number_of_round_keys - 2;<br />

++i) {<br />

auto const p_key =<br />

reinterpret_cast(&keys[i]);<br />

auto const rkey = vld1q_u8(p_key);<br />

data = vaeseq_u8(data, rkey);<br />

data = vaesmcq_u8(data);<br />

}<br />

{<br />

auto const p_key =<br />

reinterpret_cast(&keys[number_of_round_keys - 2]);<br />

auto rkey = vld1q_u8(p_key);<br />

data = vaeseq_u8(data, rkey);<br />

}<br />

{<br />

auto const p_key =<br />

reinterpret_cast(&keys[number_of_round_keys - 1]);<br />

auto const rkey = vld1q_u8(p_key);<br />

data = veorq_u8(data, rkey);<br />

}<br />

}<br />

vst1q_u8(p_out, data);<br />

}<br />

...<br />

.L164:<br />

sub r0, fp, #1012<br />

vld1.8 {d20-d21}, [r0:64]<br />

ldr r0, [fp, #-1060]<br />

vld1.8 {d18-d19}, [r0:64]<br />

ldr r0, [fp, #-1076]<br />

vld1.8 {d16-d17}, [r3:128]!<br />

cmp r3, ip<br />

aese.8 q8, q10<br />

vld1.8 {d22-d23}, [r1:64]<br />

vld1.8 {d20-d21}, [r2:64]<br />

aesmc.8 q8, q8<br />

vld1.8 {d28-d29}, [lr:64]<br />

aese.8 q8, q9<br />

vld1.8 {d26-d27}, [r10:64]<br />

vld1.8 {d18-d19}, [r8:64]<br />

aesmc.8 q8, q8<br />

vld1.8 {d24-d25}, [r4:64]<br />

aese.8 q8, q11<br />

vld1.8 {d22-d23}, [r5:64]<br />

aesmc.8 q8, q8<br />

aese.8 q8, q10<br />

vld1.8 {d20-d21}, [r6:64]<br />

aesmc.8 q8, q8<br />

aese.8 q8, q9<br />

vld1.8 {d18-d19}, [r7:64]<br />

aesmc.8 q8, q8<br />

aese.8 q8, q14<br />

aesmc.8 q8, q8<br />

12


aese.8 q8, q13<br />

aesmc.8 q8, q8<br />

aese.8 q8, q12<br />

aesmc.8 q8, q8<br />

aese.8 q8, q11<br />

aesmc.8 q8, q8<br />

aese.8 q8, q10<br />

veor q8, q9, q8<br />

vst1.8 {d16-d17}, [r0:128]!<br />

str r0, [fp, #-1076]<br />

bne .L164<br />

For AES-256<br />

.L185:<br />

sub r3, fp, #1012<br />

vld1.8 {d16-d17}, [r2:128]<br />

.L184:<br />

vld1.8 {d18-d19}, [r3:64]!<br />

cmp r9, r3<br />

aese.8 q8, q9<br />

aesmc.8 q8, q8<br />

bne .L184<br />

add r2, r2, #16<br />

cmp r2, r0<br />

vld1.8 {d20-d21}, [r9:64]<br />

vld1.8 {d18-d19}, [r10:64]<br />

aese.8 q8, q10<br />

veor q8, q9, q8<br />

vst1.8 {d16-d17}, [r1:128]!<br />

bne .L185<br />

Listing 3 demonstrates efficient unrolling of the main<br />

encryption loop. It can be noticed that no keys loading<br />

operations interfere with the round transformation sequence. All<br />

keys are preloaded into the ‘f’ registers as a loop invariant<br />

variables. The total number of instructions is 18, and the total<br />

number of used ‘f’ registers is 24 out of 32 in the AES-128<br />

example below.<br />

Listing 3. Block encryption using the proposed RISC-V AES custom instruction<br />

set extension<br />

void<br />

encrypt(block_type const& plain_text_block,<br />

block_type& cipher_block) const<br />

{<br />

uint128_type B = reinterpret_cast(plain_text_block);<br />

{<br />

B ^= m_EK[0];<br />

for (int i = 1; i < number_of_round_keys - 1; ++i)<br />

B.aesenc(m_EK[i]);<br />

}<br />

B.aesenclast(m_EK[number_of_round_keys - 1]);<br />

reinterpret_cast(cipher_block) = B;<br />

}<br />

...<br />

.L183:<br />

fld fa3,0(a5)<br />

fld fa5,8(a5)<br />

add a4,a4,16<br />

add a5,a5,16<br />

sc_xor128 fa3, fa5, fs6, fs8<br />

sc_aesenc fa3, fa5, ft6, fs1<br />

sc_aesenc fa3, fa5, fs2, fs4<br />

sc_aesenc fa3, fa5, ft4, ft8<br />

sc_aesenc fa3, fa5, fs3, fs0<br />

sc_aesenc fa3, fa5, ft2, ft10<br />

sc_aesenc fa3, fa5, fs7, fs5<br />

sc_aesenc fa3, fa5, ft0, fs10<br />

sc_aesenc fa3, fa5, fs11, fs9<br />

sc_aesenc fa3, fa5, fa2, fa6<br />

sc_aesenclast fa3, fa5, fa0, fa4<br />

fsd fa3,-16(a4)<br />

fsd fa5,-8(a4)<br />

bne a5,a2,.L183<br />

Tables IV and V summarize number of instructions and<br />

registers needed for different block length encryption/decryption<br />

for 128-bit and 256-bit respectively.<br />

ISA<br />

TABLE IV.<br />

AES-128 RESOURCES USED<br />

Instructions<br />

per block<br />

Number of used<br />

registers<br />

Load/store memory<br />

operations (per<br />

loop)<br />

Intel x86 AES-NI 28 3 xmm (from 8) 12 load + 1 store<br />

ARM8 aarch64-crypto 39<br />

Custom RV32 AES extenstion<br />

(single invocation a )<br />

Custom RV32 AES extenstion<br />

(inlined in a loop, round keys<br />

are preloaded b )<br />

ISA<br />

40<br />

18<br />

7 vector ‘q’ (or<br />

14 FPU ‘d’ from<br />

16)<br />

4 from 32 FPU<br />

‘d’<br />

24 from 32 FPU<br />

‘d’<br />

15 load + 1 store<br />

24 load + 2 store<br />

2 load + 2 store<br />

a.<br />

Use case: standalone “deep” invocation.<br />

b.<br />

Use case: steady-state block encryption/decryption with the same set of round keys.<br />

TABLE V.<br />

AES-256 RESOURCES USED<br />

Instructions<br />

per block<br />

Number of used<br />

registers<br />

Load/store memory<br />

operations (per<br />

loop)<br />

Intel x86 AES-NI 36 3 xmm (from 8) 16 load + 1 store<br />

ARM8 aarch64-crypto 75<br />

Custom RV32 AES extenstion<br />

(single invocation)<br />

Custom RV32 AES extenstion<br />

(inlined in a loop, round keys<br />

are preloaded)<br />

52<br />

22<br />

3 vector ‘q’ (or 6<br />

FPU ‘d’ from<br />

16)<br />

4 from 32 FPU<br />

‘d’<br />

32 from 32 FPU<br />

‘d’<br />

16 load + 1 store<br />

32 load + 2 store<br />

2 load + 2 store<br />

As can be seen from the tables above, the compiler-produced<br />

code, generated for the block encryption using Intel AES-NI and<br />

ARMv8 -crypto extensions, doesn’t depend on the invocation<br />

type and stays the same both for the inlined functions and “deep”<br />

standalone calls. Number of available registers is a known<br />

limiting factor at these platforms. It is not enough to hold all the<br />

loop invariant variables in the registers, which forces keys<br />

reloading for every new data block.<br />

For the proposed RISC-V custom extension, keys loading is<br />

only required for the standalone context-free “deep” calls. In<br />

contrast, for the block encryption operations called in the loop,<br />

compiler extracts code key loading sequence as the loop<br />

invariant part, which is executed only once outside of the loop.<br />

This reduces the number of per-block required instructions by<br />

more than 2x and removes unnecessary key loads.<br />

It should also be noted, what experimental RISC-V CPU<br />

implementation only supports 64-bit loads and stores, while<br />

other considered CPU can operate with 128-bit operands. We<br />

intend to extend this exercise for the RV64 and RV128 platforms<br />

upon availability.<br />

E. Instruction Scheduling Comparison<br />

In this section, we compare resulting instruction sequences<br />

for the 128-bit block encryption using two ISAs. Summary<br />

results are included in the Table VI below, reference<br />

implementations in C for the 128-bit block encryption and<br />

corresponding assembly listings are included in the Section<br />

VII.D, full sources are published in [6].<br />

13


TABLE VI.<br />

Platform<br />

Ubuntu 16.04, Intel<br />

6800K<br />

Windows 10, Intel<br />

Core i7<br />

Ubuntu 16.04 Intel<br />

Core i3<br />

SC RISC-V FPGA<br />

implementation<br />

AES-128 ENCRYPTION BENCHMARKS<br />

MHz<br />

Without extensions<br />

Using extensions<br />

MB/MHz clocks/block MB/MHz clocks/block<br />

Speedup<br />

3400 0.100 80.06 1.140 7.02 11.4<br />

2500 0.086 186.42 0.943 16.97 10.99<br />

2300 0.092 173.727 0.312 51.293 3.39<br />

20 0.015 1077.44 0.725 22.07 48.82<br />

VIII. CONCLUSION AND FUTURE WORK<br />

In this work, we considered case of the custom instructionlevel<br />

RISC-V ISA extension for the familiar AES cryptography<br />

suite.<br />

We review AES algorithms and propose functional<br />

decomposition and the possible operational basis for AES<br />

acceleration, and then describe design of the corresponding ISA<br />

extension as well as execution unit, which covers full<br />

requirements of the AES standard [1].<br />

The proposed custom AES ISA is implemented as an<br />

extension of the SCR5 RV32GC processor core [7] using basic<br />

extensibility features of the RISC-V ISA. The extended core<br />

exists as a real-time FPGA prototype with full end-to-end AESbased<br />

SW stack integrated.<br />

The resulting implementation has been benchmarked using<br />

the open-source SW library, results have been compared with<br />

AES extensions of other modern ISA, based on co-temporary<br />

CPU implementations.<br />

We’ve demonstrated, what even with the basic extensibility<br />

features provided by the RISC-V ISA, and staying “compatible”<br />

with the standard instruction formats and using only the standard<br />

architectural RF, it’s possible to produce a high-quality ISA<br />

extension even for the RV32GC baseline architecture, in all the<br />

aspects comparable with the modern commercial systems.<br />

The proposed extension is minimalistic and includes only 6<br />

new instructions, which can be easily deployed by the standard<br />

RISC-V compiler to produce high-quality results, competitive<br />

with the best co-temporary AES implementations using other<br />

ISAs, — both in code density and performance. It should be<br />

noticed, what RISC-V extension performance results have been<br />

obtained using 32-bit baseline architecture over 64-bit register<br />

files, while referenced architectures are both 64-bit with 128-bit<br />

register capabilities.<br />

Although AES algorithm is not memory-bound, some<br />

additional benefits can be expected from deploying wider, 64bit<br />

or 128bit baseline architectures (although, for the later, support<br />

in the C/C++ languages are somewhat behind, as none of the<br />

current standards had introduced portable integer int128_t or<br />

uint128_t types). This is part of the follow-up work — authors<br />

intend to advance the proposed custom AES extension<br />

implementation for the RV64 and RV128 baseline RISC-V<br />

architecture cases to evaluate incremental benefits from wider<br />

architectures. Additionally, the recently published V standard<br />

extension proposal [10] is currently in review for a potential<br />

applicability.<br />

As a minor side result, we’ve noticed some improvement<br />

opportunities in the extensibility-related aspects of the current<br />

SW infrastructure. We’ve accumulated these and communicated<br />

to the RISC-V GCC toolchain maintainers.<br />

REFERENCES<br />

[1] Pub, NIST FIPS. “197: Advanced encryption standard (AES).” Federal<br />

information processing standards publication 197.441 (2001): 0311.<br />

[2] C.Moore “Data processing in exascale class computer systems” The<br />

Salishan Conference on High Speed Computing, 2011. Available:<br />

http://www.lanl.gov/conferences/salishan/salishan2011/3moore.pdf<br />

[3] ARM® Cortex®-A57 MPCore Processor Cryptography Extension.<br />

Technical Reference Manual (2015). Available:<br />

http://infocenter.arm.com/help/topic/com.arm.doc.ddi0514g/DDI0514G<br />

_cortex_a57_mpcore_cryptography_trm.pdf .<br />

[4] Botan: Crypto and TLS for C++11 URL: https://botan.randombit.net/<br />

[5] Gueron S. “Intel® Advanced Encryption Standard (AES) New<br />

Instructions Set”. Available:<br />

https://software.intel.com/sites/default/files/article/165683/aes-wp-2012-<br />

09-22-v01.pdf.<br />

[6] Software and hardware implementation of AES using custom instructions.<br />

URL: https://github.com/syntacore/aes-paper<br />

[7] Syntacore | custom cores and tools. URL: https://syntacore.com/<br />

[8] T. N. Theis and H.-S. P. Wong, “The end of Moore’s Law: A new<br />

beginning for information technology,” Computing in Science &<br />

Engineering, vol. 19, no. 2, pp. 41–50, 2017.<br />

[9] “The RISC-V Instruction Set Manual. Volume 1: User-Level ISA,<br />

Document Version 2.2”, Editors: A. Waterman, K. Asanović, RISC-V<br />

Foundation, May 2017.<br />

[10] K. Asanovic, R. Espasa, “The RISC-V Vector ISA”. Available:<br />

https://content.riscv.org/wp-content/uploads/2017/12/Wed-1330-<br />

RISCVRogerEspasaVEXT-v4.pdf<br />

14


A RISC-V Based Open Hardware Platform for<br />

Wearable Smart Sensors<br />

Manuel Eggimann*, Stefan Mach*, Michele Magno*°, Luca Benini*°<br />

*Dept. of Information Technology and Electrical Engineering, ETH Zurich, Switzerland<br />

°Dept. of Electrical, Electronic, and Information Engineering, Università di Bologna, Italia<br />

Abstract—Wearable smart sensing is a promising technology<br />

to enhance user experience in sport/fitness, as well as health<br />

monitoring. Wearable sensing systems, as also Internet of Thing<br />

(IoT) systems, not only provide continuous data monitoring and<br />

acquisition but are also expected to filter, process and extract<br />

meaningful information from the acquired data in similar ways as<br />

human experts do. Supporting continuous “smart” operation on<br />

ultra-small batteries poses unique challenges in energy efficiency.<br />

In this work, we present an ultra-low-power open embedded<br />

platform that hosts a scalable array of analogue-to-digital<br />

converters for biomedical and inertial sensors, and it can parallelprocess<br />

data on-board with machine learning algorithms (i.e.<br />

SVM, KNN, Neural Networks) in with orders-of-magnitude higher<br />

computing power and energy efficiency compared to standard<br />

state-of-the-art microcontrollers. The platform’s compute engine<br />

is a heterogeneous multi-core parallel ultra-low power (PULP)<br />

processor based on RISC-V, capable to deliver up to 2.5 GOPS.<br />

These features are provided under 55 mW power consumption,<br />

which makes the platform ideal for battery-powered activities<br />

typical of wearable applications, with a peak of 38.3x energy<br />

efficiency increase (0.7 V, 85 MHz) compared to standard<br />

microcontrollers (MCUs) with similar power budgets. The<br />

wearable platform can be interfaced with “electronic skin” (Eskin)<br />

arrays of tactile sensors with up to 64 channels and<br />

ECG/EMG sensors up to 8 channels. Moreover, the platform<br />

includes a Bluetooth Low Energy 5.0 module for energy-efficient<br />

wireless connectivity.<br />

Keywords—Wearable Devices, Ultra Low Power, Parallel<br />

Architectue, RISC-V, Energy Efficienct, Smart Sensing.<br />

I. INTRODUCTION<br />

Recent advances in electronics performance and<br />

miniaturization in combination with the availability of sensors<br />

and reliable connectivity are enabling the design of more<br />

intelligent and miniaturized devices that embed a huge amount<br />

of computational performance compared to only a few years<br />

ago. Among other smart devices, wearable devices are tightly<br />

coupled to the human body [1]. Concurrently, there has been a<br />

significant increase of interest in monitoring human health<br />

during daily activities and use smart wearable devices for<br />

medical applications, aimed to reduce the hospitalization of<br />

users with cronic diseases. One of the most desirable features of<br />

wearable devices is their capability to autonomously process<br />

data from the sensor and take action trough actuators [1].<br />

Machine Learning (ML) methods are playing an important<br />

role in wearable devices, and they are particularly interesting<br />

tools for various emerging applications [2]. In fact, they are used<br />

for data analysis in many domains and they are the core of the<br />

perception units in our integrating intelligence era. Embedding<br />

machine learning methods is expected to enable machine<br />

intelligence in lightweight wearable devices as well as robotics,<br />

and prosthetics. Researchers on embedded machine learning are<br />

putting emphasis on designing specialized architectures to deal<br />

with the demand for large computational and storage capability<br />

[3]Due to algorithm complexity and large datasets, embracing<br />

typical machine learning methods for machine intelligence is<br />

still a challenging task in battery powered wearable devices,<br />

with limited hardware capability especially when dealing with<br />

real-time functionality .<br />

Today, microcontrollers, especially from the ARM<br />

Cortex-M family, are achieving a good tradeoff between power<br />

consumption (in the order of mW) and computational resources<br />

(in the order of MOPS)[4]. Power consumption of a few tens of<br />

mW, or lower, is a must for operating battery-operated devices<br />

without too frequent battery recharges; however, computational<br />

resources of microcontrollers tend to be insufficient to perform<br />

on-board processing for complex algorithms and sensors<br />

targeting biomedical/wearable applications [2]. In recent years,<br />

there have been many research efforts to design new processors<br />

to match the requirements of computational resources required<br />

by on-board data processing with state-of-the-art machine<br />

learning algorithms, as required for smart wearables. There are<br />

two approaches to improve the performance of ultra-low-power<br />

processors that have shown promise [5][6]. The first one is to<br />

exploit parallelism as much as possible. Parallel architectures for<br />

near-threshold operation, based on multi-core clusters, have<br />

been explored in recent years with different application<br />

workloads for an implementation in a 90nm technology [3]. A<br />

second very prolific research area is exploiting low-power fixedfunction<br />

hardware accelerators coupled with programmable<br />

parallel processors to retain flexibility while improving energy<br />

efficiency for specific workloads [7]. Such near-threshold<br />

parallel heterogeneous computing approaches hold great<br />

promise.<br />

In this work, we present a hardware platform that includes a<br />

heterogeneous parallel ultra-low power (PULP) processor based<br />

four multicore RISC-V. RISC-V is an open source instruction<br />

15


set architecture achieving similar core density performance of<br />

commercial and academic microprocessors based on a<br />

proprietary ISA, as for instance the ARM Cortex-M. The version<br />

of PULP employed in this work is capable to deliver up to 2.5<br />

GOPS with the power consumption in the range of mW. The<br />

whole platform is designed for battery-powered applications<br />

such as wearable electronics and to be interfaced with tactile<br />

sensors [7]. Evaluation in terms of power consumption,<br />

functionality and energy efficiency are presented with<br />

experimental measurements.<br />

process. Fig. 2 shows the block diagram of the processor<br />

including the 4 RISC-V cores and the shared memory. Honey<br />

Bunny offers a range of standard peripheral interfaces such as<br />

SPI, I 2 C or UART that allow interfacing it with commercial<br />

sensor and MCUs.<br />

The rest of the paper is organized as follows: Section 2<br />

describes the proposed wearable platform and it presents the<br />

whole system architecture with the SoC ultra-low power multicore<br />

parallel platform (PULP), Section 3 shows the experimental<br />

results and Section 4 concludes the paper.<br />

II.<br />

SYSTEM ARCHITECTURE<br />

Fig.1 shows the block diagram of the proposed wearable and<br />

wireless device that targets e-health applications. In this work,<br />

we focus on the platform architecture exploiting an ultra-low<br />

power and energy efficient parallel processor (PULP) based on<br />

RISC-V. The platform includes two sensor interfaces: an 8<br />

channel ECG/EMG analog front-end and a 64 channel currentinput<br />

ADC intended to interface a piezoelectric tactile sensor<br />

matrix for e-skin applications [9]. Communication between the<br />

wearable platform and external devices (i.e. smartphones) is<br />

enabled by a Bluetooth Low Energy 5.0 SoC with an ARM<br />

Cortex M4F core embedded on the SoC.<br />

Fig. 1 Overview of the Platform<br />

A. Honey Bunny:A RISC-V Parallel Ultra-Low Power<br />

Processor (PULP)<br />

The core of the designed platform is a Parallel Ultra-Low<br />

Power (PULP) processor 1 with four RISC-V cores. PULP is an<br />

open-hardware platform effort by ETH Zurich and University of<br />

Bologna featuring near-threshold multicore processing with<br />

tightly coupled memory, such as presented in [7]. The PULP<br />

SoC employed here, called Honey Bunny, features four RISC-V<br />

cores and is manufactured in GlobalFoundries’ 28nm CMOS<br />

Fig. 2 Block Diagarm of PULP Honey Bunny<br />

B. Wireless Interface<br />

The wearable platform presented here can communicate with<br />

external devices through Bluetooth Low Energy 5.0. The<br />

NRF52832 from Nordic Semiconductor was selected due to its<br />

ultra-low power consumption in transmission mode (7.1 mA<br />

current during 1 Mbit/s transmission at 0dBm) and a reasonable<br />

number of peripherals for sensor connection. A Bluetooth Low<br />

Energy custom service has been implemented to send the<br />

processed data to a smartphone where it can be stored, plotted in<br />

real time, or forwarded to the cloud In this version of the<br />

platform, we also use the NRF52832, which includes an ARM-<br />

Cortex M4F core to interface both sensor interfaces via SPI. In<br />

this preliminary work, we selected this configuration to save<br />

energy when we need to acquire data, as PULP can stay in deep<br />

sleep when its processing power is not needed, and the data just<br />

needs to be transmitted directly to the BTLE client. However, a<br />

configuration where the PULP processor is directly interfaced<br />

with the sensors can also be envisioned.<br />

C. Sensors subsytems<br />

As mentioned, the platform targets e-health applications, so<br />

two analog-to-digital front-ends are included in the design. For<br />

the ECG/EMG electrodes, an ADS1298 from Texas Instruments<br />

has been chosen. The integrated circuit front-end (IC) supports<br />

8 channels which enable conventional 12 lead ECG<br />

measurement at up to 32 kSPS per channel. Aside from the eight<br />

24-bit sigma-delta converters the chip also includes<br />

programmable gain amplifiers and circuitry to generate the<br />

necessary reference voltage for the unipolar leads.<br />

Besides ECG measurements, the platform is intended to<br />

interface tactile sensor matrices based on piezoelectric polymers<br />

for e-skin applications. This kind of sensors separate electrical<br />

charge proportionally to the applied force. It has been shown that<br />

the DDC current integrator ADC family from Texas Instruments<br />

1<br />

http://www.pulp-platform.org/<br />

16


is capable to provide sufficiently high sensitivity (20-bit and a<br />

maximum charge range of 150pC) and frequency response (up<br />

to 3.1 kSPS per channel) for tactile sensing applications [9]. Our<br />

platform uses the DDC264 which contains 64 channels. The<br />

DDC264 uses two integrators per channel to allow continuous<br />

integration of the input current. While one of them is<br />

accumulating charge, the other one is connected to the ADC. An<br />

external signal multiplexes between both integrators. To achieve<br />

the strict timing constraints of the DDC264 multiplex signal and<br />

the SPI transactions despite the frequent high priority interrupts<br />

of the Bluetooth stack, a special feature of the NRF52832 was<br />

used. The NRF52832 allows to trigger and perform SPI<br />

transactions, GPIO transitions and other tasks without the need<br />

of an interrupt service routine. The tasks can be triggered by<br />

events from other peripherals, e.g. a timer compare event. In<br />

conjunction with DMA, this allowed to generate the necessary<br />

clock signals and SPI transactions without any involvement of<br />

the core.<br />

III.<br />

EXPERIMENTAL RESUTLS<br />

To demonstrate the computational capabilities and energy<br />

efficiency of the data processing unit of the platform, a<br />

comparison of Honey Bunny and an ARM Cortex-M4F<br />

microcontroller has been conducted. Furthermore, an<br />

application utilizing the whole system pipeline has been tested<br />

in practice to highlight the functionality of the whole platform.<br />

the microcontrollers’. As PULP boasts a much larger<br />

application speed, power can be saved by reducing the<br />

frequency of Honey Bunny to match the application speed of<br />

the STM32F407. This shrinks power draw to merely 5 mW.<br />

However, energy efficiency relatively suffers by running a<br />

processor below the maximum possible frequency at a given<br />

supply voltage.<br />

Energy efficiency can be increased by scaling down the core<br />

supply voltage – a feature not available on the STM32F407 due<br />

to its fixed built-in core supply – while keeping operating<br />

frequency as high as possible. Fig. 3 shows the effects of the<br />

applied voltage scaling on processor power, application speed<br />

and energy efficiency. As such, we are able to boost PULP’s<br />

energy efficiency compared to the microcontroller up to 38x<br />

while still handling the filtering workload over 2x faster. The<br />

resulting operating point at 0.7 V core supply voltage dissipates<br />

4.5 mW which confirms PULP’s suitability for batteryoperated<br />

scenarios. Thus, utilizing PULP as a processing unit<br />

enables handling much more complex workloads under a much<br />

tighter power envelope than the microcontroller examined.<br />

A. Exploration of Honey Bunny PULP<br />

Honey Bunny operates at a nominal core supply voltage of<br />

1 V and at 1.8 V of I/O voltage. In this regime, up to 2.5 GOPS<br />

can be achieved as the four cores can run up to 625MHz, greatly<br />

exceeding the computational capabilities of low-power<br />

microcontrollers. The STM32F407 from STMicroelectronics is<br />

one of the most popular ARM Cortex-M4F microcontrollers<br />

[citation needed], commonly used for the processing of medical<br />

sensor data and it is our target microcontroller for performance<br />

comparisons. The STM32F407 performs 168MOPS at its<br />

fastest operating point.<br />

TABLE I.<br />

FILTERING KERNEL UNDER NOMINAL CONDITIONS<br />

STM32F407 Honey Bunny PULP<br />

168MHz 625MHz 40MHz<br />

Power 79.7 mW 54.6 mW 5.0 mW<br />

Speed (normalized) 1.00 15.8 1.01<br />

Energy Efficiency<br />

(normalized)<br />

1.00 23.0 16.13<br />

To compare the two processors, a kernel used in FIR<br />

filtering applications for ECG data has been executed on both<br />

processors. For this comparison, the performance (application<br />

speed in ms) of the STM32F407 when operating at 168MHz –<br />

its fastest operating point – serves as the baseline. TABLE I.<br />

lists the comparison of power, normalized application speed<br />

and normalized energy efficiency of the ARM Cortex-M4F<br />

microcontroller and Honey Bunny PULP. Honey Bunny<br />

delivers almost 16x performance at only two thirds of power<br />

drawn, resulting in an energy efficiency that is 23x higher than<br />

Fig. 3 Processor power dissipation and application characteristics of Honey<br />

Bunny PULP during scaling of supply voltage. The fastest possible operating<br />

frequency was used for each supply voltage point.<br />

B. ECG Monitoring and Processing<br />

One application of the platform tested in practice is ECG<br />

measurement: electrodes were attached to the left arm (LA) and<br />

right arm (RA) to measure the corresponding ECG signal Lead<br />

I (LA-RA).<br />

The data is then sent to PULP were power noise and baseline<br />

wandering is removed through FIR filtering. A simple bandpass<br />

filter with a lower cut-off frequency of 0.5 Hz, an upper<br />

cut-off frequency of 30 Hz and a filter order of 1138 was used.<br />

The filtered samples are sent back to the NRF52832 and are<br />

then transmitted via Bluetooth to a connected smartphone<br />

together with the unprocessed samples. The Android<br />

application decodes the incoming data and plots it in real time<br />

(see Fig. 2). In future applications the preprocessed data could<br />

also be used to perform further ECG analysis (e.g. QRS-<br />

17


complex detection, atrial fibrillation detection) directly on<br />

PULP.<br />

targets long-lasting wearable e-health application so the design<br />

needs to meet the stringent requirements of computational<br />

resources to perform the classification algorithm and cope with<br />

limited energy resources when is battery powered.<br />

ACKNOWLEDGMENT<br />

This work was in part funded by the Swiss National Science<br />

Foundation projects ‘MicroLearn: Micropower Deep<br />

Learning’(Nr. 162524)<br />

REFERENCES<br />

Fig. 2 Realtime ECG Plot on the connected Smartphone. The upper Plot<br />

shows the unprocessed raw sensor data. The lower plot shows the same signal<br />

FIR filtered by PULP.<br />

IV.<br />

CONCLUSIONS<br />

We presented an energy efficient wireless platform to<br />

process data from ECG and EMG Sensors. The platform core is<br />

a Honey Bunny PULP processor that has four RISC-V cores to<br />

boost-up the computational resources available with a power<br />

envelope of a few m-Watts. Due to the parallel ultra-low power<br />

processor, it is possible to achieve up to 2.5GOPS needed for<br />

emerging smart wearable applications, while keeping the power<br />

in mW range. We compare in this work the energy efficiency of<br />

the novel parallel processor over an ARM-Cortex M4F<br />

microcontroller, and showed that it is possible to achieve up to<br />

38.3 times more energy efficiency. The developed platform<br />

[1] Soh, Ping Jack, et al. "Wearable wireless health monitoring: Current<br />

developments, challenges, and future trends." IEEE Microwave Magazine<br />

16.4 (2015): 55-70.<br />

[2] Conti, Francesco, et al. "Accelerated visual context classification on a<br />

low-power smartwatch." IEEE Transactions on Human-Machine Systems<br />

47.1 (2017): 19-30.<br />

[3] Cavigelli, L., Magno, M. and Benini, L., 2015, June. Accelerating realtime<br />

embedded scene labeling with convolutional networks. In<br />

Proceedings of the 52nd Annual Design Automation Conference (p. 108).<br />

ACM.<br />

[4] Magno, M., Pritz, M., Mayer, P. and Benini, L., 2017, June. DeepEmote:<br />

Towards multi-layer neural networks in a low power wearable multisensors<br />

bracelet. In Advances in Sensors and Interfaces (IWASI), 2017 7th<br />

IEEE International Workshop on (pp. 32-37). IEEE.<br />

[5] Z. Wang, Y. Liu, Y. Sun, Y. Li, D. Zhang and H. Yang, "An energyefficient<br />

heterogeneous dual-core processor for Internet of Things," 2015<br />

IEEE International Symposium on Circuits and Systems (ISCAS),<br />

Lisbon, 2015.<br />

[6] Ghasemzadeh, H.; Jafari, R.; "Ultra low-power signal processing in<br />

wearable monitoring systems: A tiered screening architecture with<br />

optimal bit resolution." ACM Transactions on Embedded Computing<br />

Systems (TECS), 2013<br />

[7] M. Gautschi et al., "Near-Threshold RISC-V Core With DSP Extensions<br />

for Scalable IoT Endpoint Devices," in IEEE Transactions on Very Large<br />

Scale Integration (VLSI) Systems, vol. 25, no. 10, pp. 2700-2713, Oct.<br />

2017.<br />

[8] L. Seminara et al., "Electronic skin and electrocutaneous stimulation to<br />

restore the sense of touch in hand prosthetics," 2017 IEEE International<br />

Symposium on Circuits and Systems (ISCAS), Baltimore, MD, 2017, pp.<br />

1-4.<br />

[9] L. Pinna, A. Ibrahim, and M. Valle, “Interface Electronics for Tactile<br />

Sensors Based on Piezoelectric Polymers,” IEEE Sens. J., vol. 17, no. 18,<br />

pp. 5937–5947, Sep. 2017<br />

18


Using RISC-V in high computing, ultra-low power,<br />

programmable circuits for inference on battery<br />

operated edge devices<br />

Eric Flamand<br />

CTO<br />

GreenWaves Technologies<br />

Villard-Bonnot, France<br />

eric.flammand@greenwaves-technologies.com<br />

Abstract— Current ultra-low power edge devices operating<br />

for years on a battery are limited to relatively data poor sensors<br />

such as temperature and pressure. Allowing the next generation<br />

of edge devices to process data from richer sensors such as audio,<br />

image or motion/vibration enables many exciting new<br />

applications but also poses some serious challenges.<br />

1) How does one transform a large amount of input data into<br />

something that is several orders of magnitude smaller?<br />

2) Much more input data implies much more processing<br />

capability. How does one support algorithms requiring multigiga<br />

operations per second while keeping a power consumption in<br />

the mW range?<br />

3) Edge devices tend to have irregular activity patterns. How<br />

does one remain energy efficient in a range of workload that goes<br />

from 0 to multi GOPs?<br />

4) Finally, and just as importantly, how to do all of this while<br />

retaining a simple programming model in a context where, for<br />

the sake of energy efficiency, hardware complexity must be kept<br />

minimal?<br />

In this paper we will show how a combination of architectural<br />

innovation, design trade-offs and tools innovation that makes it<br />

possible to tackle these challenges.<br />

We will show how the RISC-V’s extendable ISA allows<br />

specific optimizations for energy efficiency and enables<br />

architectural innovation.<br />

We will use several real-life examples from the image and<br />

audio domain to illustrate how an actual multi-core RISC-V<br />

processor implementation can perform on these applications and<br />

what the path is to efficient implementation.<br />

Keywords—RISC-V; PULP; GAP8; CNN; IoT;<br />

I. INTRODUCTION<br />

During the last few years we have seen rapid progress in<br />

the field of data analytics thanks to a large variety of robust<br />

learning techniques combined with the availability of training<br />

sets. Common sources of data are those produced from sensors<br />

probing the environment for data such as images, sounds,<br />

vibrations. It is possible to directly connect sensors to cloud<br />

servers carrying out analysis however if the device needs<br />

wireless operation start of the art data links do not deliver<br />

sufficient energy efficiency to allow for battery operation as<br />

soon as the data volume becomes significant.<br />

Performing all or part of data analytics on the edge device<br />

can dramatically reduce the amount of data transmitted over<br />

the air since what needs to be carried is a qualitative view of<br />

the raw data, for example the presence of a given object, that is<br />

at least 5 orders of magnitude smaller than the original image.<br />

The challenge becomes how to deliver peak processing<br />

capabilities well above 1 giga operations per second operating<br />

on battery delivering a reasonable battery life time expectation<br />

(a year or more).<br />

We introduce GAP8, a multi core programable device<br />

derived from the PULP [1] [2] open source project PULP<br />

project itself is built on top the RISC-V project [3]. We<br />

examine how GAP8 uses the flexible attributes of the RISC-V<br />

ISA to deliver a state of the art micro controller (MCU), rich<br />

peripherals, ease of programming, security associated with a<br />

powerful programable parallel processing structure for heavy<br />

duty workloads which includes a dedicated hardware<br />

accelerator to offload the compute intense part of convolutional<br />

neural networks (CNNs). These two key building blocks are<br />

supported by aggressive on chip power management to<br />

minimize the amount of energy needed for a given task. The<br />

architecture can deliver up to 8 fully software programable giga<br />

operations per second (GOPS), 12 GOPS in the case that the<br />

CNN accelerator is used, while consuming only 1 milliwatt for<br />

0.17 GOPS.<br />

Ease of programming is a challenge in a context where<br />

several compromises in hardware architecture have to be made<br />

to keep power under strict control. To alleviate the impact of<br />

these tradeoffs we examine appropriate software automation<br />

19


tools that greatly simplify development of highly optimized<br />

computing kernels automating the generation of the glue code<br />

that sits in between a compute kernel and its data allocated<br />

across an un-cached data memory hierarchy.<br />

II. ARCHITECTURE<br />

Figure 1 provides a top-level view of the GAP8 architecture.<br />

Fig. 1. The GAP8 Architecture<br />

A. Fabric Controller<br />

The Fabric Controller, a micro controller unit (MCU), is<br />

located on the left side of figure 1. The fabric controller<br />

operates in its own independent power and frequency domain.<br />

It contains one RISC-V ISA programable core equipped with<br />

an instruction cache and a fast access-time data memory. It<br />

includes a set of peripherals enabling parallel capture of<br />

images, sounds and vibrations as well as connectivity to an<br />

external radio transceiver through a LVDS link and a 4<br />

channels PWM interface for motor control for use in<br />

applications such as domestic robotics. Most of the peripherals<br />

are shielded by a multi-channel micro DMA to minimize the<br />

amount of interactions with the controlling core when<br />

performing IO. This is illustrated in Figure 2.<br />

programmable DC/DC, LDO regulator, internal clock<br />

generation and real time clock.<br />

B. Cluster<br />

On the right side of figure 1 is the cluster domain. The<br />

cluster is in a separate voltage and frequency domain and is<br />

turned on and adjusted to the right voltage and frequency only<br />

when the software application running on the fabric controller<br />

needs it. It contains 8 identical cores based on the RISC-V ISA<br />

themselves identical to the core used in the fabric controller.<br />

This allows the SoC to be able to run the same binary code on<br />

either the fabric controller or the cluster. These 8 cores are<br />

served by a shared data memory making the cluster friendly to<br />

all the variants of models for shared memory based parallel<br />

programming, OpenMP being a good example. The shared data<br />

memory can serve all memory access requests in parallel with a<br />

very short access time latency that is completely absorbed by<br />

the core pipeline and a very low contention rate. This is<br />

enabled by a highly optimized interconnect located in between<br />

the cores’ load/store units and the memory banks. Program<br />

cache is also shared to benefit from the high occurrence of<br />

situations where all cores execute instructions in a relatively<br />

small window of code with the result that a fetched instruction<br />

is very likely to be used by several cores at different points in<br />

the same time window. Event service, parallel thread<br />

dispatching, and synchronization is supported by a dedicated<br />

hardware block (HW Sync). A fast event service is one of the<br />

key parameter for efficient parallel execution since any cycles<br />

wasted in forking and joining tasks on the cluster adds to the<br />

serial part of the application being run limiting the cluster’s<br />

ability to scale performances linearly with the number of cores<br />

involved in the task. Ultra-low overhead parallel dispatching<br />

and synchronization also allows very fine-grained parallelism.<br />

The HW Sync block controls the top-level clock gating of<br />

every single core in the cluster. A core waiting for an event<br />

(attached to a synchronization barrier or general event) is<br />

instantly brought into a fully clock gated state, zeroing its<br />

dynamic power consumption. Figure 3 illustrates dispatch from<br />

master core of a C function Foo(Arg) on all the 8 cores.<br />

Fig. 2. uDMA and I/O Architecture<br />

The L2 memory is located within the fabric controller<br />

perimeter. It is dimensioned to 512 megabytes optionally<br />

extendable via a DDR HyperBus interface. Also, in this area is<br />

a ROM containing the primary boot-code including secured<br />

boot support through eFUSE stored keys. The last important<br />

block is dedicated to power management including an on chip<br />

Fig. 3. Dispatch on cluster cores<br />

GAP8’s memory hierarchy is organized as a single name<br />

space: every single core in the chip can see all memory<br />

locations, unless they are protected by the Memory Protection<br />

Unit (MPU), with an access time which increases when the<br />

20


target address is in L2 memory or in external memory (L3). To<br />

hide the access cost of L3 and L2 memory the cluster contains<br />

a multi-channel DMA capable of 1D and 2D memory accesses.<br />

When the cluster runs CNN based applications it can<br />

offload compute intense convolutional layers to a dedicated<br />

accelerator, the Hardware Convolution Engine (HWCE) [4].<br />

This block can evaluate a full 5x5 convolution or three 3x3<br />

convolutions on 16-bit operands in a single cycle. It is directly<br />

connected to the cluster's shared L1 memory through several<br />

load store units similar to the ones used in the cluster’s<br />

programable cores. Since the HWCE shares its memory with<br />

the cores and has access to the synchronization resources, a<br />

HW accelerated convolution can be freely mixed with activities<br />

running on the cores. Besides boosting performance, the<br />

HWCE plays an essential role in improving the energy<br />

efficiency of the overall system when running CNN based<br />

applications. The fact that it internally maximizes data and<br />

coefficient reuse leads to a 4 to 5 times energy efficiency<br />

improvement compared with a pure software parallel and<br />

vectorized implementation.<br />

C. Processor<br />

One of the key building blocks of GAP8 are its cores. The<br />

elementary core is a straight in order, 4 stage pipeline,<br />

compliant with the RISC-V ISA subsets I, M, C. Since the<br />

RISC-V ISA is architected to be extendable we have used<br />

extended instructions [5] to boost performance for DSP centric<br />

kernels manipulating integer or complex numbers represented<br />

as vectors of short integers. Dedicated support has been added<br />

to support zero overhead hardware loops, pointer post modified<br />

load and store, one cycle multiply and accumulate, one cycle<br />

complex multiplication, as well as dedicated instructions for<br />

efficient rounding, normalization and clipping. To increase the<br />

intrinsic level of parallelism (ILP), single instruction multiple<br />

data (SIMD) support has been added enabling vectors of 4-byte<br />

elements or 2 short elements. SIMD operations can produce<br />

either vectors or scalars in operations such as dot products or<br />

accumulation of dot products. Finally, some bit-manipulation<br />

oriented instructions, like bit insertion or extraction, are also<br />

added to make control-oriented code more compact and more<br />

cycle efficient. The elementary core complies with the RISC-V<br />

privileged instructions specification to enable the execution of<br />

secured code, assisted by a built-in programable memory<br />

protection unit (MPU).<br />

Fig. 4. Baseline RISC-V versus extended ISA code comparison<br />

Fig. 4 gives an example of the difference between native<br />

RISC-V assembly code (on the right) and extended RISC-V<br />

assembly code (bottom left) for the same C code. In both cases<br />

the code is automatically generated by the compiler.<br />

Fig. 5. Baseline RISC-V versus extended ISA<br />

In Fig. 5 the performance improvement of the extended<br />

ISA versus the RISC-V base ISA is illustrated. ISA extensions<br />

are organized in 2 groups: V2 is DSP centric, V3 is V2 plus<br />

SIMD, bit manipulation and more DSP instructions. The<br />

performance figures are obtained on a group of representative<br />

kernels containing convolution, FFT, filtering, and cyphering.<br />

Beyond performance improvements these extensions bring<br />

additional energy efficiency since for the same performance on<br />

a given size of workload, the clock frequency and, in some<br />

cases, the supply voltage can be lowered.<br />

D. Power Management<br />

To achieve power efficiency and to minimize the number of<br />

external components the SoC contains an internal DC/DC<br />

convertor that can be directly connected to an external battery.<br />

It can deliver voltages in the range of 1.0V to 1.2V when the<br />

circuit is active. When the circuit is in sleep mode this<br />

regulator is turned off and a linear drop output (LDO) is used<br />

to power the real time clock which controls programmed wake<br />

up and, optionally, part of the L2 memory allowing retention of<br />

application state for fast wakeup. When in deep sleep the<br />

current consumption is reduced to 70nA (assuming the real<br />

time clock is active and no data retention). The two main<br />

domains have their own separate clocks. Special attention has<br />

been paid to the time needed to turn on and turn off the cluster.<br />

The typical turn-around time is between 100us and 150us<br />

allowing for agile power state transitions.<br />

E. Software development flow<br />

Code generation is supported by GCC (7.1). The RISC-V<br />

mainstream GCC compiler and binutils suite has been modified<br />

to natively support all ISA extensions described earlier,<br />

including vector support. Applications are written in C/C++.<br />

To exploit parallelism and since the cluster is designed for<br />

shared memory programming models OpenMP can be used but<br />

a simple and efficient API is also available providing very low<br />

overhead access to task dispatch and synchronization<br />

resources. Platform resource management (peripherals, time,<br />

memory, power) is supported either by the PULP-OS (event<br />

based) or through an ARM Mbed port (thread or event based).<br />

Other RTOS ports are planned in the future. In both cases the<br />

21


cluster is used with a component-based model: a software<br />

component is installed onto the cluster after it has been turned<br />

on and configured. Interfaces are negotiated between the RTOS<br />

running on the MCU and the component in the cluster.<br />

Synchronization and a small number of system services are<br />

local to the cluster. More sophisticated services, involving<br />

peripherals for example, are delegated to the MCU part of the<br />

system. This approach makes the software part running on the<br />

cluster independent of the RTOS running on the MCU.<br />

Programs are always fetched from L2 memory through the<br />

previously described instruction caches and can be statically or<br />

dynamically linked. In the latter case dynamic relocation is<br />

performed through a light weight model to limit code<br />

expansion in L2. Combining dynamic relocation with the<br />

user/machine mode support in the core and memory protection<br />

(MPU) eases the deployment of secured kernels and<br />

applications.<br />

Data is not cached to avoid the significant power penalty<br />

associated with data caches. To use the architecture at its best<br />

the preferred approach is to always try to keep data as close as<br />

possible to the cores using it. Data is efficiently moved from<br />

and to L2 and L1 by the DMA or uDMA engines, but code<br />

restructuring and organization can prove to be error prone and<br />

time consuming. To ease development a tool has been<br />

developed to automate this process. Basic kernels are first<br />

written without taking into consideration where data is located.<br />

These kernels can be optimized and parallelized without the<br />

programmer being concerned with data placement. The basic<br />

kernels are then combined into what we call user kernels. A<br />

user kernel is described as a multi-dimensional iteration space.<br />

It contains a collection of connected basic kernels that can be<br />

inserted into the user kernel iteration space at pre-defined<br />

locations (depth, prologue, body, epilogue). The user kernel<br />

definition contains a series of argument definitions for the basic<br />

kernel arguments. Each argument defines which sub space of<br />

the kernel iteration space it is concerned with, a tiling direction,<br />

a home location of the data (L2 memory, external memory) and<br />

a set of properties. Using this model and a given L1 memory<br />

budget the tool infers a tiling structure for each argument<br />

fitting within the L1 memory budget and satisfying the set of<br />

constraints put on all the arguments. Once the tiling structure<br />

has been computed it generates a C program that takes care of<br />

providing tiles to the basic kernels in a pipelined manner to<br />

keep all the cores continuously working. The tool is delivered<br />

as a C library which exposes an API to create models. The<br />

models themselves are written in C and once linked with the<br />

library become executable. Running a compiled model<br />

produces C code wrappers containing calls to the basic kernels<br />

(either sequential or parallel) as well as DMA transactions.<br />

With this approach it is possible to build generators for specific<br />

algorithms and then combine these together. We have<br />

developed generators that, for example, automatically generate<br />

different types of CNN layer. Adding a layer to a CNN graph is<br />

simplified to defining the nature of the layer (convolution,<br />

pooling, rectification, and so on), dimensionality and location<br />

of the coefficients (L2, external memory). These parameters<br />

are all captured in a single call to a generator. Since generators<br />

can be combined, the whole network can be easily built and<br />

executing the model will produce all the wrappers calling the<br />

optimized basic kernels and managing memory movements.<br />

Fig. 6. Input to code generation tool (MNIST CNN)<br />

In Fig. 6 we show how a CNN generator is instantiated for<br />

a MNIST network. Offloading to the convolutional accelerator<br />

(HWCE) fits nicely within this approach since the accelerator<br />

consumes and produces data from and to shared L1 memory<br />

and as such it conforms perfectly to the definition of a basic<br />

kernel. Beyond the domain of CNNs other generators have<br />

been developed: 2D FFT, various feature extractors (Histogram<br />

of gradients, difference of gradients, various estimators, image<br />

resize, HOG + weak-predictor based object recognition, and so<br />

on).<br />

This automatic tiling tool is used as a backend for higher<br />

level tools to automate support direct export from CNN<br />

frameworks such as TensorFlow into optimized C code ready<br />

to be compiled on the chip.<br />

Fig. 7. Software development flow<br />

Fig. 7 summarizes the key modules involved in the<br />

software development flow.<br />

22


F. Application Results<br />

TABLE I.<br />

APPLICATION RESULTS<br />

Application<br />

Cores<br />

1 2 4 8<br />

1D FFT1024<br />

Radix4<br />

28.2 14.3 7.8 4.7<br />

2D FFT 256 x<br />

256 Radix4<br />

78.9 41.9 22.6 13.3<br />

Byte 5x5 Conv 18.5 9.3 4.7 2.2<br />

Short 5x5 Conv 37.8 18.9 9.5 4.6<br />

Binary 5x5<br />

Conv<br />

20.8 10.5 5.3 2.8<br />

Short<br />

MaxPool2x2<br />

8.2 4.2 2.1 1.1<br />

Short MatMult<br />

32x32<br />

41.9 20.9 14.0 5.2<br />

Short 2048 to 1<br />

Fully<br />

3112.0 1616.0 847.0 495.0<br />

Connected<br />

CannyEdge 99.5 50.9 26.2 12.7<br />

AES-CTR 128b 15.3 7.7 4.0 2.1<br />

64 Mel<br />

Coefficients<br />

542.7 299.4 176.7 101.3<br />

HoG, 8x8<br />

Cells,<br />

2x2Blocks, 9<br />

Bins<br />

65.0 35.0 18.0 9.0<br />

4. Short Max Pool 2x2: Key kernel for CNN, operands<br />

are 16 bits, one output is produced, and cycles<br />

reported for it.<br />

5. Short MatMult 32x32: performs a matrix<br />

multiplication between 2 16-bit 32x32 matrices. 1024<br />

outputs are produced, cycle count is for 1 output.<br />

6. Short 2048 to 1 Fully connected: CNN's fully<br />

connected layer 2046 inputs, 2048 coefficients, 1<br />

output, all 16-bit. Cycle count is for 1 output.<br />

7. Canny Edge Detector: Gaussian smoothing (5x5),<br />

Gradient magnitude and orientation, non-max<br />

suppression, blob extraction. Average cycles per<br />

output image pixel is reported.<br />

8. AES-CTR 128: AES encryption/decryption, CTR<br />

form, 128b key, cycles reported for 1 output bit.<br />

9. 64 Mel coefficients: Sub band analysis on 64 bands<br />

(64 mel coefficients). Input is 16KHz, 16bits PCM,<br />

frame: 400 samples, frame overlap: 10ms. Steps: preemphasis,<br />

hamming window, radix2 fft 512, mel and<br />

mel derivatives extraction. Cycles reported for 1 mel<br />

and derivatives coefficient.<br />

In table I cycles per produced elementary output are given<br />

as a function of the number of cores used when running sample<br />

test applications. It should be noted that the exact same binary<br />

is used when running on 1, 2, 4 or 8 cores. The code is<br />

automatically dispatched on the number of cores dynamically<br />

passed to the hardware dispatcher. All cycles counts are<br />

obtained using a hardware timer to capture time before and<br />

after the application. The cycles-counts capture all activities,<br />

not just CPU time but also DMA and distant level memory<br />

accesses. The list bellow provides a detailed description of<br />

each test application. The test applications have been<br />

implemented to optimally benefit from the parallelization and<br />

vectorization opportunities provided by the architecture.<br />

1. 1D FFT 1024 Radix 4: A fixed point (Real and<br />

Imaginary are in Q15) single dimension radix 4 FFT,<br />

1024 outputs are produced, cycles is for 1 output.<br />

2. 2D FFT 256x256 Radix4: A fixed point (Real and<br />

Imaginary are in Q15) bi-dimensionsal radix 4 FFT,<br />

65536 outputs are produced. Cycles is for 1 output.<br />

3. 5x5 convolutions: Key kernel for CNN and variants.<br />

25 sums of products, byte variant handles byte input,<br />

short handles 16bits inputs, binary performs binary<br />

convolution. A single output is produced, and cycles<br />

reported for it.<br />

Fig. 8. Baseline RISC-V versus extended ISA<br />

Fig. 8 shows the factor of performance increase as a<br />

function of the number of cores used. The speedup factor<br />

indicates the architecture’s ability to efficiently scale in<br />

performance without being impaired by elements such as<br />

synchronization overhead and memory contention. For all the<br />

reported applications the geometric mean for the speedup<br />

23


factor when using 8 cores compared to a single core is 7.1. This<br />

shows that for an application set with enough diversity the<br />

architecture scales very well.<br />

TABLE III.<br />

CNN PERFORMANCE<br />

CIFAR10<br />

TABLE II.<br />

CNN TOPOLOGIES<br />

In W H Out<br />

Arithmetic<br />

Ops<br />

Conv5x5/1 1 32 32 8 313600<br />

MaxPool2x2/2 8 28 28 8 4704<br />

Conv5x5/1 8 14 14 12 480000<br />

MaxPool2x2/2 12 10 10 12 900<br />

FullyConnected 300 1 1 10 6000<br />

MNIST<br />

805204<br />

Conv5x5/1 1 28 28 32 921600<br />

ReLU 32 24 24 32 18432<br />

MaxPool2x2/2 32 24 24 32 13824<br />

Conv5x5/1 32 12 12 64 6553600<br />

ReLU 64 8 8 64 4096<br />

MaxPool2x2/2 64 8 8 64 3072<br />

FullyConnected 1024 1 1 10 20480<br />

TEXT RECO<br />

7535104<br />

Conv3x3/1 1 128 128 32 9144576<br />

ReLU 32 126 126 32 508032<br />

MaxPool2x2/2 32 126 126 32 381024<br />

Conv3x3/1 32 63 63 32 68585472<br />

ReLU 32 61 61 32 119072<br />

MaxPool2x2/2 32 61 61 32 89304<br />

Conv3x3/1 32 30 30 32 14450688<br />

ReLU 32 28 28 32 25088<br />

MaxPool2x2/2 32 28 28 32 18816<br />

FullyConnected 6272 1 1 64 802816<br />

ReLU 64 1 1 64 64<br />

FullyConnected 64 1 1 13 1664<br />

94126616<br />

To give some more insight into the architecture’s<br />

performance on more complex applications involving<br />

Convolutional Neural Networks (CNNs) we provide<br />

performance evaluations on 3 networks. The first two,<br />

CIFAR10 and MNIST, are well known. The third one is<br />

significantly larger with 421263 trainable parameters and<br />

1511904 neurons. This network is used to perform text<br />

recognition from images. The characteristics of the network are<br />

provided in table II. The arithmetic operation column gives the<br />

total number of dyadic arithmetic operations needed when<br />

performing inference, when memory accesses are excluded.<br />

Cycles / Cores<br />

1 2 4 8<br />

8 Cores<br />

+<br />

HWCE<br />

Speedup<br />

CIFAR10 711042 415838 254988 178458 65033 10.9<br />

MNIST 7620725 4099793 2359415 1559166 816731 9.3<br />

TextReco 97823730 51461201 28198788 17274720 8325727 11.7<br />

Table III shows the total number of cycles required when<br />

running a full inference on these 3 networks. Cycle counts here<br />

include all operations. For example, for the TEXT RECO<br />

network, access to coefficients stored in an external memory is<br />

included. Cycles are reported for 5 configurations: 1, 2, 4, 8<br />

cores and 8 cores with the HWCE accelerator running the<br />

convolutional layers while all the other layers are running on<br />

the 8 cores. For the 8 cores plus HWCE configuration the<br />

HWCE Total cycles versus total arithmetic operations gives a<br />

good measurement on how the architecture behaves.<br />

G. Conclusion<br />

We have presented an ultra-low power programmable<br />

platform derived from two major open source initiatives: PULP<br />

and RISC-V. Content understanding, and in particular CNNbased<br />

solutions, are the primary focus of this platform. We<br />

have provided evidence that when this platform operates on<br />

real networks the combination of parallelism and a hardware<br />

accelerator leads to a 10x improvement versus a single core<br />

model while also improving energy efficiency. Through a set<br />

of kernels and real-life applications we have shown the<br />

capability of this platform to scale efficiently with the number<br />

of used cores. We have explained how the architecture is<br />

organized and in particular, the trade-offs we have chosen to<br />

improve energy efficiency. We have shown the importance of<br />

high level tools to efficiently map complex applications onto a<br />

parallel architecture which by necessity is not equipped with<br />

hardware assistance to hide the complexity of explicit memory<br />

hierarchy management.<br />

The architecture has been taped out in GAP8 using the<br />

TSMC55LP process. It shares power and size characteristics<br />

with state of the art ultra-low power MCUs but at the same<br />

time, thanks to aggressive parallel and vector computing<br />

capabilities, is capable of delivering several giga operations per<br />

second within a very small power envelope. We show how this<br />

architecture enables new applications for battery operated edge<br />

devices with rich-data sensing capabilities.<br />

REFERENCES<br />

[1] Davide Rossi, Francesco Conti, Andrea Marongiu, Antonio Pullini, Igor<br />

Loi, Michael Gautschi, Giuseppe Tagliavini, Philippe Flatresse, Luca<br />

Benini “PULP: A Parallel Ultra-Low-Power Platform for Next<br />

Generation IoT Applications”<br />

[2] http://www.pulp-platform.org<br />

[3] https://riscv.org/<br />

24


[4] Francesco Conti, Luca Benini, “A Ultra-Low-Energy Convolution<br />

Engine for Fast Brain-Inspired Vision in Multicore Clusters”,<br />

Proceedings of the 2015 Design, Automation & Test in Europe<br />

Conference & Exhibition, 2015<br />

[5] Michael Gautschi, Pasquale Davide Schiavone, Andreas Traber, Igor<br />

Loi, Antonio Pullini, Davide Rossi, Eric Flamand, Frank K. Gurkaynak,<br />

Luca Benini, “Near-Threshold RISC-V Core With DSP Extensions for<br />

Scalable IoT Endpoint Devices”, IEEE Transactions on Very Large<br />

Scale Integration Systems (TVLSI)<br />

25


Precisely engineered RISC-V embedded processors in 30 days<br />

Keith A. Graham<br />

Electrical, Computer & Energy Engineering<br />

University of Colorado<br />

Boulder, USA<br />

Keith.A.Graham@Colorado.EDU<br />

Abstract— RISC-V can transform embedded systems by<br />

providing precisely engineered processors to satisfy performance,<br />

cost, and power which have been unavailable in conventional IP<br />

processor core designs. These precisely designed embedded<br />

processors will meet the application’s requirement by only<br />

including the necessary hardware resources. To make this<br />

realization happen, today’s embedded processor designer must<br />

become a core processor engineer. In this paper, we explore<br />

engineering an embedded processor that can real-time process data<br />

at 80 mega samples per second utilizing a highly abstracted CPU<br />

processor development strategy in 30 days.<br />

Keywords—RISC-V, Embedded Processor Design, Vector<br />

Processor, Custom CPU, Soft Processor Core, IP core processor<br />

I. INTRODUCTION<br />

Over the last two to three decades, custom CPU design has<br />

been replaced by IP processor cores. A major motivation<br />

towards these IP cores has been the availability of open source<br />

software that minimizes development costs as well as long<br />

term support cost. Today, if you design a system using an<br />

ARM processor, there are open source operating systems,<br />

drivers, and applications that are available to the embedded<br />

solution. RISC-V is changing the balance towards customs<br />

solution by making an open source Instruction Set<br />

Architecture, ISA, as well as a platform for open source code.<br />

By removing the requirement of custom software for noncritical<br />

application code as well as long term support costs,<br />

RISC-V design teams can focus on custom CPU resources and<br />

software to provide a unique customer experience.<br />

RISC-V is a modular ISA architecture in which specific<br />

extensions are defined. The benefit of modularity is that a<br />

solution only needs to support the hardware resources required<br />

for a specific ISA extension or solution, thus minimizing<br />

power, die area, and cost [1]. The RISC-V extensions can be<br />

considered as base starting points of a custom embedded<br />

solution. For example, the 80 MSPS signal processing<br />

application only requires 32-bit integer math and minimal<br />

division so it will be based on a RV32I ISA standard core.<br />

There are extensions that define 64-bit and 128-bit instruction<br />

addressing, Multiply/Division, and Atomic instructions as<br />

examples. It should be noted to be RISC-V certified, each<br />

hardware instantiation must support all basic RISC-V<br />

instructions via direct hardware execution or through software<br />

exception handling.<br />

In today’s IP core based embedded solutions, if an<br />

application does not specifically match an IP core, choosing an<br />

IP core is analogous to shopping in a grocery store where you<br />

must purchase the next larger processor box to make your<br />

favorite recipe. The next larger box will have leftovers that<br />

you purchased in terms of power and silicon. Most designs<br />

must balance development time and risk management. You<br />

would expect that IP cores would have an advantage in both<br />

time and risk management, but modern highly abstracted CPU<br />

processor development tool chains can change the balance<br />

towards customized cores. If at the later stages of a project<br />

where the application requirements increase, the ability to<br />

provide an altered CPU core reduces risk compared to a static<br />

IP core solution.<br />

For this paper, we will be developing a customized RISC-V<br />

processor to meet an 80 MSPS real-time processing project.<br />

The design is based on a 5-stage RV32I core designed at the<br />

University of Colorado at Boulder, Colorado for instructional<br />

and research purposes. The CPU processor development<br />

environment, Codasip Studio 7.0, enables the development of<br />

an optimized processor by adding CPU resources while<br />

generating a c-compiler to access these additional resources.<br />

Tools like Codasip Studio release the power of RISC-V by<br />

enabling RISC-V cores that execute open source RISC-V<br />

binaries to execute application specific code compiled to utilize<br />

the optimized CPU.<br />

II.<br />

GOAL SETTING<br />

Like all projects, goal setting defines the end objective of<br />

the embedded solution. In our project, the instrument<br />

development team defined a goal of 80 MSPS of 14-bit<br />

incoming data that must be processed real-time. Breaking<br />

down the algorithm into its sub-components, we will be<br />

focusing on the Kurtosis algorithm. The goal is to process<br />

10,000 sets of 8,192 samples in 0.10 seconds. A further goal<br />

based on limiting the FPGA in size, power, and cost, these<br />

10,000 Kurtosis operations can be spread out over a maximum<br />

of 40 CPUs on a single FPGA resulting in a Kurtosis<br />

completing every 400 micro-seconds.<br />

26


Fig. 1. Kurtosis algorithm<br />

Each RISC-V core will have a local buffer that will<br />

calculate the average of the incoming data block as it is stored<br />

from the ADC. The incoming data stream will generate the<br />

sum of each incoming block in hardware as it is brought into<br />

the RISC-V local memory buffer. With the Kurtosis block data<br />

average being done in hardware, the number of cycles to<br />

calculate the Kurtosis algorithm will be the difference between<br />

the CPU clock cycle upon exiting the Kurtosis routine and the<br />

cycle count upon entering the Kurtosis algorithm.<br />

III.<br />

EXECUTION TIME<br />

There are three key variables that make up the execution<br />

time of the Kurtosis algorithm or any CPU execution time;<br />

number of instructions, average cycles per instruction, and<br />

clock period [2].<br />

Time(secs) = # of Instruction ∗ Cycles per Instruction<br />

∗ Clock Period( secs<br />

cycle )<br />

Time(secs) = Total Cycles ∗<br />

1<br />

( sec<br />

Frequency<br />

cycles ) (1)<br />

Using data driven techniques in building the embedded<br />

processor, all three of these variables will be considered. The<br />

adding of CPU resources to the base RV32I core will have the<br />

greatest impact on instruction count and cycles per instruction,<br />

but architecting the resources can also impact the frequency,<br />

clock period.<br />

IV.<br />

GETTING A BASELINE<br />

First, let’s determine whether the base RV32I core can<br />

achieve the desired Kurtosis performance without<br />

modification. All simulations will be done using the Codasip<br />

Studio c-compiler generated for the CU RISC-V<br />

implementation with maximum optimization setting of -O1 on<br />

a cycle accurate model. This optimization was chosen for the<br />

ease of setting breakpoints upon entering and exiting the<br />

Kurtosis routine for benchmarking. To more accurately<br />

emulate -O3 optimizations, the algorithm will be written to<br />

support loop unrolling in -O1 which would be implemented by<br />

optimization -O3. The cycle accurate model represents a 5-<br />

stage pipeline hardware design that can generate RTL which<br />

then can be synthesized into a FPGA.<br />

The CU RV32I FPGA design is expected to operate at<br />

75MHz operation. We will use this frequency as our<br />

benchmark frequency.<br />

1<br />

Time(secs) = Total Cycles ∗<br />

Frequency ( sec<br />

cycles )<br />

1<br />

Time(secs) = 2,191,817 ∗<br />

75000000<br />

Time(secs) = 29,224 uS (2)<br />

The base core executing the Kurtosis routine simulated in<br />

2,191,817 cycles representing 29,224 uS which will not meet<br />

the applications requirement of 400 uS. To achieve the goal of<br />

minimal CPU resources to satisfy the application requirements,<br />

resources will be added one at a time starting with reducing<br />

instruction count through the addition of an instruction. To<br />

determine a possible instruction to add, the integrated Codasip<br />

Studio profiler tool highlights the routine’s “hot spots” in red.<br />

These “hot spots” indicate the largest concentration of clock<br />

cycles per line of c-code.<br />

Fig. 2: CU RISC-V cycle accurate Kurtosis profile<br />

Fig. 3: Profiled assembly instructions<br />

Based on the Kurtosis routine profiled “hot spot,” replacing<br />

the Multiply math function with a Multiply instruction may<br />

significantly reduce the number of instructions which should<br />

reduce the execution per equation (1) Execution time.<br />

V. MULTIPLY<br />

The RISC-V ISA defines an “M” extension which includes<br />

multiply and divide operations. This extension defines two<br />

multiply instructions. The first to return the lower 32-bits of a<br />

27


32-bit by 32-bit multiply and a second to return the upper 32-<br />

bits of a 32-bit by 32-bit multiply. The ISA allows a<br />

microarchitecture to fuse these instructions into a single<br />

operation to obtain the full 64-bit result [3]. In the Kurtosis<br />

application, the “hot spot” is the multiply instruction and not<br />

the divide. With the focus on minimal CPU resources to<br />

provide the required performance, only the multiply<br />

operations will be added to the CPU. To further define the<br />

requirement, the incoming data stream will not result in a<br />

multiply whose return value is greater than 32-bits which<br />

further defines the desired multiplication to the RISC-V MUL<br />

instruction.<br />

Fig. 4: RISC-V MUL instruction format<br />

For a two operand and a destination register operation, R-<br />

type, the complete instruction opcode comprises of three<br />

fields, the opcode, bits 6-0, the funct3, bits 14-12, and funct7,<br />

31-25. The plan is to implement this RISC-V MUL<br />

instruction, opcode = 0b0110011, funct3 = 0b000, and funct7,<br />

0b0000001 [3].<br />

Now, the power of a highly abstracted processor<br />

development tool begins to become apparent. To add the<br />

Multiply instruction, only four segments of code need to be<br />

added. First, adding the MUL opcode to the CPU model in<br />

opcodes.hcodal.<br />

Fig. 6. Adding MUL in isa.codal<br />

For readability in the instruction accurate model, we have<br />

created a routine, alu(opc, src1, src2) that the instruction<br />

accurate model calls to determine what ALU operation to<br />

complete based on the incoming opcode [4]. The ALU routine<br />

is a large switch statement that returns the result of the<br />

operation based on the complete opcode provided to the<br />

routines switch statement. The MUL implementation is a<br />

signed multiplication based on casting both src1 and src2 to<br />

(int32)<br />

Fig. 5: MUL definitions in opcodes.hcodal<br />

Second, the instruction must be added to the instruction<br />

accurate model so that the compiler becomes aware of the<br />

MUL operation. To add this instruction, two codes segments<br />

are required in the instruction accurate model to enable the<br />

assembler, disassembler, and c-compiler generation. In the<br />

isa.codal file, we define using DEF_OPC mnemonics for the<br />

assembler and the associated complete opcode, OPC_MUL.<br />

By adding opc_mul to the set of i_alu, we have just added the<br />

MUL to the R-type of instructions.<br />

Fig. 7. MUL define in ia_utils.codal<br />

It took five minutes to add the MUL to the instruction<br />

accurate model. To verify the correct operation at the<br />

instruction accurate model, Codasip will generate an<br />

assembler(ia) and simulator(ia) respectively in the Codasip<br />

Tasks window. The compiling took a total of 2.11 minutes;<br />

1.11 minutes for the assembler(ia) and 1.00 minutes for the<br />

simulator(ia).<br />

28


Fig. 8: Codasip Task window<br />

Using standard RISC-V assembly format, the following<br />

test code was added to an assembly routine that performs a<br />

regression test on all CPU instructions. The regression test<br />

includes tests for data forwarding and load data hazard<br />

detection as well.<br />

Fig. 9: MUL added to regression test<br />

In under 15 minutes, an instruction was added to the CPU<br />

model and verified through simulation. With the instruction<br />

integrated into the model, it must now be added to the cycle<br />

accurate model which RTL can be generated and synthesized<br />

into a FPGA or ASIC. Like the instruction accurate model,<br />

minimal code is required to add the MUL to the 5-stage<br />

pipeline physical model. First, the MUL must be added to the<br />

instruction decoder and then added to the ALU. It can be<br />

added into the general ALU or as a separate execution unit.<br />

For this paper, it will be added to the general ALU.<br />

Fig. 11: ML added to execute pipeline stage<br />

With these two files updated, we can now generate a cycle<br />

accurate simulation by clicking on simulator(ca) in the<br />

Codasip Tasks window. Upon completion of the cycle<br />

accurate simulator which took 1.81 minutes, the regression<br />

test was ran on the cycle accurate model which verified<br />

correct execution of the MUL instruction. In under 30<br />

minutes, the multiple instruction was added and verified in a<br />

5-stage cycle accurate model.<br />

VI.<br />

MUL SPEED UP<br />

With the Cycle Accurate model completed, the instruction<br />

count calculated to complete the Kurtosis algorithm went from<br />

2,191,817 to 68,211 cycles. At 75MHz operation, that would<br />

appear to get close to the target, 909uS versus the target goal<br />

of 400 uS. This speed up is due to instruction count reduction.<br />

We now need to add the impact of adding a multiplication in<br />

the ALU to the clock period. At the time of this paper, the<br />

FPGA has not been chosen and we will be using an estimated<br />

clock frequency set by the multiply instruction at 25 MHz.<br />

With the MUL instruction now setting the clock frequency<br />

(the clock period), going back to the Execution Time in<br />

equation (1), the simulated Kurtosis time is 2,728 uS. There is<br />

a dramatic increase in performance, but with the MUL<br />

operation now specifying the clock period of all the<br />

instructions, the non-MUL instructions are negatively<br />

impacted by the addition of the MUL instruction.<br />

Fig. 10: MUL added to cycle accurate decoder<br />

29


Rerunning the Kurtosis simulation with speeding up the<br />

“Common Case,” the instruction counts increased from 68,211<br />

clock cycles to 109,181 clock cycles, but with the clock<br />

frequency back to 75MHz instead of 25, the execution time is<br />

now 1,456 uS or a speed up of 183% over the implementation<br />

where the MUL instruction set the clock period, frequency.<br />

VII. MULTIPLY ACCUMULATE (MAC)<br />

Through data analysis, the CPU performance in several<br />

hours has increased by 20 times, but still not achieving the<br />

goal of 400 uS execution time. Going back to the profiler, the<br />

instruction sequence for the first line inside the Kurtosis loop<br />

is a MUL followed by and ADD. There is a possibility of<br />

further instruction count reduction by realizing this sequence<br />

of MUL and ADD as a Multiple Accumulate function. The<br />

first line of code would reduce the instruction count from five<br />

to four instructions.<br />

Fig. 12: Kurtosis loop disassembled<br />

From the screen shot of the Kurtosis loop, only 8 of the 23<br />

instructions are MUL instruction. The other 15 instructions<br />

could be operating at 75MHz. Going back to one of the Great<br />

Ideas of Computer Architecture, “Make the Common Case<br />

Fast,” lets change the cycle accurate model to where the MUL<br />

operation is a multiple cycle instruction that takes three clock<br />

cycles, 25 MHz, and all other instructions take one clock cycle<br />

[2]. By adding these resources, the performance of the non-<br />

MUL instructions will be back to 75MHz.<br />

Fig. 13: Adding resource to define multiple cycle instructions<br />

Fig. 15: Kurtosis loop profile with MUL instruction<br />

Learning from the previous performance optimization,<br />

both instruction count and clock frequency impact execution<br />

time. If the MAC operation reduced the instruction count but<br />

did not decrease cycle count, no performance increase would<br />

be realized. Going back to the 5-stage pipeline<br />

implementation in Fig. 16, the pipeline stage after the execute<br />

stage is the memory stage. For a non-load or store operation,<br />

this stage becomes a pass through. To gain the benefit of<br />

reducing the instruction count, the accumulate portion of the<br />

MAC instruction will be performed in the memory pipeline<br />

stage. Adding the MAC instruction is like the MUL<br />

instruction with the addition of adding an accumulate<br />

operation in the memory pipeline stage. It took 60 minutes to<br />

implement and verify the addition of the MAC routine.<br />

Fig. 16: CU RISC-V 5-stage pipeline<br />

Fig. 14: Multi-cycle delay function in execute stage<br />

Adding these changes to the cycle accurate model took 15<br />

minutes and another 15 minutes to regenerate the software<br />

development kit, the assembler, the compiler, the c-libraries,<br />

etc. In 30 minutes, the performance data on speeding up the<br />

common case was available.<br />

The penalty of completing an instruction over two stages is<br />

that the result cannot be forwarded back to the ALU for an<br />

additional clock cycle. The hardware must insure proper<br />

execution by adding a bubble if the instruction immediately<br />

following the MAC requires the result of the MAC operation.<br />

If the compiler would generate this data dependency series of<br />

instructions, the benefit of moving the accumulate operation to<br />

the memory stage would be lost. To instruct the compiler that<br />

it should organize the instructions so that the instruction<br />

immediately following the MAC has no data dependency,<br />

30


codasip_compiler_schedule_class(sc_mac) is set to two<br />

instructions in the instruction accurate model.<br />

and power estimator. With the design not synthesized at this<br />

time, the estimator is only relative.<br />

With adding the MAC additional resources, the estimated<br />

area increased by 17% and the power increased by 23% over<br />

the multi-cycle MUL implementation. With no realized<br />

performance benefit to adding the MAC, further design<br />

optimizations will be focusing on the multi-cycle Multiplier<br />

implementation.<br />

At this point, I have completed my first day of the project<br />

and I have sped up the performance by 20.08 times, increased<br />

estimated area by 24% and decreased power by 95%<br />

compared to the base core implementation.<br />

Fig. 17: Adding instruction scheduling<br />

The instruction counts for the MAC implementation<br />

increased 15,752 to 124,933 clock cycles over the multiple<br />

cycle MUL implementation increasing execution time to 1,666<br />

uS. In analyzing the data, this increase in instruction count is<br />

based on the code implemented for the Kurtosis loop and not<br />

able to always insert a useful operation immediately following<br />

the MAC operation. When the compiler cannot find a useful<br />

instruction, it inserts a NOP, addi x0, x0, 0, instruction.<br />

VIII. IMPROVING BRANCHES AND JUMPS<br />

Going back to the basic 5-stage pipeline design, branches<br />

and Jumps are decided in the execute pipeline stage [2]. If the<br />

branch is false, the code flows linear without a penalty, by<br />

default, predicting branches not being taken. In a loop,<br />

branches are normally taken. In the basic model, if a branch is<br />

taken, the instruction in the decode and fetch stages must be<br />

discarded, effectively making a branch taken three clock<br />

cycles instead of one. To eliminate the performance penalty<br />

of a branch being taken, a dynamic predicted branch buffer<br />

will be added to the fetch pipeline stage. This buffer will act<br />

as a cache. If a branch has been previously taken, the<br />

instruction after the branch will be the branch target address<br />

instead of the instruction immediately following the branch,<br />

effectively making taken branches one cycle. When the<br />

branch is not taken, it will now have a penalty to recover from<br />

the miss prediction. Like the above examples, adding<br />

resources for the branch prediction buffer is quite easy. I<br />

chose to implement the buffer using the Codasip register file<br />

component broken into the three elements; the valid bit, the<br />

tag, and the address to branch.<br />

Fig. 18: Kurtosis disassembly with MAC instruction<br />

Hand assembly could generate more efficient code in this<br />

example. The compiler used an additional MAC operation per<br />

line of code that could have been an ADD instruction. With<br />

the MAC instruction equating to a three-cycle instruction and<br />

the ADD to a single instruction, the instruction count<br />

increased. The Codasip Studio profiler includes a relative area<br />

Fig. 19: Dynamic Branch Prediction Buffer resources<br />

31


The use of #define statements enable the power of using<br />

the high-level abstraction design flow for your processor<br />

design. By changing the definitions of BPB_SIZE, you can<br />

quickly recompile and experiment to achieve an optimal<br />

branch prediction buffer size. To allow the cycle accurate<br />

model to automatically reconfigure the change in the branch<br />

prediction buffer size, BPB_INDEX_SIZE is used to<br />

determine the width of the tag per Fig. 20.<br />

Fig. 20: Dynamic Branch Prediction Buffer in cycle accurate model<br />

Obtaining execution data on the combinations for buffer<br />

sizes 2, 4, 8, and 16 took 45 minutes. It was demonstrated that<br />

the Kurtosis routine did not experience any performance<br />

increase for a buffer size greater than 2. With the dynamic<br />

branch buffer, each Kurtosis block is now being executed in<br />

1,347 uS. The overall test program had a performance benefit<br />

at a buffer size of 16. For the analysis of this test loop, we<br />

will set the dynamic branch prediction buffer to a depth of 2,<br />

but realize the buffer size may change as additional elements<br />

of the overall algorithm are developed.<br />

At this point, I have completed my second day of the<br />

project and I have sped up the performance by 21.69 times,<br />

increased estimated area by 44% and decreased power by 95%<br />

compared to the base design.<br />

IX.<br />

VECTOR PROCESSOR<br />

A significant tuning of the base processor was<br />

accomplished in two days, but to realize the performance of<br />

sub 400 uS per Kurtosis block of 8,192 elements, based on the<br />

test data of the optimized processor, an increase of<br />

performance greater than 3 and less than 4 times is required.<br />

To obtain this level of performance increase, data level<br />

parallelism will be implemented using a 4-lane vector<br />

processor. Adding a vector processor is significantly greater<br />

effort than the resources added to this point, but still very<br />

manageable. Using the same design methodology of the CPU,<br />

the instruction accurate model should be developed first to<br />

verify the instructions in an assembly routine as well as to<br />

develop the compiler. After verifying an instruction accurate<br />

model, the cycle accurate model is created based on the<br />

defined instruction accurate model. From Fig 21, you will<br />

notice that many files are replicated with the vector processor<br />

additions and indicated by simd, Single Instruction Multiple<br />

Data.<br />

Fig. 21: Vector processor files indicated by simd<br />

To further aid vector processor develop, Codasip Studio<br />

has many built in functions to ease the development of the<br />

vector processor Instruction Set such as<br />

codasip_select_v4u32(v_cond, v_src1, v_src2). This routine<br />

will select which vector lane data will be stored in the<br />

destination vector register based on the value in the v_cond<br />

vector register. This built-in is an example of many highly<br />

abstracted functions available to the designer.<br />

When developing the vector processor, the vector<br />

processor instruction set must include the required functions<br />

of the target application and the instructions to enable the<br />

compiler to generate code. For example, the compiler requires<br />

a vector move instruction to move the value of one vector<br />

register to another. This move can be a unique instruction or<br />

an alias instruction that effectively performs the move, Fig 22.<br />

32


Fig. 22: Vector MV alias / pseudo instruction<br />

Supporting the development of the compiler will enable the<br />

compiler to automatically take advantage of the vector<br />

processor in loops when appropriate. The effort to support the<br />

compiler will be worth the portability and performance<br />

enhancements of the algorithm.<br />

At the time of this paper, I am working on the compiler to<br />

optimize the use of a secondary buffer memory, a dedicated<br />

vector memory. To obtain simulation results, I utilized the<br />

compiled inner loop in an assembly routine. The execution<br />

time of the Kurtosis test loop is now 389 uS, achieving the<br />

target goal of sub 400 uS.<br />

The effort to design, implement, and verify the vector<br />

processor took ten days. At the end of the twelfth day, I have<br />

sped up the performance by 75 times, increased estimated area<br />

by 612% and decreased power by 95% compared to the base<br />

design.<br />

X. CACHES<br />

Once the FPGA selection has been narrowed, to obtain the<br />

frequency of operation, a cache may need to be implemented<br />

for performance or remove a structural hazard on a single<br />

cycle where both the instruction and data need to access the<br />

same memory [2]. In a highly abstracted design environment,<br />

cache implementation can be reduced to 6 lines of code: size,<br />

latencies, number of sets, block size, replacement policy, and<br />

non-cacheable addresses. The complexity of comparing and<br />

managing the cache tags, valid bit, and dirty bit are handled by<br />

the Cache component. Similar to evaluating the different<br />

depths of the dynamic branch prediction buffer, experiments<br />

on the cache size can be turned around in 10 to 15 minutes.<br />

The ability to obtain data easily and quickly enables the design<br />

to be developed and optimized on time with minimal risk.<br />

Fig. 23: Codasip Data Cache component<br />

XI.<br />

CONCULSION<br />

An open hardware architecture, RISC-V, provides a<br />

platform that the designer can customize a solution to match<br />

an end application while retaining the ability to execute open<br />

source software. The development of highly abstracted<br />

development tools provides the mechanism to customize the<br />

solution to meet the end application’s performance, power,<br />

area, and cost goals through simulation and data analysis. The<br />

ability to optimize the processor at any time of the<br />

development cycle, reduces project risk and enhances time to<br />

market. The goal of the design is realization in silicon. These<br />

development tools enable the generation of RTL to obtain a<br />

RTL soft core that can be synthesized into a FPGA or ASIC.<br />

Fig 24 is an example of the time Codasip Studio took to<br />

generate RTL of the RISC-V 4-lane vector processor.<br />

Fig. 24: CU RISC-V Vector Processor RTL generation<br />

Based on data driven techniques, the project embedded<br />

processor is designed with a multi-cycle MUL instruction,<br />

without a MAC instruction, a minimum 2-deep dynamic<br />

branch prediction buffer, and a 4-lane vector processor. These<br />

same techniques can be used to continually optimize the<br />

processor as the signal processing algorithm is developed<br />

minimizing any risk of a choosing the incorrect IP core<br />

processor without having leftovers purchased in terms of<br />

power and silicon area.<br />

33


XII. RESULTS<br />

ACKNOWLEDGMENTS<br />

I would like to acknowledge the efforts of Pavan Suresh<br />

Dhareshwar, Aravind Venkitasubramony, and Shivasankar<br />

Gunasekaran that have made this project possible. I would<br />

like to give special gratitude to Zdenek Prikryl for his teaching<br />

and assistance with Codasip Studio.<br />

REFERENCES<br />

[1] David Patterson, Andrew Waterman, The RISC-V Reader, An Open<br />

Architecture Atlas, Berkeley, CA: Strawberry Canyon, 2017.<br />

[2] David A. Patterson, John L. Hennessy, Computer Organization and<br />

Design, The Hardware/Software Interface, RISC-V Edition, Cambridge,<br />

MA: Morgan Kaufmann, 2017.<br />

[3] The RISC-V Instruction Set Manual, Volume I: User-Level ISA,<br />

Document Version 2.2, RISC-V standard, 2017<br />

[4] Codasip Instruction Accurate Model Tutorial, Version 7.0.0, 2017<br />

34


Securing RISC-V Machines Dynamically<br />

with Hardware-Enforced Metadata Policies<br />

Steve Milburn<br />

Chief Technical Officer<br />

Dover Microsystems, Inc.<br />

Waltham, MA, USA<br />

steve@dovermicrosystems.com<br />

Greg Sullivan<br />

Chief Scientist<br />

Dover Microsystems, Inc.<br />

Waltham, MA, USA<br />

gregs@dovermicrosystems.com<br />

I. INTRODUCTION<br />

At the highest, most simplified level, the ideal approach to<br />

securing a system (hardware or software) consists of three<br />

inter-related activities:<br />

1. Specification: Precise definition of the required behavior<br />

of the system, including both acceptable and disallowed<br />

behaviors. This specification subsumes strictly “securityrelated”<br />

concerns and includes basic functional<br />

correctness. Groups within the RISC-V community are<br />

currently working in this area.<br />

2. Implementation: Implementation of hardware in some<br />

HDL (hardware description language), e.g. Verilog, and<br />

software in programming languages, such as C/C++, Java,<br />

etc. Presumably the implementation will be done with<br />

requirements, including security, in mind.<br />

3. Verification: Proof, or argument, that the implementation<br />

satisfies the specification. For simplicity, we will focus on<br />

verification of security properties, which can be viewed as<br />

the definition of “bad” behaviors and the proof that any<br />

“bad” behaviors cannot happen at runtime.<br />

The verification process can be divided into two<br />

timeframes:<br />

1. Static analysis: Examine the implementation artifacts and<br />

prove (through reasoning about possible runtime<br />

behaviors given the semantics of the implementation<br />

languages) that disallowed behaviors can never occur.<br />

The form of a proof statement is generally “for all<br />

possible inputs, the system will never exhibit this bad<br />

behavior.”<br />

2. Dynamic analysis: While executing the system, monitor<br />

for bad behavior and interrupt execution and recover if<br />

bad behavior is detected.<br />

Testing is a type of dynamic analysis in that it demonstrates<br />

the absence of bad behavior in some finite number of system<br />

executions. Testing can be used to increase confidence in the<br />

correctness of a system, but, for complex systems, testing is far<br />

short of a proof of correctness.<br />

In summary, a system cannot be considered fully trusted<br />

unless every component of the system has been subjected to<br />

either formal static verification or non-subvertible,<br />

comprehensive dynamic analysis. We summarize this<br />

methodology as follows:<br />

1. Prove what you can, before deployment: Using formal<br />

verification (described more fully later), verify as many<br />

properties as possible, for as many elements of the<br />

system, as possible. It is beyond our current capabilities to<br />

prove complete correctness and security for all hardware<br />

and software elements involved in complex systems. But<br />

we can prove many critical facts about many critical<br />

elements of systems, and we should.<br />

2. Enforce at runtime any properties not completely<br />

proven: For remaining misbehaviors that we cannot<br />

formally rule out, we need to detect those misbehaviors at<br />

runtime, before they cause harm, and recover.<br />

In the following sections, we focus on formal specification<br />

of security policies using micro-policies, and hardware-based<br />

dynamic enforcement of those micro-policies.<br />

It should be noted that the current state of the art in formal<br />

verification of hardware and software implementations falls far<br />

short of complete system-wide verification. However, great<br />

strides are being made in this area every year, especially on<br />

RISC-V.<br />

It is a daunting task to formalize the correctness and<br />

security of a complex system. It is even more difficult to<br />

attempt to prove that a large, complex implementation<br />

precisely matches the formal specification to which it is<br />

supposed to adhere. Indeed, what constitutes a “complete”<br />

specification of a system is not well-defined. Nonetheless, our<br />

goal is to promote an incremental approach where every<br />

additional specification of some of the properties of a part of<br />

the overall system increases overall trust and security of the<br />

system.<br />

35


II.<br />

FORMAL SPECIFICATION OF SECURITY POLICIES USING<br />

MICRO-POLICIES<br />

As previously described, our goals are:<br />

1. Enable incremental, formal specification of desired<br />

security properties of a system,<br />

2. Formally prove, when possible, whether some elements<br />

(ideally, all) of the system adhere to some (ideally, all) of<br />

the security properties, and<br />

3. Dynamically enforce any policies, on any elements of the<br />

system, where static compliance (goal #2) has not been<br />

proven. There is ongoing research on the interaction<br />

between statically verified code and untrusted code [1].<br />

We want the same security specifications to apply both<br />

formal proof and dynamic checking.<br />

Our general approach, which we call micro-policies,<br />

formalizes the collection, propagation, and combination of<br />

metadata during program execution. The semantics of micropolicies<br />

leads to a natural implementation as runtime monitors.<br />

A. Micro-Policies<br />

Micro-policies were introduced in [2]. The formal model of<br />

a micro-policy is a function of five arguments (inputs) that<br />

returns three values (outputs), and that is invoked once for<br />

every instruction executed on the host machine.<br />

Inputs. The five inputs of a micro-policy capture the<br />

instantaneous state relevant to the execution of a single<br />

instruction on a host computer. Note that the values below are<br />

metadata about the corresponding elements of the host<br />

computer state:<br />

1. PC (Program Counter): Metadata associated with the<br />

distinguished PC register in the machine, that is updated<br />

at every cycle to point to the next instruction word to be<br />

executed. The PC metadata is a convenient place to track<br />

dynamic information flow properties, such as “have we<br />

branched on sensitive data?” or “are we in the process of<br />

a control flow transfer?”<br />

2. CI (Current Instruction): Metadata associated with the<br />

word containing the instruction currently being executed.<br />

At compile time, static analysis infers security-relevant<br />

properties of instructions, and that metadata is stored in<br />

the binary along with the instructions. When an<br />

application binary is loaded into memory, the information<br />

collected at compile time is associated, as metadata, with<br />

instruction words.<br />

3. OP1 (1st operand): Metadata associated with the first<br />

argument/operand of the current instruction. For a<br />

STORE instruction, this might be the register containing<br />

the address to be dereferenced. For an ADD instruction,<br />

OP1 will be the metadata associated with the first of two<br />

summands.<br />

4. OP2 (2nd operand): Metadata for the second operand.<br />

For a STORE instruction, this might be the register<br />

containing the data to write to memory. For an ADD<br />

instruction this will be the second summand.<br />

5. MEM (referenced memory): For memory operations<br />

(LOAD, STORE), metadata associated with the word in<br />

memory being referenced.<br />

Outputs: A policy first indicates whether the five inputs,<br />

above, represents an allowed instruction, or whether it is a<br />

policy violation. If the inputs are allowed by the policy, then a<br />

policy can update the metadata associated with any outputs of<br />

the instruction. The PC is always an output, but different<br />

instructions have different updates. For example, a STORE<br />

instruction will update the value at the referenced address,<br />

whereas an ADD instruction will update the value in a<br />

destination register. For our current purposes, we will assume<br />

that each instruction updates the PC and a single location<br />

(either a register or a memory address). Thus, the outputs of a<br />

policy, per instruction, are:<br />

1. Allowed?: If false, an interrupt is thrown on the host<br />

processor.<br />

2. PC’ (updated program counter): Updated value of<br />

metadata associated with the PC.<br />

3. RES (result): Updated value of metadata associated with<br />

a register or memory location modified by the current<br />

instruction.<br />

A policy can therefore be considered as having two<br />

components:<br />

1. A predicate that determines whether the current<br />

instruction, based on metadata associated with relevant<br />

state, is allowed, and<br />

2. A flow rule that updates the metadata on the PC and any<br />

data referenced by the instruction.<br />

B. Metadata Initialization<br />

The abstract model of micro-policies assumes that every<br />

word and register in a system has metadata associated with it.<br />

But where does this metadata come from? There are two<br />

possible scenarios for data entering the live memory of a<br />

running system:<br />

1. Unknown / untrusted. With no other information, we<br />

can only label data according to the route through which it<br />

entered the system. That is, we can label data according to<br />

which IO mechanism brought it into main memory (serial,<br />

network, DMA region, etc.)<br />

2. Labeled. We can devise trusted mechanisms for data to<br />

arrive in a system already labeled with some metadata. An<br />

example of this process is loading an application binary<br />

from untrusted persistent storage. A trusted analysis of<br />

source code, during the compilation process, produces a<br />

signed, encrypted collection of metadata indexed to the<br />

executable produced by the compilation process. When<br />

the operating system loader loads instruction words into<br />

memory, policy code concurrently loads metadata<br />

36


describing the application instruction words. Policy<br />

metadata loading checks the signature to verify that:<br />

a) The metadata was produced by a trusted analysis, and<br />

b) The binary described by the metadata has not been<br />

altered since the metadata was generated.<br />

With a mechanism for labeling words entering the system<br />

with initial metadata, we outline the sorts of properties that can<br />

be expressed and enforced using micro-policies.<br />

C. Expressible Micro-Policies<br />

Given the general framework outlined above, what sorts of<br />

properties can be enforced using micro-policies? Since the<br />

description so far of micro-policies has been abstract, the<br />

following are example policies that show how to state both<br />

metadata update rules and security predicates against metadata.<br />

Taint Tracking Policy: “Taint tracking” can be used to<br />

enforce either confidentiality (where data is allowed to travel)<br />

and integrity (tracking trusted data) policies. For this example,<br />

imagine that we want to ensure that confidential data must go<br />

through a designated encryption routine before being copied to<br />

a memory-mapped IO region. As a first step, we define the<br />

type of metadata maintained by taint tracking:<br />

• Tainted?: Every word in memory will have a Boolean<br />

metadata attribute indicating whether it is tainted. In our<br />

example, tainted corresponds to being considered<br />

confidential. For simplicity, we assume that all<br />

confidential data in an application had its Tainted?<br />

Metadata field set to True, and all other data has its<br />

Tainted? Metadata field set to False.<br />

• Outflow: Memory-mapped IO locations to which we<br />

want to apply our taint tracking policy will have<br />

Outflow set to True; all other words will have Outflow<br />

set to False.<br />

• UntaintInstr: A particular instruction 1 within the<br />

designated encryption routine will have the metadata<br />

field UntaintInstr set to True. All other words will have<br />

UntaintInstr as False.<br />

After defining the fields of metadata being maintained, we<br />

define the Flow Rules for propagating and updating those<br />

fields:<br />

• If any input to an operation is tainted, the output of the<br />

operation becomes tainted.<br />

• If the current instruction has UntaintInstr set to True,<br />

the output of the operation is untainted.<br />

Finally, we define the Security Predicate for the taint<br />

tracking policy:<br />

• If the output location of a memory operation has<br />

Outflow set to True, and the input value for the memory<br />

operation has Tainted set to True, then raise a security<br />

policy violation exception.<br />

1 In fact, we want to ensure that the entire encryption routine is executed,<br />

but we will keep it simple here and designate a single instruction as<br />

representing execution of the entire function.<br />

Every micro-policy has these three components: metadata<br />

type structure, flow rules, and a predicate. We can define flow<br />

rules that are used by multiple micro-policy predicates.<br />

Control Flow Integrity Policy: Another example of a<br />

micro-policy is a simple control flow integrity (CFI) policy 2 .<br />

This CFI policy will label all instructions in a loaded<br />

application as either legal targets of control flow instructions<br />

(branches, jumps), or as not targets. The security predicate<br />

simply checks that all control flow instructions land at a legal<br />

target. We will assume that there is metadata on each<br />

instruction word indicating whether or not it is a control flow<br />

instruction (i.e. a branch, jump, or call). The metadata fields<br />

are:<br />

• InstrJump? – True for words containing instructions that<br />

are control flow instructions; False otherwise.<br />

• Target? – True for words containing instructions that<br />

are legitimate targets of control flow instructions; False<br />

otherwise.<br />

• Jumping? – This field is only True in the metadata for<br />

the PC (Program Counter) and when the preceding<br />

instruction was a control flow transfer instruction.<br />

There is only a single Flow Rule:<br />

• If the current instruction has InstrJump? set to True, set<br />

the PC Jumping? field to True. Otherwise, set the PC<br />

Jumping? field to False.<br />

The Security Predicate is:<br />

• If the PC Jumping? field is True, and the current<br />

instruction metadata does not have Target? set to True,<br />

raise a Policy Violation exception.<br />

Hopefully at this point the reader gets a sense of the process<br />

of defining flow rules and security predicates, even though we<br />

have elided many technical details.<br />

D. Composition of Micro-Policies<br />

Multiple micro-policies can be combined to create a<br />

Composed Policy. Flow rules compose in a straightforward<br />

manner, as long as they are concerned with updating disjoint<br />

metadata fields. There can be dependencies between flow<br />

rules; one flow rule may depend on metadata calculated by<br />

another flow rule, inducing an execution order on rules.<br />

Composing security predicates is also straightforward: each<br />

micro-policy has “veto power.” That is, if any individual<br />

security predicate raises an exception, then the composed<br />

policy raises that exception.<br />

III.<br />

HARDWARE-BASED DYNAMIC ENFORCEMENT OF<br />

MICRO-POLICIES<br />

Micro-policy enforcement at runtime can be viewed as a<br />

sort of Reference Monitor 3 and the form (data type, flow rules,<br />

predicate) outlined in the previous section lends itself to<br />

2 We are not implying that this simple control flow integrity policy will block<br />

sophisticated control flow hijacking attacks.<br />

3 Reference Monitor on Wikipedia:<br />

https://en.wikipedia.org/wiki/Reference_monitor (checked August 2017)<br />

37


implementation as an Inline Reference Monitor [3].<br />

Unfortunately, inline reference monitors suffer from two<br />

serious drawbacks in practice:<br />

1. The monitoring code happens in-line with the application<br />

code, adding substantial runtime overhead.<br />

2. If the attacker is assumed to have some control over the<br />

control flow of the target application, the attacker may be<br />

able to jump around or otherwise subvert the inline<br />

reference monitors.<br />

Dover’s CoreGuard solution implements dynamic<br />

enforcement of security policies in hardware, with<br />

configuration of that hardware done using updatable software<br />

running on an independent processor. For each instruction the<br />

host processor executes, CoreGuard hardware gathers the<br />

relevant metadata tags for the instruction’s inputs, checks a<br />

hardware policy rule cache, and applies the appropriate security<br />

predicate flow rule to the instruction’s outputs. If a policy rule<br />

for the inputs is not found in the hardware policy rule cache,<br />

CoreGuard’s included RISC-V processor core checks the<br />

installed policy software to determine the needed rule, and<br />

provides it to the hardware for application and caching. This<br />

hybrid approach enables CoreGuard to run at-speed with the<br />

host processor checking every single executed instruction,<br />

while also allowing for the benefits of software-defined<br />

policies, such as updateability, arbitrary complexity, and<br />

compositing of multiple micro-policies.<br />

CoreGuard’s hardware uses three main components: A)<br />

Hardware Interlock, B) Rule Cache, and C) Policy Executor<br />

(PEX).<br />

A. Hardware Interlock<br />

The hardware interlock controls communication between<br />

the host processor and the rest of the system to ensure that<br />

nothing is written to external memory or peripherals without<br />

first flowing through CoreGuard.<br />

B. Rule Cache<br />

The CoreGuard hardware uses a rule cache to optimize<br />

performance by storing rule processing data so that future<br />

requests for that data can be served faster. The rule cache stores<br />

a number of metadata combinations for allowed instructions—<br />

that is, for instructions that complied with micro-policies and<br />

were therefore allowed to execute. Each rule cache entry<br />

corresponds to a unique set of metadata tags for the<br />

instruction's inputs and outputs (described in the earlier Micro-<br />

Policies section). When CoreGuard processes the current<br />

instruction, it looks to see if the instruction's input metadata tag<br />

combination exists in the rule cache for that instruction.<br />

The rule cache is a multi-way skew associative cache,<br />

which aims at reducing misses by using different indices<br />

through hashing. By default, CoreGuard will evict the least<br />

recently added cache entry when it needs to make room for a<br />

new tag combination; this eviction policy, however, is<br />

configurable via micro-policies. With each instruction that it<br />

processes, CoreGuard updates the output metadata for that<br />

instruction.<br />

C. Policy Executor (PEX)<br />

The Policy Executor (PEX) is the RISC-V processor core<br />

included with CoreGuard to execute micro-policy code. Having<br />

a separate processor enables a clean separation between policy<br />

processing and host processing, which gives CoreGuard greater<br />

control of rule processing and better ability to optimize<br />

performance.<br />

When the system is first initialized, the PEX initializes<br />

metadata for all the words in memory available to the system.<br />

It then loads the application and sets up all application-specific<br />

metadata in memory.<br />

When there is a rule cache miss, the PEX crosschecks the<br />

metadata for the current instruction against the micro-policies<br />

installed on the system. Based on input metadata, the PEX<br />

updates and creates new output metadata.<br />

For complete protection, Dover Microsystems also<br />

recommends there be a mechanism in the SOC that restricts the<br />

host processor from having access to the metadata and policy<br />

software regions of memory utilized by the CoreGuard<br />

hardware. This ensures the Host software cannot manipulate<br />

the metadata or policy software to bypass the security created.<br />

This enforcement can be realized by existing technologies<br />

typically present in SOC network fabrics.<br />

IV.<br />

CONCLUSION<br />

By combining formal verification of system components<br />

where the technology exists to enable it, and dynamic analysis<br />

with Dover CoreGuard utilizing Software Defined Metadata<br />

Processing, an execution environment can approach complete<br />

protection. Formal verification of the metadata policies<br />

running in the CoreGuard system completes the security<br />

solution, and is the subject of ongoing research.<br />

REFERENCES<br />

[1] Y. Juglaret, C. Hritcu, A. Azevedo, B. C. Pierce, A.<br />

Spector-Zabusky and A. Tolmach, "Towards a fully<br />

abstract compiler using Micro-Policies: Secure<br />

compilation for mutually distrustful components," arXiv,<br />

2015.<br />

[2] A. Azevedo de Amorim, M. Denes, N. Giannarakis, C.<br />

Hritcu, B. C. Pierce, A. Spector-Zabusky and A. Tolmach,<br />

"Micro-Policies: Formally Verified, Tag-Based Security<br />

Monitors," in IEEE Symposium on Security and Privacy,<br />

SP 2015, San Jose, CA, USA, 2015.<br />

[3] U. Erlingsson, "The inlined reference monitor approach to<br />

security policy enforcement.," 2003.<br />

38


Cycle Approximate Simulation of RISC-V<br />

Processors<br />

Lee Moore, Duncan Graham, Simon Davidmann<br />

Imperas Software Ltd.<br />

Oxford, United Kingdom<br />

simond@imperas.com<br />

Felipe Rosa<br />

Universidad Federal Rio Grande Sul<br />

Brazil<br />

Abstract— Historically, architectural estimation, analysis and<br />

optimization for SoCs and embedded systems has been done<br />

using either manual spreadsheets, hardware emulators, FPGA<br />

prototypes or cycle approximate and cycle accurate simulators.<br />

This precision comes at the cost of performance and modeling<br />

flexibility.<br />

Instruction accurate simulation models in virtual platforms,<br />

have the speed necessary to cover the range of system scenarios,<br />

can be available much earlier in the project, and are typically an<br />

order of magnitude less expensive than cycle approximate or<br />

cycle accurate simulators. Previously, because of a lack of timing<br />

information, virtual platforms could not be used for timing<br />

estimation. We report here on a technique for dynamically<br />

annotating timing information to the instruction accurate<br />

software simulation results. This has achieved an accuracy of<br />

better than +/-10%, which is appropriate for early design<br />

architectural exploration and system analysis. This Instruction<br />

Accurate + Estimation (IA+E) approach is constructed by using<br />

Open Virtual Platforms (OVP) processor models plus a library<br />

that can introspect the running system and calculate an estimate<br />

for the cycles taken to execute the current instruction. Not only<br />

can these add-on libraries dynamically inspect the running<br />

system estimate timing effects, they can annotate calculated<br />

instruction cycle timing back into the simulation and affect<br />

timing of the simulation.<br />

Keywords—RISC-V, virtual platform, instruction accurate,<br />

processor models, timing estimation<br />

I. INTRODUCTION<br />

Performance and power consumption are two key attributes<br />

of any SoC and embedded system. Systems often have hard<br />

timing requirements that must be met, for example in safety<br />

critical systems where reaction time is of paramount<br />

importance. Other systems, particularly battery powered<br />

systems, have power consumption limitations.<br />

Because of the importance of these characteristics, many<br />

techniques have been developed for estimation of performance<br />

and power consumption. Recently, with the explosion of<br />

system scenarios that must be considered, this job has become<br />

much more difficult.<br />

Instruction accurate simulation has previously not been<br />

considered as a potential technique for timing and power<br />

estimation, because it is instruction accurate and does not<br />

model processor microarchitecture details: there is no<br />

information about timing or power consumption of instructions<br />

and actions in instruction accurate models and simulators.<br />

Recently some universities, using the Open Virtual Platforms<br />

(OVP) models and OVPsim simulator [1], have experimented<br />

with adding this information into the instruction accurate<br />

simulation environment as libraries, with no changes to the<br />

models or simulation engines [2]. These efforts have shown<br />

great promise, with timing estimation results within +/- 10% of<br />

the actual timing results for the hardware for limited cases.<br />

We report here on the further development of this<br />

technique, and the extension of this technique for RISC-V ISA<br />

based processors. This is critical for the RISC-V ecosystem,<br />

since for RISC-V semiconductor vendors to win embedded<br />

system sockets, their customers are going to want to know<br />

about the timing and power consumption of those SoCs when<br />

running different application software.<br />

II. CURRENT STATE OF THE ART<br />

Historically, SoC architectural estimation, analysis and<br />

optimization has been done using either manual spreadsheets,<br />

hardware emulators, FPGA prototypes, cycle approximate<br />

simulators or cycle accurate simulator and performance<br />

simulators such as Gem5 [3]. These all have significant<br />

drawbacks: insufficient accuracy, high cost, RTL availability<br />

(meaning that the technique is only available later in the project<br />

when the RTL design is complete), low performance, limited<br />

ability to support a wide range of system scenarios or are very<br />

complex to use and gain good results. Table 1 provides a<br />

summary of the strengths and weaknesses of each technique.<br />

www.embedded-world.eu<br />

39


TABLE I. STRENGTHS AND WEAKNESSES OF CURRENTLY USED<br />

TECHNIQUES FOR TIMING AND POWER ESTIMATION<br />

Technique Strength Weaknesses<br />

Ease of use<br />

Manual<br />

spreadsheets<br />

Hardware<br />

emulators<br />

Cycle accurate<br />

Lack of accuracy; inability<br />

to support estimations with<br />

real software<br />

High cost (millions USD);<br />

needs RTL; < 5 mips<br />

performance<br />

FPGA prototypes Cycle accurate High cost (hundreds of<br />

thousands USD); needs<br />

RTL<br />

Cycle approximate<br />

simulation<br />

Good performance Lack of accuracy; lack of<br />

availability of models<br />

Cycle accurate Cycle accurate<br />

High cost (hundreds of<br />

simulation<br />

thousands of USD); lack of<br />

availability of models<br />

Gem5 Microarchitectural detail A lot of work to develop a<br />

model of specific<br />

microarchitecture and to<br />

get realistic traces of SoC.<br />

III.<br />

INSTRUCTION ACCURATE SIMULATION<br />

Instruction set simulators (ISSs) have long been used by<br />

software engineers as a vehicle for software development.<br />

Over the last 20 years, this technique has been extended to<br />

support not only modeling of the processor core, but also<br />

modeling of the peripherals and other components on the SoC.<br />

The advantages of these simulators are their performance,<br />

typically hundreds of millions of instructions per second<br />

(MIPS), and the relative ease of building the necessary models.<br />

However, the simulator engines and models are instruction<br />

accurate, and are not built to support timing and power<br />

estimation.<br />

The performance of these simulators comes from the use of<br />

Just-In-Time (JIT) binary translation engines, which translate<br />

the instructions of the target processor (e.g. Arm) to<br />

instructions on the host x86 PC. This enables users to run the<br />

same executables on the instruction accurate simulator as on<br />

the real hardware, such that the software does not know that it<br />

is not running on hardware. Peak performance with these<br />

simulators can reach billions of instructions per second. A<br />

more typically use case, such as booting SMP Linux on a<br />

multicore Arm processor, takes less than 10 seconds on a<br />

desktop x86 machine.<br />

There are also significant libraries of models available, and<br />

it is easier to build instruction accurate models than models<br />

with timing or power consumption information, or real<br />

implementation details. One such library and modeling<br />

technology is available from OVP. The OVP processor model<br />

library includes models of over 200 separate processors (e.g.<br />

Arm, MIPS, Power, Renesas, RISC-V), plus a similar number<br />

of peripheral models. Most of these models are available as<br />

open source. The C APIs for building these models are also<br />

freely available as an open standard from OVP. Finally,<br />

complete content and organizational editing before formatting.<br />

Please take note of the following items when proofreading<br />

spelling and grammar:<br />

IV. INSTRUCTION ACCURATE SIMULATION PLUS<br />

ESTIMATION<br />

Instruction accurate simulation holds the promise of faster<br />

simulation performance to support examination of more system<br />

scenarios, plus lower cost and earlier availability. With the<br />

Imperas APIs and dynamic model introspection it is easy to<br />

add in timing and power estimation capabilities into the<br />

instruction accurate simulation environment.<br />

The idea of adding these capabilities as libraries is the<br />

combination of annotation techniques and binary interception<br />

libraries used with JIT simulation engines. Annotation<br />

techniques can be imagined as a full instruction trace which is<br />

then annotated with the timing or power information.<br />

However, just using annotation requires significant host PC<br />

memory, and can slow the simulation.<br />

Binary interception libraries are used with the Imperas JIT<br />

simulators to enable the non-intrusive addition of tools, such as<br />

code coverage and profiling, to the simulation environment.<br />

Combining these techniques maintains the high simulator<br />

performance with minimal memory costs. This combined<br />

technique is being called Instruction Accurate + Estimation<br />

(IA+E).<br />

In the Imperas simulation products, which require the use<br />

of OVP models, it is possible to create a standalone library<br />

module with entry points that are called when instructions are<br />

executed. This library can introspect the running system and<br />

calculate an estimate for the cycles taken to execute the current<br />

instruction, and can take into account overhead of different<br />

memory and peripheral component latencies. Not only can<br />

these add-on libraries dynamically inspect the running system<br />

and estimate timing affects, they can annotate calculated<br />

instruction cycle timing back into the simulation and affect (i.e.<br />

stretch) timing of the simulation. An overview of the<br />

simulation architecture is shown in Figure 1.<br />

Fig. 1. Overview of the Imperas IA+E simulation environment.<br />

For processors, the instruction estimation algorithm<br />

includes:<br />

• a mixture of table look ups for simple instructions<br />

• dynamic calculations for data dependent instructions<br />

• adjustments due to code branches taken<br />

• taking into account effects of memory and register<br />

accesses<br />

A view of the timing estimation mechanism is shown in<br />

Figure 2.<br />

40


Fig. 2. Simplified view of the timing estimation mechanism.<br />

For memory subsystems and peripheral components table,<br />

lookup and dynamic estimation can be made and timing back<br />

annotated into the simulation to simulate the delay effects of<br />

slow memories and other components.<br />

With this Instruction Accurate + Estimation (IA+E)<br />

approach, there is a separation of processor model functionality<br />

and timing estimation. This means while building a functional<br />

model there is no need to worry about any timing or cycle<br />

complexity. It is only when the more detailed timing is needed<br />

is it necessary to add the extra timing data to enable the<br />

Imperas IA+E timing tools to provide cycle approximate<br />

timing simulation for the RISC-V processors.<br />

This extra timing data is added in two steps. First, the cycle<br />

information is added to the library. Second, the time per cycle,<br />

which is dependent upon the specific semiconductor process<br />

and physical implementation details, is added.<br />

The approach of providing the timing data as a separately<br />

linked dynamic program enables RISC-V processor designers<br />

to create a cycle approximate timing simulation for their<br />

specific processor implementation - without sharing any<br />

internal information.<br />

IA+E simulation performance slows down from normal<br />

simulation performance, with typical overhead of about 50% of<br />

normal performance. Still, this puts IA+E simulation<br />

performance at 100-500 MIPS.<br />

IA+E does have some limitations. This technique has<br />

currently been proven only for simple processors with a single<br />

core, no cache, and in-order pipeline.<br />

V. RESULTS<br />

This IA+E technique was first tested with Arm Cortex-M4<br />

based processors. The results were much better than expected,<br />

with an average estimation error of +/- 5% as compared to the<br />

actual device. The device was an ST Microelectronics<br />

STM32F on a standard development board, running the<br />

FreeRTOS real time operating system, with 39 different<br />

benchmark applications used. Almost all timing estimation<br />

errors were within +/- 10% of actual timing values. Figure 3<br />

shows these results.<br />

Fig. 3. Timing estimation results for IA+E simulation show average errors of<br />

better than +/- 5% over 39 different benchmarks for Arm Cortex-M4.<br />

IA+E was recently extended to support RISC-V processors,<br />

by using publicly available information (from the processor<br />

vendor's data books) to build the cycle data libraries.<br />

In the data below, showing processor implementations from<br />

Andes Technology, Microsemi and SiFive, only the cycle data<br />

is presented, since comparing timing for the various<br />

implementations would not be an accurate comparison. Also,<br />

in keeping with this theme, different benchmark applications<br />

were used for each of the different processors. All benchmarks<br />

were run with the range of compiler optimization settings, and<br />

estimated cycles were reported first assuming 1 cycle per<br />

instruction, i.e. using IA, then using the IA+E technique.<br />

These results are shown in Figs. 4-6.<br />

VI. CONCLUSIONS<br />

The Instruction Accurate + Estimation (IA+E) technique<br />

developed here has shown excellent results for timing<br />

estimation of in-order processors. It also has the benefits of<br />

easy model building, high performance to enable examination<br />

of multiple benchmarks and system scenarios, and lower cost<br />

than other techniques. In this paper, the IA+E technique has<br />

been extended to support RISC-V processors. Further work is<br />

needed to apply this technique to power estimation, and to<br />

more complex processors.<br />

ACKNOWLEDGMENTS<br />

The authors would like to thank Andes Technology,<br />

Microsemi, and SiFive for access to their processor datasheets<br />

and/or databooks.<br />

REFERENCES<br />

[1] Open Virtual Platforms (OVP), www.OVPworld.org<br />

[2] Felipe Da Rosa, Luciano Ost, Ricardo Reis, Gilles Sassatelli.<br />

Instruction-Driven Timing CPU Model for Efficient Embedded Software<br />

Development Using OVP. ICECS: International Conference on<br />

Electronics, Circuits, and Systems, Dec 2013, Abu Dhabi, United Arab<br />

Emirates.<br />

[3] Gem5, www.gem5.org<br />

www.embedded-world.eu<br />

41


Fig. 4. IA+E cycle estimation results for the Andes N25 processor.<br />

Fig. 5. IA+E cycle estimation results for the Microsemi Mi-V RV32IMA processor.<br />

Fig. 6. IA+E cycle estimation results for the SiFive E31 processor.<br />

42


A RISC-V Based Heterogeneous Cluster with<br />

Reconfigurable Accelerator for Energy Efficient<br />

Near-Sensor Data Analytics<br />

Davide Rossi<br />

DEI, University of Bologna<br />

Bologna, Italy<br />

davide.rossi@unibo.it<br />

Abstract- The end-nodes of the IoT require high performance<br />

and energy efficiency to math stringent constraints of complex<br />

near-sensor data analytics algorithms. Processing on multiple<br />

near-threshold processors is an emerging paradigm which<br />

combines the energy efficiency of low-voltage operation with the<br />

performance of parallel execution. In this work, we present a<br />

near-threshold heterogeneous architecture which extends a<br />

RISC-V based parallel processor cluster with a reconfigurable<br />

Integrated Programmable Array (IPA) accelerator. While the<br />

homogeneous cluster delivers high-performance when executing<br />

data-parallel kernels, offloading control-intensive kernels to the<br />

IPA leads to much higher system level performance and energyefficiency<br />

thanks to the exploitation of instruction level<br />

parallelism rather than data-level parallelism. Results show that<br />

the heterogeneous architecture outperforms an 8-core cluster by<br />

up to 4.8x in performance and 4.5x in energy efficiency when<br />

executing a mix of control-intensive and data-intensive kernels<br />

typical of near-sensor data analytics applications.<br />

Keywords-RISC-V processor, parallel architecture, nearthreshold<br />

computing, heterogeneous computing, reconfigurable<br />

computing.<br />

I. INTRODUCTION<br />

High performance and extreme energy efficiency are strict<br />

requirements for many deeply embedded near-sensor<br />

processing applications such as wireless sensor networks, endnodes<br />

of the Internet of Things (IoT) and wearables. One of the<br />

most traditional approaches to improve energy efficiency of<br />

deeply embedded computing systems is achieved exploiting<br />

architectural heterogeneity by coupling general-purpose<br />

processors with application- or domain-specific accelerators in<br />

a single computing fabric [1][2]. On the other hand, most<br />

recent ultra-low power designs exploit multiple homogeneous<br />

programmable processors operating in near-threshold [3]. Such<br />

an approach, which joins parallelism with low-voltage<br />

computing, is emerging as an attractive way to join<br />

performance scalability with high energy efficiency.<br />

In this paper, we present a heterogeneous architecture<br />

which integrates a near-threshold tightly-coupled cluster of<br />

processors [3] augmented with the Integrated Programmable<br />

Array (IPA) presented in [4]. This approach joins the<br />

programming legacy of instruction processors with the flexible<br />

performance and efficiency boost of Coarse Grain<br />

Reconfigurable Arrays [4][5] (CGRA). A similar approach was<br />

adopted in [6], which introduced an ultra-low power<br />

heterogeneous system featuring a Single Instruction Multiple<br />

Data (SIMD) CGRA as reconfigurable accelerator for biosignal<br />

analysis. With respect to this domain-specific<br />

architecture, where the computational kernels are mapped<br />

manually on the CGRA, the system proposed in this work is<br />

meant for general-purpose near-sensor data analytics, also<br />

relying on an automated compilation flow that allows<br />

generating the configuration bitstream for the CGRA starting<br />

from a general-purpose ANSI-C code [4].<br />

We synthesized the architecture in a 28nm FD-SOI<br />

technology, and we carried out a quantitative exploration<br />

combining physical synthesis results (i.e. frequency, area, and<br />

power) and benchmarking of a set of signal processing kernels<br />

typical of end-nodes IoT applications. Two interesting findings<br />

of the proposed exploration show that (1) the performance of<br />

the IPA is much less sensitive to memory bandwidth than<br />

parallel processor clusters and that (2) the simpler nature of its<br />

architecture allows the IPA to run twice as fast as the rest of the<br />

system. Exploiting these two features of the architecture, we<br />

show that the heterogeneous cluster achieves significant<br />

performance and energy improvement for both compute and<br />

control intensive benchmarks with respect to the 8 core<br />

homogeneous cluster, achieving up to 4.8x speed-up and up to<br />

4.4x better energy efficiency.<br />

II.<br />

HETEROGENEOUS CLUSTER ARCHITECTURE<br />

The proposed heterogeneous cluster architecture is based<br />

on the PULP (Parallel Ultra Low Power) platform [3],<br />

featuring a configurable number of RI5CY processors [7]. The<br />

cores are based on an in-order pipeline with four balanced<br />

stages optimized for energy efficient operation, which share a<br />

multi-banked scratchpad memory through a low-latency<br />

logarithmic interconnect [8]. The original RISV-V ISA is<br />

extended with instructions targeting energy efficient digital<br />

signal processing such as hardware loops, memory accesses<br />

with automatic pointers increment, SIMD operations, bit<br />

manipulation instructions. The cores share a latch-based<br />

43


Fig. 1. Heterogeneous PULP Cluster Architecture.<br />

instruction cache to boost performance and energy-efficiency<br />

over traditional SRAM-based private instruction caches. A<br />

lightweight multichannel DMA optimized for energy-efficient<br />

operation manages data transfers between the L1 memory and<br />

the off-cluster L2 memory. Both the I$ and the DMA<br />

converges on an AXI4 cluster bus connected to dual clock<br />

FIFOs featuring level shifters, enabling the cluster to operate at<br />

the desired voltage and frequency independently on the rest of<br />

the SoC. A peripheral interconnect connects the processors to<br />

the cluster peripherals such as timers, an event unit, and other<br />

memory mapped peripherals or accelerators integrated into the<br />

cluster, such as the IPA.<br />

The IPA is built around an array of 16 processing elements<br />

(PEs) communicating through a 2D torus interconnect. Each<br />

PE features a 32-bit ALU, supporting a reduced instruction set<br />

that includes arithmetic and logic operations, 16-bit to 32-bit<br />

multiplications and control flow operations such as jumps and<br />

branches. The PEs fetch instructions from the Instruction<br />

Register File (IRF), which stores the program. A Regular<br />

Register File (RRF) stores temporary variables, while a<br />

Constant Register File (CRF) stores immediates. The ALU<br />

feature two input operands coming from neighbors PEs or the<br />

internal register files (RRF and CRF). A parametric number of<br />

PEs, defined at design-time, can be instrumented with a loadstore<br />

unit employing the request-grant protocol of the PULP<br />

logarithmic interconnect [8]. This protocol allows the<br />

integration of the IPA into the heterogeneous cluster just as any<br />

other programmable processor, sharing the same multi-banked<br />

memory. The configuration bitstream for the IPA is generated<br />

automatically by a compilation flow starting from ANSI-C<br />

description of the computational kernels [4]. Since PEs may<br />

not all operate at the same time, to reduce dynamic power<br />

consumption in idle mode, the IPA integrates a tiny Power<br />

Management Unit (PMU) responsible for clock gating PEs<br />

when idle.<br />

The heterogeneous PULP cluster described in this work is<br />

based on 8 RI5CY processors, 64kB of shared data memory,<br />

4kB of shared instruction memory, and is extended with the<br />

Integrated Programmable Array accelerator (Fig. 1). Fig. 2<br />

shows a detailed block diagram of the subsystem including the<br />

Fig. 2. Block Diagram of the IPA subsystem.<br />

IPA array. The configuration bitstream is stored into a global<br />

context memory (GCM) loaded into the IPA PEs through a<br />

dedicated controller (IPAC). The GCM is connected through a<br />

DMA-capable AXI-4 port to the cluster bus, enabling<br />

prefetching of IPA contexts from L2 memory. The GCM is<br />

sized as twice as the configuration bitstream of the IPA in the<br />

worst case, allowing to employ a ping-pong buffering policy<br />

where a new bitstream is loaded from L2 when the current one<br />

is being loaded on the array, completely hiding the<br />

reconfiguration time. A set of memory mapped control<br />

registers are used for control purposes, and they are used to<br />

load a new context to the IPA array, trigger execution kernels<br />

and synchronize with the programmable processors in the<br />

cluster.<br />

As opposed to many CGRA architectures, the IPA is<br />

capable of accessing a multi-banked shared memory through 8<br />

master ports connected to the low-latency interconnect. This<br />

eases data sharing with the other processors of the cluster,<br />

following the computational model described in [4]. The<br />

optimal number of port has been chosen to optimize the<br />

tradeoff between the size of the interconnect and the bandwidth<br />

requirements of the IPA. Since the IPA can operate twice as<br />

fast as the processors, we have extended the architecture of the<br />

cluster in a way that the IPA can work as twice as the<br />

frequency of the rest of the cluster. This approach allows to<br />

operate each component in the cluster at the optimal frequency,<br />

without paying the overheads of dual-clock FIFOs, requiring a<br />

significant amount of logic and synchronization overhead. On<br />

the contrary, the hardware support for the dual-frequency mode<br />

includes a clock divider to generate the two different edge<br />

aligned clocks, and two modules needed to adapt the requestgrant<br />

protocol of the low-latency interconnect [8] to deal with<br />

the frequency domain crossing.<br />

44


TABLE I.<br />

EXECUTION TIME OF KERNELS RUNNING ON THE<br />

HETEROGENEOUS CLUSTER (NS)<br />

TABLE II.<br />

ENERGY OF KERNELS RUNNING ON THE HETEROGENEOUS<br />

CLUSTER (µJ)<br />

Kernel Single core Multi core IPA Gain<br />

MatMul 3.3 M 435 K 432 K 1.0x<br />

Conv. 9.7 M 1.5 M 1.5 M 1.0x<br />

FFT 767 K 142 K 94 K 1.5x<br />

FIR 182 K 33 K 33 K 1.0x<br />

Sep. Filter 39 M 6.4 M 6.3 M 1.0x<br />

Sobel Filter 117 M 40 M 28 M 1.4x<br />

GCD 2.9 M 2.9 M 610 K 4.8x<br />

Cordic 9 K 7 K 3.6 K 1.9x<br />

Manh. Dist. 244 K 164 K 70 K 2.3x<br />

III.<br />

EXPERIMENTAL RESULTS<br />

In this section, we present the implementation and<br />

benchmarking results of the heterogeneous PULP cluster. The<br />

SoC was synthesized with Synopsys Design Compiler<br />

2013.12-SP3 on a STMicroelectronics 28nm UTBB FD-SOI<br />

technology library, while Synopsys PrimePower 2013.12-SP3<br />

was used for timing and power analysis at the supply voltage of<br />

0.6V, 25C temperature, in typical process conditions. The<br />

benchmarks are implemented in fully portable C, using the<br />

OpenMP programming model to parallelize the applications on<br />

the PULP cluster. The three operating modes considered in<br />

these comparisons are: (a) single-core: running applications on<br />

a single core, (b) multicore: running applications on 8 parallel<br />

cores (c), IPA: running applications in the IPA.<br />

Table I reports the execution time of several near-sensor<br />

processing kernels running on a single-core, on 8 cores and on<br />

the IPA. Comparing to the performance of execution in singlecore,<br />

the accelerator achieves a maximum speed-up of 8x. The<br />

performance gain in the accelerator for the compute intensive<br />

kernels like matrix multiplication, convolution, FIR and<br />

separable filters is limited if compared to the performance of<br />

parallel-cores. However, the relatively small performance gain<br />

compared to the parallel cluster is compensated by the gain in<br />

energy efficiency as shown in Table II. The gain in energy<br />

efficiency is mainly due to (i) the simpler nature of the<br />

compute units of the IPA with respect to full processors, (ii)<br />

the smaller number of power-hungry load/store operations, and<br />

(iii) the fine-grained power management architecture that<br />

allows clock gate the inactive PEs during execution. On the<br />

other hand, the control intensive kernel like GCD does not<br />

exhibit significant data-level parallelism, hence parallel<br />

execution over multiple cores does improve performance of the<br />

homogeneous cluster. Contrarily, the execution on the IPA<br />

improves the performance by almost 5x and energy efficiency<br />

by almost 4.5x, exploiting also instruction-level parallelism<br />

rather than data-level parallelism only as the homogeneous<br />

processors cluster. More precisely, although data-parallel<br />

applications can be effectively parallelized on homogeneous<br />

clusters, the exploitation of the IPA results into a more efficient<br />

utilization of the hardware resources for control-intensive<br />

kernels causing a huge performance bottleneck in several nearsensor<br />

analytics applications.<br />

Kernel Single-core Multi-core IPA Gain<br />

MatMul 1.2 0.3 0.2 1.5x<br />

Convolution 2.8 1.1 0.65 1.7x<br />

FFT 0.3 0.09 0.04 2.25x<br />

FIR 0.08 0.03 0.025 1.2x<br />

Sep. Filter 16.6 4.6 4.3 1.1x<br />

Sobel Filter 51.5 29.5 12.7 2.3x<br />

GCD 1.1 1.1 0.25 4.4x<br />

Cordic 0.004 0.003 0.001 3x<br />

Manh. Dist. 0.1 0.1 0.03 3.3x<br />

IV.<br />

CONCLUSION<br />

In this paper, we present a novel approach towards<br />

heterogeneous computing, augmenting ultra-low power<br />

reconfigurable accelerator in the PULP multi-core cluster. The<br />

experiments integrating IPA in the PULP platform suggests<br />

that architectural heterogeneity is a powerful approach to<br />

improve energy profile of the computing systems. We have<br />

presented three possible executions of the benchmarks in the<br />

IPA integrated PULP platform. The heterogeneous cluster<br />

achieves achieving up to 4.8x speed-up and up to 4.4 better<br />

energy efficiency with respect to an 8-core homogeneous<br />

cluster.<br />

REFERENCES<br />

[1] F. Conti, A. Marongiu, and L. Benini. Synthesis-friendly techniques for<br />

tightly-coupled integration of hardware accelerators into shared-memory<br />

multi-core clusters. CODES+ISSS ’13, pages 5:1–5:10, Piscataway, NJ,<br />

USA, 2013. IEEE Press.<br />

[2] M. B. Taylor. Is dark silicon useful? harnessing the four horsemen of the<br />

coming dark silicon apocalypse, In Design Automation Conference<br />

(DAC), 2012 49th ACM/EDAC/IEEE , pages 1131–1136. IEEE, 2012.<br />

[3] D. Rossi, A. Pullini, I. Loi, M. Gautschi, F. K. Grkaynak, A. Teman, J.<br />

Constantin, A. Burg, I. Miro-Panades, E. Beign, F. Clermidy, P.<br />

Flatresse, and L. Benini. Energy-efficient near-threshold parallel<br />

computing: The pulpv2 cluster. IEEE Micro, 37(5):20–31, September<br />

2017.<br />

[4] S. Das, K. J. M. Martin, P. Coussy, D. Rossi, and L. Benini. Efficient<br />

mapping of cdfg onto coarse-grained reconfigurable array architectures.<br />

In 2017 22nd Asia and South Pacific Design Automation Conference<br />

(ASP-DAC), pages 127–132, Jan 2017.<br />

[5] B. De Sutter, P. Raghavan, and A. Lambrechts. Coarse-grained<br />

reconfigurable array architectures. In S. S. Bhattacharyya, E. F.<br />

Deprettere, R. Leupers, and J. Takala, editors, Handbook of Signal<br />

Processing Systems, pages 449–484. Springer US, 2010.<br />

[6] L. Duch, S. Basu, R. Braojos, G. Ansaloni, L. Pozzi, and D. Atienza.<br />

Heal-wear: An ultra-low power heterogeneous system for bio-signal<br />

analysis. IEEE Transactions on Circuits and Systems I: Regular Papers,<br />

2017.<br />

[7] M. Gautschi, P. D. Schiavone, A. Traber, I. Loi, A. Pullini, D. Rossi, E.<br />

Flamand, F. K. Grkaynak, and L. Benini. Near-threshold risc-v core with<br />

dsp extensions for scalable iot endpoint devices. IEEE Transactions on<br />

Very Large Scale Integration (VLSI) Systems , PP(99):1–14, 2017.<br />

[8] A. Rahimi, I. Loi, M. R. Kakoee, and L. Benini. A fully-synthesizable<br />

single-cycle interconnection network for shared-l1 processor clusters. In<br />

2011 Design, Automation & Test in Europe , pages 1–6. IEEE, 2011.<br />

45


OpenWrt 101: How to Build a Linux Embedded<br />

System in Just 30 Minutes<br />

Cesare Garlati<br />

prpl Foundation<br />

Santa Clara, CA USA<br />

cesare@prplFoundation.org<br />

Luka Perkov<br />

Sartura<br />

Zagreb, Croatia<br />

luka.perkov@sartura.hr<br />

Abstract — OpenWrt is the de-facto standard Linux<br />

distributions for embedded devices. Originally developed for<br />

Internet routing devices, such as home gateways and wireless<br />

routers, OpenWrt is now widely used in many applications<br />

ranging from laptops to mobile devices to the ever increasing<br />

number of IoT devices. The core design philosophy of minimal<br />

footprint, broad platform support and ease of customization has<br />

made OpenWrt the option of choice for developers, end users,<br />

and commercial service providers. This paper offers a practical<br />

introduction to OpenWrt: it is the basis for a 30 minutes class<br />

teaching how to setup, compile, and run a complete OpenWrt<br />

system on a commercial router of choice.<br />

Keywords—OpenWrt; embedded; Linux; operating system;<br />

software distribution; open source software; router; home gateway,<br />

Wi-Fi,IoT, Internet of Things.<br />

I. INTRODUCTION<br />

OpenWrt is a highly extensible open source GNU/Linux<br />

distribution for embedded devices. Although primarily targeted<br />

to home gateways, OpenWrt runs on wireless routers, pocket<br />

computers, laptops and many classes of IoT devices. Since its<br />

inception in 2007, the goal of the OpenWrt project has been to<br />

provide free open source tools to build and customize firmware<br />

images for numerous embedded platforms. In creating a novel<br />

embedded distribution for networking applications, the<br />

OpenWrt project followed the three main principles of small<br />

footprint, portability and customizability.<br />

These elements have made OpenWrt a desirable choice for<br />

a vast amount of projects and products varying in size and<br />

applications. A 2017 survey presented at OpenWrt/LEDE<br />

Summit 2017 [1] points out that OpenWrt is used across a<br />

broad variety of commercial Wi-Fi routers including TP-Link’s<br />

Archer C7, TL-WR1043ND, TL-WR841ND, TL-WDR3600,<br />

Ubiquiti Networks NanoStation AC and PicoStation, VPN<br />

appliances, IoT development boards, wireless printers, TOR<br />

servers, file and media sharing appliances, mesh network nodes<br />

and even wind turbines.<br />

Table 1 shows a functional comparison between OpenWrt<br />

and two other popular embedded distributions: Yocto and<br />

Buildroot.<br />

TABLE 1: COMPARISON - OPENWRT, BUILDROOT, YOCTO<br />

Component OpenWrt Buildroot Yocto<br />

menuconfig Kconfig Kconfig Kconfig<br />

C libraries<br />

File Systems<br />

uClibc<br />

glibc<br />

musl<br />

OverlayFS<br />

tmpfs<br />

SquashFS<br />

JFFS2<br />

UBIFS<br />

ext*<br />

glibc<br />

uClibc-ng<br />

musl<br />

cramfs<br />

JFFS2<br />

romfs<br />

cloop<br />

ISO 9660<br />

cpio<br />

UBI<br />

UBIFS<br />

SquashFS<br />

ext*<br />

Root Necessary Yes No Yes<br />

Init Systems<br />

procd<br />

BusyBox<br />

systemV<br />

BusyBox<br />

systemd<br />

EGLIBC<br />

Btrfs<br />

cpio*<br />

cramfs<br />

ELF<br />

ext*<br />

ISO<br />

JFFS2<br />

multiubi<br />

SquashFS<br />

UBI<br />

UBIFS<br />

SysVinit<br />

systemd<br />

Package Manager opkg - smart<br />

A. Small footprint<br />

Internet routing devices share many constraints of typical<br />

embedded devices: small processors, tiny memory and low<br />

power. OpenWrt system architecture is optimized in size to<br />

generate firmware images that fit the limited memory available<br />

in most commercial routers with little or no overhead.<br />

Traditional Linux distributions require large software<br />

libraries that introduce many additional dependencies, such as<br />

C standard library glibc, D-Bus inter-process communications<br />

facilities and heavy Network Manager applications. On the<br />

contrary, OpenWrt relies on simpler – and more robust -<br />

www.embedded-world.eu<br />

46


general-purpose components such as musl C standard library<br />

(before uClibc [2]), ubus RPC daemon which is similar to D-<br />

Bus but has a more user-friendly API, and an RPC-capable<br />

daemon netifd to manage complex network interface<br />

configurations. Through these and many other lightweight<br />

components, a heavily-stripped OpenWrt build can run even on<br />

devices with 16 MB or 8 MB main memory size – in fact the<br />

authors were able to experiment with images as small as 4 MB.<br />

Along with preserving a small footprint, OpenWrt also<br />

implements a single configuration and access point philosophy<br />

through UCI, or Unified Configuration Interface. UCI<br />

centralizes and eases the configuration of crucial system<br />

settings and stores them into a single configuration directory<br />

(/etc/config/). In addition, a large number of third-party<br />

libraries have also been made UCI-compatible to facilitate the<br />

management of their configuration files within OpenWrt.<br />

B. Portability<br />

A key element of OpenWrt is the wide support of many<br />

hardware platforms. CPU architectures supported by OpenWrt<br />

include ARM, MIPS, x86, x86_64 and many SoC platforms<br />

produced by Broadcom, Atheros/Qualcomm, Lantiq/Intel,<br />

Marvell and others.<br />

As of today the official OpenWrt’s Table of Hardware [3]<br />

(Fig 1.) lists 680 devices, including devices from leading<br />

networking device manufacturers such as Zyxel, Netgear,<br />

Linksys, TP-Link, Ubiquiti Networks, D-Link and others.<br />

Fig 1. OpenWrt - Table of Hardware<br />

C. Customizability<br />

OpenWrt firmware images can be obtained in two ways:<br />

either by downloading pre-built firmware through the OpenWrt<br />

build infrastructure – see https://downloads.openwrt.org - or by<br />

configuring and building the image with the help of the<br />

OpenWrt Build System. Pre-built images are easy to obtain and<br />

install and are the option of choice for standard applications<br />

targeting specific devices. On the other hand, images can be<br />

built manually using OpenWrt’s Build System, a set of<br />

Makefiles and patches for generating a cross-compilation<br />

toolchain and a root file system for embedded systems.<br />

Configuring and building firmware images using OpenWrt<br />

Build System provides users with menuconfig, OpenWrt<br />

default Build System configuration interface. This user friendly<br />

application allows the configuration of a wide array of options<br />

including: chipset architecture, router hardware model, root file<br />

system, application packages, and kernel options. The menudriven<br />

interface allows quick and painless configuration and<br />

generation of the firmware image according to user<br />

requirements.<br />

Once OpenWrt is successfully booted, its package manager<br />

opkg enables users to install and remove several thousand<br />

packages from the OpenWrt package repository. As opposed to<br />

traditional Linux-based firmware relying on read-only file<br />

systems, opkg gives users the possibility to modify the installed<br />

software without rebuilding and reflashing a completely new<br />

image. By using the large amount of available packages, users<br />

can utilize their OpenWrt routers for a wide variety of<br />

networking applications. To name a few, these include setting<br />

up a VPN, installing a BitTorrent client, performing trafficshaping<br />

and quality of service, creating guest networks,<br />

running server software, and many other popular applications<br />

that are not typically bundled with commercial off-the-shelf<br />

devices.<br />

II.<br />

BUILDING OPENWRT FIRMWARE<br />

To show how to build an OpenWrt firmware image in just<br />

30 minutes, we are going to use a pre-compiled OpenWrt Build<br />

System environment tailored for the Marvell ESPRESSObin<br />

[4], an ARMADA 88F3700 SoC-powered commercial router<br />

designed primarily for computing, storage and networking<br />

applications. The prebuilt environment is available as a<br />

downloadable Docker container based on Ubuntu 16.04. It<br />

allows instant setup and generation of an installable OpenWrt<br />

firmware for the ESPRESSObin board.<br />

First we need to setup the Docker container on the local<br />

machine. The Docker platform runs natively on Linux x86-64,<br />

Linux/ARM and Windows x86-64. The Docker image is pulled<br />

with:<br />

$ docker pull<br />

sartura/build_openwrt_ubuntu_16.04:espress<br />

obin<br />

The large size of this Docker image (around 12 GB) is to<br />

accommodate the space needed for downloading the OpenWrt<br />

build system, OpenWrt feeds and source packages, and finally<br />

for building (cross-compiling) OpenWrt and generating the<br />

OpenWrt firmware image.<br />

After the image is pulled, a local folder needs to be<br />

prepared where the build artifacts will be copied to:<br />

$ mkdir ~/espressobin<br />

$ chmod 777 ~/espressobin<br />

Downloaded Docker image is run with:<br />

$ docker run -it -v<br />

/home//espressobin:/opt/espresso<br />

bin --name espressobin<br />

sartura/build_openwrt_ubuntu_16.04:espress<br />

obin<br />

Inside the Docker container, Marvell-specific repositories<br />

required for building OpenWrt firmware are openwrt-dd/ and<br />

47


openwrt-kernel/, both located under /home/build directory.<br />

Already built OpenWrt firmware images for the<br />

ESPRESSObin board are located in /home/build/openwrtdd/bin/mvebu64/<br />

directory:<br />

$ cd /home/build/openwrt-dd/<br />

$ ls -1 bin/mvebu64/<br />

armada-3720-community.dtb<br />

openwrt-armada-ESPRESSObin-Image<br />

openwrt-armada-ESPRESSObin-Image-initramfs<br />

openwrt-armada-ESPRESSObin-Image.gz<br />

openwrt-mvebu64-armada-espressobinrootfs.tar.gz<br />

openwrt-mvebu64-vmlinux.elf<br />

packages<br />

sha256sums<br />

Modifying and rebuilding images simply requires running<br />

standard make commands. Within Docker, remember to issue<br />

all OpenWrt Build System commands as non-root user and to<br />

issue them in the directory where the OpenWrt sources have<br />

been cloned, in this case from /home/build/openwrt-dd. First,<br />

invoke the menuconfig interface with:<br />

$ make menuconfg<br />

This menu-driven interface (Fig 1.) is OpenWrt’s main<br />

configuration interface.<br />

Fig 2. OpenWrt - make menuconfig<br />

Here it is necessary to configure the target system, the<br />

target profile and target images, and lastly to set the path of the<br />

openwrt-kernel directory as an external kernel tree:<br />

Target System ---><br />

Marvell 64b Boards<br />

Target Profle ---><br />

ESPRESSObin (Marvell Armada 3700<br />

Community Board)<br />

Target Images ---><br />

[x] ramdisk ---><br />

* Root flesystem archives *<br />

[x] tar.gz<br />

* Root flesystem images *<br />

[x] ext4 ---><br />

[x] Advanced confguration options (for<br />

developers) ---><br />

(/home/build/openwrt-kernel) Use<br />

external kernel tree<br />

OpenWrt also features a highly modular Web User<br />

Interface called LuCI, which can be enabled by selecting the<br />

luci package:<br />

LuCI ---><br />

1. Collections ---><br />

luci<br />

Once these options are set, save your configuration and exit<br />

the interface. Now issue the rebuild with make:<br />

$ make<br />

Multiple cores can be utilized to speed up the build process:<br />

$ make -j$(($(nproc)+1))<br />

Again, the build artifacts are stored in<br />

/home/build/openwrt-dd/bin/mvebu64/, so copy the needed<br />

contents of this directory (device tree file, OpenWrt image and<br />

root file system) to the local directory and exit the container:<br />

$ cp bin/mvebu64/*armada*<br />

/opt/espressobin/<br />

$ exit<br />

The ESPRESSObin board uses the micro SD card as its<br />

main storage and booting environment. The last step of booting<br />

OpenWrt on ESPRESSObin consists of preparing the micro<br />

SD card, transferring the build files and setting the U-Boot<br />

parameters for the board to boot from micro SD card. After<br />

inserting the microSD card (listed here as /dev/sdX), first clear<br />

everything from it:<br />

$ sudo dd if=/dev/zero of=/dev/sdX bs=1M<br />

count=100<br />

Then create a new partition (sdX1 in our example):<br />

$ (echo n; echo p; echo 1; echo ''; echo<br />

''; echo w) | sudo fdisk /dev/sdX<br />

Followed by formatting this partition as ext4 with:<br />

$ sudo mkfs.ext4 /dev/sdX1<br />

Mount the micro SD card on your Linux machine (e.g.<br />

to /mnt) and position into that directory:<br />

$ sudo mount /dev/sdX1 /mnt<br />

$ cd /mnt<br />

www.embedded-world.eu<br />

48


Once here, transfer the necessary ESPRESSObin build files<br />

from the ~/espressobin/ directory. First, extract the root file<br />

system:<br />

$ sudo tar -xzf ~/espressobin/openwrtmvebu64-armada-espressobin-rootfs.tar.gz<br />

-C .<br />

Then create a boot directory where the device tree file and<br />

OpenWrt image will be copied to:<br />

$ sudo mkdir -p boot/<br />

$ sudo cp ~/espressobin/armada-3720-<br />

community.dtb boot/<br />

$ sudo cp ~/espressobin/openwrt-armada-<br />

ESPRESSObin-Image boot/<br />

Exit the mounted directory and unmount the micro SD card<br />

from the local machine. Plug the micro SD card into the SD<br />

card slot on the ESPRESSObin and connect to the<br />

ESPRESSObin board via micro USB cable. Using a micro<br />

USB cable and a serial connection software of choice (e.g. C-<br />

Kermit, Minicom) access the console on the ESPRESSObin.<br />

Once the boot starts, hit any key to stop autoboot and to<br />

access the Marvell U-Boot prompt:<br />

Hit any key to stop autoboot:<br />

Marvell>><br />

Now set the necessary U-Boot parameters for the name and<br />

location of the device tree file and OpenWrt image:<br />

Marvell>> setenv fdt_name 'boot/armada-<br />

3720-community.dtb'<br />

Marvell>> setenv image_name 'boot/openwrtarmada-ESPRESSObin-Image'<br />

Finally, set the bootmmc variable which will be used to<br />

boot from the micro SD card, save the defined environment<br />

parameters and boot using run bootmmc:<br />

Marvell>> setenv bootmmc 'mmc dev 0;<br />

ext4load mmc 0:1 $kernel_addr<br />

$image_name;ext4load mmc 0:1 $fdt_addr<br />

$fdt_name;setenv bootargs $console<br />

root=/dev/mmcblk0p1 rw rootwait; booti<br />

$kernel_addr - $fdt_addr'<br />

Marvell>> save<br />

Marvell>> run bootmmc<br />

OpenWrt should now successfully boot on the<br />

ESPRESSObin. Once connected to the ESPRESSObin, access<br />

LuCI through the browser by typing the IP address of the board<br />

(set to 192.168.1.1 by default) in the URL bar.<br />

III. CONCLUSION<br />

Throughout this brief paper – and the relative 30 minutes<br />

class – we have shown how to configure, build, install and run<br />

a typical OpenWrt system. The OpenWrt project is supported<br />

by a vibrant global community of open source developers,<br />

industry leaders and non-profit organizations [7]. We invite the<br />

reader to explore the many opportunities to be involved with<br />

OpenWrt to help shape the technology that powers the<br />

embedded devices for the Internet of Things (IoT) and the<br />

smart society of the future.<br />

REFERENCES<br />

[1] OpenWrt/LEDE 2017 Summit Survey:<br />

https://openwrtsummit.files.wordpress.com/2017/11/summit-survey-<br />

2017.pdf<br />

[2] https://elinux.org/images/e/eb/Transitioning_From_uclibc_to_musl_for_<br />

Embedded_Development.pdf<br />

[3] OpenWrt Table of Hardware: https://wiki.openwrt.org/toh/start<br />

[4] ESPRESSObin website: http://espressobin.net/<br />

[5] OpenWrt Wiki: https://wiki.openwrt.org/doc/techref/architecture<br />

[6] OpenWrt website: https://openwrt.org/<br />

[7] Prpl Foundation: https://prplfoundation.org/prplwrt/<br />

49


Live Hacking: Hardware-enforced Virtualization of a<br />

Linux Home Gateway<br />

Michael Hohmuth, Adam Lackorzynski<br />

Kernkonzept GmbH<br />

Dresden, Germany<br />

michael.hohmuth@kernkonzept.com<br />

Cesare Garlati<br />

prpl Foundation<br />

Santa Clara, CA, USA<br />

cesare@prplFoundation.com<br />

Abstract — Trust and security are central to embedded computing<br />

as network devices - such as home gateways - have become<br />

the first line of defense for the IoT devices connected to the smart<br />

home. In this paper, we present a virtualization-based approach<br />

to securing home gateway while preserving functionality and<br />

performance.<br />

Keywords—home gateway; router; virtualization; security; live<br />

hacking; linux, hypervisor, microkernel, IoT, Internet<br />

I. INTRODUCTION<br />

Trust and security have never been more important to the<br />

embedded computing world, especially when it comes to network<br />

devices, such as home gateways, that are the first line of<br />

defense for the IoT devices connected to the smart home [4]. In<br />

2017, a plethora of stories have confirmed that these devices<br />

are fundamentally broken from a security perspective.<br />

At embedded world 2017, we hosted a successful live<br />

demonstration showing attendees how the prpl Foundation’s<br />

new approach to embedded computing security works in an<br />

industrial Internet scenario – that is, secure remote control of a<br />

robotic arm. We are back this year with a new demonstration<br />

designed to show the application of the new capabilities of the<br />

prplSecurity Framework 2.0 – as implemented in the opensource<br />

L4Re hypervisor – to a different real word scenario: a<br />

typical Linux-based Internet router, deployed as a home gateway,<br />

that connects home computers, smartphones, IoT devices<br />

and other smart devices to the Internet.<br />

Linux is the dominant operating-system used for Internet<br />

routing devices. Optimized Linux distributions, like OpenWrt,<br />

add to the vanilla kernel a configuration system, additional<br />

applications including IP-telephony, network-printing services,<br />

VPN, media streaming and a browser-based administrative UI.<br />

Although optimized for minimal system footprint, many components<br />

of the resulting software stack are complex and inevitably<br />

enlarge the attack surface. The Linux kernel alone is<br />

comprised of millions of lines of code. And a large part of the<br />

code runs in privileged CPU mode or with elevated OS rights.<br />

This is a major security concern especially because many<br />

home-gateway vendors have shown marginal attention to securing<br />

devices in the field. Availability of security updates is<br />

sporadic and the patching process in not fully automated as it<br />

typically requires end-user intervention. As a result, home<br />

routers present a large attack surface and many exploitable<br />

vulnerabilities. This puts the security of personal data, smarthome<br />

applications and IoT devices at risk. Given the sheer<br />

number of connected devices, it also represents a great risk for<br />

the Internet infrastructure itself as shown by the recent DDOS<br />

attacks such as Mirai and similar.<br />

This unsatisfying state of home-router security has led the<br />

telecom industry to look for solutions that guarantee availability,<br />

security, and remote patching of home routers independently<br />

from the Linux operating system itself. One such approach is<br />

to use a software partition that can restart or even update the<br />

main OS from a clean state, and that is isolated from the main<br />

OS kernel and software. This isolation can be implemented in<br />

hardware using a separate CPU to run the update/restart process,<br />

or more cost-effectively in software using an array of<br />

hardware/software virtualization technologies.<br />

This paper shows the application of a light-weight type-1<br />

hypervisor to isolate the router software, including the Linux<br />

kernel and the user-land applications, into a virtual machine<br />

(VM). The secure update/restart process runs in a separate VM<br />

completely isolated from the rest of the system. Our work is<br />

based on the open-source L4Re hypervisor. This hypervisor<br />

leverages the hardware virtualization support of modern CPUs<br />

to provide isolation and efficiency and, most importantly, the<br />

ability to run unmodified Linux Software.<br />

The live demonstration starts by downloading the necessary<br />

code from various open source repositories. We then configure,<br />

build and install a new hardened firmware image to create<br />

multi-domain security via hardware virtualization. We then<br />

launch in real time several network attacks to exploit known<br />

vulnerabilities of the Linux instance. This shows how the<br />

breach is contained to the target VM, while the system critical<br />

components remain unaffected. This session is aimed at security<br />

architects, penetration testers and anyone who wants to see<br />

www.embedded-world.eu<br />

50


how a real-world attack is conducted and how hardware virtualization<br />

can effectively mitigate the overall impact on the<br />

system.<br />

This paper is organized as follows. In Section II, we discuss<br />

virtualization as a security mechanism and introduce virtualization<br />

concepts such as full virtualization,<br />

paravirtualization, and containerization. Section III introduces<br />

the open source L4Re hypervisor. Section IV explains the<br />

home-router setup referenced throughout this paper. Section V<br />

outlines the live-hacking scenario we present during the live<br />

demonstration, before we conclude the paper in Section VI.<br />

II.<br />

.<br />

VIRTUALIZATION AND SECURITY<br />

In general, virtualization is the concept of abstracting away<br />

from the specifics of an underlying hardware mechanism. For<br />

example, most OSes offer virtual memory as an abstraction of<br />

physical memory, providing programs with the illusion of an<br />

abstract computer with a separate, isolated memory space. In<br />

this paper, we use virtualization in a narrower, more specific<br />

definition: as a mechanism for providing virtual machines<br />

(VMs) that provide user software with the illusion of running<br />

on a separate physical computer.<br />

Virtualization can be provided on various levels. The Linux<br />

already comes with several virtualization mechanisms, including<br />

control groups and containers, which provide the illusion<br />

of several isolated Linux instances although all instances share<br />

the same Linux kernel, and kernel virtual machines (KVM),<br />

which provides VMs that look like physical computers and that<br />

run unmodified OS kernels (such as another Linux kernel) as<br />

unprivileged VM guests. These mechanisms all share the property<br />

that all VMs must trust the hosting Linux kernel and<br />

Linux-based OS, which can be undesirable from a security<br />

perspective.<br />

The alternative solution is to deprivilege all Linux instances<br />

by running them on top of a small hypervisor such as the L4Re<br />

hypervisor. Depending on which critical services these Linux<br />

guests provide, it is possible to remove the Linux OS from the<br />

critical trusted path, or trusted computing base (TCB), of a<br />

security-sensitive application. If the hypervisor is small, the<br />

critical application’s TCB can be several orders of magnitude<br />

smaller than the Linux kernel alone.<br />

Full virtualization (allowing unmodified OS kernels to run<br />

in VMs) can greatly benefit from virtualization assists provided<br />

in hardware by the platform’s CPU. Fortunately, modern server,<br />

desktop, and embedded CPUs all provide hardware-assisted<br />

virtualization features (i.e., nested paging, interrupt-controller<br />

virtualization, and I/O-MMUs). Where these hardware features<br />

are not available, either hypervisors need to resort to costly<br />

emulation of VM features that allow unmodified guest OSes to<br />

run; or, guest OSes need to be modified to be able to run as<br />

VM guests on top of the hypervisor. In the latter case, the guest<br />

OS kernel is said to be paravirtualized; of course, this is feasible<br />

only for OSes for which source code is available. There is<br />

also a hybrid approach in which the OS kernel proper runs<br />

unmodified, using hardware-assisted virtualization, but certain<br />

device drivers use hypervisor APIs directly (instead of emulated<br />

device interfaces) for performance reasons. The VirtIO set<br />

of APIs is a well-known example for such a paravirtualized<br />

device API.<br />

Hardware-assisted virtualization and paravirtualization are<br />

conceptually similar when implemented in the same software<br />

layer. Minor differences in complexity and attack surface mostly<br />

stem from additional emulation layers needed to provide the<br />

physical-machine illusion for full, hardware-assisted virtualization.<br />

III.<br />

THE L4RE HYPERVISOR AND OPERATING SYSTEM<br />

The L4Re system is a light-weight, microkernel-based, real-time<br />

operating system with support for hardware-assisted<br />

virtualization and paravirtualization [7,8]. The system components<br />

include:<br />

<br />

<br />

<br />

<br />

The L4Re Hypervisor<br />

The L4Re Runtime Environment, a POSIX-like programming<br />

environment for implementing native, trusted<br />

L4Re microapps<br />

A VMM component for hardware-assisted virtualization<br />

of unmodified guest OSes (i.e., Linux and<br />

FreeRTOS)<br />

L4Linux, a paravirtualized Linux kernel<br />

The L4Re system supports many platforms including x86,<br />

ARM and MIPS architectures in 32-bit and 64-bit mode.<br />

Hardware-assisted virtualization and device-memory virtualization<br />

(IOMMUs) are also supported if available. Additionally,<br />

experimental support is available for PowerPC and Sparc.<br />

L4Re is easily portable: developing a board-support package<br />

(BSP) for a new platform usually takes few developer-days.<br />

L4Re is a mature OS that has been in development since<br />

1997. Originally developed at TU Dresden, it has recently<br />

obtained vast commercial uptake and support. The L4Re software<br />

is licensed under GNU GPLv2 and it is available for<br />

download at https://www.kernkonzept.com. A dual-licensing<br />

schema is available for commercial applications if desired.<br />

The L4Re system aims at minimizing each application's or<br />

VM's TCB. The hypervisor is a classic L4 microkernel as it<br />

implements only those OS mechanisms that are required to<br />

securely implement isolation (i.e. address spaces/VMs,<br />

threads/virtual CPUs, scheduling/clocks, and inter-process<br />

communication) and leaves implementing all other typical<br />

operating-system services (such as resource or file management)<br />

to user-level programs.<br />

One such user-level component is L4Re's Virtual Machine<br />

Monitor (VMM), which is used to emulate the virtual platform<br />

that is made available to (hardware-assisted) virtual machines.<br />

Components that do not need virtualization do not have to<br />

depend on the VMM which then yields a smaller TCB. In fact,<br />

each VM can have its own (custom) VMM, allowing to further<br />

reduce the trust relationship among different, mutually untrusting<br />

VMs.<br />

For virtualization-friendly guest OS kernels such as Linux,<br />

the L4Re system provides a special, tiny VMM called uvmm.<br />

This VMM does little more than providing a boot API for guest<br />

OSes, providing virtual interrupts and CPUs, connecting the<br />

51


VM to VirtIO-based virtual devices (such as a virtual network<br />

interface), and passing through physical devices the VM is<br />

allowed to access.<br />

Apart from the VMM, the L4Re system provides components<br />

for memory and CPU management, for setting up VM<br />

and application resources such as physical memory and communications<br />

relationships, for bus virtualization and platform<br />

and device management, and for securely multiplexing a GUI.<br />

The loader component can be scripted with Lua language and<br />

allows static or dynamic device (pass-through), memory, and<br />

communication-relation assignments.<br />

For more information on the L4Re system, please refer to<br />

our EW2016 paper [3].<br />

IV.<br />

VIRTUALIZATION OF A LINUX-BASED ROUTER OS<br />

Our demonstration and evaluation vehicle for running a<br />

router OS in a virtual machine has been implemented on the<br />

NXP FRDM platform and uses two hardware-assisted VMs.<br />

NXP’s QorIQ FRDM-LS1021A board uses an LS1021A<br />

SoC that provides two ARM Cortex-A53 processors. These<br />

CPUs provide ARM’s virtualization technology, which allows<br />

using hardware-assisted full virtualization on this board. To<br />

provide Wi-Fi routing capability, we have attached a Wi-Fi<br />

dongle to the board’s USB interface.<br />

Using uvmm, we run the following two VMs on top of the<br />

L4Re hypervisor:<br />

Router OS—This VM runs a copy of OpenWrt with an unmodified<br />

Linux kernel. This VM drives the Wi-Fi device,<br />

which is passed through into this VM, and exposes its configuration<br />

interface over the Wi-Fi interface. As its Internet uplink,<br />

Router OS has a virtual network connection to Telco OS:<br />

Telco OS—This VM runs a simple, FreeRTOS-based system<br />

with two main functions: It has pass-through access to one<br />

Ethernet interface that serves as the uplink port and passes all<br />

traffic on to the Router OS via the virtual-network interface—<br />

except that it accepts commands on a “telco” service it implements<br />

itself. This service is intended for use by the Internet<br />

provider (or telco operator) to trigger reboots of the Router OS<br />

from its boot image, which is invisible to, and unmodifiable by,<br />

the Router OS, and therefore always represents a clean state<br />

from which the first VM can restart. Reboots of the Router-OS<br />

VM do not require a platform reset or reboot. (In the future,<br />

this service may also update the Router OS’s boot image.)<br />

This architecture provides only a minimal attack surface for<br />

the Telco OS because it does not inspect data intended for the<br />

Router OS, and does not implement any application or configuration<br />

services.<br />

This architecture has the property that any compromises of<br />

the Router OS, initiated either externally (from the Internet) or<br />

internally (by a rogue or cracked IoT device) can be undone<br />

from within Telco OS, without having to trust Router OS at all.<br />

V. LIVE DEMONSTRATION<br />

In our live demo session, we will run an exploit against a<br />

known bug in OpenWrt.<br />

At first, we will demonstrate how an instance of regular<br />

OpenWrt running natively (without a hypervisor) will become<br />

unresponsive once the attack is performed.<br />

Then, we’ll run OpenWrt in a virtual machine as described<br />

in the proceeding section. We will show that attacks on<br />

OpenWrt are still possible, but can be mitigated by the telco by<br />

remotely rebooting OpenWrt from a clean state, and possibly<br />

even updating OpenWrt from Telco OS.<br />

VI.<br />

CONCLUSION<br />

The security benefits of virtualization are no longer confined<br />

to big iron datacenter applications. Virtualization can<br />

effectively be implemented in resource-constrained embedded<br />

systems such as home routers. It allows to separate complex<br />

operating system software, such as the Linux-based OpenWrt,<br />

from the trusted computing base in critical applications.<br />

In preparation for the hand-on workshop: please download<br />

the software from https://l4re.org/download.html. The authors<br />

will provide additional download links and instructions for the<br />

demo during the class.<br />

REFERENCES<br />

[1] The L4Re System, https://l4re.org/<br />

[2] A. Lackorzynski and A. Warg. Taming subsystems: Capabilities as<br />

Universal Resource Access Control in L4. In Proceedings of the Second<br />

Workshop on Isolation and Integration in Embedded Systems, Eurosys<br />

affiliated workshop, IIES ’09, pages 25–30. ACM, 3 2009. ISBN 978-1-<br />

60558-464-5.<br />

[3] M. Röder, M. Hohmuth and A. Lackorzynski. Tux Airborne:<br />

Encapsulating Linux — real-time, safety and security with a trusted<br />

microhypervisor. In Proceedings of the Embedded World Conference<br />

2016<br />

[4] Security Guidance for critical areas of embedded computing – prpl<br />

Foundation, January 2016 https://prpl.works/security-guidance/<br />

www.embedded-world.eu<br />

52


Achieving Ultra Low Power in Embedded Systems<br />

Understand where your power goes and what you can do to make things better<br />

Herman Roebbers<br />

Embedded Systems<br />

Altran Netherlands B.V.<br />

Eindhoven, The Netherlands<br />

Herman.Roebbers@altran.com<br />

Abstract— Over the course of the last years the need to reduce<br />

energy consumption is growing. This article focuses on the<br />

possibilities for reduction of energy consumption in embedded<br />

systems. We argue that energy consumption is a system issue and<br />

therefore a matter of making compromises. Energy consumption<br />

can be reduced by software, but only so far as hardware allows.<br />

There are many things that can be done to reduce energy<br />

consumption. The goal is to define an approach for achieving less<br />

energy consumption. Also criteria for the selection of an<br />

appropriate MCU are presented. Conclusion: Many (unexpected)<br />

things can have a big impact on your achievable battery lifetime.<br />

Look beyond just the CPU/processor and software in order to<br />

achieve better results.<br />

Keywords— Ultra Low Power; approach; embedded; system<br />

issue; reducing energy consumption<br />

I. INTRODUCTION<br />

In the last years the need to reduce energy consumption is<br />

growing. One the one hand this is instigated by governments<br />

(e.g. EnergyStar), on the other hand by the need to do more with<br />

the same or less energy (think mobile telephone battery lifetime,<br />

Internet-of-Things node battery life time). In this article we will<br />

focus on the backgrounds of energy consumption in embedded<br />

systems and how to reduce this consumption (or its effect). This<br />

article covers a part of a two-day Ultra Low Power workshop<br />

about this subject which is available via the High Tech Institute<br />

(http://www.hightechinstitute.nl), T2prof and Altran.<br />

The fact that energy consumption is an important issue is<br />

illustrated by the fact that chip manufacturers make a lot of noise<br />

about their energy-economic chips. There even are benchmarks<br />

for energy economy of embedded processors: the EEMBC<br />

ULPMark TM (http://www.eembc.org/ulpbench) CP (Core<br />

Profile) and PP (Peripheral Profile), IoTMark-BLE and the<br />

soon-to-be-released SecureMark.<br />

Energy consumption is an important point in all sorts of<br />

systems. It gets more and more important in the IoT world,<br />

where the biggest consumer is usually the radio. All sorts of<br />

solutions are tried to require the radio for as short as possible.<br />

This leads to non-standard protocols that use much less<br />

energy than standard protocols.<br />

It is important to realize that energy consumption is a system<br />

issue. And a matter of weighing one thing against another and<br />

making compromises. Energy consumption can be reduced by<br />

software, but only so far as hardware allows. It is also a<br />

multidisciplinary thing, because both software discipline and<br />

hardware discipline must be involved in the design in order to<br />

achieve the desired goal.<br />

For this article we limit ourselves to smaller embedded<br />

systems like sensor nodes. These systems are typically asleep for<br />

a large proportion of the time. Depending on what functionality<br />

is required during sleep and how fast the system must wake up,<br />

the system can sleep lighter or deeper.<br />

There are many measures that can reduce energy<br />

consumption. The goal is to define an approach that should lead<br />

to less energy consumption. That approach is detailed in this<br />

article as well as in the workshop.<br />

II.<br />

CATEGORIES OF MECHANISMS FOR ENERGY<br />

REDUCTION<br />

The mechanisms for energy reduction fall into three main<br />

categories. TABLE 1 lists commonly used mechanisms per<br />

category. This list is not exhaustive. Different vendors may use<br />

different names for the same mechanism.<br />

A. Software only (includes compiler)<br />

The energy reduction mechanism is solely implemented in<br />

the software domain.<br />

B. Software and hardware combined<br />

Hardware and software together implement an energy<br />

reduction mechanism.<br />

C. Hardware only<br />

The energy reduction mechanism is implemented at the<br />

hardware level.<br />

Each of the hardware mechanisms mentioned in the table<br />

below may or may not be available in your system. If the<br />

hardware does not support it then software cannot use it.<br />

53


Overview of<br />

Power<br />

Management<br />

Mechanisms<br />

Power<br />

management<br />

works at all<br />

these levels<br />

TABLE 1. POWER MANAGEMENT MECHANISMS<br />

Level Mechanism Category<br />

Application<br />

Operating<br />

System<br />

Driver<br />

Board<br />

Chip<br />

IP block /<br />

chip<br />

IP block /<br />

RTL<br />

Transistor<br />

Substrate<br />

III.<br />

Event driven architecture<br />

Use Low Power modes<br />

Select radio protocol<br />

…<br />

Power API<br />

Operation Performance<br />

Points API<br />

Tickless operation<br />

Use DMA<br />

Use HW event<br />

mechanisms<br />

Suspend / resume API<br />

Dynamic Voltage and<br />

Frequency Scaling.<br />

Power Gating via I/O pin<br />

Controlling Voltage<br />

regulator via I/O pin.<br />

Clock Frequency Mgt.<br />

Controlling device<br />

shutdown pins by I/O pin.<br />

Power Gating<br />

Offer Low Energy Modes<br />

(Automatic) clock gating<br />

Clock frequency<br />

management<br />

Dynamic Power Switching<br />

Adaptive Voltage Scaling<br />

Static Leakage<br />

Menagement<br />

Power Gating State<br />

Retention<br />

Automatic power / clock<br />

gating<br />

Body Bias<br />

FinFet<br />

TriGate Fet<br />

Sub-threshold operation<br />

SOI, FD-SOI<br />

SIMPLE THINGS TO DO<br />

(Domain)<br />

Category A<br />

(Software)<br />

Category B<br />

(Software<br />

&<br />

Hardware)<br />

Category C<br />

(Hardware)<br />

A. Look at the OS configuration (if there is an OS)<br />

Operating Systems use a periodic scheduler invocation<br />

(‘tick’) to check whether the currently executing process is still<br />

allowed to use the processor or that it should be descheduled in<br />

favor of some other process. This periodic invocation can take<br />

quite some time, and also happens if no processes are ready for<br />

execution. In this case a so-called idle task is executed, which<br />

usually consists of a simple while (1) {}; loop, just<br />

burning energy.<br />

Some Operating Systems (e.g. Linux and FreeRTOS) offer<br />

what is known as a tickless configuration to make the CPU sleep<br />

until either a timer expires or an interrupt occurs. The standard<br />

scheduler tick timer (default 100 Hz for Linux versions prior to<br />

version 3.10) is then no longer necessary. In versions before 3.10<br />

the #define CONFIG_NO_HZ configures this behavior, in later<br />

versions it is the #define CONFIG_NO_HZ_IDLE. In order for<br />

FreeRTOS to be used in this way the #define<br />

configUSE_TICKLESS_IDLE must be set. When applicable,<br />

this is a very simple way to (possibly substantially) reduce<br />

power.<br />

B. We look at the architecture of the application.<br />

If we look at the architecture of the application software we<br />

can distinguish two major types: Super loop or event driven. The<br />

super loop goes around one big loop all of the time, often not<br />

sleeping at any time. In order to reduce energy consumption we<br />

would like the system to sleep as long as possible between<br />

successive passes through the loop. It depends on the application<br />

whether sleeping is allowed at all and what the maximum<br />

sleeping time can be. It may, however, be quite possible to do<br />

some sleeping at the end of the loop without causing any<br />

problem and in doing so save substantial energy.<br />

IV.<br />

APPROACH FOR OBTAINING ULTRA LOW POWER.<br />

We will now describe our approach toward achieving ultralow<br />

power in a step by step fashion. Basically the strategy is:<br />

Use the facilities the hardware offers. We can do this in steps,<br />

roughly in the order these features were offered over time.<br />

A. In the beginning<br />

In the beginning there was only one bus master in the system:<br />

the CPU. It could read data from instruction memory and read<br />

from and write data to data memory and peripherals. In order to<br />

check for an event the CPU had to resort to polling:<br />

while (!event_occurred())<br />

{};<br />

This piece of code keeps the CPU busy, as well as the code<br />

memory and the bus. Both the CPU and the code memory (flash<br />

usually) are big contributors to the total energy consumption,<br />

especially when code memory isn’t cached.<br />

B. Phase 2: Introducing Direct Memory Access (DMA)<br />

At some point in time a second bus master is introduced: The<br />

DMA unit. It is capable (after being programmed by the CPU)<br />

to access memory and peripherals autonomously. It can also<br />

generate an interrupt to the CPU to signal completion of its task,<br />

e.g. copying of peripheral data to memory or vice versa. This<br />

DMA unit can operate in parallel with the CPU, but they cannot<br />

access the bus simultaneously. While the DMA is copying data,<br />

the CPU can check a variable in memory for DMA completion.<br />

Pseudocode of the Interrupt Service Routine (ISR):<br />

void ISR_DMA_done(void){<br />

}<br />

... /* clear interrupt */<br />

ready = true;<br />

54


The main program:<br />

volatile bool ready = false;<br />

setup_peripherals_and_DMA();<br />

start_DMA();<br />

while ( ! ready )<br />

{<br />

}<br />

__delay_cycles(CHECK_INTERVAL);<br />

Here we check another variable, but not continuously.<br />

The __delay_cycles() function executes NOP<br />

instructions during CHECK_INTERVAL. This keeps the data<br />

bus free so that the DMA unit isn’t hindered by the CPU’s data<br />

accesses and so may complete its assignment quicker. The CPU<br />

is still fetching code from instruction memory, though.<br />

C. Stop the CPU clock when possible<br />

A relatively recent addition to the CPU’s capabilities is<br />

stopping the CPU clock until an interrupt occurs, saving power<br />

by doing so. This can be in the form of a<br />

WAIT_FOR_INTERRUPT instruction, which removes the<br />

clock from the CPU core until an interrupt occurs. ARM CPU<br />

cores offer the WFI instruction for this purpose, others such as<br />

MSP430 set a special bit in the processor status register to<br />

achieve the same effect. This does not affect our interrupt service<br />

routine. Our main program code changes thus:<br />

volatile bool ready = false;<br />

setup_peripherals_and_DMA();<br />

start_DMA();<br />

while ( ! ready )<br />

{<br />

}<br />

__WFI(); /*special insn, CPU sleeps*/<br />

In the new situation the CPU is stopped by disabling its clock<br />

until the interrupt occurs. This saves energy in several ways: The<br />

CPU is not active, instruction memory is not read and both the<br />

data bus and the instruction bus are completely available for the<br />

DMA unit to use. Most new processors know this trick.<br />

D. Events<br />

Later CPUs have the notion of events that also can be used<br />

to wake up the CPU from sleep. This mechanism is quite similar<br />

to that of using the interrupt, except that no ISR gets invoked.<br />

This saves some overhead if the ISR didn’t have to do anything<br />

other than wake the CPU. Using this mechanism requires that<br />

the CPU have an instruction to WaitForEvent. ARM Cortex<br />

processors have the WFE instruction, others, such as MSP430<br />

don’t have it.<br />

E. Passing events around: Event router<br />

When this event mechanism is coupled with peripherals that<br />

can produce and consume events using some programmable<br />

event connection matrix (‘event router’), a very powerful system<br />

emerges. In the case of Silabs EFM32 series the mechanism is<br />

referred to as Peripheral Reflex System; Nordic has another<br />

name for it. MSP430 has something a bit simpler than the other<br />

two.<br />

This mechanism allows quite complex interaction between<br />

peripherals can take place without CPU interaction. This allows<br />

the CPU to go into a deeper sleep mode and save more energy.<br />

As an example we can configure a system to do the following<br />

without any CPU interaction: On a rising edge on a given I/O<br />

pin an ADC conversion is started. The conversion done event<br />

triggers the DMA to read the conversion result and store it into<br />

memory, incrementing the memory address after each store.<br />

After 100 conversions the DMA transfer is done, generating an<br />

event to the CPU to start a new acquisition series and to process<br />

the buffered data.<br />

F. Controlling power modes<br />

The latest ULP processors have a special hardware block to<br />

manage energy modes and transitions between them in the<br />

system, combined with managing clocks and power gating<br />

peripherals in certain energy modes: The Energy Control Unit in<br />

EFM32, or Power Management Module for MSP430 for<br />

instance. They can save a lot of time otherwise required to<br />

program many registers when going to or coming out of sleep.<br />

They can also manage retaining peripheral register content at<br />

retention voltage (lower than operational voltage), such that the<br />

peripheral can immediately resume operation when power is<br />

restored. This hardware mechanism is called State Retention<br />

Power Gating.<br />

The main program is now:<br />

setup_hw_for_event_generation();<br />

configure_sleep();/*This is the extra*/<br />

start_DMA();<br />

__WFE();/*CPU sleeps, low power mode*/<br />

Using a deeper sleep can make a difference of more than a<br />

factor thousand!<br />

We have just seen what stepwise refinements we can<br />

implement to reduce energy consumption. Each step can be<br />

implemented as a logical successor to the previous one.<br />

V. WHAT TO LOOK FOR WHEN SELECTING AN MCU<br />

There is a number of parameters that one can look at and<br />

compare to select the best MCU for the application at hand. Here<br />

is one set of parameters:<br />

1) What is the active current (A/MHz) at what voltage<br />

2) What is the performance of the CPU (CoreMark/MHz)<br />

3) What is the sleep current in each of the low power modes<br />

intended to be used<br />

4) What is the wake-up time from each of these low power<br />

modes.<br />

5) What is the power consumption of each of the<br />

peripherals used<br />

6) What peripherals are available in which low power<br />

modes<br />

55


7) Can peripherals operate autonomously (e.g. be<br />

controlled by a DMA engine)<br />

8) Is there a hardware event mechanism to orchestrate<br />

hardware-based event production and consumption<br />

9) Do the available low power modes fit well with the<br />

application<br />

10) Are the peripherals designed for ultra low power<br />

operation (e.g. Low Energy UART, Low Power Timer)<br />

11) Can sensors be operated with low energy consumption<br />

(e.g. Low Energy sensor interfaces)<br />

12) Are there “on-demand oscillators”<br />

The answers to these questions serve as a guide to an informed<br />

selection of the MCU type to use for best performance for the<br />

given application. They can be used as input for a power<br />

model of the application and, together with a battery model<br />

can help predict the battery/charge lifetime for the application.<br />

VI.<br />

WHAT ELSE CAN ONE DO?<br />

There are still many more factors that all can play a role in<br />

the overall energy consumption. These are factors not obvious<br />

to many people, such as:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Regulator efficiency<br />

Switching sensors off when not in use: prepare your<br />

hardware to be able to do so<br />

Clocks: how to set them for lowest energy<br />

consumption<br />

Voltages: lower is better, the fewer the better<br />

Compiler: can make 50 % difference<br />

Compiler settings: can make 50 % difference<br />

Where do I locate critical code / data<br />

How to measure the consumption?<br />

<br />

I/O pin settings<br />

Battery properties in relation to energy<br />

consumption profile.<br />

<br />

Look for possibilities to make use of energy<br />

harvesting to prolong battery lifetime<br />

During the workshop many of these issues and others will be<br />

addressed and illustrated through hands-on sessions.<br />

VII. CONCLUSIONS<br />

Ultra-Low Power is a system thing. Hardware alone or<br />

software alone cannot achieve the lowest consumption.<br />

We have shown a stepwise approach to reducing energy<br />

consumption.<br />

In order to realize the maximum energy reduction one has to<br />

understand the details of the hardware and write the software to<br />

use available features.<br />

Energy savings can be found in unexpected places.<br />

It is possible to reduce consumption by more than a factor<br />

thousand in certain scenarios.<br />

ACKNOWLEDGMENT<br />

The author wishes to thank Altran for giving him the<br />

opportunity to investigate this subject matter and his colleagues<br />

for helpful feedback during the development of the workshop<br />

and for reviewing related publications [1].<br />

REFERENCES<br />

[1] H. Roebbers, “Hoe spaar je energie in een embedded systeem?,” Bits &<br />

Chips 08, pp. 34-39, October 2015.<br />

56


Top Misunderstandings about Functional Safety<br />

Christian Dirmeier, Claudio Gregorio<br />

Rail Automation<br />

TÜV SÜD Rail<br />

Munich, Germany<br />

christian.dirmeier@tuev-sued.de<br />

Abstract—TÜV SÜD has more than 20 years of experience in<br />

testing and certifying Functional Safety related systems and<br />

components. The presentation summarizes the most often<br />

experienced issues due to misunderstanding of key concepts in<br />

Functional Safety during this time.<br />

Keywords—functional safety, certification, safety systems<br />

I. MOTIVATION:<br />

TÜV SÜD has more than 20 years of experience in testing<br />

and certifying Functional Safety related systems and<br />

components. During this time our employees have observed<br />

recurring issues arising from misunderstanding of some<br />

concepts of functional safety. The sections below highlight<br />

some of the most interesting errors. The presentation address<br />

persons firstly approaching functional safety as well as<br />

experienced safety engineers, safety managers, project leaders<br />

who are familiar with Functional Safety topics and would like<br />

to see some not conventional aspects of the functional safety.<br />

II.<br />

PFH/PFD: NECESSARY BUT NOT SUFFICIENT FOR A SIL<br />

Often manufacturers just calculate a PFD/PFH value for<br />

their System or Subsystem and claim afterwards a SIL for it.<br />

The PFH/PFD express the Probability of a dangerous failure of<br />

a safety related system (or sub-system) per hour (PFH) or on<br />

demand (PFD). Both values address random Hardware faults<br />

and usually are calculated with use of FMEDAs. Fulfilling a<br />

specific Safety Integrity Level (SIL) requires not only the<br />

control of random failures of Hardware but also the avoidance<br />

and control of systematic failures in Hardware and Software.<br />

The latter is expressed as Systematic Capability (SC, values<br />

from 1 to 4, corresponding to the four SIL values) and reflects<br />

methods and techniques used during development of the safety<br />

related system. Therefore a SIL always consists of both:<br />

PFD/PFH and a determination of the robustness of its<br />

development process i.e. the SC.<br />

III.<br />

SIL DOES NOT MEAN RELIABILITY OF THE CONTROL<br />

SYSTEM<br />

Sometimes system integrators and plant manufacturer<br />

require their suppliers to deliver control systems for normal<br />

operation with a SIL, meaning that this will ensure a certain<br />

reliability of the control system and/or utility. Aim of a safety<br />

function (which is performed by a safety related system) is to<br />

put an Equipment Under Control (EUC) into a safe state.<br />

A safe state of a EUC is a result of the hazard and risk<br />

analysis and depends on its different operational modes. In this<br />

context we frequently observe misunderstandings about<br />

strategies and concepts regarding fail safe scenarios, i.e. e.g.<br />

shutting down the EUC in case of a failure, and fail operational<br />

scenarios, i.e. e.g. keep the EUC as much as possible in<br />

operation.<br />

Random Hardware failures (and reliability) are calculated<br />

based on failure rates. In terms of Functional Safety failure<br />

rates are split into safe and dangerous failures. Only dangerous<br />

failures (preventing the safety function to perform as intended)<br />

are considered in the calculation of the PFD/PFH values. A SIL<br />

therefore is only a degree of reliability that the safety function<br />

will perform as intended when it is required to put the EUC in<br />

a safe state.<br />

IV.<br />

WATCHDOGS AND MICROCONTROLLER<br />

Watchdogs of Microcontroller (µC) often just reset the<br />

controller but do not control any outputs in a direct<br />

independent way. In case of a fault inside the µC a<br />

deterministic behavior of the outputs is required. It is usually<br />

not possible to prove that a defect micro controller can trigger a<br />

watchdog in a correct way.<br />

V. PROVEN IN USE SOFTWARE<br />

We still observe approaches to claim systematic capability<br />

for existing SW based on operational experience (Route 2S in<br />

IEC 61508). Since IEC TS 61508-3-1 all individual failures of<br />

the SW need to be detected and reported during the observation<br />

period and also all combinations of input data, sequences of<br />

execution and timing relations must be documented. This<br />

approach is usually not possible from a practical point of view.<br />

57


Verification of Memory Interferences in Automotive<br />

Software: A Practical Approach<br />

Ludovic Pintard, Abdelillah Ymlahi-Ouazzani<br />

VALEO<br />

GEEDS Safety Department<br />

Créteil, France<br />

[name].[surname]@valeo.com<br />

Abstract— Freedom From Interferences (FFI) is a main concern<br />

in software safety. According to ISO26262 Standard on automotive<br />

functional safety, FFI means that a fault in a software component<br />

shall not propagate in a more critical one. Many projects involve<br />

mixed criticality architectures, i.e., they run applications with<br />

different Automotive Safety Integrity Levels (ASIL) on the same<br />

microcontroller.<br />

If architectural solutions are known to ensure FFI – using software<br />

partitioning, hardware memory protections, or other safety<br />

mechanisms – the verification and debugging can be difficult.<br />

Indeed, the implementation of a Memory Protection Unit (MPU)<br />

often reveals the weaknesses of the design, and it is time consuming<br />

to understand all the exceptions late in the process.<br />

This paper discusses how the design of complex application can be<br />

verified with regards to memory interferences and illustrates it with<br />

a case study on an Advanced Driving Assistance System. The focus<br />

is on the process to verify the FFI at different steps of the<br />

development cycle and how to improve it using tooling.<br />

The results obtained on the project have demonstrated that memory<br />

interferences can be efficiently detected in the early phase of<br />

architectural design.<br />

Keywords— Functional Safety; Freedom From Interferences;<br />

Mixed Criticality; Memory Interferences; ISO26262<br />

INTRODUCTION<br />

With the introduction of new Advance Driver-Assistance<br />

Systems –ADAS– and autonomous vehicle, the complexity of<br />

the automotive systems has been increased. In software<br />

automotive development, safety is a critical objective, and the<br />

introduction of ISO26262 standard for functional safety for<br />

road vehicles, has helped the industry to address a common<br />

the state of the art and improve practices.<br />

One of the important properties for ensuring safety at the<br />

software architectural level is Freedom From Interference<br />

(FFI). This property is important in automotive systems as it<br />

enables to develop software with mixed-ASILs on one microcontroller,<br />

instead of having a monolithic ASIL solution.<br />

Hence, the effort is put on the development of modules with a<br />

direct impact on safety requirements, and not all the modules.<br />

However, even if solutions are now known to ensure FFI<br />

such as: safety analyses, safety mechanisms (software<br />

partitioning, Memory Protection Unit (MPU), OS timing<br />

protection, watchdog, etc.), testing with fault injection, these<br />

solutions lead to big regression loop as errors are found during<br />

testing.<br />

The contribution of the paper is to propose an efficient<br />

way to implement and verify memory interferences on a<br />

project, with the help of Tasking Safety Checker tool [5]. This<br />

approach will be exemplified with the result of an evaluation<br />

done by Valeo.<br />

Section I introduces fundamentals notions of ISO26262<br />

and FFI. Section II describes the current state of the art in<br />

software process in the automotive industry. Section III points<br />

out drawbacks identified of the process described in Section<br />

II, and then describes a more robust process. Section IV gives<br />

the results of our experimentation of the tool: TASKING<br />

Safety Checker and the results obtained with it on several<br />

internal projects. Finally, Section V is a summary of the<br />

strengths and limitations of the approach.<br />

I. BACKGROUND<br />

A. ISO26262<br />

ISO26262 standard for functional safety in road vehicles,<br />

introduced in 2011, has helped the industry by presenting the<br />

state of the art methods in development as opposed to current<br />

practices. ISO26262 pushed recommendation toward methods<br />

and techniques to ensure that “no unreasonable risk is due to<br />

hazards caused by malfunctioning behavior of electrical and<br />

electronic systems”. The standard has 10 parts, but we focus<br />

on Part 6: “product development at the software level”. The<br />

standard follows the well-known V model for engineering.<br />

B. Freedom From Interferences<br />

With the increase of processing capabilities of the<br />

microcontrollers used for automotive systems, software<br />

designs now integrate applications with different criticality<br />

levels, i.e., different Automotive Safety Integrity Level<br />

(ASIL).<br />

1<br />

58


According to the ISO26262, one method consists in<br />

developing all the modules with the highest ASIL. Hence,<br />

lower ASIL applications should be developed also with a<br />

higher effort as higher ASIL modules require applying more<br />

techniques and methods.<br />

The alternative method is to integrate modules with<br />

different ASILs, but requires that the Freedom From<br />

Interferences (FFI) is ensured. FFI is defined as the “absence<br />

of cascading failures between two or more elements that could<br />

lead to the violation of a safety requirement”. If a lower ASIL<br />

component fails, it should not propagate to a higher ASIL<br />

component and make it fails. As an example, in a software<br />

design, a Quality Management (QM) can be integrated with<br />

ASIL-C software modules only if it could be proven that it<br />

does not interfere with the ASIL-C modules by any source<br />

like:<br />

<br />

<br />

<br />

Memory interferences: these correspond to<br />

corruption of content, read/write access to<br />

memory allocated to another element.<br />

Exchange of information interferences: These<br />

are errors between a sender and a receiver caused<br />

by: repetition of information, loss of information,<br />

delay of information, insertion of information,<br />

blocking a communication channel, etc.<br />

Timing and execution interferences: They<br />

occur during runtime if a safety-relevant software<br />

element is blocked due to another software<br />

element, or the system is in a deadlock, livelock,<br />

etc.<br />

II. CURRENT SOFTWARE PROCESS AND ARCHITECTURE TO<br />

ENSURE SAFETY<br />

In this paper, we take as a hypothesis that we are<br />

developing a system with safety critical requirements at<br />

software level and mixed-ASILs software modules. In this<br />

section, we describe the different activities performed in the<br />

software development process.<br />

A. Preliminary Software Architecture<br />

1) Software Requirements<br />

To design preliminary software architecture, the process<br />

starts with the allocation of system requirements on software<br />

and hardware level. Both functional and safety requirements<br />

are used to define a first allocation and definition of the<br />

needed components for the implementation of the<br />

functionalities.<br />

Today, most of the automotive projects follow AUTOSAR<br />

architectures depicted hereafter.<br />

2) AUTOSAR: AUTomotive Open System Architecture<br />

AUTOSAR [2] is a standard for automotive E/E software<br />

architecture developed by major OEMs and suppliers. It is a<br />

major enhancement to software development in the<br />

automotive industry.<br />

AUTOSAR brings in the realization of an applicationspecific<br />

approach for automotive software development as<br />

opposed to an ECU-specific one. The AUTOSAR architecture<br />

mainly encompasses an application layer (comprising<br />

Software Components (SWC), a Run-Time Environment<br />

(RTE) and the Basic Software (BSW).<br />

B. Software Safety Analyses<br />

Based on the Technical Safety Concept (TSC) performed<br />

at system level and with the preliminary software architecture,<br />

it is possible to start software safety analyses, such as:<br />

<br />

<br />

<br />

Software Fault Tree Analysis (SwFTA)<br />

Software Failure Mode and Effect Analysis<br />

(SwFMEA)<br />

Software Critical Path Analysis (SwCPA)<br />

These analyses aim at determining how the fault<br />

propagates into the software architecture, in order to allocate<br />

the ASIL to each software component in the critical path.<br />

With this allocation of critical modules, the safety<br />

mechanisms to ensure FFI can be defined. The state of the art<br />

measures to ensure this property is to implement:<br />

<br />

<br />

<br />

End-to-End Protections in order to protect<br />

sender/receiver communications, with Cyclic<br />

Redundancy Check (CRC), Timeouts, Counters,<br />

etc.<br />

Timing protections from the OS: tasks and<br />

interrupts execution budget protection, etc. The<br />

timing protection for FFI can also be performed<br />

with AUTOSAR Watchdog Manager (WdgM)<br />

module. The WdgM module [4] is a key SW<br />

Module in the AUTOSAR-based architecture to<br />

ensure the application works safely, detecting<br />

violation of timing and logical constraints. The<br />

WdgM is part of the System Services layer and is<br />

responsible for error detection, isolation and<br />

recovery. It provides three supervision<br />

mechanisms: alive supervision, deadline<br />

supervision, and control flow supervision.<br />

Software partitioning and use of a Memory<br />

Protection Unit on the micro-controller, in order<br />

to mitigate Memory interferences.<br />

C. Software Partitioning and Memory Protection Unit<br />

In order to implement the partitioning (see Fig 1.), the<br />

following AUTOSAR concept and modules are used.<br />

OS-Application: The AUTOSAR-OS offers the possibility<br />

to group different OS objects (Tasks, ISRs, Alarms, Schedules<br />

Tables, Counters, etc.) into so called OS-Applications. All<br />

objects within one OS-Application share their memory<br />

protection scheme and the access rights.<br />

According to AUTOSAR-OS Specifications [3], OS-<br />

Applications can either be trusted or non-trusted. Trusted OS-<br />

Applications are allowed to run in CPU Supervisor Mode<br />

without restrictions and non-trusted ones are running in CPU<br />

User Mode with limited access to OS and HW resources.<br />

2<br />

59


Runnables<br />

Runnables ASIL B<br />

Runnables QM<br />

Switch<br />

Event<br />

HeadLight<br />

ComStack<br />

Switch<br />

Event<br />

HeadLight<br />

ComStack<br />

RTE<br />

RTE<br />

RTE<br />

E2E<br />

OS<br />

I/O HW<br />

Abstraction<br />

IoHwAbs<br />

I/O Drivers<br />

Dio<br />

Light Switch<br />

ASIL B<br />

Communication<br />

Services<br />

Com<br />

Communication<br />

HW Abstraction<br />

CanIf<br />

CanTrcv<br />

Communication<br />

Drivers<br />

Can<br />

Ignition<br />

HeadLights Switch ECU<br />

ASIL B<br />

Dashboard<br />

QM<br />

Partitionning<br />

E2E<br />

OS<br />

I/O HW<br />

Abstraction<br />

I/O Drivers<br />

ASIL B<br />

IoHwAbs<br />

Dio<br />

Light Switch<br />

ASIL B<br />

Shared<br />

Memory<br />

HeadLights<br />

ASIL B<br />

Communication<br />

Services<br />

Com<br />

Communication<br />

HW Abstraction<br />

CanIf<br />

CanTrcv<br />

Communication<br />

Drivers<br />

Can<br />

QM<br />

Ignition<br />

Switch ECU<br />

Dashboard<br />

QM<br />

Legend<br />

ASIL Modules<br />

QM Modules<br />

Fig. 1. Software Partitionning<br />

MMU/MPU: The basic memory protection requirement<br />

that shall be fulfilled by the OS is to protect data, code and<br />

stack section of OS-Application. In the AUTOSAR OS<br />

standard, this protection is activated during the execution of<br />

the non-trusted OS-Applications in order to prevent the<br />

corruption of the trusted OS-Application memory sections.<br />

Moreover, it could be also used to protect private data and<br />

stack within the same OS-application if necessary.<br />

The memory protection requires hardware support in the<br />

microcontroller of a Memory Protection Unit MPU and/or<br />

Memory Management Unit MMU.<br />

AUTOSAR Inter OS-Application Communicator – IOC:<br />

The communication between two OS-Applications has also to<br />

be protected. Indeed, OS-Applications intend to create<br />

memory protection boundaries, therefore dedicated<br />

communication mechanisms are needed to cross them. This<br />

feature is implemented in AUTOSAR-OS and called IOC. It is<br />

the dedicated communication mean between OS-Applications,<br />

whether or not the OS-Applications are allocated to the same<br />

core (the communication could be between two OS-<br />

Applications on the same core, or allocated to two different<br />

cores in multicore architectures). Its main function is to ensure<br />

the integrity of the transmitted messages via a buffer. These<br />

messages could be data structure or notifications (activation of<br />

a task, callback, etc.).<br />

D. Verification and Validation<br />

The final part of the process is the verification and<br />

validation testing.<br />

It is important to note that all this process is incremental,<br />

and if errors are detected in one of the activities, the previous<br />

ones may be impacted. Indeed, testing may highlight that a<br />

software module should be redesigned or reallocated and<br />

designed with a higher ASIL.<br />

Fig. 2. MPU Functional Description<br />

III. IMPROVEMENT OF SOFTWARE PROCESS<br />

In the previous Section, we describe the automotive<br />

software development process for the implementation of<br />

memory protection. In this new Section, we will explain the<br />

limitations and difficulties link to this process and propose<br />

solutions.<br />

A. Identification of Gaps and Lacks in the Development of<br />

Memory Protection<br />

This process tends to be implemented on most of the<br />

projects that are deployed today in automotive. However, this<br />

process often leads to several iterations in order to obtain the<br />

final version of stable software.<br />

3<br />

60


Indeed, the development of such systems leads to these<br />

outcomes:<br />

<br />

Preliminary software architecture it is a starting<br />

input of software safety but it often needs<br />

modification. Therefore Software Safety analysis<br />

is performed on code to save time.<br />

In many projects, it is a challenge to have software safety<br />

design that is completely in line with the actual code<br />

implementation. And as the code reflects the most the<br />

requirements, software safety engineer performs his analysis<br />

directly on the code. This introduces a difficulty to have a<br />

systematic way to perform this analysis, as the code alone has<br />

more complexity. Hence, the main problem is to have a<br />

systematic way to perform the analysis.<br />

<br />

Software safety analyses are performed by hand.<br />

Today, there are no tools widely adopted by safety actors.<br />

Even if internal tools may have been developed based on<br />

company knowhow and chosen solution, there are no known<br />

solutions other than performing the software safety analysis by<br />

hand.<br />

<br />

Verification that the code ensures FFI before<br />

integration tests is very difficult.<br />

When the Safety analysis has been performed, and a first<br />

version of the code is available, it is not easy to verify by hand<br />

that the code ensure FFI requirements. Indeed, the call graph<br />

of all the lower ASILs functions should be inspected to verify<br />

that it does not potentially corrupt the data of a higher safety<br />

module.<br />

Hence, we rely either on the expertise of the analyst, every<br />

time the analysis is performed, or on the late integration tests<br />

that will cause MPU exceptions due to architectural faults.<br />

<br />

MPU integration leads to lot of debug that needs<br />

new architectural design.<br />

Anybody, who has integrated an MPU in a project knows<br />

that it will highlight remaining architectural problems, and it<br />

can take a long time to debug. Also, it could lead to modify<br />

significantly the software architecture.<br />

<br />

MPU integration needs high testing coverage<br />

Particularly, for memory interferences verification, the<br />

objective is to perform complete fault injection tests to verify<br />

that the MPU protects the data of the higher ASIL level<br />

partition from corruption by lower ASIL modules. Indeed, in<br />

most implementations the MPU description registers are<br />

configured dynamically, and it is handled by the OS on<br />

context switches (when there is a new task, an interrupt is<br />

started, the MPU configuration is changed).<br />

To improve the testing coverage, the objective is to verify<br />

in each context (execution of each task or interrupt) that the<br />

good access rights are activated. Particularly, for the<br />

verification of the FFI, a task in lower ASIL partition should<br />

have access to write into the higher ASIL data sections (see<br />

Fig 3.).<br />

Fig. 3. Fault Injection tests to verify FFI<br />

B. Importance of Tooling<br />

In automotive software another challenge comes from the<br />

number of parties involved in the development of one<br />

Electronic Control Unit (ECU). Indeed; the development<br />

requires interactions between OEMs (Renault, PSA, Daimler,<br />

GM, BMW, etc.), Tier one suppliers (Bosh, Delphi,<br />

Continental, Valeo, etc.) and Hardware suppliers that provide<br />

the microcontroller and the Micro-controller Abstraction<br />

Layer (MCAL) (NXP, Renesas, Infineon, etc.) and the<br />

supplier of the Real-Time Operating System (ETAS, Vector,<br />

Elektrobit, etc.). As different suppliers are developing part of<br />

the final code, it is a challenge to integrate everything and<br />

ensure FFI.<br />

Also, new hardware technologies are based on complex<br />

heterogeneous multi-core architectures and it will be also<br />

more complicated to implement and verify that the good<br />

access rights are given for the execution of a software<br />

(different shared memories regions, shared cores for the<br />

execution of different applications). All these may lead to the<br />

introduction of even more memory interferences.<br />

One of the major breakthroughs in software development<br />

was the introduction of automatic static code analyzer, to<br />

improve code robustness and quality. These tools verify<br />

different rules from standards such as MISRA-C [6].<br />

Polyspace also verifies dynamic properties (stack usage for the<br />

code).<br />

Hence, the development of such tools for software safety<br />

purpose is needed.<br />

C. TASKING Safety Checker<br />

The TASKING Safety Checker [5] is an automated<br />

analysis tool that statically analyzes the component’s source<br />

code for FFI verification and access violations that would<br />

trigger MPU exceptions. Indeed it enables to detect wrong<br />

access as if the code was executed with a MPU wellconfigured.<br />

The Safety Checker framework is similar to compiler<br />

environment and it eases a smooth integration of the tool in<br />

any project. As an input, the analyst needs to provide, on the<br />

one hand, all the sources of the projects: C files, H files, and<br />

on the other hand, a file describing the allocation of all the<br />

files to different classes. The definition of the classes is up to<br />

the analyst, that can create the needed ones, and as a start for<br />

the analysis, they reflect ASILs of the modules of the project:<br />

QM, ASIL_A, ASIL_B, ASIL_C, ASIL_D.<br />

The definition of such classes then enables to define the<br />

access rights between two classes like on a MPU, when you<br />

restrict the access to a region. Hence you have to define what<br />

4<br />

61


the allowed accesses are for Read, Write, and Execute from<br />

one class to another.<br />

The last configuration of the tools is the allocation of<br />

functions, global variables and local variables to the classes.<br />

This allocation is done by assigning the C files or parts of<br />

them to a class, similarly to a linker script in a compiling<br />

process. The same allocation can also be performed on static<br />

memory regions by giving the start address and the end<br />

address and assigning this region to a class.<br />

The complete process of the tool is given in Fig.5.<br />

Then the Safety Checker can be run and will give the<br />

following outputs:<br />

<br />

<br />

Call graph of all the functions with variables or<br />

addresses accessed.<br />

List of all the read/write/execute violations found<br />

by the tool based on the configuration.<br />

D. Process for Robust Implementation<br />

Consequently, to improve our software development<br />

process for memory protection integration, the following<br />

activities are recommended (see Fig 4.).<br />

In the early phases of development of a trial project, when<br />

the safety analyses have been performed and a first<br />

implementation of the code is available, we decided to<br />

perform a first round analysis with Safety Checker in order to<br />

verify the architecture that has been designed. The tool<br />

performs first a check of the software modules ASIL<br />

allocation, and then verifies the interferences. In addition, it<br />

can help to adjust the MPU configuration for certain sections<br />

of the code, by helping to decide if Read/Write/Execute rights<br />

are suitable on the current implementation with regards to the<br />

safety requirements.<br />

Moreover, we decided to use this tool every time we had a<br />

new software release on this project to verify that the code<br />

modifications do not introduce any new interference. It should<br />

be highlighted that, from a quality perspective, it is important<br />

to verify each version of the software, so that we cover the<br />

maximum number of systematic faults.<br />

Preliminary<br />

Software<br />

Architecture<br />

Software Safety<br />

Analyses<br />

Verification<br />

of ASIL<br />

allocation<br />

Software<br />

Partitionning &<br />

MPU<br />

Implementation<br />

Verification of<br />

Implementation<br />

on each<br />

release<br />

Verification &<br />

Validation<br />

MPU testing<br />

with Fault<br />

Injection<br />

Fig. 4. Proposed Development Process for Memory<br />

Software development Activities<br />

New Activities with Safety Checker<br />

New Fault Injection Activities<br />

The main objective is also to start the activity as soon as<br />

possible, to detect the error early. Then, it can be run again,<br />

more quickly, on new versions, and the new results can be<br />

analyzed more easily by focusing on new findings.<br />

In spite of all these features, the usage of the tool does not<br />

replace the integration of a MPU to catch interferences, and<br />

when a MPU in implemented, it is still needed to test the<br />

implementation of the protection on the memory sections that<br />

are protected via fault injection tests. These tests should<br />

enable to also evaluate the error detection time, the error<br />

reaction time and the reaction mechanisms.<br />

IV. EXPERIENCE RESULTS ON VALEO PROJECTS<br />

Currently, we have started to integrate this new process for<br />

nine months on several projects.<br />

A. Characteristic of Targeted Projects<br />

We decided to target mainly complex ADAS projects, like<br />

automatic parking assistance system, front camera systems<br />

etc.. These projects are mainly based on AUTOSAR. We also<br />

evaluated our approach on a non-AUTOSAR project. The<br />

highest ASIL of these projects at software level is ASIL B.<br />

one of the project has mixed ASIL requirements, with three<br />

part ASIL A, ASIL B and QM, but the others projects are a<br />

mix of ASIL B and QM software modules.<br />

Fig. 5. TASKING Safety Checker Process<br />

5<br />

62


The different projects where this process has been applied<br />

are composed of around 100 software modules (Operating<br />

System, basic software and Application layer) with around<br />

300 to 600 C files.<br />

The chosen projects also target several microcontrollers,<br />

representative of automotive: Power PC microcontroller from<br />

STMicroelectronics or RH850 from Renesas. There are also<br />

different build environment with Green Hills or Windriver.<br />

B. Tool Experimentation<br />

The first evaluation of the Safety Checker, in the described<br />

process, is the configuration time. After a first integration on a<br />

mockup project to understand the tool, the different projects<br />

have shown that one week of configuration of the tool is<br />

needed on a new project.<br />

<br />

<br />

This step has two challenges: The allocation of all<br />

the variables and functions to the correct class is<br />

done manually. Thus, the allocation of hundreds<br />

of C files has to be done by hand.<br />

The tool does not analyze assembly code. This<br />

part of the code must be removed, in order to run<br />

the tool. (This will be discussed in the following<br />

section.)<br />

Then, when the project has been configured once, it is<br />

possible to assess a new version of the software (with minor<br />

new modifications, if the analysis is done regularly). Three<br />

days are needed to run again the analysis on the project.<br />

When the tool is configured then it is easy to get different<br />

results, as you test different allocations of the software<br />

modules, in order to verify the preliminary allocation and also<br />

verify different variants of the allocation. Hence, even for<br />

these complex software, the tool runs in less than one hour to<br />

provide all the safety violation in read, write and execute.<br />

On the projects, to verify the FFI, we focused on the<br />

detection of Write access from a lower ASIL module to a<br />

higher ASIL module, and also on read access from a higher<br />

ASIL module to a lower ASIL module. For the read access,<br />

even if it is not prohibited to do so, the objective is to be able<br />

to justify the use of such access, by mean of checks (CRC,<br />

range check, plausibility check…).<br />

The next section will describe and analyze the results<br />

obtained with the tool on the project.<br />

C. Results<br />

a) Tool Findings<br />

In the different projects, the Safety Checker, enabled to<br />

find several lacks in the preliminary architectures, and this<br />

was taken into account to modify the architecture and the<br />

implementation of such access violations for write access from<br />

lower ASIL to a higher ASIL. In most of the project, the tool<br />

finds around dozens of such violations.<br />

Adding to that, the tool enables to check the code for read<br />

accesses from a higher ASIL to a lower one. Then, it can be<br />

traced with the software safety requirements if these access<br />

where designed and then, do a manual check in the code to<br />

verify that the plausibility check or safety mechanisms have<br />

been implemented. This is an important feedback, for the<br />

software safety analyst to get an automatic evaluation of the<br />

current implementation of safety on the project.<br />

Example of finding:<br />

<br />

<br />

<br />

Error: [xxx.c" 3845] safety violation writing "pos"<br />

(ASIL_B) from "XXX_GetGlobalPosition32" (ASIL_A)<br />

Info 1: [yyy.c" 951] the address of "pos" is passed<br />

from function "CalcTravelDist" as parameter #1 to<br />

function "ZZZ_GetGlobalVehiclePosition"<br />

Info 2: [zzz.c" 9430] parameter #1 containing the<br />

address of "pos" is passed from function<br />

"ZZZ_GetGlobalVehiclePosition" as parameter #1 to<br />

function "XXX_GetGlobalPosition32"<br />

In this right access violation, the error is to access a data<br />

“pos” that is ASIL B from a function "XXX_GetGlobalPosition32"<br />

that is ASIL A, by using pointers.<br />

In this case, the correction is to read directly the position<br />

value from ASIL B context instead of providing it via ASIL A, and<br />

to add a plausibility check on the value of global position in ASIL<br />

B context.<br />

Next, all the generated findings must be analyzed to assess<br />

what is the severity of the access, and the real impact. The<br />

Safety Checker only reports access violation without severity<br />

assessment, so the results must be analyzed. Particularly,<br />

justification of all errors is needed and/or creating tickets in<br />

the tool chosen to reports bugs, if the architecture must be<br />

modified.<br />

After the modifications have been taken into account on<br />

the final version, the good practice is to compare old and new<br />

files in order to verify, if the error linked to ticket has been<br />

fixed and the error does not exist anymore, and see if there are<br />

regressions, i.e., new access violations. Then, the justified<br />

errors that remain can be kept if the justification is still valid.<br />

b) Safety Management<br />

One of the interesting results of the process is that it eases<br />

the communication between software safety analyst and<br />

software development.<br />

It is a systematic way to perform part of safety analysis,<br />

and it is more exhaustive and quicker than doing safety<br />

analysis by reading the code on each new versions. Moreover<br />

the tool provides the call graphs and the lines of code creating<br />

the error.<br />

As it is helpful for the software safety engineer, it also<br />

helps software engineers to be more involved in the process of<br />

reviewing the code and understanding FFI requirements.<br />

c) Misuse and Limitations<br />

However, this process may create misuse and have<br />

limitations<br />

<br />

Misuse: Using such a tool does not prevent to<br />

implement software partition and MPU<br />

6<br />

63


Limitation 1: The tool does not analyze the<br />

assembly code. It is focused on C code, that<br />

enables to have enough abstraction, but as the<br />

instructions depend on microcontroller<br />

architecture, the tool, does not go in these details.<br />

Hence, assembly code should be peer-reviewed to<br />

make sure it will not lead to memory<br />

interferences.<br />

Limitation 2: The tool is not a replacement of the<br />

software safety analyst, as it only tackles only the<br />

memory interferences. Hence, we still need a<br />

safety engineer to allocate the ASILs on<br />

components based on the results of a safety<br />

analysis (SwFMEA, SwCPA, SwFTA).<br />

Moreover, it does not help for other interferences:<br />

communication, execution or timing. The tool<br />

itself does not replace the complete process, but it<br />

eases it.<br />

V. CONCLUSION<br />

This paper first describes the state of the art methods used<br />

for the development of mixed-criticality automotive systems,<br />

for which according to ISO26262, FFI must be ensured. This<br />

process relies on software safety analysis made by hand. This<br />

is error prone, and cause inconsistencies depending on the<br />

analyst. The integration memory access protection, i.e., MPU,<br />

is also really difficult as it finds all the wrong access in the<br />

code, in the final phase of the integration when it is the most<br />

difficult to modify the architecture of the software.<br />

This paper then defines amelioration that can be done on<br />

this development process. On the one hand, the use of a tool<br />

TASKING Safety Checker, in order to have a systematic and<br />

automatic way to assess FFI on software code. On the other<br />

hand, the use of fault injection techniques to have a higher<br />

testing coverage of the MPU.<br />

Finally, in a final Section, the paper discusses the results<br />

obtained on different projects inside Valeo, where this new<br />

process has been used. These results are promising as it helps<br />

with a minimum effort with the tool to assess FFI in the early<br />

phase of the development, and so, reduce the time of MPU<br />

debug due to software architectural errors, and systematic<br />

faults. The introduction of fault injection for MPU also<br />

enables to improve and make the implementation of such<br />

mechanisms more robust.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank their colleagues from<br />

Valeo for their insightful comments.<br />

REFERENCES<br />

[1] ISO 26262 – Road Vehicles – Functional Safety, 10 November, 2011.<br />

http://www.iso.org/iso/home/news index/news archive/news.htm?refid=<br />

Ref1499 [Online; accessed 17-Jan-2018].<br />

[2] AUTOSAR Development Cooperation, http://www.autosar.org [Online;<br />

accessed 17-Jan-2018].<br />

[3] AUTOSAR Specification of Operating System, Release 4.3.1,<br />

https://www.autosar.org/fileadmin/user_upload/standards/classic/4-<br />

3/AUTOSAR_SWS_OS.pdf [Online; accessed 17-Jan-2018]<br />

[4] AUTOSAR Specification of Watchdog Manager, Release 4.3.1,<br />

https://www.autosar.org/fileadmin/user_upload/standards/classic/4-<br />

3/AUTOSAR_SWS_WatchdogManager.pdf [Online; accessed 17-Jan-<br />

2018].<br />

[5] TASKING Safety Checker: http://www.tasking.com/content/safetychecker-asil-verification<br />

[Online; accessed 17-Jan-2018]<br />

[6] MISRA-C:<br />

https://www.misra.org.uk/Activities/MISRAC/tabid/160/Default.aspx<br />

[Online; accessed 17-Jan-2018]<br />

7<br />

64


Functional Safety in AI-controlled Vehicles<br />

If not ISO 26262, then what?<br />

Joseph Dailey<br />

Global functional safety manager<br />

Mentor, a Siemens Business<br />

Phoenix, Arizona, USA<br />

joe_dailey@mentor.com<br />

Abstract— Since its establishment in 2011, the ISO 26262<br />

international functional safety standard has rapidly emerged as<br />

the definitive guideline for automotive engineers looking to<br />

optimize the safety of electrical and/or electronic (E/E) automotive<br />

systems. But because ISO 26262 implies strict adherence to<br />

analyzing architecture and its effect on safety, the shift toward<br />

machine learning for critical driving decisions in self-driving cars<br />

threatens to break the standard’s direct link between functional<br />

safety and the requirements for how these new concepts should be<br />

fulfilled.<br />

After presenting a short history of automotive functional safety<br />

standards up to ISO 26262, this paper outlines the standard’s<br />

specific deficiencies related to artificial intelligence (AI)-controlled<br />

vehicles. Particular attention will be paid to challenges faced by<br />

existing standards bodies grappling with a more autonomous<br />

future. Whereas ISO 26262 specifies requirements for eliminating<br />

safety hazards in the presence of an E/E system fault, this paper<br />

suggests that new standards for the AI era must address so-called<br />

safety of the intended functionality (SOTIF), which means helping<br />

to validate that advanced automotive functionalities are<br />

engineered into the vehicle to avoid safety hazards even in the<br />

absence of a fault. The paper will draw on my experience working<br />

on the world’s first SOTIF standard, ISO/WD PAS 21448, which<br />

is under development.<br />

Keywords— autonomous vehicles, ISO 26262, functional safety,<br />

self-driving cars<br />

I. STANDARDS: A SHORT HISTORY<br />

Since the beginning of the Industrial Revolution, standards<br />

have been helping commerce by breaking down trade barriers.<br />

The first standard was created in 1841 and concerned screw<br />

thread measurements. In 1900 during the Paris International<br />

Electrical Congress, discussions between the British and<br />

American electrical engineering professional associations<br />

established the International Electrotechnical Commission. The<br />

IEC held its first meeting on June 26, 1906, and is still operating<br />

today. In 1926, the International Federation of the National<br />

Standardizing Associations (ISA) was established, focusing<br />

heavily on mechanical engineering. During World War Two<br />

(WWII), the ISA become the International Organization for<br />

Standardization (ISO).<br />

While standardizing products facilitated commerce within a<br />

country, there was no legal requirement between countries to<br />

adopt the same standards. Differing national approaches to<br />

standards soon created barriers to international commerce.<br />

After WWII there was a need to promote international trade<br />

by reducing or eliminating trade barriers. As a result, with the<br />

help of the newly formed United Nations, the General<br />

Agreement on Tariffs and Trade (GATT) was signed by 23<br />

nations on October 30, 1947, and went into effect January 1,<br />

1948. In the 1994 Uruguay Round Agreements, the World Trade<br />

Organization (WTO) was established as GATT’s successor.<br />

More than 120 countries took part in the Uruguay Round,<br />

producing the binding Agreement on Technical Barriers to<br />

Trade (TBT), administered by the WTO.<br />

The TBT agreement aimed to further the 1994 GATT<br />

objectives by:<br />

Recognizing the important contribution that<br />

international standards and conformity assessment systems<br />

can make in this regard by improving efficiency of<br />

production and facilitating the conduct of international<br />

trade [1]<br />

In one of its opening paragraphs, the agreement goes on to<br />

give a rationale for embracing standards as a means to ensure<br />

safety, defined as the protection of human, animal or plant life<br />

or health:<br />

Recognizing that no country should be prevented from<br />

taking measures necessary to ensure the quality of its<br />

exports, or for the protection of human, animal or plant life<br />

or health, of the environment, or for the prevention of<br />

deceptive practices, at the levels it considers appropriate,<br />

subject to the requirement that they are not applied in a<br />

manner which would constitute a means of arbitrary or<br />

unjustifiable discrimination between countries where the<br />

same conditions prevail or a disguised restriction on<br />

international trade, and are otherwise in accordance with<br />

the provisions of this Agreement [1]<br />

65


Since the WTO was formed, member nations have generally<br />

adhered to TBT’s original intent in creating mandatory and<br />

voluntary standards and guidelines with international standards<br />

organizations like ISO, IEEE and IEC. Standards now exist for<br />

everything from electrical and mechanical hardware, software<br />

and systems, to manufacturing processes and occupational<br />

hazards for hundreds of different applications and products.<br />

These standards furthered goals like consistency and reliability,<br />

but assumptions that a reliable product was also safe were called<br />

into question, particularly as systems became larger and more<br />

complex.<br />

In the 1990s, to create a guild of functional safety-minded<br />

engineers, the IEC led comprehensive studies of the process of<br />

creating both hardware- and software-based systems. The<br />

objective was to provide a standard so that hardware and<br />

software developers could claim their systems were safe. These<br />

studies led to the release of IEC 1508, which after public<br />

comments and a few years of further revisions, became the<br />

world’s first functional safety standard, IEC 61508, in 1998; last<br />

four parts of the standard were released in 2000. IEC 61508 has<br />

spawned similar standards in a range of industries, including<br />

automotive (ISO 26262), rail software (IEC 62279), process<br />

industries (IEC 61511), nuclear power plants (IEC 61513),<br />

machinery (IEC 62061) and more.<br />

Though our deliberations are by necessity private, as a<br />

member of the committee working on the update, I can say that<br />

we are struggling with the typical normative approach of<br />

standards, which simply doesn’t make sense in addressing the<br />

coming revolution of ever more powerful advanced driver<br />

assistance systems (ADAS) technologies and autonomous<br />

vehicles. As a result a new group was formed to address safety<br />

of the intended functionality (SOTIF) for autonomous vehicles.<br />

Despite our best efforts, standardizing ADAS and autonomous<br />

systems will remain controversial given the challenge of<br />

identifying all possible driving scenarios, and creating relevant<br />

normative standards, and then validating those standards.<br />

II.<br />

QUESTIONS ACCOMPANY THE RISE OF AI<br />

How is it possible to standardize autonomous automotive<br />

outcomes which rely heavily on machine learning and AI? How<br />

does a standards committee factor in all of the unsafe conditions<br />

and possible AI-based responses that may occur? Today ISO is<br />

wrestling with a new standard, ISO/WD PAS 21448, to address<br />

the rise of ADAS and AI in vehicles [2]. The basic goal of<br />

functional safety in ADAS/AI is the same as always — to avoid<br />

unintended system behavior, in the absence of a fault, resulting<br />

from technological and system shortcomings and reasonably<br />

foreseeable misuse. But even if an autonomous system is built<br />

Figure 1. A Google self-driving<br />

car in 2013. Waymo (formerly<br />

Google) says it logs 25,000<br />

miles of road tests weekly for its<br />

autonomous fleet. In 2016, the<br />

company drove a billion miles<br />

in simulation, as well. Still, no<br />

matter how deep the pockets or<br />

good the methodology, testing<br />

will always be beset by inherent<br />

limitations. Image courtesy<br />

Becky Stern on Flickr under<br />

terms of Creative Commons<br />

(CC BY-SA 2.0).<br />

A joint effort of ISO and the Society for Automotive<br />

Engineers, ISO 26262 addresses (E/E) systems in passenger cars<br />

with a maximum gross weight up to 3,5oo kg. Though less than<br />

a decade old, ISO 26262 has become one of the most important<br />

standards in the automotive industry today. And this evolution<br />

of E/E functional safety is not over; the ISO 26262 committee is<br />

back at work developing the next revision, expected to be<br />

released in the third quarter of 2018. The revision adds<br />

motorcycles, and it addresses trucks and buses, which eliminates<br />

any weight restriction. Now the only exclusion is for mopeds.<br />

from technologies aligned with this goal, does that guarantee<br />

driver and public safety? Or does terminology like ‘foreseeable’<br />

emphasize that safety is always a relative notion? Perhaps the<br />

proper goal is just to be safer than yesterday, or as Elon Musk<br />

and other AI proponents would say, safer than notoriously bad<br />

human drivers. Will society eventually accept entirely<br />

unforeseen accidents of robot cars? The issue with any standard<br />

is that real life situations never restrict themselves to foreseeable<br />

events. So the struggle and deficiency in any forthcoming ADAS<br />

and autonomous standards is how to make the safety of the<br />

66


intended functionality (SOTIF) — which is all that can<br />

reasonably be addressed given the basic facts about the<br />

underlying technologies — actually safe enough for societal<br />

acceptance.<br />

So let’s start by looking at some of the common ADAS<br />

technologies and AI techniques.<br />

III.<br />

ADAS AND AI<br />

It’s helpful to recall that, whether it’s an algorithm or an<br />

adolescent at the wheel, driving is a learned behavior. I<br />

remember my dad always telling me to keep my distance and<br />

that you should always be able to see the rear tires of the car in<br />

front of you. Such instruction is fine, but all humans learn by<br />

confronting actual experiences while driving — say merging<br />

into freeway traffic — then using the brain’s prodigious<br />

capabilities to quickly evaluate scenarios and decide on the most<br />

intelligence includes the ability to reason, represent knowledge,<br />

plan, learn, communicate and integrate these skills toward a<br />

common goal.<br />

AI has progressed through the years. In the 1980s, machine<br />

learning started with supervised learning techniques, which used<br />

known data structures to make decisions. Among the first<br />

examples many of us encountered were email rules, alerts, and<br />

spam filters. These rules recognized specific words, websites,<br />

addresses, or other known information, and then placed suspect<br />

emails into a specific folder. Users could change and improve<br />

the rules by either identify something new to add to the suspect<br />

list or marking certain data causing messages to be flagged by<br />

mistake. (Of course, Google’s application of deep-learning AI<br />

across its mail services has by now mostly rendered such manual<br />

maintenance unnecessary, at least when it comes to Gmail, the<br />

world’s largest email service.)<br />

Figure 2. How a convolutional neural network distinguishes between a lion and a tiger. For a<br />

more detailed description, see: http://bit.ly/2EPuPA7, a blog post by Facebook software<br />

engineer, Shafeen Tejani.<br />

likely safe outcome, and finally taking action to accomplish that<br />

outcome. (In getting onto the freeway, perhaps the answer is to<br />

accelerate to get in front of the doddering senior in the merge<br />

lane, or if it’s a barreling semi instead, maybe it’s best to hang<br />

back until the truck has passed.)<br />

However, even experienced drivers come across situations<br />

they have never observed or expected. The reaction to these<br />

events must be quick. Have you ever hit the brakes to avoid an<br />

animal on the road? Did you look behind you? If not, did you<br />

get hit from behind? Did you swerve? How about coming up to<br />

a truck blocking the road. Did you break the law by crossing the<br />

double yellow lines to go around? Any attempt to consider all<br />

scenarios, including road and weather conditions, driver error or<br />

misuse, or acts of God, in all combinations, would result in a<br />

massive list of driving rules, one that inevitably would also be<br />

incomplete. Whether for human or computer, the only approach<br />

is to ‘look’ or ‘sense’ the environment around a vehicle, predict<br />

the most likely safe outcome and then act. AI increasingly is<br />

making such predictions for individual ADAS technologies and<br />

the entire autonomous systems.<br />

A. A look back at AI<br />

A media darling today, AI has been around since the 1950s<br />

and simply means a machine that can perform an intelligent task<br />

generally performed by humans. The criteria for what comprises<br />

Next came unsupervised learning which used unknown data<br />

sets and determined actions based on probabilities. These tasks<br />

were used in anomaly detection algorithms or regression<br />

analysis. Machines got smarter, reinforcing their decisionmaking<br />

processes with previous outcomes and probabilities. By<br />

the 2010s, advanced neural networks were making possible<br />

surprising feats of deep learning, such as the Google AI triumph<br />

over one of the world’s best Go players.<br />

Still, the ability of AI to solve any problem with a nearhuman<br />

level of intelligent actions is far off in the future and may<br />

stay in the realm of science fiction forever. But since<br />

increasingly autonomous vehicles are here now, the question for<br />

standards committees is how to make AI workable for an<br />

industry, not just the deep-pocketed industry giants. Doing so<br />

means understanding how AI does what it does, at least at a<br />

cursory level.<br />

B. Deep learning<br />

Deep learning uses multiple layers of nonlinear processing<br />

units for feature learning and classification of its inputs. Each<br />

layer is taught under supervisory control and once it learns, is<br />

placed in an unsupervised state. This learning technology has<br />

proved successful in natural language processing, computer<br />

vision, speech recognition, audio recognition and social network<br />

filtering.<br />

67


Deep learning techniques consist of artificial neural<br />

networks (ANNs) and deep neural networks (DNNs). ANNs and<br />

DNNs are systems inspired by the neural circuitry in the brain;<br />

such systems progressively improve their ability to do tasks<br />

through accumulated examples of rewards from distinct actions<br />

done correctly. A DNN uses a hidden layer that can model<br />

complex non-linear relationships.<br />

The challenge with this technique is it determines outcomes<br />

by fitting a pattern. If the pattern does not fit all of the available<br />

data, then a system based on such a DNN will probably make<br />

the wrong decision. These systems also take a lot of computing<br />

power, which is costly in such a competitive markets, like the<br />

relatively low-margin automotive industry.<br />

C. Deep reinforcement learning (DRL)<br />

Deep reinforcement learning (DRL) is another type of<br />

machine learning algorithm. This technique uses its experience<br />

to determine the ideal behavior. Instead of rewards, DRL<br />

classifies what amount to punishments and pain into what is<br />

called a Q value, which it seeks to optimize as it makes<br />

predictions from its inputs. Accordingly, DRL learns from<br />

examples to create new and ever more sophisticated models.<br />

(DRL systems for visually analyzing imagery are often based on<br />

so-called convolutional neural networks, which have proven<br />

specifically useful in creating ADAS technologies that sense and<br />

act on the world from the point of view of a driver.)<br />

D. Cognitive computing<br />

Cognitive computing is a technique where computing<br />

systems try to simulate human thought processes. Whereas<br />

typical AI systems follow complex algorithms to solve a<br />

problem, cognitive computing tries to mimic the very social<br />

human brain that is always interacting with humans, be they<br />

passengers, other drivers or pedestrians.<br />

The potential is huge if AI, deep learning, reinforcement<br />

learning and cognitive computing can be applied together in an<br />

automotive context. But one fact still remains a problem for<br />

standards committees — AI actions do not follow a determinist<br />

and normalized model. In other words, AI-driven actions are not<br />

directly linked to a system’s input, parameters, initial conditions,<br />

or prescribed rule. There simply is no way to define a standard<br />

reaction, since most actions are now based on probabilistic<br />

models, and reinforced with previous actions and outcomes.<br />

The path forward for a rule-making committee may be<br />

creating standards that themselves adapt to the way we think and<br />

act in the world, though perhaps this is an impossible task as it’s<br />

akin to defining those processes in the human brain. (The<br />

committees, of course, are generally full of engineers, not<br />

neuroscientists.) We need to provide guidelines that fit the way<br />

AI is evolving — increasingly, self-driving cars will use<br />

accumulated previous knowledge to determine actions in<br />

specific scenarios. That is, cars will teach themselves, and as<br />

long as we provide the correct guidance, our collective safety<br />

should be enhanced as the vehicles convey that learning into new<br />

actions in the world.<br />

IV.<br />

THE RESPONSE OF REGULATORS, LAWMAKERS AND<br />

STANDARDS BODIES (SO FAR)<br />

The rise of AI doesn’t obviate the need for existing standards<br />

regimes, which will remain the same for supporting processes,<br />

hardware and software development. However, the looming<br />

stumbling block is dealing with the SOTIF idea, particularly<br />

relevant in ADAS and autonomous applications. Engineering<br />

teams across the supply chain, especially at relatively small<br />

companies, will be looking for guidance on how to conduct a<br />

state of the art development cycle that includes advanced<br />

concepts in artificial intelligence.<br />

As ISO 26262 committee struggles with the new SOTIF<br />

standard ISO/PAS 21448, organizations like the U.S. National<br />

Highway and Transportation Safety Administration (NHTSA)<br />

are similarly working through the new issues posed by<br />

autonomy. On September 12, 2017, NHTSA released the new<br />

federal guidelines for Automated Driving Systems called A<br />

Vision for Safety 2.0, which includes a normative statement that<br />

“the process shall describe design redundancies and safety<br />

strategies for handling automated driving system (ADS)<br />

malfunctions.” [3] But there is no guidance on how to determine<br />

those redundancies and strategies but refers to ISO 26262.<br />

The NHTSA publication is non-regulatory and covers<br />

autonomous safety elements from autonomous driving levels<br />

three to five, placing significant emphasis on software<br />

development, verification and validation. It also provides<br />

guidance on ADS safety elements covering system safety,<br />

operational design domain (road type, geographic area, speed<br />

range and other constraints), object and event detection and<br />

response (crash avoidance capability), fallback or minimal risk<br />

conditions, validation methods and cybersecurity. Though it<br />

provides a means for self-assessment, the document is not meant<br />

to be a legal requirement.<br />

In the United States, aside from the NHTSA guidance,<br />

momentum is apparent in state legislatures and governors’<br />

offices when it comes to passing laws on self-driving. Since<br />

Nevada authorized the first autonomous vehicles in 2011,<br />

twenty additional states have passed similar legislation related<br />

to autonomous vehicles. At least a dozen more have introduced<br />

laws on the subject, while governors in five states have bypassed<br />

the legislative process altogether and issued executive orders<br />

related to autonomous vehicles. [4]<br />

The international situation is just as scattershot. Rules differ<br />

by country, though generally are based on the U.N. Economic<br />

Commission for Europe (UNECE) Convention on Road Traffic,<br />

commonly known as the Vienna Convention on Road Traffic.<br />

First agreed upon during the Global Forum for Road Traffic<br />

Safety in November 1968, the agreement has 36 signatories and<br />

has been ratified by 75 parties, including most of Europe, and<br />

parts of the Americas, Asia, the Middle East, Africa, Russia and<br />

Indonesia. The major countries that are not a part of this<br />

agreement include the United States, Canada, China, Japan,<br />

Australia and India.<br />

In March 2016, the UNECE passed a regulatory milestone<br />

towards the deployment of automated vehicle technologies with<br />

an amendment to the 1968 rules, allowing automated driving<br />

technologies that transfer driving responsibilities to the vehicle<br />

68


in traffic, provided these technologies can be overridden or<br />

switched off by the driver. The amendment includes discussions<br />

on self-steering systems that take over the control of the vehicle<br />

under the permanent supervisor of the driver, like lane-assist,<br />

self-parking or highway autopilots.<br />

Despite this flurry of activity, there is a conspicuous lack of<br />

focus on what it means to achieve ‘safety.’ Regulations describe<br />

at great length what and how a vehicle is to be tested, and who<br />

can conduct those tests, and even the road and environmental<br />

conditions during testing. California’s Senate Bill No 1298 is<br />

representative of most other legislation in the amorphous way it<br />

references safety, with a stated goal to “… [create] appropriate<br />

rules intended to ensure that the testing and operation of<br />

autonomous vehicles in the state are conducted in a safe<br />

manner.” Little guidance is given on defining ‘a safe manner’<br />

except to ensure that a human driver can take over when<br />

necessary. The law states that “[t]he autonomous vehicle shall<br />

allow the operator to take control in multiple manners, including,<br />

without limitation, through the use of the brake, the accelerator<br />

pedal, or the steering wheel, and it shall alert the operator that<br />

the autonomous technology has been disengaged.” [5]<br />

This guidance is problematic on many levels. A driver is<br />

required to be the fallback to dynamic driving tasks, which<br />

assumes that the driver is aware of all unsafe situations and able<br />

to recover into a safe state when pressed to do so. Already a<br />

significant percentage of accidents are caused by distracted<br />

drivers, a group unlikely to be ready to take over in a crisis and<br />

that seems poised to grow in number as technology mediates<br />

more of the driving experience and life in general. The more<br />

glaring problem is that human fallback option is only applicable<br />

for level three and below; the language doesn’t seem to include<br />

fully autonomous level four and five vehicles, where the system<br />

itself is the ‘fallback’ option.<br />

V. THE LIMITS OF BRUTE-FORCE TESTING<br />

Another solution for providing safe autonomous systems is<br />

an abundance of testing. Waymo (previously Google) is the<br />

flagship example here, logging approximately 25,000 miles<br />

every week with its fleet autonomous test vehicles operating in<br />

four U.S. cities. Granted, testing will be critical, requiring both<br />

actual road miles and simulation. Waymo, arguably the leader in<br />

real-world testing, notes that it also drove a billion miles in<br />

simulation in 2016 alone. (Tass International, a Siemens<br />

business, offers a range of simulation and validation solutions<br />

for automated driving, including a platform called PreScan for<br />

simulating traffic and road environments, and support for<br />

hardware in the loop testing for various sensor and<br />

communication systems.)<br />

Still, no matter how deep the pockets or how good the<br />

technology, testing will always be beset by inherent limitations.<br />

The most obvious one is the difficulty in testing all edge cases.<br />

Even after millions or billions of city-street miles and virtual<br />

testing, an autonomous system might still react in an unsafe<br />

manner when confronted with an unforeseen set of inputs. (And<br />

yes driving, as all tech-mediated human activity, will always<br />

involve unforeseen inputs.)<br />

A second and related problem is determining what is correct<br />

in terms of autonomous decision-making. Invariably, robot car<br />

decisions can only be measured in degrees of correctness. Who<br />

determines what is correct enough? If the car is going to hit<br />

something on the road, does it run over the object? Stop and risk<br />

a rear-end collision? Swerve? Cross the double yellow lines? Or<br />

do we allow a more simple result — that is, an autonomous<br />

vehicle “passes” the test if it responds to a situation so that no<br />

one gets hurt? In functional safety, there is much attention given<br />

to testing and evidence that builds confidence that a given<br />

system is safe, but this notion is murky when confidence is not<br />

tied to yes/no outcomes but instead to degrees of probability or<br />

correctness.<br />

The only certainty is that a combination of simulation,<br />

laboratory testing and real-world testing is the only practical<br />

method, so it’s up to committees to normalize or standardize<br />

these methods.<br />

VI.<br />

LIABILITY AND INSURANCE<br />

The question of liability shadows the work of all standards<br />

committees. If deterministic requirements for autonomous<br />

vehicles are published, and a developer follows that standard to<br />

the strictest degree and an accident still occurs, who assumes<br />

liability?<br />

Standards, of course, are generally not legally binding<br />

though are often used by the courts to settle legal disputes,<br />

especially in product liability cases. Invariably accountability<br />

for accidents will shift from drivers and their insurance<br />

companies to carmakers, and the tech vendors increasingly<br />

prominent in the auto supply chain. (An example: California<br />

recently scrapped a planned rule that would have let carmakers<br />

off the hook if their autonomous vehicle crashed if it was<br />

determined that the car hadn't been maintained to spec. That is,<br />

a carmaker might have been spared liability if a vehicle in a<br />

fender bender had muddy sensors, even if the accident actually<br />

stemmed from sloppy code. Regulators said no way.) Standards<br />

committees comprised mostly of carmakers and their<br />

hardware/software suppliers might be disinclined to accelerate<br />

this tectonic shift in how blame is apportioned for accidents,<br />

particularly in environments still mostly devoid of consistent<br />

national regulation.<br />

What makes the most sense is for these committees to<br />

provide requirements on a V-cycle development process,<br />

including historical reconstruction of such processes. Indeed,<br />

this is already required in ISO 26262. But when it comes to<br />

autonomous architecture and AI development, the committee<br />

might need only to provide information or guidelines in detailing<br />

the ongoing operational parameters, algorithm use cases and<br />

probability-based evaluation of outcomes which lead up to an<br />

unsafe situation.<br />

VII. AUTONOMOUS STANDARDS TODAY AND BEYOND<br />

The good news is that we have robust standards, notably ISO<br />

26262, for processes, and hardware/software development. And<br />

governments are increasingly issuing guidelines and laws<br />

concerning self-driving cars for real-world testing, slowly<br />

clarifying or at least populating the previously barren regulatory<br />

landscape.<br />

But since AI development points the way to more and more<br />

nondeterministic applications, instead of focusing on proving<br />

69


inputs, parameters, and logical paths that lead to a safe selfdriving<br />

application, we must shift our focus to proving the safety<br />

of both the process of creating the AI in the first place and then<br />

to its eventual outcomes when it makes decisions and mistakes<br />

on public roads. It won’t be easy given all the vagaries around<br />

the notion of safety which, according to the philosophers, is<br />

often fundamentally in conflict with freedom and free will.<br />

For now, a standard like the forthcoming ISO/PAS 21448<br />

will likely be the best guideline our industry is capable of<br />

producing for now. The guidance itself will fit the<br />

nondeterministic nature of AI, but will at least provide some<br />

normalization and information. The committee members I’m<br />

serving with represent a breadth of impressive technological<br />

expertise in the ADS field. Among other outcomes, expect the<br />

standard to provide a much needed common vocabulary so we<br />

can all begin communicating effectively on autonomous safety.<br />

And it’s clear that the standard will provide guidance even on<br />

slightly opaque issues such as how to consider known and<br />

unknown use cases, dependencies, the limitation of<br />

countermeasures, automation authority and warning strategies.<br />

There also will be information on more conventional topics like<br />

verification and validation.<br />

VIII. CONCLUSION<br />

Starting more than a century ago with an effort make screw<br />

threads uniform, standards have boosted commerce by breaking<br />

down trade barriers between companies and countries. The<br />

existing ISO 26262 and forthcoming ISO/PAS 21448 standards<br />

are no different. But as products and even the design flows<br />

themselves become more autonomous and non-deterministic,<br />

standards committees will need to wrestle with outcomes<br />

determined from probabilities.<br />

Standards bodies and society at-large will need to accept the<br />

reality that there will be unsafe situations, accidents, injuries and<br />

deaths in the autonomous future. The goal of functional safety is<br />

not to eliminate these events since they will always happen;<br />

instead, the goal is to limit their likelihood.<br />

More to the point, in our era of measuring and optimizing<br />

metrics of all kinds, can we make unsafe situations significantly<br />

less likely than they are today? The answer is, of course! Despite<br />

the difficulty in describing precisely how a robot car will make<br />

decisions, standardization efforts like ISO 26262 and the<br />

forthcoming SOTIF work by ISO provide much needed<br />

guidance on how to determine how safe is safe enough as these<br />

cars proliferate.<br />

And that’s an excellent outcome, any way you describe it.<br />

ACKNOWLEDGMENT<br />

Thanks to Robert Bates and Andrew Macleod for their<br />

review of this paper, and to Geoff Koch for editing and layout<br />

assistance.<br />

REFERENCES<br />

[1] “Agreement on Technical Barriers to Trade,” accessed January 2018,<br />

http://bit.ly/2FIDA0n.<br />

[2] ISO/WD PAS 21448, "Road vehicles -- Safety of the intended<br />

functionality," ISO Standards Catalogue,<br />

www.iso.org/standard/70939.html.<br />

[3] "Automated Driving Systems 2.0: A Vision for Safety," NHTSA,<br />

accessed January 2018, www.nhtsa.gov/manufacturers/automateddriving-systems.<br />

[4] "Autonomous Vehicles | Self-Driving Vehicles Enacted Legislation,"<br />

National Conference of State Legislatures (NCSL), accessed January<br />

2018, http://bit.ly/2ELZ4YI.<br />

[5] California SB-1298, "Vehicles: autonomous vehicles: safety and<br />

performance requirements, (2011-2012), http://bit.ly/2EPXDc1.<br />

FURTHER READING<br />

[1] Robert Bates, “Is it Possible to Know How Safe We Are in a World of<br />

Autonomous Cars,” 2017 Mentor Graphics whitepaper,<br />

http://go.mentor.com/4VxbP.<br />

[2] A. G. Foord and W. G. Gulland, 4-Sight Consulting, UK; C. R. Howard,<br />

Istech Consulting Ltd, UK, "Ten Years of IEC 61508; Has It Made Any<br />

Difference?" IChemE Symposium Series No. 156, 2011,<br />

http://bit.ly/2FIGBxD.<br />

[3] "An Introduction to Functional Safety and IEC 61508," MTL Instruments<br />

Group plc, 2002, http://bit.ly/2FKpj3a.<br />

[4] "Global status report on road safety," World Health Organization, updated<br />

July 2017, http://bit.ly/2FLmcb4.<br />

70


Designing Embedded Systems for Autonomous<br />

Driving with Functional Safety and Reliability<br />

David Lopez<br />

NXP Semiconductors,<br />

Marketing and application manager,<br />

Safety and power management<br />

Toulouse, France<br />

Jean-Philippe Meunier<br />

NXP Semiconductors,<br />

Functional safety architect,<br />

Safety and power management<br />

Toulouse, France<br />

Maxime Clairet<br />

NXP Semiconductors,<br />

Systems and applications engineer,<br />

Safety and power management<br />

Toulouse, France<br />

Abstract-. Societal changes and policy regulations are driving<br />

automotive requirements for electrification, connectivity and<br />

autonomy. Embedded systems need safety defined and designed<br />

solutions as well as extended robustness to accompany this<br />

transition. This paper addresses a methodology applied to power<br />

management devices that require the highest level of functional<br />

safety; how those processes extend robustness and are<br />

instrumental to reducing and preventing systematic failures with<br />

dedicated hardware management strategies. Extended<br />

qualification tests that assess reliability robustness demonstrate<br />

how a device can operate under different environments, with<br />

different grade levels representing the qualification. This paper<br />

will discuss the results of some Grade 0 tests performed on power<br />

management solutions, to secure the temperate operating range.<br />

Keywords—functional safety, power management architectures,<br />

fail silent, fault tolerant, grade 0, robustness.<br />

I. INTRODUCTION<br />

Today’s society is faced with considerable challenges<br />

towards the impending energy transition. Concerns about<br />

climate change, urbanization and austerity measures due to<br />

shortage of resources are combining with the need for<br />

increasing safety on the roads. The automotive industry’s role<br />

in contributing to this transition is clear-cut. The dominant<br />

drivers will be electrification, connectivity and autonomy<br />

enabling a driverless system that improves the mobility<br />

experience and helps to reduce fatalities.<br />

However, other industries such as the aeronautic industry are<br />

also dominant and collaboration will be essential to the rapid<br />

development of future systems. In fact, the aeronautics industry<br />

set the precedent for embedded systems with full redundancy to<br />

reach higher levels of autonomy. This is what the automotive<br />

industry aspires to, to assist or replace drivers. Systems such as<br />

these, albeit adapted to the automotive market, would still<br />

require high dependability, with cost effective architectures and<br />

solutions. This article will introduce the evolution of system<br />

architectures from fail safe to fail operational, with a highlight<br />

on power management solutions developed to simplify system<br />

design and secure safety assessments. Each market uses its own<br />

standard, methodologies and certification, but the evolution of<br />

embedded systems required for future mobility (road or air)<br />

requires closer collaboration.<br />

Finally, the consumer market is permeating the automotive<br />

market with solutions for artificial intelligence. However, these<br />

solutions need to be adapted to the constraints of the automotive<br />

environment. The international SAE standard sets out different<br />

levels of automation. In this model, the vehicle operates in fail<br />

safe mode in levels 0 through 2 and in fail operational mode in<br />

levels 3 through 5. These levels are essential for setting out<br />

minimum requirements in terms of functional safety for<br />

autonomous vehicles.<br />

Robustness, lifetime, quality and reliability are the key<br />

challenges to adapting these technologies to the new mobility<br />

requirements. The next section will discuss the features of a failsafe<br />

architecture with examples of typical system<br />

implementations.<br />

II.<br />

FAIL-SAFE ARCHITECTURE<br />

The safety architecture for a given system such as Electrical<br />

Power Steering is illustrated in “Fig. 1,” This architecture,<br />

traditionally based on a fail-safe topology, aims to disactivate<br />

driver assistance functionality in the case of a failure.<br />

VBAT<br />

Core Supply<br />

ADC Ref<br />

Sensor Supply<br />

FS6500<br />

SAFETY Power<br />

Management<br />

ASIL D<br />

VCOREMON<br />

VMON1<br />

VMON2<br />

Challenger<br />

WD<br />

FCCU1<br />

FCCU2<br />

RSTb<br />

FS0b<br />

FS1b<br />

(Delayed)<br />

CAN<br />

Sensor supply<br />

Core Supply<br />

VDDIO 1<br />

ADC Ref 1<br />

4<br />

SPI<br />

Safety MCU - ASIL D<br />

FCCU1<br />

FCCU2<br />

RESETB<br />

Fig. 1. EPS Fail-Safe architecture<br />

Sensor<br />

FS0b<br />

Safety Switch<br />

FS0b<br />

VBAT<br />

FCCU<br />

Gate driver<br />

FS1b<br />

(delayed)<br />

M<br />

71


In this specific implementation, when a failure occurs, the<br />

safety switches are opened so that the system remains<br />

controllable. A functional safety system basis chip, such as the<br />

FS6500, plays an important role here because it is the only<br />

component able to reset the microcontroller and transition the<br />

system in safe-state in case of hardware or software issues. The<br />

microcontroller hardware failures are monitored by the FS6500<br />

via Fault Collector Control Unit (FCCU) inputs and software<br />

plus temporal aspects are monitored via the watchdog<br />

challenger. However, this architecture limits availability. So, if<br />

a failure occurs in the system, availability is reduced, or even<br />

lost, forcing it to move into a safe state and therefore losing<br />

driver assistance functionality.<br />

The Automotive Safety Integrity Level (ASIL), defined by<br />

the ISO 26262 Functional safety for road vehicles standard, for<br />

the FS6500 is classified “ASIL D”.<br />

“Fig 2,” shows a simple representation of a fail operational<br />

architecture.<br />

TABLE II.<br />

Fig. 2. Fail-Operational Unit<br />

This “ASIL D” level dictates the highest integrity<br />

requirements which are basically reported on the individual<br />

components that compose the system including the Power<br />

Supply. Technical safety assumptions were taken during the<br />

development phase because the System Basis Chip was<br />

developed as a Safety Element out of Context (SEooC).<br />

Several technical safety requirements derived from these<br />

safety assumptions, that highlighted the independencies of the<br />

safety monitoring unit of the product. This includes the<br />

independencies and redundancies of the reference voltages,<br />

current references, clocks and state machines compared to the<br />

power management domain (SMPS, LDOs, system features).<br />

This independent safety monitoring unit rated “ASIL D”<br />

provides a set of fully configurable built-in safety mechanisms<br />

to the system integrator.<br />

In this architecture, two fail silent units are used. The<br />

independency of the power sources is ensured by VBAT1 and<br />

VBAT2. Redundant and independent power supplies (FS65) and<br />

processing (MCU) are elaborated and both ECUs can drive the<br />

actuator. In case of failure in one of the units, the second backup<br />

ECU is able to take the control. In addition to the full<br />

redundancy, both ECUs can check critical information with each<br />

other to increase the global diagnostic coverage of the system.<br />

A. Fail operational concept applied to the Electrical Power<br />

Steering use case.<br />

If we apply this high-level concept using the electrical power<br />

steering architecture, we end-up with the following<br />

implementation shown in “Fig. 3.”<br />

III.<br />

FAIL OPERATIONAL ARCHITECTURES<br />

In accordance with the SAE standard, the highest levels of<br />

automated driving systems (Level 4/ Level 5), require new fail<br />

operational architecture implementations to deliver functionality<br />

in the vehicle.<br />

Fail operational systems guarantee the full or degraded<br />

operation of a function in case of failure. So, a single Fail-Safe<br />

ECU can no longer be used for the reasons explained above.<br />

To satisfy the requirements of a very high-availability<br />

system, redundant fail-silent units are envisaged. However, a<br />

complete safety analysis must be done to ensure diversity in the<br />

information channel and to eliminate common cause failures.<br />

72


FS6500<br />

Safety Switch<br />

FS0b-1<br />

FCCU<br />

system must be redundant and independent as described in “Fig<br />

5.”<br />

Sensor supply 1<br />

Sensor<br />

1<br />

VBAT1 Core Supply 1<br />

SAFETY Power VDDIO 1<br />

Management<br />

ASIL D ADC Ref 1<br />

Core Supply 1 VCOREMON<br />

ADC Ref 1 VMON1<br />

Supply Torque 1 VMON2<br />

Safety MCU - ASIL D<br />

VBAT1<br />

Gate driver 1<br />

FS1b<br />

(delayed)<br />

SENSING<br />

THINKING<br />

ACTING<br />

Challenger<br />

WD<br />

FCCU1<br />

FCCU2<br />

RSTb<br />

FS0b-1<br />

FS1b<br />

(Delayed)<br />

I2C-1<br />

CAN 1<br />

4<br />

SPI<br />

FCCU1<br />

FCCU2<br />

RESETB<br />

FS0b-1<br />

M<br />

Sensors<br />

ASIL Sensors B<br />

ASIL<br />

Sensors<br />

B<br />

ASIL B<br />

Thinking<br />

ASIL B<br />

Sensor Fusion module 1<br />

Decision Making<br />

ASIL D<br />

Vehicle<br />

Control<br />

Often ASIL<br />

D<br />

Safety Switch<br />

VBAT2<br />

Core Supply 2<br />

ADC Ref 2<br />

Supply Torque 2<br />

Sensor Supply 2<br />

SAFETY Power<br />

Management<br />

ASIL D<br />

I2C-2<br />

Core Supply 2<br />

VCOREMON VDDIO 2<br />

VMON1 ADC Ref 2<br />

VMON2<br />

Safety MCU - ASIL D<br />

Sensor<br />

2<br />

FS0b-1<br />

VBAT2<br />

FCCU<br />

Gate driver 2<br />

FS1b<br />

(delayed)<br />

Thinking<br />

ASIL B<br />

Fig. 5. Fault tolerant central fusion<br />

Sensor Fusion module 2<br />

Decision Making<br />

ASIL D<br />

FS0b-2<br />

Challenger<br />

WD<br />

4<br />

SPI<br />

FCCU1<br />

FCCU2<br />

RSTb<br />

FS0b-2<br />

FS1b<br />

(Delayed)<br />

CAN 2<br />

FCCU1<br />

FCCU2<br />

RESETB<br />

Each sensor fusion module can be decomposed as shown in<br />

“Fig 6,” FS6500 is rated “fit for ASIL D” and is the ideal<br />

companion-ship of a safety MCU in the ASILD(D) domain.<br />

Fig. 3. Fail operational EPS (degraded operation)<br />

In this configuration two FS6500 supply two different<br />

MCUs. The full chain is independent from the power (VBAT 1<br />

and VBAT 2) to the gate drivers (GD1, GD2). Only one<br />

electrical motor is used (6 phases). If a failure occurs in one of<br />

the channels, the relevant channel is switched off and the<br />

operation will continue using the back-up channel. This option<br />

loses roughly 50% of the torque assist, but means that it will<br />

continue to work in degraded operation.<br />

B. Fail operational concept applied to central fusion use<br />

case.<br />

The central fusion system takes data from various sensors in<br />

the vehicle such as Radar, Camera, Lidar; merges and computes<br />

that information to then command the actuators like braking and<br />

steering.<br />

“Fig 4,” represents a high-level block diagram of a central<br />

fusion system. The ASIL allocation for each function is also<br />

represented. As a general comment, in the central fusion module,<br />

it is very common to find ASIL D(D) elements coexisting with<br />

ASIL B(D) on the same module.<br />

SENSING<br />

Sensors<br />

ASIL B<br />

Radar<br />

Camera<br />

Lidar<br />

Thinking<br />

ASIL B<br />

THINKING<br />

Sensor Fusion<br />

Fig. 4. High Level Central Fusion Block Diagram<br />

Decision Making<br />

ASIL D<br />

ACTING<br />

Vehicle<br />

Control<br />

Often ASIL<br />

D<br />

Engine<br />

Transmission<br />

Brake<br />

Steering<br />

Airbag<br />

Suspension<br />

Then, in a view of a fault tolerant system for driving<br />

automation starting at level 3, all different parts that compose the<br />

CAMERA<br />

RADAR<br />

LIDAR<br />

PMIC EN-1<br />

Fig. 6. Central Fusion Unit<br />

VBAT<br />

Core Supply<br />

VCCA<br />

2<br />

VAUX<br />

PFxxx<br />

ASIL B<br />

SAFETY SBC<br />

FS6500<br />

ERRMON<br />

SOC-1<br />

VCOREMON<br />

VMON1<br />

VMON2<br />

Challenger<br />

WD<br />

FS0b<br />

RSTb<br />

MCU HW<br />

Monitoring<br />

DDR<br />

Core Supply<br />

VCCA<br />

VAUX<br />

I2C<br />

4<br />

Communication<br />

interfaces<br />

SOC - 1<br />

ASILB<br />

ASIL B(D)<br />

Domain<br />

SPI<br />

RESETB<br />

FCCU1<br />

FCCU2<br />

Safety MCU<br />

ASIL D<br />

ERRMON<br />

SOC1<br />

I2C<br />

ASIL D(D)<br />

Domain<br />

DDR<br />

PMIC EN-1<br />

Central Fusion - 1<br />

FS6500 integrates a “safety island” where all safety<br />

mechanisms are designed. This safety island is biased by a full<br />

redundant architecture completely independent from the power<br />

management side. Also, a specific focus has been done in terms<br />

of isolation eliminating perturbations from switchers (e.g.<br />

negative substrate injection).<br />

To achieve these safety architectures, different measures are<br />

taken to assess the failure probability of the IC. The attach<br />

strategy for power management ICs to microcontrollers, means<br />

that products are defined from the outset to go together. It highly<br />

simplifies the safety strategy and design of the system and means<br />

that integrated measurements can be implemented. However, in<br />

order to analyze the risk at a system level, it is necessary to be<br />

able to quantify the risk of each individual IC failure.<br />

73


IV.<br />

QUANTITATIVE ANALYSIS: FROM<br />

RELIABILITY TO FUNCTIONAL SAFETY<br />

Functional safety metrics are calculated based on the Failure<br />

In Time metric (FIT), that quantifies the risk of failure during<br />

the lifetime of an application, according to the IECTR62380<br />

standard [2]. This FIT rate is calculated with (1) below:<br />

FIT = λdie + λpackage + λoverstress <br />

where λdie, λpackage and λoverstress are respectively the risk<br />

of failure related to the integrated circuit, all the parts<br />

constituting the package and the system stress during operation.<br />

The parameter λ die is calculated with (2) below:<br />

0.<br />

35a<br />

<br />

Ne <br />

y<br />

<br />

<br />

i<br />

<br />

<br />

<br />

1<br />

die<br />

2<br />

on<br />

<br />

<br />

t<br />

<br />

i i <br />

<br />

<br />

off <br />

<br />

1<br />

<br />

<br />

Based on hardware deterioration, the FIT rate calculation<br />

helps to determine the following ISO 26262 metric. The FIT rate<br />

of the device is distributed to the device functions based on their<br />

representative die size and for each function it is equally<br />

distributed to all possible failure modes. If the failure mode of a<br />

safety related function violates one of the application safety<br />

goals, a safety mechanism is required to detect it. One FIT<br />

represents one failure in 109 device hours, or 114 years.<br />

This FIT rate is an input of a SafeAssure tool developed by<br />

NXP. The Dynamic FMEDA (Failure Mode Effect and<br />

Diagnostic Analysis) calculates three ISO 26262 [1] metrics<br />

required to qualify for ASIL level.<br />

The SPFM (Single Point Fault Metric) represents a failure<br />

rate coverage which violates an application safety goal: >99%<br />

for ASIL D. Depending on the diagnostic coverage of the safety<br />

mechanism, low-60%, medium-90% or high-99%, the residual<br />

FIT of the undetected failure mode is calculated with (5).<br />

<br />

SPFM<br />

<br />

<br />

1<br />

<br />

SR<br />

RF<br />

<br />

<br />

<br />

where λ 1 is per transistor base failure rate of the integrated circuit<br />

family, λ 2 is the failure rate related to the technology mastering<br />

of the integrated circuit, N is the number of transistors of the<br />

integrated circuit, a is the year of manufacturing minus 1998,<br />

(π t) i is i th temperature factor related to the i th junction temperature<br />

of the integrated circuit mission profile, τ i is i th working time<br />

ratio of the integrated circuit for the i th junction temperature of<br />

the mission profile, τ on is the total working time ratio of the<br />

integrated circuit and τ off is the time ratio for the integrated<br />

circuit being in storage.<br />

The parameter λ package is calculated with (3) below:<br />

where λ SR is equal to the FIT rate of safety related functions<br />

The LFM (Latent Fault Metric) failure in the safety detection<br />

mechanism (also called Monitoring) can lead to the violation of<br />

the application safety goal in conjunction with a single point<br />

fault: >90% for ASIL D. The same approach is applied to the<br />

LFM and the residual FIT of the undetected failure mode (by<br />

BIST example) is calculated with the following equation:<br />

<br />

LFM<br />

1<br />

<br />

<br />

MPF<br />

<br />

RF<br />

<br />

SR<br />

<br />

<br />

<br />

z<br />

3<br />

0.<br />

68<br />

2,<br />

75.<br />

10 <br />

n<br />

Ti<br />

<br />

i<br />

3<br />

i1<br />

<br />

<br />

<br />

<br />

<br />

package<br />

<br />

<br />

where λ MPF is equal to the residual FIT of latent faults.<br />

where π α is the influence factor related to the thermal expansion<br />

coefficients difference between the mounting substrate and the<br />

package material, (π n) i is i th influence factor related to the annual<br />

cycles number of thermal variations seen by the package, with<br />

the amplitude ΔT i, ΔT i is the i th thermal amplitude variation of<br />

the mission profile and Λ 3 is the base failure rate of the<br />

integrated circuit package.<br />

The PMHF (Probability Metric of Hardware Failure),<br />

concerns the residual probability of breaching a safety goal<br />

(


TABLE III.<br />

LEVEL OF METRICS FOR TYPE OF ASIL<br />

V. APPLICATION MISSION PROFILE AND<br />

AUTOMOTIVE QUALIFICATION REQUIREMENT<br />

The automotive market is moving towards the<br />

convergence of the electrification and autonomous driving to<br />

lower emissions, optimize traffic congestion and reduce other<br />

hazards. This trend needs more and more electronic systems that<br />

are capable of acting in the place of a human driver such as for<br />

steering, braking or transmission. However, they also need to be<br />

able to manage an efficient monitoring and charging of the<br />

battery to optimize vehicle autonomy and the battery lifetime<br />

over 15 years. In addition to these automotive profiles, there are<br />

qualification requirements for standard products that need to be<br />

validated to be used in an automotive context.<br />

A. Grade 0 requirement<br />

New steering, braking and transmission power train<br />

applications are more and more highly integrated and are<br />

sometimes combined with lower housing thermal performance,<br />

to reduce the production cost. The consequence is a higher<br />

working temperature range requiring an AEC-Q100 RevH [3]<br />

Grade 0 qualification (Ta=150°C and Tj=175°C). The +25°C<br />

delta temperature on both Ta and Tj compared to a standard<br />

Grade 1 qualfication is an important gap to satisfy with<br />

additional qualification stress to perform.<br />

Table IV below shows an automotive mission profile<br />

requiring Grade 0 qualification.<br />

TABLE IV.<br />

GRADE 0 MISSION PROFILE<br />

Ambient<br />

temperature (Ta)<br />

Junction<br />

temperature (Tj)<br />

Operation<br />

time (Hrs)<br />

-40 -15 260<br />

-15 10 450<br />

5 30 550<br />

45 70 700<br />

75 100 800<br />

85 110 900<br />

95 120 1200<br />

105 130 3100<br />

115 140 2100<br />

125 150 1600<br />

135 160 330<br />

145 170 10<br />

Average Ta Average Tj Total op time<br />

73°C 98°C 12000Hrs<br />

The FS6500 fit for ASIL D system basis chip family<br />

has qualified for Grade 0 according to the above mission profile.<br />

It has 2000 hours of High Temperature Operating Life Test<br />

(HTOL) performed at Tj=175°C, 3000 Temperature Cycles<br />

(TC) performed from -55°C to +150°C and 2000Hrs of High<br />

Temperature Storage Life Test (HTSL) performed at +150°C.<br />

The FS6500 portfolio, extended with the Grade 0<br />

MC35FS6500 family, has an outstanding reliability<br />

performance to support high temperature applications required<br />

by the harshest automotive environment and market trends.<br />

B. Extended Grade 1 requirement<br />

New Battery Management applications for electrical<br />

vehicles require longer device operation time up to 30% of the<br />

device lifetime. Indeed, compared to a traditional Internal<br />

Combustion Engine (ICE), batteries for an Electrical Vehicle<br />

(EV) require the electronics in charge of the battery management<br />

to be active even during the charging phase, while the car is in<br />

parking mode.<br />

Table V below shows an Electrical Vehicle mission<br />

profile where we can see the long operation time around 60°C.<br />

This thattakes into account the charging phase of the batteries<br />

with a total operation time of 40,000 Hours, between three to<br />

four times longer than a typical ICE mission profile.<br />

TABLE V.<br />

EV MISSION PROFILE<br />

Ambient Junction Operation<br />

temperature (Ta) temperature (Tj) time (Hrs)<br />

-40 -25 2400<br />

63 78 19703<br />

70 85 2656<br />

90 105 5201<br />

100 115 6440<br />

105 120 2407<br />

115 130 1093<br />

125 140 100<br />

Average Ta Average Tj Total op time<br />

79°C 94°C 40000Hrs<br />

.<br />

The FS6500 fit for ASIL D family developped by NXP<br />

has successfully passed 4200 Hours of High Temperature<br />

Operating Life Test (HTOL) performed at Tj=150°C covering<br />

the mission profile above.<br />

C. FIT rate impact<br />

These mission profiles, extended in temperature range<br />

or in operation time, have to be carefully analyzed by the<br />

semiconductor supplier to determine the appropriate reliability<br />

stress conditions and durations to be performed during the<br />

qualification of the device.<br />

Moreover, the Failure In Time (FIT) rate of the<br />

electronic devices selected in case of safety automotive<br />

applications are calculated with the mission profile input. As<br />

described in Chapter I, the application mission profile influences<br />

the total FIT rate of the device and consequently it proportionally<br />

affects the PMHF metric, output of the FMEDA analysis as we<br />

can see in table VI below.<br />

75


TABLE VI.<br />

FS6500 FIT AND PMHF COMPARISON<br />

Grade 1 EV Grade 0<br />

FIT Rate (FIT) 53.6 75.3 70.1<br />

PMHF (FIT) 0.72 1.02 0.96<br />

A tough mission profile will increase the FIT rate and<br />

the PMHF. This is where a redundant device architecture<br />

between the function and its monitoring makes the difference.<br />

The FS6500 safety related functions and their monitoring are<br />

physically and electrically independent, allowing a limited<br />

PMHF impact against such a mission profile and facilitating<br />

safety automotive applications development up to ASIL D<br />

compliancy.<br />

On the other hand, for the final pillar of automotive<br />

qualification requirements, where consumer grade<br />

semiconductors are used in vehicles, ZVEI and other industry<br />

partners developed a framework to handle this permutation. The<br />

guidelines help facilitate the use of products created without<br />

automotive processes for applications that require more stringent<br />

reliability and robustness. Several systems from the consumer,<br />

gaming and the networking markets are crossing over to the<br />

automotive market, making mobility connected, efficient and<br />

autonomous. These components require the adaptation of<br />

technologies to the automotive environment, or at the least an<br />

evolution of the design and qualification process.<br />

VI.<br />

CONCLUSION<br />

Convergence between different markets means that<br />

embedded systems not specifically designed for automotive<br />

environment conditions are being used in vehicles, with<br />

associated guarantees and performance- mostly to respond to<br />

demands for high-performance infotainment, radar and camera<br />

driver assistance technologies.<br />

With the move towards electrification and automation, more<br />

stringent test and reliability stresses need to be performed to<br />

ensure the high level of quality and robustness required at the<br />

component level for specific environment conditions.<br />

As the paper has demonstrated, the combination of<br />

functional safety measures, with IC robustness improvements<br />

applied at a component and system level, helps to reduce the risk<br />

of failure in vehicles. This also complies with the automotive<br />

environment evolution.<br />

At a system level, the safety architecture and system design<br />

aim to enable full redundancy and therefore facilitate higher<br />

levels of autonomous driving to achieve fault tolerance in the<br />

case of failure.<br />

VII.<br />

REFERENCES<br />

[1] ISO/DIS 26262, “ISO-26262 Road vehicles - Functional safety”,<br />

International Organization for Standardization, 2011<br />

[2] IECTR62380:2004 - Reliability data handbook - Universal model for<br />

reliability prediction of electronics components, PCBs and equipment<br />

[3] AEC-Q100-RevH: Failure mechanism based stress test qualification for<br />

integrated circuits<br />

76


Safety Architectures on Multicore Processors –<br />

Mastering the Time Domain<br />

Thomas Barth<br />

Department of Electrical Engineering<br />

Hochschule Darmstadt – University of Applied Sciences<br />

Darmstadt, Germany<br />

thomas.barth@h-da.de<br />

Abstract— A key architecture for building safe architectures<br />

is a strict separation of normal application code (also referred to<br />

as QM code) and safety function code, considering a separation<br />

not only in the memory and peripheral domain but also in the<br />

time domain. Whereas hardware features like memory- or busprotection<br />

units allow a comparable simple protection of the<br />

memory domain, the supervision of the timing domain is a lot<br />

more complex. Race conditions on multicore system are far more<br />

likely and complex as compared to a single core system, as we<br />

have a true parallel execution of code and more asynchronous<br />

architectural patterns. Most safety standards such as IEC61508<br />

[1] and ISO26262 [2] require:<br />

Prof. Dr.-Ing. Peter Fromm<br />

Department of Electrical Engineering<br />

Hochschule Darmstadt – University of Applied Sciences<br />

Darmstadt, Germany<br />

peter.fromm@h-da.de<br />

I. SAFETY ARCHITECTURE<br />

A very common design pattern for the implementation of a<br />

safety architecture is the use of redundant and independent<br />

channels. By monitoring and comparing the input, control and<br />

output values of both channels, single errors can be detected<br />

and the system can be switched into a safe state, as shown in<br />

Figure 1.<br />

• Alive monitoring<br />

• Real-time monitoring<br />

• Control flow monitoring<br />

In this paper we will describe a typical signal flow on a<br />

multicore safety system and based on this architecture introduce<br />

an innovative second-level monitoring layer, which is supervising<br />

the real-time constraints of the safety and functional monitoring<br />

functions. We will demonstrate the use of selected hardware<br />

features of the Infineon AURIX and TLF watchdog chip together<br />

with the SafetyOS PxROS from the company HighTec and show,<br />

how they can be used in the context of a safety architecture.<br />

Furthermore, we will demonstrate the use of a combined<br />

watchdog / smart power module, which does not only support an<br />

emergency switch-off but also the control of multiple power<br />

domains and defined reboot sequences in case of system errors.<br />

Keywords—Timing, Control Flow, Functional Safety, Safety<br />

Architectures, Multicore, Runtime Environment<br />

Figure 1 - Dual channel fail safe architecture<br />

Transferring this architecture onto a multicore system by<br />

simply replacing the ECU’s with a core will not lead to the<br />

same level of reliability, as the probability of common cause<br />

failures is higher compared to the discrete setup, which is due<br />

to shared resources, common power supply and similar [3].<br />

Using a safe operating system providing separation techniques<br />

will help, but still the risks caused by a wrong configuration<br />

remains.<br />

A possible approach to overcome these weaknesses is the<br />

introduction of a multi-layer monitoring architecture [4]. The<br />

first layer of monitoring functions will supervise the coherency<br />

of the sensor input data, the calculated control variables and the<br />

correct transfer to the actors, which still can and should be<br />

implemented in a multi-channel structure. As long as the<br />

monitoring functions work as intended, the system can be<br />

assumed to be in a safe operational state. [5]<br />

www.embedded-world.eu<br />

77


Figure 2 – Multi-layer monitoring architecture<br />

However, what happens if a bug in one of the units shows<br />

an impact on the functionality of the monitors? In this case, the<br />

system might end up in a dangerous state, as the correct<br />

operation of the Input-Logic-Output channel is not supervised<br />

anymore. Therefore a second layer of monitoring functions is<br />

introduced, which monitors the health state of the two layers<br />

below. This health state needs to be actively and periodically<br />

reported to an external safety device like a watchdog chip,<br />

which in case of a failure will bring the system into a safe state<br />

[6].<br />

The external device is required, because in case of a system<br />

error the main controller might not be able to reach the safe<br />

state by itself, e.g. due to an output task which is acting<br />

incorrectly or due to misconfigured or frozen safety ports.<br />

The health-monitoring layer can be divided into four major<br />

blocks:<br />

System supervision covers hardware errors<br />

reported e.g. by a lockstep core, memory bit flips<br />

and similar.<br />

<br />

<br />

<br />

Memory and bus supervision focusses on the<br />

separation in the memory domain, by detecting<br />

illegal memory accesses reported by the memory<br />

protection unit or access violations of shared<br />

busses.<br />

Timing supervision ensures that critical software<br />

components are executed in predefined intervals.<br />

Furthermore, possible violations of the agreed<br />

real-time constraints are checked as well as the<br />

correct execution order of safety functions. This<br />

block will be the focus of this paper.<br />

Finally yet importantly, the peripheral supervision<br />

block ensures that all peripheral modules work as<br />

expected. Often, access violations can be detected<br />

and handled by the core’s safety logic using the<br />

MPU and bus protection. In addition, the physical<br />

operation of the pin can be checked by reading it<br />

back or by using external supervision modules.<br />

II. WHY TIMING SUPERVISION?<br />

The supervision of the time domain is a typical requirement<br />

in most safety standards in order to detect system malfunctions<br />

and to take corrective actions before a system failure might<br />

harm humans or the environment.<br />

Alive monitoring checks, if critical functions are executed<br />

at all. This is typically done by introducing a watchdog, which<br />

needs to be triggered in predefined intervals. In case the<br />

watchdog is not triggered, the system will respond with a<br />

hardware reset or similar action. This supervision is<br />

comparable easy in its implementation; however, the error<br />

handling scenarios are limited and usually quite harsh. This<br />

technique is often used in fail-safe systems. A failure of the<br />

alive monitoring check leads to a transition into a safe state,<br />

e.g. by using an external emergency shut-off unit.<br />

Real-time monitoring measures the execution time of safety<br />

functions and checks, if the defined timing gates are met. This<br />

approach can be used to detect exceeded runtime of a function<br />

caused e.g. by a buggy algorithm, resulting in a late update of<br />

data required by a following process and subsequent system<br />

failure.<br />

Control flow monitoring addresses the correct execution<br />

order of code. On the level of single functions this is ensured to<br />

a certain extend by using a qualified compiler. In the following<br />

code sequence, we can assume that the assignment in line 1<br />

will be executed before the if-statement in line 2, which is<br />

followed by the function call in line 4, in case funcA returned<br />

the value 4.<br />

1 a = funcA(); //a is global<br />

2 if (4 == a)<br />

3 {<br />

4 funcB()<br />

5 }<br />

Figure 3 - Code snippet control flow<br />

However, what happens, if an interrupt appears between<br />

line 1 and 2, modifying the value of the global variable a?<br />

Probably the behavior is not as expected. The same might<br />

happen, if we use a preemptive operating system. Here, a<br />

higher priority task may interrupt a lower priority task if it is<br />

activated. This might lead to a wrong behavior if not all critical<br />

sections are correctly identified and secured. This becomes<br />

even more an issue on multicore systems, where the cores<br />

execute code completely independent but share memory and<br />

other resources.<br />

78


III.<br />

ALIVE MONTORING<br />

Alive monitoring is the most basic check, which detects if a<br />

system is alive or not. Being alive does not mean that a system<br />

is operational, it simply means that there is user-code executed<br />

and that the system is not locked inside an infinite loop, ISR,<br />

Trap or similar.<br />

As alive monitoring aims to check if software is executed,<br />

the monitor itself cannot be implemented in software but<br />

hardware features have to be utilized, in order to ensure that<br />

errors can be detected even if no code is executed. A common<br />

hardware feature for alive monitoring are watchdog timers,<br />

which need to be triggered in predefined intervals. If a<br />

watchdog timer is not triggered as expected, it causes a<br />

hardware event, which can be used for error escalation.<br />

Hardware vendors provide different watchdog timers. The<br />

most basic is a hardware counter, which automatically counts<br />

down a timeout value. If it reaches zero, a hardware event is<br />

triggered. Software has to ensure that the counter value is reset<br />

before the counter reaches zero; it becomes possible to detect<br />

whether software has reached the retriggering sequence within<br />

a certain interval. However, it is not possible to check if the<br />

watchdog has been triggered multiple times during an interval.<br />

Hence, it is only possible to check if user-code is executed<br />

within a maximum time but not if it is executed with the<br />

correct frequency.<br />

A window watchdogs feature a time-window, in which it<br />

expects to be triggered. Only if it is triggered within the<br />

window, it is properly reset. If it is triggered outside of the<br />

window, it will report an error. Window watchdogs therefore<br />

not only allow checking if user-code is executed within a<br />

maximum time, but also introduce a minimum time. With<br />

window watchdogs, it becomes possible to monitor if software<br />

is executed within defined timing constraints.<br />

On bare metal systems with super loop architecture, the<br />

watchdog could be triggered in each iteration of the super loop.<br />

However, most systems run an operating system where the<br />

alive state is only given if certain tasks are executed<br />

periodically. In this case, every periodic task has to be<br />

monitored. This can be solved by defining a background or<br />

watchdog task, which triggers the watchdog only if all tasks<br />

report execution. With this approach, the task triggering the<br />

watchdog needs to have a higher cycle time and a lower<br />

priority than any of the monitored tasks.<br />

As a major drawback, only cyclic tasks can be monitored<br />

using alive monitoring. Event driven tasks do not have a<br />

constant start time. More advanced concepts like deadline<br />

monitoring or control flow monitoring are required to secure<br />

such tasks.<br />

While the alive state of a single core controller can be<br />

defined quite easily, the alive state of a multicore controller<br />

with multiple independent CPUs might be more complex.<br />

However, shared resources and similar can be used for intercore<br />

communication. In this scenario, a watchdog-task on one<br />

core could gather information about all the tasks executed,<br />

even those on remote cores. Alive monitoring on multicore<br />

controllers is manageable but requires a well-designed overall<br />

architecture, which considers alive monitoring and inter-core<br />

communication.<br />

IV. REALTIME MONITORING<br />

Real-time monitoring measures the execution time of a<br />

software function and compares it against a given design goal.<br />

The following picture shows the most important timing gates<br />

of a software function. The release time R of a function<br />

determines the earliest point of time a function can start. The<br />

start time S is the true time the function will start and the end<br />

time E is the true time the function does end. The deadline D is<br />

the latest time the function may end. The computation time C is<br />

the time, the function is active. As long as the execution of the<br />

function is not interrupted, C=E-S.<br />

Figure 4 - Timing gates<br />

Functional Watchdogs extend the trigger mechanism by<br />

introducing a protocol. Only if the protocol is adhered the<br />

watchdog is triggered, otherwise an error is reported. An<br />

example for a functional watchdog is the question and answer<br />

watchdog, where the watchdog provides a question, which has<br />

to be answered by software. In the most basic fashion, there is<br />

a limited set of questions and the answers are stored in a<br />

constant table. Functional watchdogs add a certain complexity<br />

and allow checking not only if the watchdog is triggered but<br />

also if basic mechanisms of the system are operational.<br />

As long as the conditions R < S and E < D are true for all<br />

functions, the system fulfills the real-time requirements.<br />

A very simple and commonly used solution is to measure<br />

the idle time of a low priority background task. As long as the<br />

background task is executed, the system is not working to<br />

capacity - at least if all runnables which should have been<br />

called in this cycle have been activated.<br />

To get a more detailed picture, we could also start a<br />

measurement at the entry point of the function and stop it at the<br />

end, which allows us to get values for the timing gates S and E.<br />

Safety operating systems like PXROS-HR [7] provide special<br />

services to abort a function if a timing gate is not set.<br />

www.embedded-world.eu<br />

79


1 abortEventMask = PxExpectAbort(ev, func);<br />

2 if (abortEventMask.events != 0)<br />

3 {<br />

4 //Do some error handling<br />

5 }<br />

Figure 5 - Realtime monitoring using abort functions<br />

In line 1, the function func will be called and an event is<br />

provided, which will terminate the function func in case it is<br />

activated. A possible configuration would e.g. be to use a 1ms<br />

timing event. If the function requires more than 1ms runtime, it<br />

will be terminated and error handling can be initiated.<br />

Compared to traditional timing measurement, this approach has<br />

the advantage that the worst-case execution time of the<br />

function is known, allowing accurate timing violation<br />

detection. If a timing violation is caught on functional level,<br />

aggressive error escalation on system level can be avoided.<br />

Another advantage is that we can start the aborting event at the<br />

release time R and set it to the maximum cycle time (D-R) to<br />

protect all runnables, which will be called during this period.<br />

comparable low and a higher timer resolution can be applied.<br />

Furthermore, the absolute time can be used to analyze the<br />

control flow to a certain extend.<br />

V. CONTROL FLOW MONITORING<br />

In a multicore environment introducing asynchronous, nonblocking<br />

messaging mechanisms between the cores, deadline<br />

monitoring alone comes to a limit, as shown in the following<br />

example.<br />

In order to avoid unintended data corruption on<br />

communication channels, only one task in the system is<br />

allowed to physically access the communication ports, e.g.<br />

CAN. All other tasks who want to use this port send a message<br />

containing the data to be transmitted to this service task. The<br />

service task will queue and transmit the data through the bus<br />

and will return an answer protocol to the requesting task using<br />

another message.<br />

For data transmission, the requester task thus requires two<br />

runnables: One for sending the message and one, which is<br />

activated upon message-receive to process the return data. Let<br />

us assume the following valid sequence of operations:<br />

Figure 6 - Using abort event for deadline monitoring<br />

A more data centric approach is to store the age of data<br />

together with the data payload. The age metadata is set to 0<br />

whenever a new value is written to the data and incremented in<br />

cyclic intervals, e.g. every 1ms. Before using the data, the age<br />

and implicitly the call updating the data can be verified.<br />

1 template <br />

2 class data{<br />

3 uint21_t m_age;<br />

4 T m_data;<br />

5 };<br />

The disadvantage of this solution is the comparable high<br />

runtime overhead required for data update. This can be limited<br />

by increasing the update cycle time, but this decreases the<br />

precision of the information.<br />

An alternative approach is to store the absolute time<br />

whenever the data is updated. When using the data, the current<br />

time needs to be subtracted to get the age. As this will typically<br />

only happen a few times during every cycle, the overhead is<br />

Figure 7 - Asynchronous communication<br />

Runnable run2() transfers data to a service task of core 2<br />

using an asynchronous message. The service is executed in<br />

runnable run3() and the return value is send back using<br />

another message, which will activate runnable run4() on<br />

core 1. The received data is stored in a shared memory and is<br />

used by runnable run5(). Runnable run5() assumes that<br />

the data has been updated in this cycle.<br />

What happens if the service runnable run3() is delayed?<br />

If the delay is short, the timeout event will fire, a correct<br />

detection / handling of the error as shown in the picture below<br />

is possible.<br />

80


Figure 8 - Asynchronous messaging and deadline monitoring<br />

If the delay increases, we might run in the following<br />

situation, where the update of the data in run4() happens in<br />

the next cycle. Core 2 might detect the deadline violation of<br />

run3()(at least if the service is executed in a blocking way)<br />

but core 1 might not be aware of what has happened, because<br />

the timing event supervising the execution of the runnables has<br />

been reset at the deadline D. As no code has been executed at<br />

that time, the behavior seems ok.<br />

Figure 9 - Undetected error with asynchronous messaging<br />

What can be done? One option would be to escalate a<br />

possible deadline violation of core 2 to core 1, but this will<br />

result in a complex error handling hierarchy.<br />

The key problem is the following: Whereas the runnables<br />

run1() and run2() are called in a synchronous way, the<br />

event driven runnables run4() and run5() have a rather<br />

stochastic start time. Obviously, we have to verify that all<br />

runnables are executed within the expected order and timeline<br />

to ensure a proper operation of the system.<br />

A comparable easy approach, which is a first step toward<br />

control flow monitoring, is to use the update time metadata<br />

concept introduced in the previous chapter. By comparing the<br />

update time of the message data in run4() with the request<br />

time of run2(), the significant delay would become visible.<br />

Alive and deadline monitoring mainly focus on runtime<br />

constraints, whereas control flow monitoring checks if code is<br />

executed in a valid sequence. How can we describe a valid<br />

sequence? In the example above, obviously the flow 1-2-3-4-5<br />

would be a valid sequence. This however only is true, if all<br />

runnables are executed in the same cycle, which is adding<br />

another condition. Furthermore, runnable run1() is<br />

independent from the other runnables and may be executed<br />

anytime i.e. also 2-1-3-4-5 would be a valid sequence. This<br />

trivial example shows that the description and validation of all<br />

rules for a real system will become very complex and time<br />

consuming.<br />

A compromise could be to exclusively monitor critical<br />

sequences and conditions, which cannot be detected using the<br />

simpler and robust alive and real-time monitoring approaches.<br />

An alternative implementation is to use token passing.<br />

Tokes can be sent from one runnable to another runnable and<br />

ensure the correct order of execution. However, also tokens<br />

have limitations if there are multiple valid sequences or if the<br />

data-path is reconfigured during runtime.<br />

For a multicore system, a key requirement for control flow<br />

monitoring is a synchronous operation of the cores, which is<br />

typically hard to be achieved, as the cores might have different<br />

boot times. Using a common timer to store the update time of<br />

data signals is a possible solution to solve this problem.<br />

VI. ERROR ESCALATION AND REACHING THE SAFE STATE<br />

Occasional violations in real-time or control flow<br />

monitoring might be handled locally without negative impact<br />

on the safety function. However, frequent timing violations as<br />

well as violations of the alive monitoring indicate severe<br />

malfunction of the system and need to be escalated<br />

One approach to satisfy those needs is the introduction of a<br />

warning counter with threshold. A detected timing violation<br />

increases the warning counter, while correct timing decreases<br />

the counter. If a predefined threshold is reached, the timing<br />

violation is escalated.<br />

The escalation of critical timing violations needs to be<br />

handled in hardware, as it cannot be guaranteed that software is<br />

executed properly anymore. Microcontrollers used for safety<br />

applications such as the Infineon AURIX feature a Safety<br />

Management Unit (SMU), which collects hardware error<br />

signals and defines the system reaction to an error. All<br />

watchdog error events cause so-called SMU alarms. The SMU<br />

reaction to an alarm is configurable; it is possible to send an<br />

interrupt/NMI request, to stop certain cores or to perform a<br />

reset. To implement a safety architecture, the SMU needs to be<br />

combined with an external watchdog such as the Infineon<br />

TLF35584, which is a multiple output power system supply for<br />

safety-relevant applications. In addition to power supply<br />

functionalities, it provides functional safety features like<br />

voltage monitoring, external watchdogs and error monitoring.<br />

A companion chip reduces the probability of common cause<br />

failures, as it is equipped with own vital components such as<br />

power supply or clock generation. By creating an own time<br />

domain on an external chip, the reliability of the watchdog<br />

concept is increased. Standards such as ISO26262 require the<br />

utilization of an external monitor in order to reach higher safety<br />

integrity levels.<br />

www.embedded-world.eu<br />

81


Furthermore, all power domains are permanently monitored<br />

for voltage and current overflows. In our architecture, the<br />

module also performs alive monitoring for all CANOpen<br />

nodes. The implementation of the logic is based on a Cypress<br />

PSoC, where the safety functions are realized in software and<br />

programmable hardware. This allows easy adaption of the<br />

system to different user requirements.<br />

Figure 10 - AURIX Microcontroller with TLF companion chip<br />

With the presented methods, it is possible to detect and<br />

escalate timing violations within the scope of the<br />

microcontroller. However, as the controller is usually part of a<br />

larger system, it has to be ensured that errors detected in the<br />

scope of the controller may not lead to unintended behavior of<br />

attached modules, such as actuators.<br />

A possible solution is the implementation of a safe power<br />

supply unit. A prototype of such a system has been developed<br />

in the framework of the publically funded ZIM project “Future<br />

Technology Multicore” focusing on providing design patterns<br />

and solutions for safe multicore applications. The safe power<br />

supply ”SmartPower” is providing supply voltage and boot up,<br />

reboot and shutdown sequences for three safety domains: ECU,<br />

logic modules and actuators. ”SmartPower” is connected to the<br />

safe state pin of the TLF and turns off the system as a last line<br />

of defense in case of fatal errors.<br />

VII. REFERENCES<br />

[1] IEC, Teil 3: Anforderungen an die Software/ Funktionale<br />

Sicherheit elektrischer/elektronischer/programmierbarer<br />

elektronischer Systeme, 61508, VDE, 2001 (3 July 2001).<br />

[2] ISO, Part 6: Product development: software level / Road<br />

vehicles - Functional Safety, 26262, Genf, 2011 (2011).<br />

[3] J. Barth, et al., 10 Schritte zum Performance Level,<br />

Bosch Rexroth Group, 2014.<br />

[4] Thomas Barth, Peter Fromm, A Monitoring Based Safety<br />

Architecture for Multicore Microcontrollers, Nürnberg,<br />

2017.<br />

[5] Thomas Barth, Peter Fromm, Functional Safety on<br />

Multicore Microcontrollers for Industrial Applications,<br />

Nürnberg, 2016.<br />

[6] Prof. Dr.-Ing. Peter Fromm, Thomas Barth, Mario<br />

Cupelli, Sicherheit auf allen Kernen - Entwicklung einer<br />

Safety Architektur auf dem AURIX TC27x,<br />

Sindelfingen, 2015.<br />

[7] HighTec EDV Systeme GmbH, Tricore Development<br />

Platform User Guide v4.6.5.0, Saarbrücken, 2015.<br />

Figure 11 - Complete safety architecture<br />

82


Developing Medical Device Software to be<br />

compliant with IEC 62304-Amendment 1:2015<br />

Mark A. Pitchford<br />

Technical Specialist<br />

LDRA<br />

Wirral, UK<br />

mark.pitchford@ldra.com<br />

I. INTRODUCTION<br />

Paraphrasing European Union directive 2007/47/EC of the<br />

European parliament of the council 1 , a medical device can be<br />

defined as:<br />

“Any instrument, apparatus, appliance, software, material or<br />

other article, whether used alone or in combination … to be<br />

used for human beings for the purpose of:<br />

• Diagnosis, prevention, monitoring, treatment, or<br />

alleviation of disease<br />

• Diagnosis, monitoring, treatment, alleviation of, or<br />

compensation for an injury or [disability]<br />

• Investigation, replacement, or modification of the<br />

anatomy or of a physiological process<br />

• Control of conception”<br />

Given that such definitions encompass a large majority of<br />

medical products other than drugs, it is small wonder that<br />

medical device software now permeates a huge range of<br />

diagnostic and delivery systems. The reliability of the<br />

embedded software used in these devices and the risk<br />

associated with it has been an ever-increasing concern as that<br />

software becomes ever more prevalent.<br />

In initial response to that concern, the functional safety<br />

standard IEC 62304 3 “Medical device software – Software life<br />

cycle processes” emerged in 2006 as an internationally<br />

recognized mechanism for the demonstration of compliance<br />

with the relevant local legal requirements 4 . The set of<br />

processes, activities, and tasks described in this standard<br />

established a common framework for medical device software<br />

life cycle processes as shown in Figure 1.<br />

FDA's Center for Devices and Radiological Health (CDRH) is<br />

responsible for regulating firms who manufacture, repackage,<br />

relabel, and/or import medical devices sold in the United<br />

States. The FDA’s introduction to its rules for medical device<br />

regulation states 2 :<br />

“Medical devices are classified into Class I, II, and III.<br />

Regulatory control increases from Class I to Class III. The<br />

device classification regulation defines the regulatory<br />

requirements for a general device type. Most Class I devices<br />

are exempt from Premarket Notification 510(k); most Class II<br />

devices require Premarket Notification 510(k); and most Class<br />

III devices require Premarket Approval.”<br />

Figure 1: Overview of software development processes and<br />

activities according to IEC 62304:2006 +AMD1:2015 5<br />

1<br />

"Directive 2007/47/ec of the European parliament and of the council".<br />

Eur-lex Europa. 5 September 2007.<br />

2 https://www.fda.gov/MedicalDevices/DeviceRegulationandGuidance/<br />

Overview/<br />

3<br />

IEC 62304 International Standard Medical device software – Software<br />

life cycle processes Edition 1 2006-05<br />

4<br />

IEC 62304 International Standard Medical device software – Software life<br />

cycle processes Consolidated Version Edition 1.1 2015-06<br />

5<br />

IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device software<br />

- Software life cycle processes Figure 1 – Overview of software<br />

development PROCESSES and ACTIVITIES<br />

83


On June 15, 2015, the International Electrotechnical<br />

Commission, IEC, published Amendment 1:2015 to the IEC<br />

62304 standard “Medical device software – software life cycle<br />

processes” 6 . The amendment complements the 1st edition<br />

from 2006 by adding and amending various requirements,<br />

including those relating to safety classification, the handling of<br />

legacy software, and software item separation.<br />

In practice, for all but the most trivial applications compliance<br />

with IEC 62304 can only be demonstrated efficiently with a<br />

comprehensive suite of automated tools. This paper describes<br />

the key software development and verification processes of<br />

the standard, and shows how automation both minimizes the<br />

cost of development and verification, and provides a sound<br />

foundation for an effective maintenance system once the<br />

product is in the field.<br />

Work on the second, updated edition of IEC 62304 is ongoing.<br />

The 2nd edition will possibly be published in 2018. It seems<br />

very likely that the changed requirements included in<br />

Amendment 1:2015 will be integrated into the updated edition.<br />

II.<br />

CLASSIFICATION<br />

One of the more significant changes concerns the new riskbased<br />

approach to the safety classification of medical device<br />

software. The previous concept was based exclusively on the<br />

severity of the resulting harm. Downgrading of the safety<br />

classification of medical device software from C to B or B to<br />

A used to be possible by adopting hardware-based risk<br />

mitigation measures external to the software. The new<br />

amendment now replaces this concept with safety<br />

classification as shown in a decision tree (Figure 2)<br />

The three classes are defined in the standard as follows:<br />

Class A<br />

The software system cannot contribute to a hazardous<br />

situation, or the software system can contribute to a hazardous<br />

situation which does not result in unacceptable risk after<br />

consideration of risk control measures external to the software<br />

system.<br />

Class B<br />

The software system can contribute to a hazardous situation<br />

which results in unacceptable risk after consideration of risk<br />

control measures external to the software system, but the<br />

resulting possible harm is non-serious injury.<br />

Class C<br />

The software system can contribute to a hazardous situation<br />

which results in unacceptable risk after consideration of risk<br />

control measures external to the software system, and the<br />

resulting possible harm is death or serious injury.<br />

III.<br />

PARTITIONING OF SOFTWARE ITEMS<br />

The classification assigned to any medical device software<br />

has a tremendous impact on the code development process<br />

from planning, developing, testing, and verification through<br />

to release and beyond. It is therefore in the interests of<br />

medical device manufacturers to invest the effort to get it<br />

right the first time, minimizing unnecessary overhead by<br />

resisting over classification, but also avoiding expensive and<br />

time-consuming rework resulting from under classification.<br />

IEC 62304:2006 +AMD1:2015 helps to minimise<br />

development overhead by permitting software items to be<br />

segregated. In doing so, it requires that “The software<br />

ARCHITECTURE should promote segregation of software<br />

items that are required for safe operation and should describe<br />

the methods used to ensure effective segregation of those<br />

SOFTWARE ITEMS”<br />

Figure 2: Safety classification according to IEC 62304:2006<br />

+AMD1:2015 7<br />

Amendment 1 clarifies the position on that software<br />

segregation by stating that segregation is not restricted to<br />

physical separation, but instead permits “any mechanism that<br />

prevents one SOFTWARE ITEM from negatively affecting<br />

another” suggesting that separation in software is similarly<br />

valid.<br />

6 IEC 62304:2006/AMD1:2015 AMENDMENT 1 - MEDICAL DEVICE SOFTWARE<br />

- SOFTWARE LIFE CYCLE PROCESSES<br />

7<br />

IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device software<br />

- Software life cycle processes Figure 3 – Assigning software safety<br />

classification<br />

84


Reference to Figure 4 shows that it applies only to Class C<br />

code.<br />

Figure 3: Example of partitioning of software items according<br />

to IEC 62304:2006 +AMD1:2015 Figure B.1 8<br />

Figure 3Figure 3 shows the example used in the standard. In it,<br />

a software system has been designated Class C. That system<br />

can be segregated into one software item to deal with<br />

functionality of limited safety implications (software item X),<br />

and another to handle highly safety critical aspects of the<br />

system (software item Y).<br />

That principle can be repeated in a hierarchical manner, such<br />

that software item Y can itself be segregated into software<br />

items W and Z, and so on – always on the basis that no<br />

segregated software item can negatively affect another. At the<br />

bottom of the hierarchy, software items such as X, W and Z<br />

that are divided no further are defined as software units.<br />

IV.<br />

CLAUSE 5. SOFTWARE DEVELOPMENT PROCESS<br />

In practice, any company developing medical device software<br />

will carry out verification, integration and system testing on<br />

all software regardless of the safety classification, but the<br />

depth to which each of those activities is performed varies<br />

considerably. Table 1 is based on table A1 of the standard, and<br />

gives an overview of what is involved.<br />

For example, subclass 5.4.2 of the standard states that “The<br />

MANUFACTURER shall document a design with enough<br />

detail to allow correct implementation of each SOFTWARE<br />

UNIT.”<br />

8 IEC 62304:2006/AMD1:2015 Amendment 1 - Medical device<br />

software - Software life cycle processes Figure B.1 – Example of<br />

partitioning of SOFTWARE ITEMS<br />

Software Development PROCESS requirements<br />

by software safety CLASS<br />

Clause<br />

Sub-clauses<br />

5.1.1, 5.1.2,<br />

5.1 Software<br />

5.1.3, 5.1.6,<br />

development planning 5.1.7,<br />

5.1.8, 5.1.9<br />

Class<br />

A<br />

Class<br />

B<br />

Class<br />

C<br />

X X X<br />

5.1.5, 5.1.10,<br />

X X<br />

5.1.11,5.1.12<br />

5.1.4 X<br />

5.2 Software requirements 5.2.1, 5.2.2, X X X<br />

analysis<br />

5.2.4, 5.2.5,<br />

5.2.6<br />

5.2.3 X X<br />

5.3 Software<br />

5.3.1, 5.3.2,<br />

X X<br />

ARCHITECTURAL<br />

design<br />

5.3.3, 5.3.4,<br />

5.3.6<br />

5.3.5 X<br />

5.4 Software detailed<br />

5.4.1 X X<br />

design<br />

5.4.2, 5.4.3,<br />

X<br />

5.4.4<br />

5.5 SOFTWARE<br />

5.5.1 X X X<br />

UNIT implementation 5.5.2, 5.5.3,<br />

X X<br />

and verification<br />

5.5.5<br />

5.5.4 X<br />

5.6 Software integration and All<br />

X X<br />

integration testing<br />

requirements<br />

5.7 SOFTWARE SYSTEM All<br />

X X X<br />

testing<br />

requirements<br />

5.8.1,,5.8.2,5. X X X<br />

5.8 Software release<br />

8.4,5.8.7,5.8.<br />

8<br />

5.8.3, 5.8.5,<br />

X X<br />

5.8.6,<br />

Figure 4: Summary of which software safety classes are<br />

assigned to each requirement in the development lifecycle<br />

requirement, highlighting clause 5.4.2 as an example 9 .<br />

IEC 62304 is essentially an amalgam of existing best practice<br />

in medical device software engineering, and the functional<br />

safety principles recommended by the more generic functional<br />

safety standard IEC 61508 10 , which has been used as a basis for<br />

industry specific interpretations in a host of sectors as diverse<br />

9<br />

Based on IEC 62304:2006/AMD1:2015 Amendment 1 - Medical<br />

device software - Software life cycle processes Table A.1 – Summary<br />

of requirements by software safety class<br />

10<br />

IEC 61508:2010 Functional safety of<br />

electrical/electronic/programmable electronic safety-related systems<br />

85


as the rail industry, the process industries, and earth moving<br />

equipment manufacture.<br />

A process-wide, proven tool suite has been shown to help ensure<br />

compliance to such software safety standards (in addition to<br />

security standards) by automating both the analysis of the code<br />

from a software quality perspective and the required validation<br />

and verification work. Equally important, such a tool suite<br />

enables life-cycle transparency and traceability into and<br />

throughout the development and verification activities<br />

facilitating audits from both internal and external entities.<br />

The V diagram in Figure 5 illustrates how tools can help through<br />

the software development process described by IEC 62304. The<br />

tools also provide critical assistance through the software<br />

maintenance process (clause 6) and the risk management process<br />

(clause 7). Clause 5 of IEC 62304 details the software<br />

development process through eight stages ending in release.<br />

Notice that the elements of Clause 5 map to those in Figure 1<br />

and Figure 5.<br />

verification plan to include tasks to be performed during<br />

software verification and their assignment to specific<br />

resources.<br />

Software Requirements Analysis (Sub-clause 5.2) involves<br />

deriving and documenting the software requirements based on<br />

the system requirements.<br />

Achieving a format that lends itself to bi-directional<br />

traceability will help to achieve compliance with the standard.<br />

Bigger projects, perhaps with contributors in geographically<br />

diverse locations, are likely to benefit from an application<br />

lifecycle management tool such as IBM ® Rational ®<br />

DOORS ®11 , or Siemens ® Polarion ® PLM ®12 . Smaller projects<br />

can cope admirably with carefully worded Microsoft ® Word ®<br />

or Microsoft ® Excel ® documents, written to facilitate links up<br />

and down the development process model.<br />

This Bidirectional Traceability of Requirements 13 (Figure 6)<br />

would be easily achieved in an ideal world. But most projects<br />

suffer from unexpected changes of requirement imposed by a<br />

customer. What is then impacted? Which requirements need<br />

re-writing? What elements of the code design? What code<br />

needs to be revised? And which parts of the software will<br />

require re-testing?<br />

Figure 5: Mapping the capabilities of the LDRA tool suite to<br />

the guidelines of IEC 62304:2006 +AMD1:2015<br />

Sub-clause 5.1 Software Development Planning outlines the<br />

first objective in the software development process, which is<br />

to plan the tasks needed for development of the software in<br />

order to reduce risks and communicate procedures and goals<br />

to members of the development team.<br />

The foundations for an efficient development cycle can be<br />

established by using tools that can facilitate structured<br />

requirements definition, such that those requirements can be<br />

confirmed as met by means of automated document (or<br />

“artefact”) generation.<br />

The preparation of a mechanism to demonstrate that the<br />

requirements have been met will involve the development of<br />

detailed plans. A prominent example would be the software<br />

Figure 6: An Illustration of the principles of Bidirectional<br />

Traceability<br />

Requirements rarely remain unchanged throughout the<br />

lifetime of a project, and that can turn the maintenance of a<br />

traceability matrix into an administrative nightmare.<br />

Furthermore, connected systems extend that headache into<br />

maintenance phase, requiring revision whenever a<br />

vulnerability is exposed.<br />

A requirements traceability tool alleviates this concern by<br />

automatically maintaining the connections between<br />

requirements, development, and testing artefacts and activities.<br />

Any changes in the associated documents or software code are<br />

automatically highlighted such that any tests required to be<br />

revisited can be dealt with accordingly (Figure 7).<br />

11<br />

http://www-03.ibm.com/software/products/en/ratidoor<br />

12<br />

https://polarion.plm.automation.siemens.com/<br />

13<br />

http://www.compaid.com/caiinternet/ezine/westfall-bidirectional.pdf<br />

Bidirectional Requirements Traceability, Linda Westfall<br />

86


Figure 7: Automating requirements traceability with the<br />

TBmanager component of the LDRA tool suite<br />

Software Architectural Design (Sub-clause 5.3) requires the<br />

manufacturer to define the major structural components of the<br />

software, their externally visible properties, and the<br />

relationships between them. Any software component<br />

behaviour that can affect other components should be<br />

described in the software architecture, such that all software<br />

requirements can be implemented by the specified software<br />

items. This is generally verified by technical evaluation.<br />

Developing the architecture means defining the interfaces<br />

between the software items that will implement the<br />

requirements. Any third-party software integration must be in<br />

accordance with Sub-clause 4.4, “Legacy Software”.<br />

If a model-based approach is taken to software architectural<br />

design using tools such as MathWorks ® Simulink ®14 , IBM ®<br />

Rational ® Rhapsody ®15 , or ANSYS ® SCADE 16 , then their<br />

integration with test tools will make for seamless analysis of<br />

generated code and ensure traceability to the models.<br />

Software Detailed Design (Sub-clause 5.4) involves the<br />

specification of algorithms, data representations, and<br />

interfaces between different software units and data structures<br />

to implement the verified requirements and architecture.<br />

Later in the development cycle, tools can help by generating<br />

graphical artefacts suited to the review of the implemented<br />

design by means of walkthroughs or inspections. One<br />

approach is to prototype the software architecture in an<br />

appropriate programming language, which can also help to<br />

find any anomalies in the design. Graphical artefacts like call<br />

graph and flow graphs are well suited for use in the review of<br />

the implemented design by visual inspection (Figure 8).<br />

Figure 8: Diagrammatic representations of control and data<br />

flow generated from source code by the LDRA tool suite aid<br />

verification of software architectural and detailed design<br />

Software Unit Implementation and Verification (Subclause<br />

5.5) involves the translation of the detailed design into<br />

source code. To consistently achieve the desirable code<br />

characteristics, coding standards should be used to specify a<br />

preferred coding style, aid understandability, apply language<br />

usage rules or restrictions, and manage complexity. The code<br />

for each unit should be verified using a static analysis tool to<br />

ensure that it complies in a timely and cost-effective manner.<br />

Verification tools offer support for a range of coding standards<br />

such as MISRA C and C++, JSF++ AV, HIS, CERT C, and<br />

CWE. The better tools will be able to confirm adherence to a<br />

very high percentage of the rules dictated by each standard,<br />

and will also support the creation of, and adherence to, inhouse<br />

standards from both user-defined and industry standard<br />

rule sets.<br />

IEC 62304 also requires strategies, methods, and procedures<br />

for verifying each software unit. Amongst the acceptance<br />

criteria are considerations such as the verification of the<br />

proper event sequence, data and control flow, fault handling,<br />

memory management and initialization of variables, memory<br />

overflow detection and checking of all software boundary<br />

conditions.<br />

Unit test tools offer a graphical user interface for the<br />

specification of requirements-based tests and to present a list<br />

of all such defined test cases with appropriate pass/fail status.<br />

By extending the process to the automatic generation of test<br />

vectors, such tools provide a straightforward means to analyse<br />

boundary values without creating each test case manually.<br />

Test sequences and test cases are retained so that they can be<br />

repeated (“regression tested”), and the results compared with<br />

those generated when they were first created.<br />

14<br />

https://uk.mathworks.com/products/simulink.html<br />

15<br />

http://www-03.ibm.com/software/products/en/ratirhapfami<br />

16<br />

http://www.ansys.com/products/embedded-software/ansys-scadesuite<br />

87


Thorough verification also requires static and dynamic data<br />

and control flow analysis. Static data flow analysis produces a<br />

cross reference table of variables, which documents their type,<br />

and where they are utilized within the source file(s) or system<br />

under test. It also provides details of data flow anomalies,<br />

procedure interface analysis and data flow standards<br />

violations.<br />

Dynamic data flow analysis builds on that accumulated<br />

knowledge, mapping coverage information onto each variable<br />

entry in the table for current and combined datasets and<br />

populating flow graphs to illustrate the control flow of the unit<br />

under test.<br />

Software Integration and Integration Testing (Sub-clause<br />

5.6) focuses on the transfer of data and control across a<br />

software module’s internal interfaces and external interfaces<br />

such as those associated with medical device hardware,<br />

operating systems, and third party software applications and<br />

libraries. This activity requires the manufacturer to plan and<br />

execute integration of software units into ever larger<br />

aggregated software items, ultimately verifying that the<br />

resulting integrated system behaves as intended.<br />

Integration testing can also be used to demonstrate program<br />

behaviour at the boundaries of its input and output domains<br />

and confirms program responses to invalid, unexpected, and<br />

special inputs. The program’s actions are revealed when given<br />

combinations of inputs or unexpected sequences of inputs are<br />

received, or when defined timing requirements are violated.<br />

The test requirements in the plan should include, as<br />

appropriate, the types of white box testing and black box<br />

testing to be performed as part of integration testing.<br />

To show which parts of the code base have been exercised<br />

during testing, the LDRA tool suite has the capability to<br />

perform dynamic structural coverage analysis, both at system<br />

test level and at unit test level. Mechanisms for structural<br />

coverage such as statement, branch, condition,<br />

procedure/function call, and data flow coverage vary in<br />

intensity, and so are specified by the standard depending on<br />

classification.<br />

A common approach is to operate unit and system test in<br />

tandem, so that (for instance) coverage can be generated for<br />

most of the source code through a dynamic system test, and<br />

complemented using unit tests to exercise such as defensive<br />

code. It is advisable to re-run (or “regression test”) these test<br />

cases as a matter of course and perhaps automatically, to<br />

ensure that any changed code has not affected proven<br />

functionality elsewhere.<br />

Software System Testing (Sub-clause 5.7) requires the<br />

manufacturer to verify that the requirements for the software<br />

have been successfully implemented in the system as it will be<br />

deployed, and that the performance of the program is as<br />

specified.<br />

V. CLAUSE 6. SOFTWARE MAINTENANCE PROCESS<br />

With the advent of the connected device and the Internet of<br />

Things, system maintenance takes on a new significance.<br />

For any connected systems, requirements don’t just change<br />

in an orderly manner during development. They change<br />

without warning - whenever some smart Alec finds a new<br />

vulnerability, develops a new hack, compromises the<br />

system. And they keep on changing throughout the lifetime<br />

of the device.<br />

For that reason, the ability of next-generation automated<br />

management and requirements traceability tools and<br />

techniques to create relationships between requirements,<br />

code, static and dynamic analysis results, and unit- and<br />

system-level tests is especially valuable for connected<br />

systems. Linking these elements already enables the entire<br />

software development cycle to become traceable, making it<br />

easy for teams to identify problems and implement solutions<br />

faster and more cost effectively. But they are perhaps even<br />

more important after product release, presenting a vital<br />

competitive advantage in the ability to respond quickly and<br />

effectively whenever security is compromised.<br />

Many software modifications will require changes to the<br />

existing software functionality – perhaps with regards to<br />

additional utilities in the software. In such circumstances, it is<br />

important to ensure that any changes made or additions to the<br />

software do not adversely affect the existing code.<br />

Automatically maintaining the connections between the<br />

requirements, development, and testing artefacts and activities<br />

helps alleviate this concern – not just during development, but<br />

onwards into deployment and the maintenance phase.<br />

VI.<br />

CONCLUSION<br />

A software functional safety standard such as that<br />

prescribed by IEC 62304 with its many sections, clauses<br />

and sub-clauses may at first seem intimidating. However,<br />

once broken down into digestible pieces, its guiding<br />

principles offer sound guidance in the establishment of a<br />

high quality software development process, not only<br />

leading up to initial product release but into maintenance<br />

and beyond. Such a process is paramount for the assurance<br />

of true reliability and quality—and above all the safety and<br />

effectiveness of medical devices. When used with a<br />

complementary and comprehensive suite of tools for<br />

analysis and testing, it can smooth the way for development<br />

teams to work together to effectively develop and maintain<br />

large projects with confidence in their quality.<br />

VII. WORKS CITED<br />

"Directive 2007/47/ec of the European parliament and of the<br />

council". Eur-lex Europa. 5 September 2007.<br />

US Food and Drug Administration website<br />

https://www.fda.gov/MedicalDevices/DeviceRegulationandGu<br />

idance/Overview/<br />

88


IEC 62304 International Standard Medical device software –<br />

Software life cycle processes Edition 1 2006-05<br />

IEC 62304 International Standard Medical device software –<br />

Software life cycle processes Consolidated Version Edition<br />

1.1 2015-06<br />

IEC 61508:2010 Functional safety of<br />

electrical/electronic/programmable electronic safety-related<br />

systems<br />

IBM Rational DOORS website http://www-<br />

03.ibm.com/software/products/en/ratidoor<br />

Siemens Polarion ALM website<br />

https://polarion.plm.automation.siemens.com/<br />

Object Management Group Requirements Interchange Format<br />

website http://www.omg.org/spec/ReqIF/<br />

Bidirectional Requirements Traceability, Linda Westfall<br />

http://www.compaid.com/caiinternet/ezine/westfallbidirectional.pdf<br />

MathWorks SIMULINK website<br />

https://uk.mathworks.com/products/simulink.html<br />

IBM Rational Rhapsody family website<br />

http://www-03.ibm.com/software/products/en/ratirhapfami<br />

ANSYS SCADE SUITE WEBSITE<br />

HTTP://WWW.ANSYS.COM/PRODUCTS/EMBEDDED-<br />

SOFTWARE/ANSYS-SCADE-SUITE<br />

COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral<br />

CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Pitchford<br />

Technical Writer<br />

E:mark.pitchford@ldra.com<br />

89


Certifying Linux: Lessons Learned in Three Years of<br />

SIL2LinuxMP<br />

Andreas Platschek<br />

OpenTech EDV Research GmbH<br />

Augasse 21<br />

2193 Bullendorf, AUSTRIA<br />

andi@opentech.at<br />

Nicholas Mc Guire<br />

OSADL eG<br />

Am Neuenheimer Feld 583<br />

D-69120 Heidelberg, GERMANY<br />

hofrat@osadl.org<br />

Lukas Bulwahn<br />

BMW Car IT GmbH<br />

Moosacher Straße 86<br />

80809 Munich, GERMANY<br />

Lukas.Bulwahn@bmw-carit.de<br />

Abstract—When the SIL2LinuxMP project was started about<br />

three years ago, many non-safety-critical systems using Linux<br />

were already built and in operation. Industry designed them in<br />

such a way mostly due to the tremendous security capabilities<br />

as well as the unmatched support for modern hardware. Both<br />

requirements are important for modern industrial applications<br />

and can be met using Linux on contemporary multi-core CPUs.<br />

However the question whether a safety argumentation for systems<br />

based on Linux can be provided and maintained was still open.<br />

While the ultimate goal of certifying a system based on Linux<br />

has still not been achieved as of today, it definitely is in reach<br />

for the basic components (Linux, glibc, busybox).<br />

The SIL2LinuxMP project was started as an industrial<br />

research project with the goal to find out whether or not it<br />

is possible to build complex software-intensive safety-related<br />

systems using the Linux operating system as its foundation.<br />

During the course of those last years, a number of potential<br />

issues that were seen in the early days turned out to be mostly<br />

manageable, while other problems took us by surprise. The most<br />

striking one being the fact that to this day no certified multi-core<br />

CPU (with four or more cores) seems to be available.<br />

This paper not only presents the issues encountered and<br />

status achieved during the last three years, it also discusses the<br />

approaches currently being proposed to resolve them.<br />

These approaches cover all aspects of the system safety lifecycle.<br />

At the system engineering level, we devised appropriate<br />

processes to tailor the safety process, moving from a development<br />

to a controlled selection process. We developed a layered<br />

system hazard analysis to systematically derive adequate safety<br />

properties and demonstrated the capabilities of this analysis on<br />

a use case. We covered the Linux development process with an<br />

assessment for which we devised data mining methods, to quantify<br />

software quality, utilizing the available development data. Based<br />

on this data, we derived statistical arguments to demonstrate the<br />

suitability of the development process. To address residual uncertainty<br />

in the area of Linux source code and its assessment, we<br />

combined the quality assessments for multiple semi-independent<br />

Linux features, capable of mitigating the same systematic fault,<br />

into a single safety argumentation. This multilayer handling of<br />

residual faults is based on a software layers-of-protection analysis.<br />

These approaches provide the necessary means for a Linux<br />

qualification route suitable for up to safety integrity level 2<br />

according to IEC 61508 (SIL2).<br />

Keywords—Linux, Safety, Qualification of Pre-Existing Software<br />

I. INTRODUCTION<br />

Over recent years, industries have announced new developments<br />

that rely on highly complex systems. Examples for this<br />

are autonomous vehicles or shared working environments for<br />

industrial robots and humans. While these new developments<br />

promise great improvements for everyone’s private and work<br />

life, they have so far mostly been tackled from a functional<br />

side. However, apart from the tackle from that functional side,<br />

it will be necessary to consider the non-functional properties,<br />

such as safety and security as well, before these systems can<br />

be used by the general public.<br />

This paper focusses on the safety properties of such highly<br />

complex systems and gives insight into the experience that was<br />

gained during the first three years of the SIL2LinuxMP [1]<br />

project, managed by the Open Source Automation Development<br />

Lab (OSADL) and a number of partner companies from<br />

various industries.<br />

The goal of the SIL2LinuxMP project is to create a framework<br />

that can be used for providing a safety argumentation of<br />

a mainline Linux kernel that guides the certification process<br />

and reduces the effort of the qualification process as far as<br />

possible. We achieve this reduction by automating as much of<br />

the process as possible and make the quality assessment of<br />

Linux repeatable for the continuously evolving Linux kernel.<br />

In order to verify that the framework is actually usable, we<br />

perform the qualification of Linux for a specific use case.<br />

Before investigating solutions, this paper shows the implications<br />

from the huge step on the complexity scale, compared<br />

to other previously existing safety-critical systems. These<br />

implications, discussed in Section II, justify the need for the<br />

SIL2LinuxMP project and the (for a safety-critical system)<br />

excessively large software stack that it involves.<br />

Then, this paper introduces the most important problems<br />

that are tackled in the project and presents the currently<br />

intermediate results. We split the discussion of these problems<br />

into two sections, depending on whether they have been<br />

www.embedded-world.eu<br />

90


anticipated at the beginning of the project (Section III), or<br />

whether they were unexpectedly discovered in the course of<br />

the project (Section IV).<br />

Note, that while this paper presents the more interesting<br />

and novel approaches, a significant part of the work still<br />

involves traditional safety engineering activities, which are not<br />

mentioned explicitly here.<br />

A. Goals and Common Misunderstandings of SIL2LinuxMP<br />

Before we dive into discussion of the technical aspects<br />

in Sections II-IV, we present the goals of the SIL2LinuxMP<br />

project. These goals are described here as clear counter<br />

statements to re-occuring misconceptions that have caused<br />

confusion with discussion partners. This makes clear what to<br />

expect and what not to expect and it summarizes how the<br />

SIL2LinuxMP project handles the particular issue.<br />

This clarification of goals is essential as it requires a major<br />

paradigm shift for companies that are used to buy and treat the<br />

operating system as a black box. While Linux comes with the<br />

advantage of royalty-free licensing for each deployed system,<br />

the costs are shifted towards the building up knowledge about<br />

proper safety engineering in the respective companies.<br />

• Goal: Establish a framework that gives guidance on<br />

how to build Linux-based safety-related systems.<br />

Common Misconception: Some people perceive that<br />

SIL2LinuxMP is creating a product that will be available<br />

in a shrink-wrapped package without additional<br />

engineering effort.<br />

The idea that SIL2LinuxMP creates a packaged product<br />

is a very common misunderstanding, and unfortunately<br />

one that seems to keep some companies from<br />

actively participating in the SIL2LinuxMP project.<br />

The authors do not think that there can and will ever<br />

be a shrink-wrapped package with a Linux version<br />

that can be executed on arbitrary hardware without<br />

any restrictions and still be used for safety-critical<br />

applications.<br />

This is a dream many seem to have (some try to name<br />

that dream SEooC), but unfortunately, this dream is<br />

unrealistic for a variety of reasons. The most important<br />

one being, that the interface to the Linux kernel is<br />

rather big. It is just under 1500 API calls, and this<br />

does not even include other interfaces into the kernel,<br />

e.g. the proc or sys pseudo filesystems.<br />

Analyzing it down to the last system call may be<br />

doable in theory, but it definitely is not maintainable<br />

and thus not economical. At the time of this writing,<br />

the list of system calls used in the SIL2LinuxMP use<br />

case is about 30-35 API calls (system calls as well as<br />

library calls, such as e.g. memset()), with additional<br />

restrictions on their parameters, i.e., only certain flags<br />

are allowed, and on their usage, i.e., some calls may<br />

only be used during system initialization. All of these<br />

API calls are considered commonly used in most of<br />

the existing applications that run on Linux, none of<br />

the more esoteric and rarely used calls are used.<br />

In addition, their usage is restricted to specific combinations,<br />

e.g., allocation of memory requires a combination<br />

of malloc(), mlockall() and memset()<br />

and is further only allowed during system initialization.<br />

Furthermore, we only implement mechanisms to<br />

counter-act use-case specific faults that were revealed<br />

during the hazard analysis.<br />

In contrast to the non-implementable care-free shrinkwrapped<br />

pre-certified Linux package, the goal of the<br />

SIL2LinuxMP project is to create a framework that<br />

guides the safety engineer through the process of<br />

certifying a system based on Linux.<br />

That means, it will be necessary to do a specific<br />

analysis for every Linux-based safety-critical system.<br />

Of course with time progressing, there will be certain<br />

patterns that emerge and using Linux in safety-critical<br />

systems will become easier, but nevertheless, an operating<br />

system of this complexity can never be expected<br />

to be resilient against all credible faults in all current<br />

and future applications.<br />

• Goal: Qualify Linux as pre-existing software element<br />

following Route 3 S in IEC 61508-3 [2].<br />

Common Misconception: Since Linux is widely in<br />

use, a Proven-in-use strategy can be done.<br />

A proven-in-use argument (named Route 2 S in<br />

IEC 61508-3 [2]) seems to be the first naive attempt<br />

for everyone who has never thought about the<br />

issue of Linux certification before. Unfortunately,<br />

proven-in-use qualifies as unusable as soon as one<br />

studies the pre-requisites for the collected historic<br />

data of such an argument in the relevant standards<br />

(e.g. in IEC 61508-7, C.2.10.1 [2]).<br />

In contrast, the SIL2LinuxMP project provides the<br />

safety qualification argument for the pre-existing software<br />

elements (Linux kernel, glibc, busybox) following<br />

IEC 61508, Route 3 S . This means, we provide an<br />

argument explaining why the development process of<br />

those pre-existing software elements satisfies the high<br />

standards of IEC 61508.<br />

Nevertheless, the SIL2LinuxMP project makes use of<br />

the popularity and wide usage of Linux by considering<br />

it in the selection process (see Section III-A for<br />

details)—but only as an additional parameter for (de-<br />

)selection and not as the sole or main argument.<br />

• Goal: Provide a minimal run-time environment for<br />

safety-critical applications up to a level of SIL 2.<br />

Common Misconception: A Full-fledged distribution,<br />

e.g. Debian, Yocto, will be available.<br />

Unfortunately, running a full-fledged distribution with<br />

bells and whistles is not seen as doable (and practical<br />

for that matter), as the code base of all packages in a<br />

distribution is just too big.<br />

Therefore the SIL2LinuxMP project is restricted to<br />

the Linux kernel, a minimum set of standard libraries<br />

and a minimum run-time environment (based on<br />

busybox). While some out there may still hope for<br />

a SIL2-certified Android or Yocto-based system,<br />

including graphics stack and everything—this is<br />

certainly not our goal.<br />

www.embedded-world.eu<br />

91


II.<br />

Co-location of a SIL0 container in an overall mixedcriticality<br />

system is being considered. While this will<br />

allow somewhat more non-safe functionality to run in<br />

parallel, it also will not provide a full-fledged Linux<br />

distribution without any restrictions.<br />

The SIL2LinuxMP approach is to keep those software<br />

elements to the QM/SIL0 1 container where all kinds<br />

of non-safety applications can be executed (see Figure<br />

3) and keep the safety-critical application to a bare<br />

minimum. Integration of applications with mixed criticality<br />

is done using isolation mechanisms in the Linux<br />

kernel, as described later in Section III-C.<br />

IMPLICATIONS OF THE COMPLEXITY INCREASE<br />

As already mentioned in the introduction, a significant<br />

increase in complexity is happening in many industries. This<br />

increase in complexity is driven by the applications that are<br />

developed for the near (and not so near) future. We derive<br />

a number of implications from these anticipated applications<br />

that have a significant impact on non-functional properties of<br />

the systems.<br />

1) Computing Performance – The performance needs<br />

for these highly complex applications is much higher<br />

than in traditional systems. This implies not only<br />

much higher energy consumption [3], but also that<br />

CPUs that are to date not used in industrial applications<br />

but, e.g., in the server market, need to be<br />

used as otherwise the necessary performance cannot<br />

be provided.<br />

2) Concurrent Computation Capabilities – The above<br />

mentioned need for state-of-the-art processors due to<br />

performance requirements also mandates an operating<br />

system that is able to manage such modern multicore<br />

CPUs and is able to take advantage of as many<br />

performance-enhancing features as possible.<br />

3) Security – Another common need across all<br />

industries is the need to connect systems to the<br />

outside world—often through the internet. This<br />

inevitably leads to security issues.<br />

close this gap and allow the usage of Linux-based computing<br />

platforms that satisfy the performance as well as safety needs<br />

for future applications, the SIL2LinuxMP project was started.<br />

III.<br />

ANTICIPATED RESEARCH QUESTIONS<br />

This section gives a summary on research questions that<br />

we anticipated from the start of the project. This does not<br />

mean that the solution was fully clear, only that the problems<br />

in principle were recognized with some concepts in place on<br />

how to tackle them.<br />

A. From Implementation to Selection<br />

The main difference between the SIL2LinuxMP platform<br />

and the approach commonly used is that the basic software<br />

elements are pre-existing software that have not been subject<br />

to dedicated development.<br />

This means that there are no provisions in the lifecycle for<br />

fault elimination and this implies a strong concentration on<br />

fault mitigation—a fundamentally wrong approach to system<br />

safety.<br />

Thus the safety lifecycle is altered in such a way, that<br />

it mitigates this flaw by adjustment of the safety lifecycle.<br />

Specifically, the V-Model is split into two parts, where the<br />

upper part is the system specification and architecture. This<br />

part is developed for this particular system and thus done in<br />

the regular route 1 S development as defined in IEC 61508 [2].<br />

The bottom part is where usually, the dedicated software would<br />

be designed, developed and integrated. Since pre-existing<br />

software elements are used, this bottom part is replaced by<br />

a software selection process, as shown in Figure 1.<br />

While SIL2LinuxMP does not focus on security, the<br />

project has set itself the goal to check every design<br />

decision on the architecture in order to assure that<br />

the design does not conflict with generally applied<br />

security concepts.<br />

These properties of a computing platform for up-coming<br />

high-complexity applications make Linux the prime suspect<br />

as the basis for such a computing platform, since Linux is<br />

being used in high-performance as well as security-demanding<br />

applications for many years. Therefore it provides a number<br />

of mechanisms that allow an optimal utilization of the given<br />

resources while providing outstanding security capabilities<br />

(protection and monitoring).<br />

However, the usage of Linux for safety-critical systems<br />

has so far been restricted to some very specific cases [4],<br />

[5] but a general approach to certify an unmodified mainline<br />

kernel is not available even though discussed in the past [6]. To<br />

1 Without going into detail, please note, that QM/SIL0 does not mean<br />

arbitrarily crappy code may be executed in the QM/SIL0 container!<br />

Fig. 1.<br />

Safety Lifecycle: Selection Process for Pre-Existing Elements.<br />

This adapted lifecycle model describes the workflow<br />

of how elements are selected. Depending on the element,<br />

the possible selection items vary. For the elements in the<br />

SIL2LinuxMP project, the following variables are up for<br />

selection:<br />

• Kernel Version – A new stable version of the Linux<br />

kernel is released approximately every two months. In<br />

www.embedded-world.eu<br />

92


addition about once a year one of these stable versions<br />

is made a long-term stable (LTS) version is released.<br />

Not every version is as stable as the others, thus an<br />

important part of the SIL2LinuxMP projects selection<br />

process is to select stable kernels (see Section III-D<br />

for details on how to use development data to identify<br />

a stable version), that are long-term supported (LTS)<br />

and ideally used in a very broad context (e.g. used by<br />

major distributions) and thus well tested.<br />

• Kernel Configuration – The goal of the<br />

SIL2LinuxMP project is not to provide a certificate<br />

of the full kernel and allow it to be configured<br />

however one might want to, but rather to establish<br />

a framework that allows the certification of one<br />

specific configuration, the assumption being that this<br />

configuration is reduced to functions that are:<br />

◦ Needed by the application to satisfy functional<br />

or non-functional requirements, while their selection<br />

is driven by the maturity and quality of<br />

the respective candidate, or<br />

◦ Established as de-facto standard, i.e., a kernel<br />

configuration without that configuration item<br />

would be far away from every other kernel<br />

configuration in use.<br />

Thus the kernel configuration also allows the (de-)-<br />

selection of certain subsystems based on criteria, such<br />

as their development history, novelty of design, size,<br />

and known bug rate.<br />

• Non-Kernel Elements – In addition to the kernel,<br />

other pre-existing elements will be used, e.g., C library,<br />

and math library. Usually, different variants<br />

for these libraries are available. For these non-kernel<br />

elements, the selection process thus starts with the selection<br />

of the variant that shall be used. This selection<br />

can—amongst the criteria used for the kernel itself—<br />

also include the width of deployment and the activity<br />

of the development. Using a C library that is only<br />

deployed in a few systems and was not maintained<br />

for some years just does not make any sense. Instead<br />

the advantage of a broad and active community has<br />

to be considered by selecting an active and healthy<br />

project.<br />

B. Adapting Methods<br />

One issue in many safety-related projects is tailoring<br />

methods or arguing entirely new methods. The SIL2LinuxMP<br />

project with the outlined selection process of pre-existing<br />

software elements (see Section III-A requires that not only<br />

the software (i.e. source code) itself has to be considered, but<br />

also the development environment including all tools that are<br />

used in the development of the pre-existing software elements<br />

need to be investigated. This investigation entails an analysis<br />

of the contribution to safety by the tools (if any).<br />

Furthermore, the increased complexity requires that methods<br />

not suitable for such applications are replaced by stateof-the-art<br />

methods that can handle this increased complexity.<br />

Unfortunately the methods suggested by standards, such as<br />

IEC 61508 [2] are in large parts rather antiquated and proper<br />

replacement needs to be argued.<br />

Finally, the application of methods to pre-existing elements<br />

also changes the properties of applied established methods,<br />

e.g., the specification of an pre-existing element in retrospective<br />

can not provide a contribution to ”Freedom from intrinsic<br />

specification faults, ...” as suggested in the measures and<br />

techniques properties of Annex C, i.e., 61508-3 Table C.1.<br />

Thus the list of methods that need to be tailored or newly<br />

introduced is quite significant and we considered it as good<br />

practice to not try and wildly argue methods, but rather build<br />

up a systematic process that not only allows the argumentation<br />

of tailored as well as new methods, but also shows that<br />

original intent of IEC 61508 [2] is addressed with regards<br />

to the covered properties by the new set/combination of used<br />

methods.<br />

IEC 61508-3<br />

Annex A/B<br />

(Detailed Tables)<br />

Method to be Tailored / Replaced.<br />

New / Tailored Method<br />

IEC 61508-7<br />

Anne C<br />

Aim/Description<br />

Reverseengineer<br />

Rationale<br />

Impact on<br />

Dependent<br />

Measur s<br />

Fig. 2. Example of Tailoring a Method to the Context of Modern Computing<br />

Platforms.<br />

In Figure 2, the workflow for tailoring a method is outlined.<br />

Basically, it shows the transition (represented by the black<br />

arrow with the dotted line) from a current method that shall<br />

be tailored/replaced by a new/tailored method.<br />

The approach taken (see the red path going through area 1○<br />

in Figure 2), is to reverse-engineer the rationale in IEC 61508-<br />

7 in order to find out what the intend of the original was. Then<br />

IEC 61508-3, Annex C is used to retrieve the properties that<br />

the method contributes to the system development. Now the<br />

contribution to those properties by the newly introduced (or<br />

tailored) method is evaluated.<br />

Next (following the green path going through area 2○in Figure<br />

2) the contribution to the properties by the new methods are<br />

compared to the contribution of the original method. Note, that<br />

at this point, multiple new methods could be used to replace<br />

one original method. If this is the case, the contribution of all<br />

the (relevant) new methods are compared to the contribution<br />

of the original method at least at a semi-quantitative level.<br />

The last step (following the blue path going through area 3○<br />

www.embedded-world.eu<br />

93


in Figure 2) is to do a gap analysis between the contribution to<br />

properties by the new and original method(s) and, if necessary<br />

adjust the method set to cover those gaps. This iterative process<br />

is done until an appropriate coverage is reached.<br />

C. Isolation Properties<br />

One of the key capabilities of the Linux kernel that are<br />

widely used are its isolation mechanisms. Previously, they were<br />

mainly used in security-aware systems, but nowadays, with<br />

the rise of containers used for simplified system deployment,<br />

system adminisation and application development, these isolation<br />

mechanism are used virtually in every running system.<br />

These isolation mechanisms are used to build containers that<br />

are isolated from the rest of the system, in order to make the<br />

failure of the contained applications independent of the core<br />

system, to limit the impact of a security breach to the container<br />

and to assure that unrelated applications can not crash each<br />

other or the core system.<br />

The availability of multiple independently designed isolation<br />

and protection mechanisms allows to build up layered<br />

isolation architectures. Figure 3 illustrates how independent<br />

applications of mixed-criticality are isolated, using multiple<br />

layers of protection.<br />

Fig. 3.<br />

SIL2LinuxMP base system<br />

Monitoring<br />

busybox<br />

glibc<br />

CPU 0<br />

RAMbank 0..n<br />

SIL 2 SIL 2<br />

Safety app.<br />

32bit FP<br />

glibc 32bit<br />

seccomp<br />

CPU 1<br />

RAMbank n+1..m<br />

Safety app.<br />

64bit INT<br />

glibc 64bit<br />

seccomp<br />

CPU 2<br />

RAMbank m+1..i<br />

SIL2LinuxMP – One Possible Architecture<br />

SIL 0<br />

Debian<br />

Container<br />

CPU 3<br />

RAMbank i+1..j<br />

Notably, there are two boundaries on which these isolation<br />

properties are used in SIL2LinuxMP:<br />

• to achieve isolation between independent applications<br />

(of different criticality), and<br />

• to assure that API constraints specified by the result<br />

of the hazard analysis (see Section IV-B) are honored.<br />

While this immediately sounds intriguing—having unrelated<br />

applications, and even applications with mixed<br />

criticality—without the worry of inter-dependence between<br />

those applications, the old problem that appears when investigating<br />

the usability for safety, always is how to verify that this<br />

is adequately safe, and that the isolation properties that<br />

allow independence of applications failure are trustworthy.<br />

In this particular case, it was realized that this is a problem<br />

that is similar to a problem well-known in the process industry,<br />

well formulated by Audrey Canning:<br />

”Yet further concerns relate to whether a consequence can<br />

be so severe that the frequency of the hazardous situation<br />

should not be taken into account, thus negating the concept<br />

for ’risk’ in selecting the appropriate set of implementation<br />

techniques. In order to address this concern IEC 61511 formalized<br />

the concept of ’layers of protection’ requiring diversity<br />

between the different layers.” [7]<br />

The situation in this particular case is similar, in so far,<br />

as the risk can not be evaluated because the frequency of<br />

the hazardous situation cannot—at least not with reasonable<br />

effort—be obtained. For that reason, our solution leans on the<br />

basics for layers of protection analysis (LOPA). The intention<br />

is to assign multiple layers of protection for each class of<br />

hazards. The (usually already very unlikely) event of the<br />

hazardous situation will then only happen undetected if all<br />

the layers of protection fail at same time, making the event of<br />

the hazardous situation even less likely—arguably extremely<br />

unlikely.<br />

In order to employ a LOPA and truthfully conclude the<br />

previous assertion, the isolation mechanisms satisfy the basic<br />

properties of independent protection layers (IPL), cf. [8, Section<br />

1.3]:<br />

• Independence: a LOPA only comes to a correct risk<br />

assessment, if the protection layers are sufficiently<br />

independent from each other. If physical isolation<br />

is the target, then this might be problematic with<br />

software, therefore the focus is shifted towards logical<br />

isolation.<br />

• Effectiveness: It needs to be assured that the functionality<br />

of the layer protects or mitigates from the<br />

studied consequences and works even if the hazard<br />

happens. Each layer of protection shall provide sufficient<br />

protection or mitigation on its own, the use of<br />

multiple layers is meant to add an additional safety net<br />

for weaknesses in the argumentation or analysis of the<br />

individual layers. This does not mean, that it allows the<br />

developers to just not do an analysis, but only refers<br />

to situations where the exact quantitative frequency of<br />

failure is simply not available or is attached with an<br />

uncertainty that is too high for a proper argumentation.<br />

• Auditability: It shall be possible to inspect the design<br />

and development of the IPLs as well as the IPLs<br />

themselves to assure the safety of the individual IPLs.<br />

Since the SIL2LinuxMP project uses only free/libre<br />

open-source software (FLOSS) the source code of<br />

the IPLs themselves is available for analysis, as is<br />

a plethora of documentation, and the development<br />

history (revision control system, mailing lists, bug<br />

reports, etc.).<br />

With this LOPA executed, we show that the isolation<br />

mechanisms are sufficient to provide a proper logical isolation<br />

of the different applications running on the same, shared,<br />

Linux-based system.<br />

www.embedded-world.eu<br />

94


D. Statistical Modelling<br />

Traditionally, safety-critical software development follows<br />

a rigorous development process guided by the relevant standard.<br />

The assumption is, that this rigorous development process<br />

leads to a residual bug rate that satisfies the targeted SIL<br />

level. Alternatively, we can ask the question What process<br />

is responsible for the presence of a software fault? We<br />

anticipate answering this question with statistical methods<br />

in SIL2LinuxMP. With other words, we target to provide<br />

statistical statements about faults introduced by a stochastic<br />

(human) process in the development lifecycle activities.<br />

Traditional safety-related systems had, not too surprising,<br />

no means of quantifying systematic faults in software due to<br />

the small software size and the statistically low number of<br />

iterative re-designs. Instead, a qualitative defense with deeper<br />

analysis was considered. Essentially, this pans out to take<br />

sets of methods, applying these sets by qualified teams of<br />

engineers and wrapping this all into a controlled process for<br />

which metrics can serve as indicators of systematic capabilities<br />

bestowed on the software elements.<br />

Provided adequate trace data and development metadata is<br />

available for such a pre-existing element, we are able to infer<br />

adequate process compliance on statistical basis through an<br />

indirect metric on the non-compliant development. The basic<br />

model is depicted in Figure 4.<br />

on the applied methodology—even though rigor R1 (see IEC<br />

61508-3 Ed 2 Annex C.1.1) R1: without objective acceptance<br />

criteria, or with limited objective acceptance criteria. E.g.,<br />

black-box testing based on judgement, field trials.” might seem<br />

to be calling for very little. If we statistically establish a clear<br />

relation between activities called for and effective findings,<br />

we can establish a overall claim of principle achievement of<br />

the objectives of IEC 61508, specifically the fifth objective<br />

of clause 7.4 is being addressed here [2, Part3, 7.4] The fifth<br />

objective of the requirements of this sub-clause is to verify that<br />

the requirements for safety-related software (in terms of the<br />

required software safety functions and the software systematic<br />

capability) have been achieved.<br />

Such a predictive model is an evaluation of the presumed<br />

stable underlying process of development, not an assessment<br />

of the systematic faults in a particular version itself. To achieve<br />

this, we model successive cycles of the kernel DLC and deduce<br />

continuity and trends of improvement for the overall kernel as<br />

well as for a particular selected configuration. At the heart,<br />

these are regression analysis of the -stable releases for Long-<br />

Term-Stable (LTS) kernels using negative-binomial regression<br />

models bootstrapped on the development trace data of the<br />

Linux kernel.<br />

CMMI<br />

review<br />

testing<br />

audit<br />

...<br />

Defects<br />

FLOSS<br />

Time<br />

review<br />

testing<br />

usage<br />

...<br />

Defects<br />

Time<br />

Fig. 4.<br />

Principle of statistical modelling using development data.<br />

Establishing process compliance for a highly-complex software<br />

element, e.g., the Linux kernel, is a two-step process:<br />

1) Establish the principle existence of mandatory activities,<br />

essentially this is what route 3 S in IEC 61508-3<br />

Ed 2 7.4.2.13 a-i encodes, and<br />

2) Establish the actual effectiveness of these methods<br />

based on statistical analysis of process metadata.<br />

The Linux kernel developers define a quite rigorous development<br />

process [9] which, in principle, can address most of the<br />

requirements for a structured and managed process set forth in<br />

IEC 61508 Ed 2 part 3. But clearly this FLOSS project lacks<br />

the safety management structure to claim any particular rigor<br />

Fig. 5.<br />

config)<br />

Linux 4.4 patches development over SUBVERSIONS (Use-Case<br />

Modeling the development of patches over stable kernel<br />

developments based on development of patches shown in<br />

Figure 5 respectively by analysis of the specific hunks of<br />

applicable patches applied in Figure 6 in the selected kernel<br />

feature set, allows estimating the residual bugs in the kernel as<br />

well as an overall judgement of process robustness. The goal<br />

of such models is not to imply we know the number of yet-tobe-discovered<br />

bugs in the kernel, rather it allows to judge if<br />

the development is comparable to a bespoke development, and,<br />

equally important, if the rate of reported bugs can be managed<br />

in a safety lifecycle or not.<br />

From current, still quite limited set of root-cause analysis<br />

www.embedded-world.eu<br />

95


Fig. 6. Linux 4.4 applicable hunks development over SUBVERSIONS (Use-<br />

Case config)<br />

data, where bug-fixes in -stable kernels were analyzed, we<br />

estimate that ≤ 1/30 bugs are safety related for our specific<br />

use case and thus the expected number of bugs is manageable.<br />

Regressions though are assuming that bugs are field findings<br />

and thus discovered by a time-dependent process, naturally<br />

this is not true for all bugs—the recently emerging meltdown<br />

and spectra bugs, which led to a very significant updated of<br />

critical kernel elements, demonstrates that such predictions<br />

have their limits. Nevertheless it is a first quantification of<br />

potential impact and thus an important metric for selecting a<br />

particular kernel version and configuration selection.<br />

IV.<br />

UNEXPECTED TROUBLES<br />

While the things that were presented above were already<br />

known from the beginning, there were some further issues that<br />

we did not anticipate at the beginning of SIL2LinuxMP project.<br />

A. Impact Analysis<br />

A question that is raised at discussions quite often is How<br />

do you plan to certify an operating system with 19million lines<br />

of code?. The simple answer is: This was never the plan.<br />

As already mentioned in Section III-A, the selection of the<br />

configuration items is part of the safety development lifecycle.<br />

Essentially, the selection is not only a step to eliminate faults,<br />

as well as minimize residual faults, but also a step allowing to<br />

dramatically reduce the code base.<br />

The Linux kernel configuration takes care that only a<br />

fraction of the actual code base is actually used. In addition,<br />

every analysis that is done (e.g. the statistical models presented<br />

in III-D) ultimately should focus its consideration on those<br />

commits that have an impact on the specific configuration.<br />

In order to being able to do this, the SIL2LinuxMP project<br />

uses two tools developed in the context of the project:<br />

• The minimization [10] tool was developed by Hitachi<br />

in the context of the SIL2LinuxMP project. It allows<br />

to—based on the kernel configuration—produce a<br />

code base where all the code that is not used is stripped<br />

based on C-macros used for configuration.<br />

• The patch impact tester(PIT) is a tool that is currently<br />

still under development (but shows very promising<br />

results). In contrast to the minimization tool, it<br />

does not work a single version of the kernel, but rather<br />

it tests whether a given patch has an impact in a given<br />

configuration. This way, the development data of the<br />

individual changes is preserved, and the number of<br />

commits that have to be considered is reduced.<br />

While this problem may seem trivial at first, it has<br />

quite a number of cases where this is not a trivial<br />

problem.<br />

The PIT itself is based on a GCC plugin. This plugin<br />

is used when compiling the kernel and configuration<br />

that shall be used. The information provided by this<br />

plugin is collected in a database, and can then be used<br />

to check whether a given patch has an impact on the<br />

configuration, or not.<br />

B. HD 3 – Hazard driven Decomposition, Design and Development<br />

While the complexity of the use case considered in the<br />

SIL2LinuxMP project is far below the complexity of intended<br />

future applications, e.g., autonomous driving, it became obvious<br />

during hazard analysis that this kind of complexity<br />

is not controllable with traditional hazard analysis methods.<br />

For that reason, a new approach based on the hazard and<br />

operability study (HAZOP) method was investigated. This new<br />

approach is called Hazard-driven Decomposition, Design and<br />

Development (HD 3 ).<br />

The primary premise of HD 3 is that the design of the<br />

system shall be driven by identified hazards. The idea is to<br />

use the hazards as design input and use this to eliminate them<br />

at the design level, where possible. This way, the need for<br />

mitigation mechanisms is minimized, preventing the system<br />

complexity to unnecessarily increase.<br />

The general procedure of HD 3 is to start with the basic<br />

functionality of the system. In the SIL2LinuxMP use case,<br />

this was ”Measure the quality of water.”. Based on this<br />

basic functionality, a technology-agnostic process was derived,<br />

i.e., the process as if done by a biochemist in a laboratory<br />

setup. This technology-agnostic process is then subject to the<br />

first hazard analysis. A traditional HAZOP is conducted on<br />

the technology-agnostic process, revealing the hazards at this<br />

highly abstract level. Elimination conditions and mitigation<br />

capabilities are recorded in the form of Safety Application<br />

Conditions (SACs) for each of the analysis levels. These SACs<br />

are then consolidated into a set of derived items—still at a<br />

technology-agnostic level.<br />

The results of the hazard analysis at the technologyagnostic<br />

level are used as input for a technology-aware design<br />

while still staying technology-unspecific. That means, that an<br />

automated system is designed by allocating unspecific devices<br />

(motors, pumps, valves, sensors, etc.) to perform the actions<br />

www.embedded-world.eu<br />

96


that are performed by the bio-chemist in the technologyagnostic<br />

process. At this level no specific devices is yet<br />

allocated (e.g. only a ”pump” is used without knowing whether<br />

it’s a diaphragm pump, radial pump, peristaltic pump, etc.).<br />

This technology-aware unspecific design is then fed into<br />

yet another hazard analysis. The result of this second round<br />

of hazard analysis is used to do a more detailed technologyaware<br />

design using specific devices. The important part here<br />

is, that based on the hazards at the higher level of abstraction,<br />

it was possible to select specific devices that are inherently<br />

safe against a number of the specific identified faults. This<br />

leaves a limited (ideally minimized) set of hazards that cannot<br />

be eliminated this way. Only for this limited set, a mitigation<br />

mechanism has to be introduced into the system design to<br />

assure a low residual probability of failure.<br />

The actual allocation of mitigations finally can be at<br />

the level of the specific safety related application or at the<br />

unspecific (generic) level of selected elements (see LOPA).<br />

The result of the second hazard analysis is then used to<br />

go into a third level of hazard analysis where the unspecific<br />

devices are replaced by the selected specific devices.<br />

It is important to note, that the HD 3 approach is then completed<br />

with further intermediate layers of derived requirements<br />

that are necessary to get the needed hazard information for the<br />

next level of design.<br />

Furthermore, each hazard analysis also emits Safety Application<br />

Conditions (SACs) that put conditions on the system<br />

that have to be met. The approach of HD 3 results in these<br />

SACs showing hierarchical properties, as SACs from higher<br />

levels of abstraction map to more fine-grained SACs at the<br />

detailed level. For example, at the technology-agnostic level,<br />

the SAC ”Critical data must be verified after write.” at<br />

the technology-agnostic level is refined as follows: In the<br />

unspecific technology-aware level, this is reflected in the form<br />

of ”Grey-channel the storage media.”, and in the specific<br />

technology-aware level this becomes: ”Written data must be<br />

read back.”, ”Individual measurement values shall be timestamped.”,<br />

”A CRC shall be stored for individual measurement<br />

values.”.<br />

This crude example shows how SACs are refined conducting<br />

the hazard analysis at the various levels of abstraction.<br />

Similarly, the required API manifests itself. What can not be<br />

seen in this example is that the SACs and the used API calls<br />

that are part of the result constitute<br />

• a minimum set of parameter constrained API calls,<br />

and<br />

• a maximum set of constraints.<br />

Thus, reducing the functional subset that needs to be<br />

analyzed. This way, the complexity of the software components<br />

is handled with a minimized effort.<br />

While the experience with HD 3 is still limited due to its<br />

novelty, the first results look very promising and a thorough<br />

analysis from the high-level design down to the implementation<br />

was possible for the SIL2LinuxMP use case. Maybe a<br />

traditional hazard analysis would have been possible for this<br />

level of complexity, but from our experience the effort to do<br />

so would have been significantly higher, and more importantly,<br />

honoring the important rule of ”first eliminate, then mitigate”<br />

would not have been possible to the same extent.<br />

V. CONCLUSION<br />

While the goal of a SIL2-certified platform has not been<br />

reached within the first three years of the SIL2LinuxMP<br />

project, partly due to the lack of certified multi-core CPU<br />

hardware, it has been shown that this goal is not out of reach,<br />

especially for our main investigation subject, the Linux kernel.<br />

The above sections present the progress that has been made<br />

in various parts of the safety lifecycle. The biggest issue in the<br />

project’s endeavor was to find ways to handle the complexity.<br />

This was achieved using the HD 3 method (introduced in<br />

Section IV-B) to perform hazard analysis and do the system<br />

design. Second, the software LOPA (Section III-C) provides<br />

argumentation for the partitioning of the problem by separating<br />

applications of same and different criticality and allowing them<br />

to be handled separately. Furthermore, the impact analysis, as<br />

described in Section IV-A, allows the automatic reduction of<br />

the Linux kernel code base to those lines of code that do have a<br />

direct impact on the specific configuration in use, thus reducing<br />

the effort as everything else can be discarded.<br />

For the re-use of pre-existing open-source elements, the<br />

most important steps were the transition from a traditional<br />

V-model for software development to the selection process<br />

outlined in Section III-A, as well as the systematic process<br />

to arguing the use of new methods as well as the tailoring<br />

existing methods, discussed in Section III-B.<br />

In summary, the overall progress of the SIL2LinuxMP<br />

project is at a point where the authors are confident that the<br />

goal can be achieved by completing these sketched activities.<br />

VI.<br />

ACKNOWLEDGEMENTS<br />

The SIL2LinuxMP project is organized by the Open-<br />

Source Automation Development Lab (OSADL), as well as<br />

the SIL2LinuxMP partner companies. We thank them for their<br />

support and their funding.<br />

REFERENCES<br />

[1] OSADL, SIL2LinuxMP Webpage, https://www.osadl.org/SIL2LinuxMP.sil2-<br />

linux-project.0.html, 2016<br />

[2] IEC 61508 Edition2, Functional safety of electrical/electronic/programmable<br />

electronic safety-related systemsIEC,<br />

2010<br />

[3] Bloomberg, Driverless Cars Giving Engineers a Fuel Economy<br />

Headache, https://www.bloomberg.com/news/articles/2017-10-<br />

11/driverless-cars-are-giving-engineers-a-fuel-economy-headache,<br />

October 2017<br />

[4] Andreas Gerstinger, Heinz Kantz and Christoph Scherrer, TAS Control<br />

Platform: A Platform for Safety-Critical Railway Applications,<br />

https://publik.tuwien.ac.at/files/PubDat 167529.pdf<br />

[5] Peter Sieverding and Detlef John, SICAS ECC - die Platform fr Siemens-<br />

ESTW fr den Nahverkehr, Signal und Draht, May 2008<br />

[6] CSE International Limited for the Health and Safety Executive, RE-<br />

SEARCH REPORT 011: Preliminary assessment of Linux for safety<br />

related systems, 2002<br />

[7] Audrey Canning, Functional Safety: Where have we come from? Where<br />

are we going? in Proceedings of the Twenty-fifth Safety-critical Systems<br />

Symposium, Briston, UK, 2017<br />

www.embedded-world.eu<br />

97


[8] Guidelines for Initiating Events and Independent Protection Layers<br />

in Layer of Protection Analysis, Center for Chemical Process Safety<br />

(CCPS), 2015, Published by Wiley&Sons<br />

[9] A guide to the Kernel Development Process,<br />

https://www.kernel.org/doc/html/latest/process/developmentprocess.html,<br />

2018<br />

[10] GIT Repository of the Minimization Tool, https://github.com/Hitachi-<br />

India-Pvt-Ltd-RD/minimization, 2017<br />

www.embedded-world.eu<br />

98


A Multi-Platform Modern C++ Framework for<br />

Safety-Critical Embedded Software<br />

Daniel Tuchscherer (Author)<br />

Automotive Systems Engineering<br />

Hochschule Heilbronn, Germany<br />

daniel.tuchscherer@gmail.com<br />

Ingmar Troniarsky (Author)<br />

ITronic GmbH<br />

Erdmannhausen, Germany<br />

i.tron@itgroup-europe.com<br />

Markus Hinse (Author)<br />

ITronic GmbH<br />

Erdmannhausen, Germany<br />

m.hinse@itgroup-europe.com<br />

Frank Tränkle (Author)<br />

Automotive Systems Engineering<br />

Hochschule Heilbronn, Germany<br />

frank.traenkle@hs-heilbronn.de<br />

Abstract—The choice of a programming language and its<br />

idioms have a critical impact on reliability, safety and efficiency<br />

of the embedded software under development. In the automotive<br />

and robotics domains, the C programming language as well as<br />

model-driven tools are well established for safety-critical<br />

software. However, automated driving and innovative robotics<br />

applications are both examples for the emerging complexity of<br />

safety-critical software. Both domains contribute to the<br />

increasing popularity of modern approaches among the<br />

established ones to increase flexibility, such as Modern C++ with<br />

the ISO-standards C++11 and C++14<br />

This paper discusses experiences in applying Modern C++ as<br />

efficiently and as effectively as possible for developing safetycritical<br />

software. A multi-platform and simple-to-use framework<br />

for safety-critical software in Modern C++ is developed and<br />

applied to a concrete industrial application in the area of humanrobot<br />

collaboration. On the one side, Modern C++ is used to<br />

realize the speed control of the collaborative robotic system,<br />

which includes a proximity sensor system that measures distances<br />

between the robot and humans. On the other side, safety<br />

mechanisms are realized with Modern C++ in order to monitor<br />

system entities and communication channels for failures. In case<br />

of real-time violations or failures, the safety-control software in<br />

Modern C++ must ensure safety-stops in order to prevent<br />

humans from hazards and resulting injuries. In concrete terms,<br />

this paper discusses in which way Modern C++ enhances<br />

usability, reliability and safety for the implementation of a busindependent<br />

safety-communication protocol, which is used to<br />

provide message-based real-time monitoring, dual-channel<br />

utilities and actuation monitoring in a maintainable, extensible<br />

way.<br />

Keywords—Modern C++, embedded safety software, reliability,<br />

human-robot collaboration, reliable communication, IEC 61508,<br />

ROS (key words)<br />

I. INTRODUCTION<br />

Cooperating and collaborating robots interacting with<br />

humans without any barrier can be classified as so-called<br />

safety-critical systems. Safety-critical systems are embedded<br />

systems where malfunctions may lead to hazards that<br />

potentially result in severe injuries for humans [17], [18]. In<br />

case of a fault that may boil up to a failure, human damage<br />

must be prevented in first place. This is why measures to<br />

detect, avoid and handle malfunctions play a crucial role from<br />

the earliest phase of development. These measures relate to the<br />

topic of functional safety and include in-depth work with safety<br />

standards [28]. In conformance with these safety standards, it<br />

shall be verified and validated that safety-critical systems such<br />

as applications in the fields of human-robot collaboration<br />

(HRC) work as specified and maintain their intended<br />

functionality [7].<br />

At the same time, every safety project is limited by factors<br />

such as budget, time and resources. The programming language<br />

and tools along the development process have significant<br />

impact on these factors as Binkley states in an article about<br />

C++ for safety-critical software [1]. The efficiency of the<br />

development process and reliability of safety-critical software<br />

is dependent on the programming language and the<br />

programming idioms utilized - this is especially the case for<br />

safety-critical software with additional safety requirements.<br />

Modern programming languages and tools shall support<br />

embedded software developers to build reliable, safe,<br />

maintainable and simple code under highest efficiency. One<br />

example for a programming language, that gains popularity in<br />

the domains mentioned is Modern C++ including the ISOstandards<br />

C++11, C++14 and C++17. C++ is powered by these<br />

modern standards to facilitate time- and cost-effective<br />

development of high-quality software for features such as<br />

99


communication protocols and control functions, by providing<br />

paradigms for holistic views on the system and embedded<br />

software under development. A tool like Robot Operating<br />

System (ROS) supports developers to maximize flexibility in<br />

addition. ROS is a middleware for the development of<br />

autonomous systems and robots. It provides reusable utilities to<br />

visualize, communicate, test, simulate, trace as well as control<br />

robots to speed up development. All these utilities are<br />

accessible via C++ APIs.<br />

However, despite its popularity, Modern C++ for safetycritical<br />

software leaves room for discussion, if and how it will<br />

be applicable in detail. In this work, Modern C++ and ROS are<br />

utilized for an application in the area of HRC. This shall<br />

demonstrate and discuss in which way Modern C++ can be<br />

applied as efficient and effective as possible for the<br />

development of safety-critical embedded software in general.<br />

MODBAS-Safe is presented in this work - a multi-platform<br />

and simple-to-use framework for safety-critical embedded<br />

software written in Modern C++. The framework is applied to<br />

a concrete industrial application in the area of HRC as a<br />

demonstration use-case with respect to safety standards such as<br />

ISO 10218, ISO 13849 and IEC 61508. Modern C++ is used to<br />

implement the functional requirements of the collaborative<br />

robotic system, which includes a speed control of the robot in<br />

collaborative mode. For the speed control, a proximity sensor<br />

system measures the distances between the robot's tool center<br />

point (TCP) and the human worker. The embedded software in<br />

Modern C++ evaluates these distances. Is the distance less than<br />

a given tolerance, the software shall stop any motion by driving<br />

the robot's actuators to prevent humans from damage. The<br />

robotics system, including the system architecture and the<br />

speed control is described in Section III while the safety<br />

architecture is presented in Section IV.<br />

In addition, safety mechanisms are also realized in Modern<br />

C++ in order to monitor system entities and communication<br />

channels for faults and failures. In case of real-time violations<br />

or failures, the safety-control software in Modern C++ must<br />

ensure safety-stops to prevent humans from hazards and<br />

resulting injuries. The safety-control software makes extensive<br />

use of the MODBAS-Safe framework features, including busindependent<br />

safety-communication protocol presented in<br />

Section V. The safety-communication protocol provides means<br />

for the realization of real-time violation monitoring, dualchannel<br />

functionality or actuation monitoring. The safetycontrol<br />

software and the MODBAS-Safe framework are topic<br />

of Section VI. This section demonstrates in which way features<br />

of C++11 and C++14 can be used to boost reliability and<br />

prevent from incorrect usage by utilizing compile-time checks,<br />

computations and transformations. Code examples of this<br />

section show in which way the multi-paradigm approach of<br />

Modern C++ helps to reduce the overall complexity and makes<br />

it simple to transform mental models, functional and safety<br />

requirements directly into code.<br />

The C++ programming language is only a part of the<br />

toolchain: ROS as a tool is used for visualization, verification<br />

and validation within this project. In this work, ROS is clearly<br />

separated from the embedded software deployed. None of the<br />

ROS components used for testing are part of productive<br />

embedded code. In other words: The embedded software in<br />

Modern C++ is fully functional without the ROS ecosystem<br />

and its dependencies to third-party software. The embedded<br />

code shall not contain any build-time or runtime dependencies<br />

to ROS, which reduces certification effort and makes the<br />

embedded software portable. MODBAS-Safe instead provides<br />

a decoupling mechanism by using the safety-communication<br />

protocol and a developed ROS gateway to transfer runtime data<br />

from the target to the ROS ecosystem for visualization and<br />

verification purposes to still take advantage of the possibilities<br />

that come with ROS.<br />

Section VII gives a conclusion about the usage of Modern<br />

C++ for the application in HRC and how Modern C++ can be<br />

used for deploying high-quality embedded software in general<br />

from the experiences made. Section VIII gives a brief outlook<br />

about the on MODBAS-Safe and what are the open challenges<br />

of applying Modern C++ to safety-critical software.<br />

II.<br />

RELATED WORK<br />

Writing safety-critical software in C++ is often still limited<br />

to the usage of language standards including the release in year<br />

2003 (C++03). One example for C++03 is the Program for<br />

Development of Joint Strike Fighter (JSF) [8]. A helpful<br />

compilation on the use of C++ and the intents for using this<br />

programming language within the JSF projects is presented by<br />

Stroustrup and Carroll [32]. From these projects, the JSF AV<br />

C++ coding standard emerged. Based on JSF++, additional<br />

programming guidelines such as MISRA C++ were published<br />

in 2008. Other publications that relate to writing safety-critical<br />

embedded software in C++ are by Binkley [1] and Williams<br />

[34]. Binkley targets on design patterns for the safe handling of<br />

fixed-point and floating-point arithmetics by using C++<br />

classes. Reinhardt provides detailed and extensive view on<br />

C++ for safety-critical system [27]. The interesting point about<br />

this work is the close and direct relation to the relevant safety<br />

standard IEC 61508 and the comparison of C++ with other<br />

programming languages for safety-critical software<br />

development.<br />

Still, there are concerns and issues raised when writing<br />

embedded software in C++. These concerns are remains from a<br />

time, where no modern C++ language features existed and the<br />

C++ tool support was limited, which especially includes (cross-<br />

)compilers that were not available in this wide variety, as it is<br />

the case today. A rebuttal of common concerns and guidelines<br />

on how to write efficient embedded software in C++ is<br />

provided by Goldthwaite in the Technical Report on C++<br />

Performance [10] and Stroustrup [31]. The recommendations<br />

in the technical report represents the basis for the general usage<br />

of C++ in this work. This includes recommendations, which<br />

C++ paradigms are applicable to boost efficiency and<br />

effectivity for embedded systems where real-time constraints<br />

matter.<br />

Individual reports and books point to the modern standards<br />

C++11 and C++14 in the context of embedded software<br />

programming. Extensive examples about C++ for real-time<br />

systems are presented in the book written by Kormanyos, with<br />

reference to the automotive industry and AUTOSAR [19]. A<br />

compact overview is given by Grimm [11]. Our work heavily<br />

relies on the experiences documented in the related work. This<br />

includes the interaction with new language features like<br />

100


`constexpr` for compile-time computations,<br />

`static_assert()` for compile-time checks and the<br />

effective use of the C++ STL in parts suitable for safety-related<br />

software.<br />

reflection principle. It contains 12 detection sensors composed<br />

of 2 coupled sensor modules for complete redundancy. For failsafe<br />

reasons a crosscheck between the proximity sensor<br />

modules is implemented, checking signal integrity two times<br />

Figure 1: System Architecture of the HRC application.<br />

III.<br />

APPLICATION<br />

This work is located in the innovative field of human-robot<br />

collaboration (HRC), which relates to Industry 4.0 - the socalled<br />

next industrial revolution [29], [20]. In the context of<br />

HRC, human and robot work together without any locking<br />

guards [25]. A compact overview of the current status of HRC<br />

is given by Huelke [13]. There are a number of robots on the<br />

market designed for collaboration. Known examples are the<br />

robot-series UR3, UR5 and UR10 by the Danish company<br />

Universal Robots, the KUKA LBR iiwa 4 or the collaborative<br />

robot CR-35iA by FANUC. In order to avoid or mitigate<br />

severe injuries, all these robotic systems have a collision<br />

detection embedded. The collision detection is realized either<br />

by torque-monitoring or, as in the case of KUKA CR-35iA, by<br />

force-monitoring. On exceedance of a mostly configurable<br />

torque limit, a collision is detected and the robot executes a<br />

safety-stop. However, this implies a collision, before the<br />

system even stops. Thus, these systems currently are limited to<br />

a tool center point (TCP) speed of 250 mm/s by the safety<br />

standard ISO 10218. Due to the reduced speed, severe injuries<br />

are mitigated. For higher speeds exceeding the current limit, a<br />

locking guard is still mandatory [6]. The limited use-cases with<br />

speed-limitation lead to an increasing interest of collision-free<br />

HRC, to be able to operate collaborative systems at higher<br />

speeds. Proximity sensors support a collision-free<br />

collaboration. If the distance between human and robot is less<br />

than a given tolerance, a sensor system shall help to execute a<br />

safety-halt, even before robot and human collide. Once the<br />

distance is greater than the critical limit again, the robot shall<br />

continue its work without confirmation. The basis for a sensorbased,<br />

collision-free HRC is presented by Ostermann [24],<br />

[25].<br />

The proximity sensor system (sensorhead in Figure 1 and<br />

Figure 2) for contactless object detection has been developed<br />

by ITSoft company (member of ITGroup). This sensorhead is<br />

attached at the last joint of the robot, designed to be used with<br />

Universal Robots UR3, UR5, UR10 or FANUC-robot family.<br />

The sensing range is from 0 to 1500 mm based on ultrasonic<br />

for each sensor module. Also a sensor check is performed in<br />

every measurement. Interferences of the measurements are<br />

detected by the sensorhead and reported to the evaluation unit.<br />

The sensorhead power supply is 24 volts DC. Status LEDs<br />

in the sensorhead indicate obstacle recognition, operation mode<br />

and fail-signaling. The detection spread is 360 degrees in<br />

vertical axis and 60 degrees in horizontal axis at all time. For<br />

development of sensorhead software a TÜV-certified compiler<br />

(armcc.exe V5.04 update 2 build82) is applied.<br />

Figure 2: Ultrasonic sensorhead of the safety system.<br />

The following work is based on the idea of a completely<br />

collision-free collaboration by measuring distances with this<br />

ultrasonic proximity sensor system. To prevent collisions as<br />

best as possible, the intended function is to ensure the<br />

execution of a safety-halt (SS2), if a certain minimum allowed<br />

distance between human and robot is violated [4]. As a<br />

collaborative robot, the Universal Robot's UR3 (3 kg payload)<br />

is used. Figure 1 shows the general architecture of the system<br />

under development in a block diagram. The collaborative<br />

system consists of the proximity sensor system, an evaluation<br />

unit for safe speed control and the UR3. The proximity sensor<br />

101


system is mounted on the UR3's tool head and measures the<br />

distances to environmental objects within a certain range based<br />

on the systems performance. The evaluation unit is an<br />

embedded system used to realize the speed control of the<br />

robot's TCP during collaboration mode based on the proximity<br />

sensor distances. The distance samples are sent from the sensor<br />

system to the evaluation unit via CAN; the robot's current state<br />

(including actual pose, speed) is sent periodically every 8ms<br />

via Ethernet TCP/IP.<br />

On violation of certain minimum distances, the evaluation<br />

unit shall either reduce the robot's speed or even stop any<br />

motion of the robot, as soon as the distance of robot and human<br />

is too low. This paper proposes two identical instances of this<br />

evaluation unit with cross-monitoring using 2 CAN bus<br />

channels for communication. Missing, wrong or implausible<br />

events in the measurement or in the communication between<br />

the sensor system and the evaluation units are detected and<br />

published on all channels in order to trigger a safety-stop of the<br />

complete system.<br />

The processing of sampled sensor distances and the safe<br />

speed control is realized by an evaluation unit application<br />

software executed on the target. This embedded software is<br />

completely developed in Modern C++. Its development and the<br />

execution is made under a Linux OS with PREEMPT_RT<br />

patch to meet required real-time constraints. Developing under<br />

Linux allows both the execution of the compiled and linked<br />

software on the development PC for rapid-prototyping and fast<br />

feedback. At the same time, it is possible to deploy the same<br />

software to an embedded target running an Embedded Linux<br />

(including PREEMPT_RT) without any adaptations needed<br />

(e.g. no specific build defines / settings necessary). From the<br />

earliest phase of development, the software is executed on the<br />

development PC and is later deployed on an embedded target<br />

without additional effort. One possibility is to execute the<br />

evaluation software for the safe speed control in Software-inthe-Loop<br />

(SiL) simulation mode, in which the sensor system<br />

and UR3 are simulated. It is also possible to run the software<br />

on the development PC and communicate with the real system<br />

entities sensor system and UR3. For the realization of the SiLtest<br />

ROS and V-REP are applied. ROS rviz is used for<br />

visualization. ROS rviz provides means to display sensor<br />

distances and the robot's pose in a three-dimensional view for<br />

rapid-prototyping and fast feedback on testing new features. In<br />

order to simulate the robot's environment for SiL tests of the<br />

evaluation unit application software, V-REP as a multi-rigidbody<br />

simulation tool is used to simulate the robot's dynamics,<br />

the behavior of the proximity sensor system and the dynamic<br />

obstacles such as humans. The evaluation unit software either<br />

communicates directly via virtual TCP/IP and virtual<br />

SocketCAN with the simulation or with the real system<br />

entities. A switch between virtual and physical interfaces does<br />

not require any adaptions to the software under test.<br />

The collaboration space / working area is splitted into three<br />

spatial zones as depicted. These zones can be imagined as fixed<br />

to the tool head's frame moving together with the robot's sensor<br />

system. The distances to adhere are calculated according to<br />

DIN EN ISO 13855 and ISO TS 15066.<br />

If there a dynamical obstacle occurs in the permissible<br />

zone, the evaluation unit application software may operate the<br />

robot with a maximum speed of v TCP = 1 m/s. Within the<br />

tolerance zone the robot's TCP speed is limited to the safety<br />

limited speed (SLS) of v TCP = 250 mm/s. In the safe zone the<br />

specified distance to a dynamic obstacle must lead to a safetyhalt<br />

triggered by the evaluation unit to stop any motion of the<br />

robot as long as the obstacle remains in this range. For the<br />

evaluation unit application software to be able to detect<br />

dynamic obstacles like humans the evaluation unit must<br />

distinguish between static and dynamic objects. Before any<br />

collaboration a reference drive to record the static environment<br />

is mandatory. Within this automated teach mode the robot is<br />

driven along the paths with constant speed it will take during<br />

collaboration without the existence of any dynamic objects in<br />

the collaboration space - only static objects of the environment<br />

exist during teach mode. The distances measured by the sensor<br />

system and the robot pose are sampled periodically in fixed<br />

time steps. The evaluation unit application software stores<br />

every record consisting of the distances of the sensor system<br />

and the robot's pose in a table-based, application-specific data<br />

model. Over each record entry a CRC is computed to ensure<br />

data integrity. In collaboration mode, this persistent data model<br />

is accessed to compare the reference samples with the current<br />

sensor distances at a specific robot pose. The deviation to the<br />

current dynamic environment can be evaluated. In<br />

collaboration mode, however, the challenge is that the speed is<br />

not constant as it is in teach mode; in other words, the time<br />

information from the teach mode gets lost, because the robot is<br />

slowed down or accelerated during collaboration. In order to<br />

relate corresponding record entries of the collaboration and<br />

teach modes at a given robot's pose Dynamic Time Warp [23]<br />

is applied.<br />

This section specifies the general concept of the system to<br />

be developed under the use of Modern C++ in order to realize<br />

the functional requirements and some of the safety<br />

requirements. The intent is not to give an in-depth look on<br />

functional safety of the system, but to provide a basic<br />

understanding of the features needed to be realized in Modern<br />

C++. In the following section, the safety architecture is<br />

described. This represents the basis for the safety control<br />

software that monitors the evaluation unit for real-time<br />

violations, for instance. This clear separation of the application<br />

logic of the evaluation unit application software and the safety<br />

control software including fault-detection and safetymechanisms<br />

is obligatory. This partitioning is highly<br />

recommended according to the literature [2], [5]. Avoiding<br />

mix-ups and the clear distinction between the application<br />

software and the safety-related software enable easier testing<br />

and verification of the individual software components. In<br />

addition, this also leads to smaller, simpler and thus<br />

maintainable software units.<br />

102


IV.<br />

SAFETY ARCHITECTURE<br />

The safety-limited speed function in the tolerance zone or<br />

the safety-halt for the safe zone covers aspects of functional<br />

safety. The safety mechanism for the HRC application in case<br />

of a malfunction is accomplishes by transferring the system to<br />

<br />

<br />

<br />

Invalid robot state<br />

Actuator command from the evaluation unit not<br />

executed by the robot<br />

Invalid record entries from the teach mode<br />

Figure 3: Safety architecture category 3 required according to ISO 10218-1.<br />

a safe state. The system is not required to be fail-operational. In<br />

case of a deviation from the intended functionality, the<br />

complete system is transferred into this safe state. The system<br />

will remain in this safe state until there is a manual reset by an<br />

operator. In this context, a safe state means to stop any robot<br />

motion to reduce risks for the worker that collaborates with the<br />

robot. As a recap, the following safety functions are realized by<br />

the evaluation unit application software:<br />

<br />

<br />

If there is a dynamical object within the tolerance<br />

zone then the application software shall reduce the<br />

robot's TCP speed to the safety limited speed (SLS) of<br />

v TCP = 250 mm/s.<br />

If there is a dynamical object within the safe zone<br />

then the application software shall execute a safetyhalt<br />

(SS2) as long as the object remains in the zone.<br />

These two safety functions are sufficient to reduce risks for<br />

the human. However, this is only the case if the system<br />

operates with its intended functionality. Since no system is<br />

completely free of runtime faults, additional safety functions<br />

need to be specified. These safety functions are executed, if the<br />

intended functionality of the system cannot be guaranteed.<br />

First, possible faults need to be identified. Examples for faults<br />

and failures that need to be detected are:<br />

Real-time violations of sensor or robot<br />

communication: e.g., the periodical update of robot<br />

state or sensor data is not transmitted in time.<br />

<br />

<br />

Failure of the evaluation unit<br />

Invalid sensor distances<br />

For each of these faults or failures the complete system<br />

shall be transferred into a safe state by a safety function the<br />

UR3 robot provides. The UR3 provides the following built-in<br />

safety functions: Safety Limited Speed (SLS), Safety-Halt<br />

(SS2) and Safety-Stop (SS1). All of these can be triggered<br />

externally through digital inputs of the robot. In order to detect<br />

and handle faults and failures, an appropriate safety<br />

architecture must be chosen to detect and handle faults as<br />

described above. According to ISO 10218-1 safety-relevant<br />

entities of the robot's safety control shall reach Performance<br />

Level d (PL d) with a category 3 architecture [15]. According<br />

to category 3 a HRC system must provide two independent<br />

channels. In addition, the architecture must provide means that<br />

both logic devices of the two-channel system can do a crossmonitoring<br />

of computations and the actuator command. This<br />

cross-monitoring is used to determine any deviation from the<br />

intended functionality of one of the logic devices. In this case<br />

one logic device is represented by one evaluation unit.<br />

According to category 3 architecture an actuator monitoring is<br />

recommended. For instance, this monitoring ensures that an<br />

executed safety-stop is really maintained as long as the system<br />

is exposed to a hazard.<br />

Figure 3 depicts the safety architecture of the HRC system<br />

to achieve category 3. With this architecture malfunctions<br />

described above can be detected. The safety architecture<br />

represents an extension of the initial system architecture shown<br />

in Section III that meets the recommended category 3<br />

architecture from ISO 13849 [26]. In direct comparison to the<br />

reference from ISO 13849, the one for the HRC is also based<br />

on two channels. An inter-process communication (IPC)<br />

mechanism is used to send safety-relevant data from the<br />

application process to the safety control software for<br />

103


monitoring. This IPC-mechanism uses the safety protocol<br />

specifically developed for the demonstration of Modern C++ in<br />

the context of safety-critical systems. Both evaluation units<br />

cross-monitor each others' results using the same safety<br />

protocol to detect deviations. Actuator monitoring can be<br />

achieved by feeding back the current robot state to the<br />

evaluation units. The proximity sensor system is redundant. In<br />

each evaluation unit the application software and the safety<br />

control software (SafetyMaster) is executed. The SafetyMaster<br />

is responsible to monitor malfunctions like real-time violations,<br />

incorrect actuator commands as well as to monitor failures of<br />

individual system entities such as the robot or the proximity<br />

sensor system.<br />

V. SAFETY PROTOCOL<br />

A reliable inter-process communication (IPC) system forms<br />

the basis of a correct operation of this HRC application and to<br />

achieve the safety measures of Section IV. The IPCmechanism<br />

allows the data exchange between the application<br />

software process and the safety control process. For a reliable<br />

and safe communication, a safety protocol is specified,<br />

implemented in Modern C++. This safety protocol an<br />

elementary part of the Modern C++ safety framework<br />

MODBAS-Safe. The safety protocol's design is focused on the<br />

detection and control of common transmission errors, because<br />

there is no guarantee for an error-free data transmission. The<br />

detection of transmission errors and an adequate reaction is<br />

important to prevent from violations of safety integrity.<br />

Important measures for error-detection and -control in the<br />

context of reliable communication for safety-critical systems is<br />

presented in literature by DIN 61784-3, [3] and [12]. These<br />

measures are also applied to the safety protocol developed. A<br />

safety header within a safety frame of the protocol consists of<br />

the following elements: A 32-bit CRC, the message length of<br />

the payload in bytes, a timestamp, a numeric identifier to<br />

distinguish between safety-related and non-safety-related<br />

messages and a message counter. The identifier is also used as<br />

the priority of a frame.<br />

For the transmission of safety-relevant data, established,<br />

standardized protocols such as Ethernet UDP/IP or field bus<br />

systems like CAN or FlexRay are used. These communication<br />

channels are not safe in themselves. For the exchange of safetyrelevant<br />

data additional measures on higher OSI layers are<br />

mandatory. The majority of real-time communication protocols<br />

utilized for safety-critical systems like FlexRay, EtherCAT,<br />

Ethernet POWERLINK or PROFINET are bound to specific<br />

field bus systems. In contrast to this, bus-independent safety<br />

protocols like openSAFETY or End-to-End-Protection (E2E)<br />

known from AUTOSAR are available. The safety protocol in<br />

development is also bus-independent by applying the blackchannel<br />

principle. The black-channel principle allows the<br />

transmission of safety-relevant data and non-safety data over<br />

the same communication channel. In general, the safety<br />

protocol in development shall support the following use-cases:<br />

<br />

<br />

Reliable and safe transmission of sensor data and<br />

actuator commands.<br />

Monitoring of sensors and actuators for real-time<br />

violations and connection losses.<br />

<br />

<br />

Cross-Monitoring of computed results for system<br />

entities in multi-channel architectures.<br />

Rapid extension for additional communication nodes<br />

for extended visualization and diagnosis (passive<br />

read-only nodes).<br />

Based on these requirements the Modern C++<br />

implementation shall fulfill the following requirements:<br />

<br />

<br />

Easy-to-use API: The API for the application<br />

developer that uses the safety protocol shall be as<br />

simple as possible to prevent from incorrect usage.<br />

The creation, transmission and reception of message<br />

objects shall be unambiguous.<br />

Compile-Time Checks: Frame and packet lengths<br />

shall be only configurable during compile-time. Used<br />

types and configurations shall be checked during<br />

compile-time to maximize reliability and prevent from<br />

errors from the earliest phase of application<br />

development.<br />

Several nodes use the safety protocol and communicate<br />

with each other over the same communication channel, sharing<br />

safety-critical and non-safety-critical data. Nodes can be either<br />

individual processes running within the operating system that<br />

communicate over a virtual network device (IPC) or<br />

distributed, embedded systems that access the same physical<br />

communication channel. In this work, the safety protocol is<br />

based on Linux SocketCAN. SocketCAN allows the usage of<br />

CAN Flexible Datarate (CAN FD) both for communication via<br />

virtual interfaces as well as for real CAN-channels. A switch<br />

between virtual interfaces in a simulation for instance and real<br />

physical interfaces is possible with code adaptations.<br />

The safety control software assembles information of the<br />

system entities it interacts with by using the safety-protocol and<br />

thus is able to detect malfunctions. The messages sent from the<br />

application process to the safety control process are eventbased<br />

and time-based. An example for event-based messages is<br />

the cyclic transmission of the robot's state. As soon as the<br />

application receives the current robot's state from the UR3, an<br />

event message is sent to the safety control. If the application<br />

process does not receive data, no event message is transmitted<br />

to the safety control. Consequently, the safety control process<br />

will detect a timeout / real-time violation if a specified deadline<br />

is not met. The same applies to the cyclic transmission of the<br />

distances to obstacles measured by the proximity sensor<br />

system. If no samples are received in the application process,<br />

no event message is sent to the safety control process - a realtime<br />

violation is detected by the safety control process.<br />

VI.<br />

APPLYING MODERN C++<br />

Modern C++ as a programming language is used to realize<br />

both the evaluation unit application software and the safety<br />

control software. Before the development of the concrete<br />

application and safety control, patterns for the development of<br />

safety-critical software are assembled into one Modern C++<br />

framework named MODBAS-Safe. MODBAS-Safe is a safety<br />

framework implemented in Modern C++ (C++11 and C++14).<br />

This framework is developed under consideration of<br />

programming and development guidelines for safety-critical<br />

104


software. Particularly this includes guidelines from IEC 61508-<br />

3 as a basis and recommendations from established<br />

programming guidelines such as MISRA C++, JSF AV C++,<br />

HIC++, the NASA JPL guidelines as well as CERT C++ to<br />

develop a high-quality safety-framework in Modern C++. Also,<br />

current developments in the Modern C++ community like the<br />

C++ Core Guidelines and the methodology of defining a<br />

superset instead of language subset are considered [33], [31].<br />

Keeping in mind these guidelines during development helps to<br />

maximize reliability, maintainability, readability, portability<br />

and robustness.<br />

MODBAS-Safe. The central point during development was to<br />

create a simple-to-use safety framework completely written in<br />

C++ focusing on specific language features and generic parts<br />

from the current standards and the STL implementation that<br />

maximize readability, maintainability and thus effectivity. At<br />

the same time MODBAS-Safe hinders from using idioms that<br />

may lower readability, raise overall complexity and verification<br />

effort right away (static code analysis and testing), by<br />

providing reusable generic, compile-time configurable software<br />

modules. In a nutshell, the design rule of making interfaces<br />

easy to use correctly and hard to use incorrectly is promoted<br />

Figure 4: Modern C++ STL features and idioms that facilitate high-quality embedded software development.<br />

This framework provides the following generic features<br />

independent of any application:<br />

<br />

<br />

<br />

<br />

An easy-to-use safety protocol to exchange safetycritical<br />

and non-safety-relevant data.<br />

Application Monitoring: By utilizing the developed<br />

message-based monitoring and safety protocol,<br />

applications can be monitored.<br />

Real-Time Monitoring: Timeout / deadline violation<br />

detection<br />

Fault-Management: Persistent fault storage, logging,<br />

fault-handling, safety-function execution and faultdetection<br />

messaging<br />

In relation to the concrete application these features can be<br />

used and configured. The intent of MOBBAS-Safe is to<br />

provide a collection of proven-in-use and verified solutions in<br />

C++11 and C++14, applicable for safety-critical software<br />

applications in the domains of automated driving and humanrobot<br />

collaboration, that speed up development and at the same<br />

time boost quality measures such as reliability and flexibility.<br />

Recurring challenges for safety-critical software<br />

development such as developing reliable communication,<br />

modeling real-time constraints and fault-handling in code or<br />

transforming safety-requirements directly in C++-code in<br />

general can be solved efficiently by using the modules of<br />

[22]. This makes certain unsafe C++ features for embedded<br />

programming seem to be not that relevant. MODBAS-Safe was<br />

developed under the following constraints and requirements<br />

kept in mind to maximize reliability in the first place and to<br />

lower verification effort in consequence:<br />

<br />

<br />

No RTTI (`-fno-rtti` option for gcc and clang)<br />

and no virtual methods: Virtual methods can lower<br />

efficiency especially when called with a high<br />

frequency. In addition, RTTI raises difficulties for<br />

WCET estimations. Dynamic polymorphism in<br />

general undermines certain aspects of static code<br />

analysis [27]. From the experience made during<br />

research and development the additional complexity<br />

introduced by RTTI is not worth the benefit using it.<br />

No exceptions (`-fno-exceptions` option for<br />

gcc and clang): The same applies for exceptions as for<br />

RTTI. WCET estimations are not that simple and rely<br />

on the application. From the experiences made during<br />

development of the HRC system the benefit<br />

introduced by using C++ exceptions in embedded<br />

software is not worth the additional effort needed for<br />

verification and WCET estimation [10]. But<br />

exceptions are not only a problem because of the<br />

hidden control path. Using exceptions raises memory<br />

consumption [21].<br />

105


No compiler optimizations: Indeed, the full<br />

performance and power of the C++ programming<br />

language relies on compiler optimizations. At the<br />

same time, this is a major challenge for developing<br />

safety-critical software. As a demonstration use-case<br />

compiler optimizations are disabled due to the fact<br />

that it is easier to inspect and verify what the compiler<br />

generates.<br />

<br />

<br />

No dynamic memory used within the application:<br />

During runtime, no dynamic memory allocations and<br />

de-allocations shall occur within the code [10].<br />

Full, pedantic warnings and warnings as errors (`-<br />

Wall -Wextra -Wpedantic -Werror`<br />

options): From the earliest stage of development the<br />

complete code is compiled with full warnings and<br />

warnings as errors enabled, which maximizes<br />

reliability by detecting unused/uninitialized variables<br />

for instance. The warnings as error flag (`-Werror`)<br />

can also be a hint for detecting false optimizations in<br />

case compiler optimizations are enabled.<br />

In general, meeting the constraints above are not mandatory<br />

for every embedded software project developed in C++ and<br />

should be seen as recommended guidelines from the<br />

experiences made and literature given. Potentially everything<br />

can be used. However, it must be always clear which additional<br />

effort, complexity and side-effects the usage of a certain idiom<br />

introduces and if the benefit is worth the verification effort.<br />

MODBAS-Safe instead focuses on language features that gain<br />

effectivity. Such language features with small footprint, but<br />

highest effectivity for safety-critical software development in<br />

the domains of HRC are depicted in Figure 4. MODBAS-Safe<br />

is based on three C++ paradigms shown in Figure 4 that makes<br />

the framework that effective: Generic programming by using<br />

templates, object-orientation and the usage of C++11 STL<br />

features such as `std::array`, ``,<br />

`std::chrono`, `std::tuple` and<br />

``.<br />

MODBAS-Safe itself consists of generic parts based on<br />

these C++ language features and idioms shown in Figure 4. For<br />

a concrete application, these generic parts must be instantiated<br />

and configured accordingly before compile-time. Dependent on<br />

the application, three steps must configured for the integration:<br />

1. Definition of a periodical update rate, the scheduling<br />

priority and policy of module `SafetyMaster`. With this<br />

period the safety control software is invoked periodically.<br />

2. Implementation of individual monitoring units that fulfill<br />

a certain monitoring function in conformance with the safety<br />

specification. These monitoring units are collected within the<br />

`SafetyModules` container and called periodically for<br />

monitoring each update step.<br />

3. Definition of possible malfunctions in the module<br />

`FaultManager` and an appropriate reaction (e.g. execution<br />

of a safety-function or fault-handling).<br />

In reference to the first point, Listing 1 shows the<br />

instantiation, initialization and the cyclic update of one<br />

`safety_master` object to realize fault- and failuremonitoring<br />

by utilizing the safety protocol. First the CAN FD<br />

interface `"vcan1"` is passed at construction. With a call to<br />

`SafetyMaster::init()` the POSIX scheduling priority<br />

and policy are set. Both parameters are specified by template<br />

parameters in the backend and checked for valid ranges of the<br />

parametrized policy and priority at compile-time with the<br />

SafetyMaster safety_master{"vcan1"};<br />

int main () noexcept {<br />

}<br />

AR::boolean ok = safety_master.init();<br />

while ( ok ) {<br />

}<br />

safety_master.update();<br />

ok = safety_master.is_ok();<br />

safety_master.idle();<br />

return 0;<br />

Listing 1: Initialization of the SafetyMaster node for monitoring.<br />

C++11 feature `static_assert()`. During startup, the<br />

communication binding is also being initialized. In the specific<br />

use-case SocketCAN is used to transmit and receive messages.<br />

Within a loop `SafetyMaster::update()` accesses the<br />

communication binding first to receive incoming safety-related<br />

messages for routing them to registered monitoring units. On<br />

an incoming message the `safety_master` object<br />

broadcasts the message to all registered monitoring units of this<br />

application. If there are no messages pending to route or<br />

process, the `safety_master` still is in charge to trigger all<br />

monitoring units registered periodically. The individual<br />

monitoring units decide on their own if this message is relevant<br />

and process it further if this is the case. With a call to<br />

`SafetyMaster::idle()` the single-threaded process<br />

sleeps for the period configured to achieve a fixed time-step.<br />

The sleep is implemented with `clock_nanosleep()`<br />

internally. As a clock `CLOCK_MONOTONIC` is used.<br />

Each monitoring unit is based on the reception of messages<br />

using the safety protocol of the safety framework. With eventbased<br />

messages timeouts can be detected. Periodic messages<br />

holding the actual actuator command as data can be used to<br />

detect data integrity violations. Application messages are<br />

modeled as C++ objects containing data. For the user it shall be<br />

as simple as possible to create messages for transmission and<br />

reception.<br />

Listing 2 shows the description of a safety-related message<br />

as an example. The message `RobotDataUpdateMsg` is<br />

derived from the base class `SafetyMessage`. This<br />

signalizes the backend / stack implementation of the safety<br />

protocol that this is a safety-related message and which checks<br />

must be done on the structure / class with the help of the<br />

C++11 header library ``. This STL library<br />

allows type checks and transformations during compile-time.<br />

First, the backend of the safety protocol implementation checks<br />

with the help of C++11 feature `static_assert()` that a<br />

106


given type of the message object represents a *concrete* object<br />

and no pointer for instance. Thus, passing a pointer to the<br />

`SafetyProtocol::Send()` method will raise an error at<br />

compile-time. Also, the priority `kId` can be checked at<br />

compile-time if it's in the valid range for safety-critical<br />

messages. All safety-relevant message shall have a priority in<br />

the range of 0 (highest) to 100 (lowest) for instance. In a<br />

further step, `std::conditional` available in<br />

`` is used to select the type of frame to be<br />

sent. If the application-specific message is derived from<br />

struct RobotDataUpdateMsg : public<br />

SafetyMessage {<br />

/// High priority of safety - related<br />

message<br />

static constexpr auto kId{10U};<br />

/// Send the robot joints for<br />

visualization purposes<br />

std::array < float32 , 6U > joints_;<br />

/// Send the TCP speed for actuationmonitoring<br />

};<br />

float32 tcp_speed_;<br />

Listing 2: Description of safety message objects. The safety protocol<br />

implementation checks for constraints at compile-time.<br />

`SafetyMessage`, `std::conditional` will return<br />

the type `SafetyFrame` including the `SafetyHeader`.<br />

If the message type is not derived from `SafetyMessage`, a<br />

`StdFrame` (non-safety) is the selected frame to be sent. This<br />

is automatically done during compile-time in the backend of<br />

the safety protocol. Based on the application-specific message<br />

declared, `static_assert()` allows a compile-time check<br />

to not exceed the maximum transmission unit (MTU) of the<br />

communication binding used by the safety protocol to send and<br />

receive frames.<br />

// monitoring the robot update interval for<br />

real - time violations<br />

RTMonitor<br />

robot_data_rt{RobotDataUpdateMsg(), 50ms,<br />

kUpdatePeriod};<br />

// monitoring the sensorhead update<br />

interval for real - time violations<br />

RTMonitor<br />

sensorhead_rt{SensorheadUpdateMsg(), 100ms,<br />

kUpdatePeriod};<br />

Listing 3: Description of monitoring objects in C++.<br />

Application messages as the example in Listing 2 can be<br />

used for the realization of real-time violation / timeout<br />

detection. Listing 3 shows two examples for describing realtime<br />

monitoring objects in C++. Real-time constraints are<br />

directly transformed in C++ with the template class<br />

`RTMonitor` provided by MODBAS-Safe. The first<br />

template parameter of `RTMonitor` specifies the module<br />

to inform if a violation occurs. In this case, the<br />

`FaultManager` is the one to inform about a real-time<br />

violation. `FaultManager` handles any violation as<br />

configured by the user. As the first parameter of the constructor<br />

the message is passed to `RTMonitor`. This is the<br />

message to listen for. As soon as this message is received by<br />

`safety_master`, the timeout timer is reset. The second<br />

parameter of the constructor specifies the deadline within a<br />

message of this type given as the first constructor parameter<br />

must be received. The third constructor argument specifies a<br />

constant expression value (constexpr) to specify at which<br />

period the `RTMonitor` object is updated. As it can be<br />

seen from the listing all time units are defined with C++11<br />

`std::chrono` to make time conversions and measurements<br />

simple. In addition, C++14 `std::chrono_literals`<br />

allows to use time units directly in C++ (s, ms, us). This<br />

enhances readability.<br />

Individual monitoring objects, as shown in listing 3 are<br />

assembled into a safety modules collection and accessed by the<br />

`safety_master` object in each update step and in case of<br />

a notification. The safety modules collection is simply<br />

extensible by the use of C++11 variadic templates. Within this<br />

variadic template class named `SafetyModules`, a C++11<br />

`std::tuple` contains all the monitoring units defined.<br />

Besides the real-time monitoring units as depicted in the<br />

example an additional unit for actuation monitoring could be<br />

registered with one line of code. The C++11 STL template<br />

class `std::tuple` is a heterogenous container of<br />

static size. The `SafetyMaster` accesses this tuple object<br />

each update call and iterates over all monitoring units within<br />

the container. The same pattern of having a tuple of units to<br />

manage is also applied to the `FaultManager` which is in<br />

charge of executing safety functions in case of a fault / failure<br />

reported by one of the monitoring units and transmitting fault<br />

messages over the safety protocol in order to inform other<br />

nodes about an active safety function.<br />

VII. CONCLUSION<br />

Modern C++ for innovative safety-critical applications in<br />

the fields of human-robot collaboration and highly automated<br />

driving enables simple, holistic views of the system in<br />

development by facilitating modern language features and the<br />

multi-paradigm methodology. In these domains of research and<br />

development, it is of great importance to get a fast feedback<br />

and to get things done. Within shortest amount of time, the<br />

embedded application code and the safety control in the<br />

domains of HRC was completely written in Modern C++<br />

proving efficiency. Driven by the superset of a subset<br />

methodology, the safety framework MODBAS-Safe gives<br />

generic, simple solutions in Modern C++. On the one side, this<br />

leads to flexibility; on the other side, C++ allows to interact<br />

with safety standards like IEC 61508 and ISO 13849 closely.<br />

Generic programming / C++ templates, C++11/C++14 (STL)<br />

features such as static_assert(), chrono, variadic templates,<br />

std::tuple, std::array, decltype, type_traits and user-defined<br />

literals are used to transfer mental models, functional and<br />

107


safety requirements directly into C++ code. Moreover, the<br />

toolchain of and around the C++ programming language<br />

supports this efficient development workflow for producing<br />

safety-critical embedded software rapidly. Compilers like clang<br />

or gcc as well as compiler extensions such as clang-tidy act as<br />

a first part of static code analysis and enhance reliability right<br />

from the earliest development phase. Tools like the ROS<br />

middleware along with C++ support the developer in rapidprototyping.<br />

ROS as not being part of the productive embedded<br />

code is used for SiL-, HiL-simulation, verification, for tracing<br />

and replay with rosbag and visualization in ROS rviz. The<br />

result of all these measures is simple, readable, maintainable<br />

and thus high-quality embedded code used for an application in<br />

HRC in the shortest amount of time, even under the additional<br />

load of satisfying safety requirements.<br />

VIII. FUTURE WORK<br />

MODBAS-Safe is currently used for the development of<br />

control and monitoring functions in the fields of highly<br />

automated driving. However, two major challenges need to be<br />

solved when using C++ for developing productive safetycritical<br />

embedded code. First, an extensive analysis of C++<br />

compiler optimizations is needed. Second, template<br />

instantiations must be somehow made be clearly visible to the<br />

application developer for simplifying verification and to avoid<br />

unintended behavior: which template instantiation is called at<br />

which time. Another challenge that arises when developing<br />

embedded software in Modern C++ is security. Secure code in<br />

Modern C++ will take a major role during development, since<br />

the growing connectivity and security issues likely have an<br />

impact on functional safety.<br />

REFERENCES<br />

[1] Binkley, David W. (1997). “C++ in safety critical systems”. In: Annals<br />

of Software Engineering 4.<br />

[2] DGUV (2008). BGIA-Report 2/2008 - Funktionale Sicherheit von<br />

Maschinensteuerungen - Anwendung der DIN EN ISO 13849. Hrsg. von<br />

BGIA.<br />

[3] DGUV (2014). Grundsätze für die Prüfung und Zertifizierung von<br />

"Bussystemen für die Übertragung sicherheitsbezogener Nachrichten".<br />

Techn. Ber. Deutsche Gesetzliche Unfallversicherung.<br />

[4] DGUV (2015). DGUV Information 209-074 - Industrieroboter . Techn.<br />

Ber. DGUV.<br />

[5] Douglass, Bruce P. (1998). Safety-Critical Systems Design. Techn. Ber.<br />

i-Logix.<br />

[6] Dürr, Klaus und Jochen Vetter (2014). Auf die Applikation kommt es an.<br />

Hrsg. von funktionale sicherheit.<br />

[7] Dunn, William R. (2003). “Designing Safety-Critical Computer<br />

Systems”. In: IEEE Computer Society. URL:<br />

https://pld.ttu.ee/IAF0530/01244533.pdf.<br />

[8] Emshoff, Bill (2014). Using C++ on Mission and Safety Critical<br />

Platforms. CppCon. URL: https://channel9.msdn.com/Events/CPP/C-<br />

PP-Con-2014/010-Using-C-on-Mission-and-Safety-Critical-Platforms.<br />

[9] Gerald, J. (2006). “The Power of 10: Rules for Developing Safety-<br />

Critical Code”. In: Computer 39.6, S. 95–97. DOI:<br />

10.1109/MC.2006.212.<br />

[10] Goldthwaite, Lois (2004). Technical Report on C++ Performance.<br />

Techn. Ber. ISO/IEC. URL: http://www.openstd.org/Jtc1/SC22/WG21/docs/papers/2004/n1666.pdf.<br />

[11] Grimm, Rainer (2014). Embedded programming with C++11.<br />

[12] Hannen, Heinrich-Theodor (2012). “Beitrag zur Analyse sicherer<br />

Kommunikationsprotokolle im industriellen Einsatz”. Diss. University<br />

of Kassel.<br />

[13] Huelke, Michael (2014). Kollaborierende Roboter – Zum Stand von<br />

Forschung, Normung und Validierung. URL: http://www.suqr.uni-<br />

wuppertal.de/fileadmin/site/suqr/Kolloquium_Download/Huelke_2014-<br />

01-14.pdf .<br />

[14] IEC (2010). IEC 61508 - Functional safety of<br />

electrical/electronic/programmable electronic safety-related systems -<br />

Part 3: Software requirements. Techn. Ber. IEC.<br />

[15] ISO (2011). Industrieroboter - Sicherheitsanforderungen - Teil 1:<br />

Roboter (ISO 10218-1:2011). Techn. Ber. International Organization for<br />

Standardization.<br />

[16] ISO (2016). ISO/TS 15066 - Robots and robotic devices - Collaborative<br />

robots. Techn. Ber. ISO.<br />

[17] Kalinsky, David (2005). "Architecture of safety-critical systems."<br />

Embedded Systems Programming.<br />

[18] Knight, J. C. (2002). “Safety critical systems: challenges and<br />

directions”. In: IEEE Software Engineering.<br />

[19] Kormanyos, Christopher (2013). Real-Time C++. Efficient Object-<br />

Oriented and Template Microcontroller Programming. doi :<br />

10.1007/978-3-642-34688-0.<br />

[20] KUKA Aktiengesellschaft (2015). Hello Industrie 4.0 - we go digital.<br />

KUKA Robots. URL: https://www.kuka.com/-/media/kukacorporate/documents/press/broschuereindustrie40de.pdf.<br />

[21] LLVM Compiler Infrastructure (2016). LLVM Coding Standards.<br />

[22] Meyers, Scott (2014). The Most Important Design Guideline. URL:<br />

https://www.youtube.com/watch?v=5tg1ONG18H8&t=1729s.<br />

[23] Müller, Meinard (2007). Information Retrieval for Music and Motion.<br />

[24] Ostermann, Björn (2014). “Entwicklung eines Konzepts zur sicheren<br />

Personenerfassung als Schutzeinrichtung an kollaborierenden<br />

Robotern”. Diss. Bergische Universität Wuppertal.<br />

[25] Ostermann, Björn, Michael Huelke und Anke Kahl (2010). Von Zäunen<br />

befreit – Industrieroboter mit Ultraschall absichern .<br />

[26] Pilz GmbH & Co. KG (2017). EN ISO 13849-1: Performance Level<br />

(PL). URL: https://www.pilz.com/de-DE/knowhow/law-standardsnorms/functional-safety/en-iso-13849-1.<br />

[27] Reinhardt, Derek W. (2004). “Use of the C++ Programming Language<br />

in Safety Critical Systems”. Thesis. University of York. URL:<br />

https://pdfs.semanticscholar.org/c7d1/ca2b4aade2c7d5a8784dddaf401f1<br />

7e06853.pdf.<br />

[28] Rolle, Ingo (2013). “Funktionale Sicherheit programmierbarer<br />

elektronischer Systeme”. In: Funktionale Sicherheit - Echtzeit 2013.<br />

[29] Rossi, Ben (2017). The Fourth Industrial Revolution: Technology<br />

alliances lead the charge. URL: http://www.information-age.com<br />

/fourth-industrial-revolution-technology-alliances-lead-charge-<br />

123465633/.<br />

[30] Schwan, Ben (2013). Kollege Roboter: BMW testet Zusammenarbeit<br />

von Mensch und Roboter. URL:<br />

https://www.heise.de/newsticker/meldung/Kollege-Roboter-BMWtestet-Zusammenarbeit-von-Mensch-und-Roboter-1972138.html.<br />

[31] Stroustrup, Bjarne (2005). “A rationale for semantically enhanced<br />

library languages”. In: LCSD. URL:<br />

http://www.stroustrup.com/SELLrationale.pdf.<br />

[32] Stroustrup, Bjarne und Kevin Carroll (2006). C++ in Safety-Critical<br />

Applications: The JSF++ Coding Standard.<br />

[33] Stroustrup, Bjarne und Herb Sutter (2017). C++ Core Guidelines.<br />

[34] Williams, Stephen (1997). “Embedded Programming with C++”. In:<br />

Third USENIX Conference on Object-Oriented Technologies and<br />

Systems.<br />

108


Challenges in Virtualizing Safety-Critical<br />

Cyber-Physical Systems<br />

Alessandro Biondi, Mauro Marinoni,<br />

and Giorgio Buttazzo<br />

Scuola Superiore Sant’Anna<br />

Pisa, Italy<br />

{alessandro.biondi, mauro.marinoni,<br />

giorgio.buttazzo}@santannapisa.it<br />

Claudio Scordino and Paolo Gai<br />

Evidence SRL<br />

Pisa, Italy<br />

{claudio, pj}@evidence.eu.com<br />

Abstract — Embedded computing platforms are evolving<br />

towards heterogeneous architectures that require new software<br />

support for simplifying their usage, optimizing the available<br />

resources, and providing a predictable runtime behavior for<br />

managing concurrent safety-critical applications. This paper<br />

describes the main challenges in providing such a software support<br />

through virtualization techniques, while taking into account safety<br />

requirements, security issues, and real-time performance. An<br />

automotive application is considered as a case of study to illustrate<br />

some of the presented concepts.<br />

Keywords — Heterogeneous platforms, embedded computing,<br />

real-time systems, virtualization, hypervisor.<br />

I. INTRODUCTION<br />

The design of computing infrastructures for modern cyberphysical<br />

systems is facing with two major trends that are<br />

significantly steering the development process of embedded<br />

software. On one hand, the last years have been characterized<br />

by a continuous increase of the software complexity to meet<br />

more and more richer functional requirements and to support<br />

new technologies. At the same time, computing platforms are<br />

evolving toward heterogeneous designs that integrate multiple<br />

components such as multicore processors, general-purpose<br />

graphic processing units (GPGPUs), and field programmable<br />

gate arrays (FPGAs), which allow power-efficient parallel<br />

execution of multiple software systems at the cost of a paradigm<br />

shift in their development.<br />

These two trends are increasingly pushing software<br />

designers to integrate a higher number of functions in the same<br />

hardware platform, typically resorting to methodologies such<br />

as component-based software design (CBSD) and also facing<br />

with the problem of incorporating legacy software.<br />

Furthermore, in many industrial fields, integration is considered<br />

the most affordable solution to problems related to space,<br />

weight, power, and cost (SWaP-C).<br />

Virtualization of computational resources established as a<br />

de-facto technique to address these needs while efficiently<br />

exploiting the processing power of modern platforms.<br />

Virtualization is typically achieved via hypervisors (also called<br />

virtual machine monitors), which allow executing multiple<br />

software domains upon the same platform, each of them<br />

possibly executing a different operating system (OS). The<br />

domains benefit from the illusion of disposing of a dedicated<br />

computing platform, while in reality the access to the shared<br />

computational resources is regulated by the hypervisor, which<br />

typically offers to the domains sets of virtualized memory<br />

address spaces, CPUs, and possibly peripherals. Nowadays, this<br />

technology is increasingly adopted to realize multi-OS<br />

solutions [22] for mixed-criticality systems, integrating a<br />

mission-critical real-time operating system (e.g., to perform<br />

sensing, control, and actuation tasks), with rich, non-critical<br />

operating systems such as Linux, which exploit a large<br />

availability of drivers, libraries, and connectivity stacks.<br />

Realistic designs possibly also include the integration of legacy<br />

software systems as-a-whole, i.e., with their original operating<br />

system, drivers, and configurations, thus favoring the evolution<br />

of cyber-physical systems towards centralized schemes with<br />

few but powerful computing platforms.<br />

Orthogonally to such major trends, designers of newgeneration<br />

embedded software cannot neglect safety and<br />

security needs, which inevitably affect the functionality<br />

provided by virtualization stacks. The former are driven by<br />

increasingly stringent legal regulations and certifiability<br />

requirements, while the latter are becoming of paramount<br />

importance due to the exposure of embedded computing<br />

platforms by means of network connections. The integration of<br />

components with different safety and security levels (also<br />

known as MILS systems) may pose hazards in guaranteeing key<br />

requirements of the critical software such as timing constraints<br />

and data integrity and confidentiality. For instance, if no proper<br />

isolation mechanisms are provided by the hypervisor, a<br />

malfunctioning or an attack interesting a low-critical domain<br />

may arbitrarily delay the execution of critical tasks, thus<br />

compromising the system behavior or strongly jeopardizing its<br />

performance.<br />

The joint consideration of all such a kind of aspects poses<br />

several challenges in the development of suitable virtualization<br />

layers. The scope of this short paper is to discuss some of such<br />

challenges, with a particular focus on temporal and spatial<br />

isolation of software domains, timing predictability, resource<br />

109


contention, and the management of hardware-based security<br />

technologies.<br />

II.<br />

BACKGROUND<br />

A. Hypervisors<br />

The concept of Hypervisors dates back to the 60's [13], but<br />

it became significant in the last decade as a fundamental<br />

solution to harness the complexity of the modern hardware<br />

platforms, and the multiple applications executing concurrently<br />

on top of them. This need for isolation could be declined in<br />

different ways depending on the specific application<br />

requirements and the underlying platform executing it.<br />

A platform on which the hypervisor executes is denoted as<br />

the host machine, and each virtual machine managed by the<br />

hypervisor is called a guest. The two main features upon which<br />

is based the classification of a hypervisor concern the type of<br />

implementation and the abstraction provided to the guest virtual<br />

machine. There are two types of hypervisor:<br />

●<br />

●<br />

Type-1, also called native or bare-metal, which<br />

directly run on the hosting hardware to control it and<br />

to handle guest operating systems;<br />

Type-2, also called hosted, where the hypervisor is<br />

provided as an extension to an operating system that is<br />

executed on the host while the guests run as tasks..<br />

Another element of distinction comes from the API exposed<br />

by the host to the generic guest OS:<br />

●<br />

●<br />

In fully virtualized solutions the guest executes in a<br />

transparent manner and without software<br />

modifications, while the hypervisor provides the API<br />

to emulate the underlying platform;<br />

In a paravirtualized implementation the guest is aware<br />

of the presence of virtualization. Thus it uses an API<br />

similar, but not identical, to that of the underlying<br />

hardware. This allows to create specific solutions and<br />

reduce the overhead.<br />

Due to the advantages of higher flexibility and no<br />

modifications required in the guest domains, the hardware<br />

manufacturers started providing virtualization extensions to<br />

support full virtualization, which allow minimizing the<br />

overheads resulting from the emulation of the underlying<br />

platform.<br />

B. Existing solutions<br />

The wide range of application scenarios and platforms<br />

fostered the creation of a significant number of hypervisors,<br />

each of them with a focus on a subset of the several issues<br />

concerning virtualization. Moreover, the profound interaction<br />

between the hypervisor and the hardware platform leads to a<br />

considerable effort when porting the hypervisor to a new<br />

architecture, also due to the extensive use of specific platform<br />

features to improve performance. The result is a reduced set of<br />

hypervisors available for each particular platform.<br />

Since some application fields, like mainframes, cloud<br />

infrastructures, and virtualized network infrastructures highly<br />

benefit from virtualization and massively relies on Linux,<br />

several hypervisors pivoting on the latter have been developed.<br />

Among the firsts and one of the most famous is Xen [14], which<br />

executes Linux in a privileged domain called dom0. The wide<br />

range of supported platforms is considered one of its main<br />

advantages, but also as a drawback because it has lead to a<br />

considerable codebase. A similar approach is followed by KVM<br />

[15], which is a virtualization infrastructure available in the<br />

mainline kernel that turns it into a type-1 hypervisor. Jailhouse<br />

[16] is a type-1 partitioning hypervisor, more concerned with<br />

isolation rather than virtualization, aiming at creating a small<br />

and lightweight hypervisor targeting industrial-grade<br />

applications. Like Xen, Jailhouse requires Linux to provide the<br />

management interface, which allowed keeping the size of<br />

source code small. Like KVM, it is loaded from a regular Linux<br />

system, but when started, it takes full control of the hardware<br />

and splits the hardware resources into isolated compartments<br />

(called cells) that are entirely dedicated to guest software<br />

programs (called inmates). One cell runs the Linux OS and is<br />

known as the root cell, that is similar to the dom0 in Xen, but<br />

the root cell doesn't assert full control over hardware resources<br />

as dom0 does.<br />

When dealing with embedded systems and their possible<br />

requirements regarding safety and security, it is essential to<br />

exploit solutions characterized by a small codebase both for<br />

SWaP and certification issues. Xvisor [17] is a type-1<br />

hypervisor, aiming at providing an entirely monolithic, lightweight<br />

and portable virtualization solution. The most appealing<br />

characteristic of Xvisor is that it provides full virtualization, and<br />

therefore supports a wide range of unmodified guest operating<br />

systems. NOVA [18] is an academic hypervisor designed at TU<br />

Dresden. It follows the micro-kernel approach, and it has been<br />

developed using the C++ programming language. Another<br />

significant feature is the fixed-priority preemptive scheduler<br />

with execution time budgets and priority inheritance. XtratuM<br />

[19] is a hypervisor specially designed for real-time embedded<br />

systems, providing fixed priority scheduling, and relying on<br />

paravirtualization. Fiasco [20] is a hypervisor based on the L4<br />

ABI and is implemented using the C++ programming language.<br />

The Fiasco kernel is enriched by a broad set of user-space<br />

components, collectively called L4 Runtime Environment<br />

(L4Re). Attempts have been made to exploit the TrustZone<br />

security features available on modern ARM processors into<br />

hypervisors. An example is the SierraVisor [21] hypervisor.<br />

Despite all the effort from these and other projects, there are<br />

still significant issues to be addressed before being able to<br />

provide a considerable level of isolation and virtualization for<br />

modern heterogeneous platforms. The next section outlines<br />

some of the more significant ones.<br />

III.<br />

MAJOR CHALLENGES<br />

A. Achieving effective isolation on multicores<br />

Isolation capabilities are of paramount importance for an<br />

hypervisor to be used within a mixed-criticality system. Two<br />

types of isolation can be identified: spatial and temporal. Most<br />

(if not all) solutions provide support for spatial isolation of<br />

memory spaces, which is typically achieved by means of<br />

memory virtualization leveraging memory management units<br />

(MMU). Temporal isolation is generally realized by reserving<br />

110


dedicated CPUs to a domain, or by implementing bandwidth<br />

reservation schemes for the CPU time, e.g., by reserving a<br />

budget of execution time that is periodically provided to a<br />

domain by the hypervisor scheduler.<br />

Although these features are primary, and in fact are widely<br />

supported by open-source and commercial hypervisors, they are<br />

not enough to guarantee an effective isolation on commercial<br />

off-the-shelf (COTS) multicore platforms. Indeed, even if the<br />

domains access separate memory regions, and execute upon<br />

disjoint sets of CPUs, mutual interference is still possible due<br />

to the implicit contention of architectural resources such as<br />

caches and memory banks. These resources are typically not<br />

under the control of the hypervisor, but rather they are<br />

transparently managed by chip subsystems (e.g., the memory<br />

controller) that in most cases are not conceived to enforce<br />

isolation nor to guarantee timing predictability [5][6].<br />

pending memory transactions. Furthermore, DRAM memory<br />

controllers generally resort to scheduling algorithms that reorder<br />

memory accesses with the aim of improving throughput.<br />

While these algorithms provide benefits in the average-case,<br />

they leave room for pathological scenarios that lead to high<br />

worst-case latencies, hence harming the system predictability.<br />

In the literature, several clever solutions have been proposed<br />

to solve this kind of issues in non-virtualized multicore systems.<br />

Software-based approaches such as cache coloring or cache<br />

lockdown [7] can be employed to partition the amount of cache<br />

used by a core, or more in general by a set of software tasks.<br />

Reservation of memory bandwidth [5] and bank-aware memory<br />

allocators [7] have also been proposed to control the contention<br />

in accessing the main memory. Nevertheless, to the best of our<br />

records, adequate support for such techniques is limited in<br />

commercial hypervisors.<br />

Modica et al. [8] realized effective isolation mechanisms for<br />

shared caches and main memories in an open-source hypervisor<br />

targeting ARM platforms. The authors developed a new virtual<br />

memory allocator that employs cache coloring to statically<br />

isolate the amount of shared cache reserved to each domain.<br />

Furthermore, a bandwidth reservation mechanism to access the<br />

main memory has been integrated with the hypervisor<br />

scheduler. Their experimental results showed that inter-domain<br />

interference can increase the execution time of state-of-the-art<br />

benchmarks up to the 50%, while the realized mechanisms can<br />

restore isolation at the price of degrading average-case<br />

performance.<br />

Figure 1 - Inter-core interference in accessing a shred level of cache<br />

For instance, consider a quad-core platform with private<br />

level-1 caches for each core and a shared level-2 cache, as it is<br />

illustrated in Figure 1. Suppose that a critical real-time<br />

operating system is executing upon the first core, while the<br />

remaining three cores are dedicated to execute a generalpurpose<br />

Linux domain. The execution of the critical domain<br />

results in fetching data and code from the main memory,<br />

consequently populating the level-2 shared cache (green box in<br />

the figure). In parallel, the Linux domain can also populate the<br />

same cache, with the result that the content stored by the critical<br />

domain can be evicted, hence provoking cache misses at the<br />

next access. This phenomenon may generate large and<br />

unpredictable interference across domains, thus breaking<br />

isolation by introducing a strong coupling of their timing<br />

properties. Conversely, if the Linux domain is subject to an<br />

attack or a malfunctioning such that it floods the system with<br />

memory transactions, proper isolation mechanisms should<br />

shield the critical domain.<br />

To further complicate the problem, inter-domain<br />

interference can also arise when accessing the main memory,<br />

e.g., in correspondence to cache misses. The access to DRAM<br />

memories is subject to highly variable delays that depends on<br />

the actual memory location to be accessed and simultaneous<br />

B. Virtualization of FPGAs and GPGPUs<br />

Heterogeneous platforms that include FPGAs and/or<br />

GPGPUs represent very attractive and powerful solutions to<br />

implement modern cyber-physical systems, but at the same time<br />

they introduce new problems in terms of resource management.<br />

Concerning virtualized systems, FGPAs and GPGPUs should<br />

also be controlled by the hypervisor and made available to<br />

domains in a controlled manner.<br />

Modern FPGAs dispose of dynamic partial reconfiguration<br />

(DPR) capabilities, which allow reprogramming a portion of the<br />

FPGA area while the rest continues to operate. This interesting<br />

feature may be used to virtualize the FPGA area supporting<br />

several hardware modules and accelerators in time sharing,<br />

whose overall area consumption exceeds the one that is actually<br />

available in the platform. A framework [11] has also been<br />

proposed to ensure that the reconfiguration and area contention<br />

delays are predictable, thus making realistic the adoptance of<br />

this technique in the context of critical systems. Static FPGA<br />

virtualization is also possible by controlling its configuration<br />

phases. Unfortunately, no integration within a hypervisor is<br />

today available.<br />

Dually, work has also been dedicated to the development of<br />

software mechanisms to integrate the advantages of GPGPU<br />

into the virtualization paradigm. Hong et al. [23] provided an<br />

overview of the state-of-the-art of virtualization techniques,<br />

hardware supports, and scheduling mechanisms for multiple<br />

concurrent requests. They also outlined a list of challenges that<br />

still require being addressed to improve the exploitation of<br />

111


GPGPUs, ranging from overheads reduction to energy<br />

management, from scalability and space optimization to<br />

security.<br />

Another issue consists in the fact that modules deployed<br />

onto the FPGA and GPGPUs can typically act as memory<br />

masters on the system bus, hence (i) generating additional<br />

memory interference (e.g., see [10]) that complicates the<br />

problems discussed in the previous section, and (ii) potentially<br />

exposing memories to an uncontrolled access that may bypass<br />

the spatial isolation. The first problem needs to be addressed<br />

with adequate support, such as specialized software-based<br />

memory bandwidth controllers, or in the case of FPGAs with<br />

the development of hardware bandwidth controllers deployed<br />

onto the FPGA and managed by the hypervisor. The second<br />

problem requires dealing with virtualization techniques and<br />

components such as I/O MMUs.<br />

C. Supporting hardware-based security technologies<br />

Due to the external exposure by means of network and bus<br />

connections, security issues became central aspects in the<br />

design and development of modern embedded computing<br />

systems. Although a rich set of software-based techniques have<br />

been developed to increase the security level of a software<br />

system, cyber attacks are also increasingly becoming more and<br />

more complex, defeating most attack mitigation techniques<br />

and/or exploiting wrong software configurations. With the<br />

intent of providing a robust support to implement security<br />

features, chip makers are moving towards architectures that<br />

offer hardware-based solutions to realize trusted execution<br />

environments (TEEs). TEEs must be strictly isolated for the<br />

normal execution environment and should also dispose of<br />

dedicated computing resources.<br />

One of the most popular of such technologies is TrustZone<br />

developed by ARM. TrustZone provides hardware-based<br />

isolation of two execution worlds: secure, conceived to support<br />

the execution of a TEE, and non-secure, which is provided to<br />

host the execution of a rich (classical) operating system.<br />

TrustZone-enabled chips may also include support for secure<br />

boot, i.e., cryptographic validation of the firmware to be<br />

executed, and cryptographic hardware accelerators. The<br />

introduction of such features poses new challenges when<br />

realizing a security-aware virtualization stack.<br />

First, there is the need to virtualize such hardware-based<br />

security technologies to allow the coexistence of multiple<br />

domains each potentially comprising a TEE running in a<br />

virtualized secure world. Initial attempts in this direction have<br />

been made by Cicero et. al [9], which proposed an open-source<br />

dual-hypervisor solution where two jointly-configured<br />

hypervisors are employed to virtualize secure and non-secure<br />

worlds, respectively, both orchestrated by a monitor firmware<br />

that handles world switches and dispatches interrupt signals.<br />

This solution avoids the existence of a single point of failure<br />

and aims at containing the run-time overhead. Remarkable<br />

efforts have also been spent by Hua et al. [12], which proposed<br />

a centralized solution to virtualize TrustZone by building upon<br />

the Xen hypervisor.<br />

Second, hypervisors should offer the virtualization of<br />

cryptographic hardware resources, possibly guaranteeing strict<br />

integrity and confidentiality of data even in the presence of sidechannel<br />

attacks. Built-in support for software-based attack<br />

mitigation techniques such as data execution prevention (DEP),<br />

address-space layout randomization (ASLR), and control flow<br />

integrity (CFI) are also desirable. The latter require careful<br />

attention when integrated with virtualization mechanisms.<br />

Third, to the end of supporting component-based software<br />

design and possibly open environments, hypervisors should<br />

provide software authentication mechanisms also at the level of<br />

domains, paying particular attention at rollback-based attacks.<br />

The authors believe that list is not limited to the abovementioned<br />

challenges and that security-related aspects will<br />

likely steer the design of future virtualization software.<br />

IV.<br />

THE AUTOMOTIVE CASE<br />

As a proof of concept, this section describes a realistic<br />

scenario related to the automotive domain in which<br />

virtualization is applied. The described solution, from the<br />

RETINA project [1], aims at providing an AUTOSARcompliant<br />

software stack for next-generation automotive<br />

systems. The stack allows the integration of components with<br />

different criticality levels onto modern multi-core SoCs,<br />

reducing the overall time-to-market and manufacturing costs.<br />

At the lowest level, the stack consists of an hypervisor to<br />

enforce isolation (thus, reliability and safety) between the guest<br />

operating systems. The RETINA project relies on Jailhouse [2],<br />

a small and lightweight type-1 hypervisor developed by<br />

Siemens and released as Open-Source software. The hypervisor<br />

supports both x86-64 and ARM-based platforms, provided the<br />

availability of hardware virtualization instructions. Rather than<br />

providing resource virtualization and scheduling (like e.g. the<br />

Xen hypervisor), Jailhouse focuses on isolation and resource<br />

partitioning. For this reason, there is no intra-core scheduling<br />

(i.e., each core cannot run more than one guest OS) and<br />

resources are statically assigned to only one guest. This static<br />

approach allows to:<br />

● provide average latencies and jitters similar to baremetal<br />

solutions, due to the low run-time overhead;<br />

● ease potential certification processes in the future,<br />

thanks to a very small codebase.<br />

On top of the hypervisor, the RETINA project runs two<br />

guest OSs with different criticality levels. The real-time and<br />

safety-critical tasks are run by the ERIKA Enterprise RTOS [3].<br />

ERIKA Enterprise is a tiny RTOS (i.e. a few KBs of footprint)<br />

designed and certified for the automotive market. It is<br />

developed by Evidence Srl and released as Open-Source<br />

software under a dual licensing model.<br />

The less critical tasks (e.g., HMI, logging, etc.), instead, are<br />

executed on a Linux guest, improved through the<br />

PREEMPT_RT real-time patch [4] when needed. The<br />

communication between the two OSs is done by means of a<br />

library exposing an API similar to the one specified by the<br />

AUTOSAR COM standard. The library is meant to be used by<br />

an AUTOSAR Run-Time Environment (RTE) generator<br />

developed by Evidence Srl for its RTOS. Most critical tasks are<br />

run using the SCHED_DEADLINE Linux scheduler [17].<br />

112


Figure 2 summarizes the main components of the<br />

automotive software stack described above.<br />

[11] A. Biondi, A. Balsini, M. Pagani, E. Rossi, M. Marinoni, and G. Buttazzo,<br />

“A framework for supporting real-time applications on dynamic<br />

reconfigurable FPGAs,” in Proc. of the IEEE Real-Time Systems<br />

Symposium (RTSS 2016), December 2016, pp. 1–12<br />

[12] Z. Hua, J. Gu, Y. Xia, H. Chen, B. Zang, and H. Guan, “vTZ: Virtualizing<br />

ARM trustzone,” in In Proc. of the 26th USENIX Security Symposium,<br />

2017.<br />

[13] R. Adair, R. Bayles, L. Comeau, and R. Creasy. “A virtual machine<br />

system for the 360/40,” Technical Report 320-2007, IBM Corporation,<br />

Cambridge Scientific Center, May 1966.<br />

[14] Xen project, https://www.xenproject.org/<br />

[15] Linux Kernel Virtual Machine, http://www.linuxkvm.org/page/Main_Page<br />

[16] Jailhouse project page, https://github.com/siemens/jailhouse<br />

[17] J. Lelli, C. Scordino, L. Abeni, D. Faggioli, “Deadline scheduling in the<br />

Linux kernel”, Software: Practice and Experience, 46(6): 821-839, June<br />

2016.<br />

[18] Nova hypervisor, http://www.hypervisor.org<br />

[19] XtratuM project page, http://www.xtratum.org<br />

[20] Fiasco project page, https://l4re.org/fiasco/<br />

[21] SierraVisor, http://www.openvirtualization.org<br />

Figure 2 - Multi-OS automotive software stack developed for the<br />

RETINA project.<br />

[22] PikeOS hypervisor, https://www.sysgo.com/products/pikeos-hypervisor/<br />

[23] Hong, Cheol-Ho & Spence, Ivor & Nikolopoulos, Dimitrios, “GPU<br />

Virtualization and Scheduling Methods: A Comprehensive Survey”.<br />

ACM Computing Surveys. 50. 1-37, 2017.<br />

V. CONCLUSIONS<br />

This paper presented some of the major challenges in<br />

providing a software support for exploiting modern<br />

heterogeneous platforms for complex safety-critical systems<br />

consisting of several interacting components with real-time<br />

requirements. Virtualization techniques, successfully used to<br />

isolate the behavior of software components running on the<br />

same processor, are considered to be extended for managing<br />

other architectural resources, such as shared memories, and<br />

other computational units, such as FPGAs and GPUs. Issues<br />

concerning safety, security, and real-time performance are also<br />

discussed and illustrated using a case of study taken from the<br />

automotive domain.<br />

REFERENCES<br />

[1] RETINA EUROSTARS project, http://retinaproject.eu/<br />

[2] Siemens, Jailhouse hypervisor, https://github.com/siemens/jailhouse<br />

[3] Evidence Srl, ERIKA Enterprise RTOS, http://www.erikaenterprise.com/<br />

[4] The Linux Foundation, Real-Time<br />

https://wiki.linuxfoundation.org/realtime<br />

collaborative project,<br />

[5] H. Yun, G. Yao, R. Pellizzoni, M. Caccamo, and L. Sha, “Memguard:<br />

Memory bandwidth reservation system for efficient performance isolation<br />

in multi-core platforms,” in 19th IEEE Real-Time and Embedded<br />

Technology and Applications Symposium (RTAS), 2013, pp. 55–64<br />

[6] H. Yun, R. Mancuso, Z. P. Wu, and R. Pellizzoni, “PALLOC: DRAM<br />

bank-aware memory allocator for performance isolation on multicore<br />

platforms,” in 19th IEEE Real-Time and Embedded Technology and<br />

Applications Symposium (RTAS), April 2014.<br />

[7] G. Gracioli, A. Alhammad, R. Mancuso, A. A. Fröhlich, and R.<br />

Pellizzoni, “A survey on cache management mechanisms for real-time<br />

embedded systems,” ACM Comput. Surv., vol. 48, no. 2, Nov. 2015.<br />

[8] P. Modica, A. Biondi, G. Buttazzo, and A. Patel, “Supporting temporal<br />

and spatial isolation in a hypervisor for arm multicore platforms,” in<br />

Proceedings of the 18th IEEE International Conference on Industrial<br />

Technology (ICIT 2018), Feb. 2018.<br />

[9] G. Cicero, A. Biondi, G. Buttazzo, and A. Patel, “Reconciling Security<br />

with Virtualization: A Dual-Hypervisor Design for ARM TrustZone,” in<br />

Proceedings of the 18th IEEE International Conference on Industrial<br />

Technology (ICIT 2018), Feb. 2018<br />

[10] B. Forsberg, A. Marongiu and L. Benini, "GPUguard: Towards<br />

supporting a predictable execution model for heterogeneous SoC,"<br />

Design, Automation & Test in Europe Conference & Exhibition (DATE),<br />

2017, Lausanne, 2017, pp. 318-321<br />

113


Security In Manufacturing<br />

Closing the Backdoor in IoT Products<br />

Josh Norem<br />

Ass. Staff Systems Engineer<br />

Silicon Labs<br />

joshua.norem@silabs.com<br />

Abstract— It is common for system developers to pay a lot of<br />

time and attention to developing secure products and ensuring<br />

that their devices are difficult to exploit in the field. Unfortunately,<br />

security in the build process and supply chain receives much less<br />

consideration. In this paper, we discuss the various attack vectors<br />

present in the process of designing, building and testing IoT<br />

systems as well as methods for preventing these attacks.<br />

Keywords— IoT, Security, Manufacturing, Assembly<br />

I. INTRODUCTION<br />

It’s well understood that any secure system is only as strong<br />

as its weakest component. Unfortunately, it’s all too common<br />

to forget that every step in the manufacturing process is a<br />

component in that system. While much has been written about<br />

the security of wireless protocols, ICs, and deployed systems,<br />

securing the manufacturing process for those systems is often<br />

forgotten.<br />

To illustrate, this let’s examine how we might attack an<br />

embedded system. In our case we’ll use a smart lock as an<br />

example. First, it’s important to note that if we are serious about<br />

attacking this system, we probably don’t want to compromise<br />

just one lock. We want to create a systematic exploit that can<br />

be used against any lock and then sold to others who want to<br />

bypass one specific lock in the field.<br />

The manufacturer of this lock has anticipated our attack and<br />

spared no expense creating a secure product. From multiple<br />

code reviews, to anti-side-channel-attack hardware, to<br />

extensive penetration testing, the product is well designed and<br />

well protected. This would be a problem if we were going to<br />

attack the lock itself, but luckily, we have another option.<br />

Instead, we’re going to attack the contract manufacture (CM)<br />

that assembles and tests the lock. It is almost universally<br />

required for firmware images to be transferred, stored, and<br />

programmed in plain text. All we need do is bribe one of the<br />

CM employees to give us the image, and then swap it out with<br />

an image we modified. The firmware will be nearly identical,<br />

but with a backdoor we can exploit whenever we wish. The CM<br />

will then be manufacturing fundamentally compromised<br />

devices for us.<br />

Our exploit requires no special hardware and only a<br />

moderate amount of sophistication to develop, which makes it<br />

extremely cheap to create. It also completely bypasses all the<br />

time and effort the manufacturer spent to make their product<br />

secure.<br />

A. Protecting Firmware Integrity<br />

The fundamental problem in manufacturing is that with<br />

current embedded processors it’s very difficult to guarantee the<br />

integrity of a firmware image. If the firmware is programmed<br />

in plain text, we can easily modify it on the test system as shown<br />

in block diagram 1 of Figure 1, where the red maker indicates<br />

code vulnerable to attack.<br />

If the manufacturer decides to encrypt their code and load it<br />

via a secure boot loader, we attacked the boot loader, which had<br />

to be stored and programmed in plain text. This is shown in<br />

block diagram 2 of Figure1. If the manufacturer uses external<br />

test hardware to verify the firmware after it’s programmed, we<br />

attacked both the firmware and the code that checks it as shown<br />

in block diagram 3 of Figure 1. No matter how many layers are<br />

added, we ultimately reach something that had to be<br />

programmed in plain text and can be attacked.<br />

Figure 1: Points of attack in firmware programming<br />

114


It’s also worth noting that manufacturing is not the only time<br />

code can be modified. For example, an exploit that results in<br />

arbitrary code execution becomes much more valuable if it can<br />

permanently install itself by reprogramming the device. A<br />

complete solution to the problem of code integrity in<br />

manufacturing also addresses other source of firmware image<br />

corruption.<br />

B. Protecting Firmware Confidentiality<br />

In addition to ensuring that a system is programed with the<br />

intended firmware, it may sometimes be necessary to protect<br />

the confidentiality of that firmware. For example, if there is a<br />

proprietary algorithm we want to ensure competitors don’t have<br />

access to, we need to ensure that the code can’t be obtained by<br />

simply copying a file from our CM test/programming system.<br />

Implementing firmware confidentiality can be done in a<br />

variety of ways and benefits from other hardware-based<br />

security features. However, any confidential boot loading<br />

process that takes place at an untrusted CM will ultimately<br />

follow the same pattern. First the device is locked so that an<br />

untrusted manufacturing site can no longer access or modify the<br />

contents of the device. Then the device performs a key<br />

exchange with a trusted server using a private key that the<br />

manufacture never has access to, normally generated on the<br />

device after it is locked. Once the key exchange is complete,<br />

information can be passed confidentially between the trusted<br />

server and device.<br />

Confidentially requires integrity. If an attacker can modify<br />

the device’s firmware to generate a known private key, then<br />

they can trivially decrypt the image sent to that device.<br />

While this paper focuses on providing firmware integrity,<br />

which is only one of the components needed to provide<br />

firmware confidentiality, more information on firmware<br />

confidentiality can be obtained from many sources including<br />

the author of this paper.<br />

C. Secure Debugging<br />

Another historic issue in the manufacturing process is the<br />

ability to diagnose issues in the field or when products are<br />

returned. For both the IC manufacturer and system developer,<br />

there is a need to gain access to locked devices to perform this<br />

analysis. Historically this has been done by introducing<br />

backdoor access, which is by definition a security hole.<br />

The most common approach to this problem is to allow<br />

unlock + erase such that a device can be unlocked but all flash<br />

is erased during the unlock progress. This process has several<br />

drawbacks. First, in some cases access to the current contents<br />

of flash may be needed for debug purposes and will not be<br />

available. Second, this opens a security hole for attacks centered<br />

on erasing and reprogramming the device with modified code.<br />

Other approaches provide an unrestricted back door that<br />

unlocks without erasing, or offer a permanent lock that will<br />

protect the part but makes debug of failure impossible. Both<br />

options have some well understood drawbacks.<br />

II.<br />

FIRMWARE INTEGRITY HALF MEASURES<br />

There are some things we can do today to address this<br />

problem and make attacking our manufacturing process more<br />

difficult and less profitable.<br />

A. Sampling<br />

The simplest solution is to implement a sampling<br />

authentication program in another site. For example, we could<br />

pick systems at random (say, one out of every 1000 we build)<br />

and have them sent to an engineering/development site where<br />

we read out the firmware and validate it. If someone tampers<br />

with our CM, this sampling will indicate that this has happened.<br />

To circumvent this check, the attacker must either compromise<br />

our engineering site in addition to the CM, or be able to know<br />

which devices will be sent for verification and exclude them<br />

from the attack.<br />

There is still a technical problem here. To authenticate the<br />

code at our engineering site, we need to be able to read that code<br />

out. Typically, MCUs are locked after production to ensure that<br />

memory cannot be modified or read out, which will also prevent<br />

us from checking that the contents are correct.<br />

It’s important to remember our method of checking needs<br />

to assume any code on the device may be compromised. For<br />

example, one option is to have a verification function that<br />

computes a simple checksum or hash of the image that we can<br />

read out through a standard interface (UART, I2C).<br />

Unfortunately, that option relies on code that may be<br />

compromised to generate the hash. If an attacker has replaced<br />

our image, they can also replace our hashing function to return<br />

the expected value for a good image instead of re-computing it<br />

based on the contents of flash.<br />

To make this authentication work, we need to find an<br />

operation that can only be accomplished if the entire correct<br />

image is present in the device. One way of doing this would be<br />

to have our verification function simply dump out all the code.<br />

An even better idea is to have our function generate a hash of<br />

the image based on a seed the test system randomly generates<br />

and passes in. Now the attacker can’t simply store a<br />

precomputed hash because the hash value changes based on the<br />

seed. To respond with the correct result, the attacker’s code<br />

must now have access to the entire original image and correctly<br />

compute the hash.<br />

B. Dual Site Manufacturing<br />

Similar to the sampling program, board assembly and<br />

programming could be carried out at one site and then board<br />

tested at another. This has the benefit of catching an attack<br />

immediately and preventing any compromised units from being<br />

shipped. It also has all the drawbacks of the sampling method<br />

since it requires some way to authenticate the firmware during<br />

the test phase. It also has a higher cost to implement than the<br />

sampling method.<br />

It may be tempting to program but not lock the device<br />

during manufacturing and then lock after test. This would<br />

eliminate the need for special verification code since the<br />

contents of the device can simply be read out. However, for<br />

most embedded processors, leaving debug unlocked also leaves<br />

programming unlocked. In this case, attackers could<br />

115


compromise only the second (test) site and simply program<br />

their modified firmware there.<br />

C. Over the Air/Field Updates<br />

Another way to mitigate an attack on a connected system,<br />

as well as several other unrelated security issues, is to<br />

implement and use over-the-air (OTA) updates or some other<br />

periodic style of firmware update.<br />

In most OTA systems, any manufacturing time<br />

modifications will be discovered or overwritten with the next<br />

OTA update. For a system that regularly rolls out updates,<br />

quarterly for example, the value of a factory compromise is<br />

greatly reduced if it’s only available for that short time. This is<br />

an excellent example of the value of in-field updates for secure<br />

systems.<br />

III.<br />

A FULL SOLUTION FOR FIRMWARE INTEGRITY<br />

The fully secure solution to this problem relies on hardware.<br />

Specifically, hardware must contain a hard-coded public<br />

authentication key and hard-coded instructions to use it. For this<br />

purpose, ROM is an excellent solution. Though ROM is<br />

notoriously easy to read through physical analysis, it is difficult<br />

to modify in a controlled and non-destructive way.<br />

Any firmware loaded into the device must then be signed.<br />

Out of reset, the CPU begins execution of ROM and can<br />

validate that the contents of flash are properly signed using the<br />

public authentication key, which is also stored in ROM. If an<br />

attacker attempts to load a modified version of the firmware,<br />

authentication will fail, and the part will not boot. To get a<br />

modified image to boot, the attacker would need to provide a<br />

valid signature for their modified firmware, which can only be<br />

generated using a well-protected private key.<br />

With a real IC, security measures are a bit more complex to<br />

support numerous use cases and to avoid security holes. The<br />

hard-coded public key (Manufacturer Public Key) will be the<br />

same for all devices since it is not modifiable. This makes it<br />

incredibly valuable, providing the root of trust for all devices.<br />

The associated private key (Manufacturer Private Key) must be<br />

closely guarded by the IC manufacturer and will never be<br />

provided to users to sign their own code.<br />

When booted the Manufacturer Public Key will be used to<br />

validate any code provided by the IC manufacturer that resides<br />

in flash. This gives the ability to ensure that code or other<br />

information provided by the manufacturer is not tampered with<br />

as shown in step 1 of Figure 2.<br />

Users of the device will need to have their own key pair (User<br />

Private Key and User Public Key) for signing and<br />

authenticating their firmware images. To link the User Public<br />

Key into the root of trust, the IC manufacturer must sign the<br />

User Public Key with the Manufacturer Private Key creating a<br />

User Certificate. A certificate is simply a public key and some<br />

associated metadata that has been signed. When booted, the part<br />

authenticates the User Certificate using the Manufacturer<br />

Public Key, as shown in in step 2 of Figure 2.<br />

Finally, the user firmware can be authenticated with the User<br />

Public Key in the known-valid User Certificate. This is shown<br />

in step 3 of Figure 2.<br />

Figure 2: A secure boot process<br />

It’s important to note that one additional step is required to<br />

lock a device to a specific user. The system described in steps<br />

1-3 can only ensure that the User Certificate was signed by the<br />

IC manufacturer. This does prevent a random person from<br />

reprograming the part, but another legitimate customer of the<br />

IC manufacturer could write their code and their legitimately<br />

signed certificate onto the part, and it would boot. This<br />

effectively means that if an attacker can convince the IC<br />

manufacturer they are a legitimate customer and can generate a<br />

signed User Certificate, they can get the device to boot their<br />

code.<br />

To lock a part to a specific end user, the User Certificate<br />

needs to contain not only the User Public Key but also a user<br />

ID so that changing either the key or the ID will invalidate the<br />

certificate. The IC manufacturer will program the user ID into<br />

the Manufacturer Code area where it is protected by the<br />

Manufacturer Public Key. Finally, at boot time, in addition to<br />

verifying the signature of the User Certificate, the boot process<br />

also compares the user ID in the User Certificate against the one<br />

in the Manufacturer Code, as shown in step 4 of Figure 2.<br />

116


Let’s see what happens when someone attempts to modify<br />

each part of the system.<br />

• If the user identifier in the Manufacturer Code is<br />

changed the signature of that space is no longer valid<br />

and the part does not boot.<br />

• If the user ID in the User Certificate is changed it will<br />

not match the one in the Manufacture Certificate and<br />

the part will not boot.<br />

• If either the User Public Key or the User ID in the User<br />

Certificate are modified the certificate will be invalid<br />

and the part will not boot.<br />

• Finally, if the User firmware image is changed the<br />

signature will be invalid and the part will not boot.<br />

We now have a system that will only boot firmware properly<br />

signed by the customer who ordered the part from the IC<br />

manufacture. Furthermore, this entire system relies on only 2<br />

secrets, the Manufacture Private key and the User Private Key,<br />

both of which are only ever accessed to sign new images and<br />

can be extremely well protected due to the infrequency of that<br />

process.<br />

A. Additional Mitigations<br />

Of course, even in this system correct construction of the<br />

certificate system is required. As we have defined, the<br />

Manufacturer Private Key is extremely valuable as it applies to<br />

every device the IC manufacturer ever builds. It’s also accessed<br />

far too frequently since it’s constantly being used to sign User<br />

Certificates.<br />

This can be addressed by creating a different Manufacturer<br />

Public/Private Key pair for each die so that compromising one<br />

Manufacturer Private Key only exposes that die. Similarly,<br />

instead of directly signing User Certificates with the<br />

Manufacturer Private Key, a hierarchy of sub-keys can be<br />

developed and used for that operation such that sub-key can be<br />

revoked by a Manufacturer Code update if compromised. The<br />

details of such schemes are beyond the scope of this paper, but<br />

many things are possible with the fundamental hardware root of<br />

trust established.<br />

IV.<br />

PROVIDE SECURE UNLOCK<br />

Providing secure debug unlock turns out to be a simple task.<br />

First, each system developer generates a key pair for debug<br />

access and programs the Public Debug Key onto the device. The<br />

integrity of that key can be established in the same manner as<br />

the user’s firmware, preventing anyone from tampering with the<br />

Public Debug Key. This is shown in step 5 of Figure 3. Each<br />

device is also provided with a unique ID, which is almost<br />

universally available on MCUs today.<br />

To unlock the part, its unique ID is read out (1) and signed<br />

with the Private Debug Key (2), creating an Unlock Certificate.<br />

The Unlock Certificate is then fed into the device for<br />

authentication against the Public Debug Key (3). If it<br />

authenticates, the part is unlocked. This ensures only those with<br />

access to the Private Debug Key may generate an Unlock<br />

Certificate, and only those with an Unlock Certificate may<br />

unlock the part.<br />

Figure 3: A method of secure debug unlock<br />

The Private Debug Key can be stored on a secure server and<br />

be extremely well protected. Note that since the ID used by the<br />

device does not change, the process of generating an Unlock<br />

Certificate happens only once, and then that certificate may be<br />

used to unlock the part as long as is desired.<br />

A benefit of this method is that it generates Unlock<br />

Certificates on a per-device basis. That means it’s possible to<br />

grant unlock privileges to field service personnel or the IC<br />

manufacturer on only the device they are trying to diagnose.<br />

A drawback to this method is that once a valid Unlock<br />

Certificate is created, anyone with access to that certificate may<br />

unlock the device. To mitigate that risk of a valid Unlock<br />

Certificate being obtained by an attacker, a counter can be<br />

added to the end of the unique ID so that after an Unlock<br />

Certificate is no longer needed, it can be revoked by<br />

incrementing the counter via a debugger command. This will<br />

cause a new ID to be generated, and the old certificate will no<br />

longer be valid.<br />

Finally, it’s important to note that the more devices a private<br />

key gives access to, the more valuable it becomes. As a result,<br />

system developers may want to change debug unlock keys<br />

periodically to limit the number of devices affected in the event<br />

that a Private Unlock Key is compromised.<br />

V. OTHER MANUFACTURING CONSIDERATION<br />

A. Test-based Security Holes<br />

It is extremely common for the needs of manufacturing to<br />

result in the intentional or unintentional introduction of security<br />

holes.<br />

An example of an unintentional hole is when a system<br />

manufacturer forgets to disable the debug interface as part of<br />

their board test and ships units with a wide-open debug port.<br />

Even more common are intentional security holes. For<br />

example, a developer may want to provide a way to reopen<br />

debug access after locking and put in a ‘secret’ command or pin<br />

state to unlock the part. If discovered, this allows any attacker<br />

the same unlock capability in the field.<br />

Developers should always take care to implement<br />

manufacturing and test processes in a secure way. This includes<br />

117


avoiding any intentional security holes and conducting reviews<br />

to catch unintentional ones.<br />

B. The Offshore Process<br />

Any system firmware images or test programs pass through<br />

are part of the manufacturing flow and may be vulnerable to<br />

attack. Having a secure system and secure manufacturing<br />

process won’t help if files are transferred to the CM through an<br />

FTP or email server that hasn’t been patched in ten years. Every<br />

place files are stored should be considered part of the system<br />

and secured.<br />

C. Product Development<br />

Just like product manufacturing is often a secondary<br />

thought, product development is also often overlooked. The<br />

measures discussed in this paper will not be helpful if an<br />

attacker can commit unnoticed changes to the source code<br />

repository. Sometimes this takes the form of an external<br />

penetration (electronically or physically walking into the<br />

building), and sometimes by compromising an employee.<br />

Standard IT system security practices and standard coding<br />

practices play a huge role in preventing this type of attack.<br />

These practices include ensuring all PCs automatically lock<br />

when not in use, requiring user logins to access code<br />

repositories, performing code reviews on all repository<br />

commits, and performing test regressions on release candidates.<br />

VI.<br />

CONCLUSION<br />

Security is increasingly important in embedded systems.<br />

Products that were once standalone are now part of a network,<br />

increasing both their vulnerability and value. Much has been<br />

published in the past few years about securing IoT devices<br />

themselves, but not enough attention has been focused on<br />

ensuring security throughout the design and manufacturing<br />

processes.<br />

This paper has demonstrated how historical manufacturing<br />

processes can be easily compromised and has explored some<br />

simple steps that can be taken today in both design and<br />

manufacturing to make attacking a CM or engineering site more<br />

difficult and less profitable. In addition, we have presented<br />

some hardware improvements that can ensure firmware<br />

integrity and provide secure access for failure analysis and field<br />

debugging.<br />

Effective security requires everyone, from silicon vendors to<br />

design firms to OEMs, to work together to ensure that supply<br />

chain security receives the time and attention it deserves. The<br />

good news is that not only are new hardware features being<br />

developed to address these issues, but there are some simple<br />

measures system developers can start implementing today to<br />

create a more secure manufacturing environment.<br />

118


Rowhammer - a survey assessing the severity of<br />

this attack vector<br />

Norbert Wiedermann ∗ , Sven Plaga ∗<br />

∗ Fraunhofer Institute AISEC, Garching bei München, Germany<br />

{norbert.wiedermann, sven.plaga}@aisec.fraunhofer.de<br />

Abstract—Dynamic random access memory (DRAM) is a<br />

cheap manufacturable main memory architecture and widely<br />

used in consumer and professional Information Technology (IT)<br />

systems. In March 2015 Seaborn et al. presented sample code<br />

[1] to demonstrate how an already known technical issue of<br />

this memory architecture can be exploited by making use of<br />

insights from Kim et al. [2]. This work proofed that the issue<br />

can be abused to compromise current IT systems. Using this<br />

knowledge as starting point other research teams continued the<br />

work. A JavaScript based approach was for example presented<br />

by Gruss et al. in [3]. As presented exploits gained high medial<br />

attention in non-scientific press, the rowhammer bug [4] and<br />

mitigation strategies [5] are still object of research. In this paper,<br />

the hardware related circumstances are reviewed and analysed<br />

to provide an understanding of the technical aspects which led<br />

to this bug. Based on related work an own test setup was used to<br />

comprehend the steps of the attack. The challenges in creating<br />

this independent and functional setup based on x86 and Linux<br />

are introduced. Additionally, constraints in mounting an attack<br />

on current Linux distributions and possible mitigation strategies<br />

are presented. This paper summarises the current state of the<br />

art and provides insight to this severe though complex attack<br />

vector. With the presented results it is possible to estimate future<br />

refinements of rowhammer and identify mitigation strategies for<br />

own designs.<br />

I. Introduction<br />

The term “rowhammer bug” refers to a hardware related<br />

flaw, which can be utilised to create bit flips in a computers<br />

main memory. This issue was first discussed in the paper by<br />

Yoongu Kim et al. in 2014 [2]. The issue gained great attention<br />

in public after a blog post by Mark Seaborn [1]. Together<br />

with Thomas Dullien he presented two working exploits and<br />

gave a talk at the Black Hat conference in 2015 [6].<br />

The bug can be exploited by performing high frequent,<br />

un-cached read access to dynamic random access memory<br />

(DRAM) which eventually causes a bit flip in a memory area<br />

next to the accessed one. This flaw can be abused to gain<br />

full memory access, even to privileged areas, and by that<br />

obtain full control over the attacked system. This hardware<br />

related flaw can be abused even from user space applications<br />

to escape sandboxes and obtain kernel privileges, as Seaborn<br />

demonstrated with his exploits. From that point, the work<br />

was also continued by other researchers. One example for<br />

this follow-up investigations are the results of Daniel Gruss<br />

and Clémentine Maurice who presented a JavaScript based<br />

approach which was published in a paper [3] and also presented<br />

in a talk at the 32c3 in December 2015.<br />

Due to the fact, that DRAM is widely used in consumer<br />

as well as in professional IT systems, and the presented<br />

exploits had high impact, the topic dominated publications<br />

in technical and even non-scientific related press for a while.<br />

Nevertheless, the circumstances of this bug are quite complex<br />

and require good in-depth knowledge of IT systems, such as the<br />

Central Processing Unit (CPU) architecture and its associated<br />

chip-set which are strongly closely linked to main memory<br />

management. Though the numerous high-level summaries<br />

published by the specialised press based on the corresponding<br />

scientific publications, these often showed a certain level of<br />

simplification in order to meet the readers level of knowledge.<br />

On the one hand, this led to wide awareness on this topic.<br />

On the other hand, however, these simplifications caused also<br />

uncertainty on how to classify related risks.<br />

This motivated the current work, in which the bug was<br />

studied in more detail to comprehend its circumstances and<br />

effects. The following Section II gives a short overview on<br />

memory architectures. With this background information<br />

the hardware related reasons behind the bug are clarified.<br />

Further on, related work is discussed providing a context<br />

to other research results (Section III). The theory behind<br />

the rowhammer attack is discussed in detail in Section IV.<br />

Thereafter, in Section V, the test setup used to work on a<br />

result replication. The test approaches are outlined as well.<br />

Discussing the achieved results which were summarised in the<br />

preceding section, possible mitigation strategies are presented<br />

in Section VI. In Section VII the paper concludes by putting<br />

the rowhammer attack into a context to other common issues.<br />

Finally, a small guideline for risk estimation is provided<br />

followed by ideas for possible connection-work.<br />

II. Background<br />

This section provides basic background information on<br />

how main memory is organised in current IT systems. The<br />

following Section II-A describes physical characteristics and<br />

how memory cells are organisation in hardware. Thereafter,<br />

Section II-B outlines basic logical approaches used by the<br />

operating system (OS) to map memory to real hardware.<br />

A. Physical Memory Organisation<br />

Since modern IT systems are quite complex and executed<br />

tasks requiring large amounts of main memory [7], market<br />

demands cheap memory modules providing sufficient storage<br />

capacities. Addressing these requirements, manufacturing<br />

processes get further refined enabling smaller semiconductor<br />

scales. Shrinking the semiconductor structures, more components<br />

can be placed on a chip thus providing more memory<br />

capacity. On the contrary, the transistors and capacitors on the<br />

structures also getting smaller resulting in digital information<br />

being represented by tiny electrical capacities in the range of<br />

some femtofarad [2].<br />

119


The main memory of computer systems is organised in<br />

a multi level hierarchy which is employed to manage the<br />

high quantity of memory cells. Each cell is made from a<br />

transistor (T) and a capacitor (C) which is the actual store for<br />

a single bit. Such a basic DRAM memory cell is depicted in<br />

Figure 1.<br />

bitline<br />

C Sp<br />

T<br />

Figure 1. Basic DRAM memory cell circuitry.<br />

wordline<br />

U Sp<br />

Several of these basic cells are then arranged in rows which<br />

then form matrices. This structure is referred to as a memory<br />

bank. Additional hardware components are necessary to<br />

translate a memory address to select a specific row in such a<br />

bank. A simplified memory bank is illustrated by Figure 2.<br />

row address register<br />

row decoder<br />

00<br />

01<br />

10<br />

11<br />

Basic DRAM cell<br />

wordline<br />

bitline<br />

between these signals occur when the signals are toggled<br />

between “low” to “high” state. This causes weak magnetic<br />

induction in adjacent wordlines resulting in further electrical<br />

leakage of capacitors attached to transistors controlled by the<br />

effected wordlines [8]. To preserve the saved information,<br />

the memory rows gets refreshed periodically. Typical refresh<br />

cycles are performed every 64 ms.<br />

To retrieve Information, the memory cell is discharged<br />

and restored after its access. For management reasons, it is<br />

only possible to access the rows of a memory bank by an<br />

address. As a result, reading information causes the discharge<br />

of all capacitors of a memory row. To maintain the stored<br />

information during the process, it is written to the row buffer.<br />

Afterwards, the electrical charge is restored to the original<br />

row. This process of discharging and charging can interfere<br />

memory cells of adjacent rows. Finally, high frequent read<br />

operations within the refresh timeframe on the same memory<br />

row can cause significant electrical leakage in memory cells<br />

of adjacent rows. As a consequence, reads to adjacent rows<br />

can lead to misinterpretations by the row buffer when the<br />

amount of charge fell below a certain threshold. This results to<br />

flipped bits, which are subsequently provided to the requesting<br />

software layer. Since the physical fundamentals of the attack<br />

are requiring a high frequent accesses to certain memory rows,<br />

it is referred to as “hammering” which is also the origin of<br />

the attack’s name.<br />

B. Logical Memory Organisation<br />

Main memory is organised using virtual addresses on the<br />

software layer. Virtual address spaces are an abstraction<br />

mechanism to separate memory areas for each application<br />

being executed. The mapping from virtual to physical memory<br />

addresses is handle by the underlying OS and the Memory<br />

Management Unit (MMU) of the hardware. In cases the<br />

physical available memory is smaller than the provided virtual,<br />

this is solved by these two instances using memory mapping<br />

and swapping to other memory, such as harddrives. This<br />

mapping is illustrated in Figure 3.<br />

row buffer<br />

Virtual Memory<br />

0x0000<br />

0x0000<br />

Physical Memory<br />

00 01 10 11<br />

column decoder<br />

column address register<br />

OS &<br />

MMU<br />

0x0fff<br />

Figure 2. Simplified DRAM memory bank.<br />

Memory chips, containing eight memory banks are then<br />

placed on the Printed Circuit Board (PCB) of the respective<br />

DRAM module. Usually, eight chips are used on a single<br />

ranked DRAM module, or 16 chips for a double ranked<br />

module.<br />

Due to the small semiconductor scales and physics, the<br />

transistors in the memory cell tend to leak electrical charge of<br />

the attached capacitors. Because of the manufacturing process,<br />

it is not avoidable to route some of the wordlines, used to<br />

access the rows, parallel to each other. Therefore, interference<br />

0xffff<br />

Figure 3. Simplified view on memory mapping. Virtual memory is mapped<br />

by operating system (OS) and Memory Management Unit (MMU) to physical<br />

available memory.<br />

Virtual memory layout is organised in different hierarchies,<br />

such as page directories (PDs), page table (PT) and pages.<br />

120


To access this structure, virtual addresses are used with<br />

fixed offset to gain access to a certain level of this memory<br />

management. On the lowest level the physical address to the<br />

value is stored. These aspects are depicted in Figure 4 using a<br />

32 Bit architecture as reference. On current 64 Bit IT systems<br />

memory hierarchies are organised using analogous concepts<br />

but with deeper nested structures.<br />

31<br />

CR3<br />

10 Bit<br />

Offset Page Directory<br />

Phys. Addr. PT<br />

virtual address (32 Bit)<br />

10 Bit<br />

Offset to Page Table<br />

Phys. Addr. Page<br />

Page Directory Page Table Page<br />

12 Bit<br />

Offset to Page<br />

Physical Address<br />

Figure 4. Mapping of virtual to physical addresses. Management hierarchy<br />

uses page directories (PD), page table (PT), and pages. Sample based on a<br />

32 Bit architecture.<br />

The final mapping of physical addresses to actual cells on<br />

the memory module depents on the architecture of the CPU.<br />

Each generation features different approaches to optimize<br />

this mapping. Further, the underlying CPU architecture also<br />

specifies the applied caching schemes. The algorihms implementing<br />

this mapping and caching are not public documented<br />

by the CPU vendors, hence, reverse engineering is inevitable<br />

to understand them.<br />

Seaborn started off on Intel’s Ivy Bridge and Sandy Bridge<br />

CPU architecture as test platform. While the Ivy Bridge<br />

architecture contained some more complexity he successful<br />

reversed the mapping for Sandy Bridge and documented his<br />

results in [9]. Further insights on how the memory mapping is<br />

conducted is documented in [10], also covering the Ivy Bridge<br />

architecture. Additional CPU architectures were reversed by<br />

Pessl et al. in [11]. Approaches to circumvent various levels<br />

of caches on the CPU was researched by Hund et al. in<br />

[12]. By measuring timing differences of memory accesses<br />

the actual mapping can be deduced. Invoking the clflush<br />

command to clear caches forces the hardware to reload the<br />

accessed row. An algorihm describing timing based analysis<br />

is presented by Liu in [13].<br />

This detailed knowledge how to find the location of specific<br />

physical addresses in DRAM is essential in order to mount<br />

a rowhammer attack. Based on the documented memory<br />

mappings and algorihms to circumvent caches the physical<br />

addresses of memory rows located adjacent to each other<br />

can be retrieved and correlated to virtual addresses. Thus, it<br />

is possible to identify so called aggressor rows next to the<br />

targeted victim row. For the aggressor rows the following<br />

requirement holds: Same bank, different row like the victim<br />

row. Previously to the reversed memory mapping, or without<br />

sufficient knowledge about the target platform with respect to<br />

memory organisation, the rowhammer attack can be performed<br />

by randomly selecting virtual addresses. Applying algorihms<br />

like presented by Liu [13], aspects can be reversed and thereby<br />

the attack becomes more efficient. For example after retrieving<br />

the specific mapping of virtual addresses to physical memory<br />

Value<br />

0<br />

rows, attacking a victim row with and adjacent and subjacent<br />

attacker row is possible.<br />

III. Related Work<br />

Corruption of information stored in random access memory<br />

(RAM) is not a new phenomenon. These issues are well<br />

known since the time when Intel introduced its first commercial<br />

DRAM chips [14]. Traditionally, these drawbacks<br />

are comprehended and handled as reliability issue. Usually,<br />

memory corruption occurring on a random basis caused<br />

by environmental influences such as radiation or significant<br />

variation in temperature [15], [16]. Reliability of main memory<br />

can be increased by using specialized memory modules.<br />

Employing error correction codes (ECC) memory can help<br />

to reduce the risk of data corruption, since it is capable to<br />

correct single bit errors. An error caused by two corrupt bits<br />

at least can be detected and usually results in a system crash.<br />

Coupling effects between bitlines and wordlines were<br />

researched by Redeker et. al in a paper presented in [8]<br />

(2002). Effects described in this work are also present in the<br />

rowhammer bug where coupling effects between wordlines<br />

adds to charge leakage. In extreme conditions, such as several<br />

thousand row accesses, bit flips are caused.<br />

Environmental influences were also already exploited to affect<br />

the reliability of memory. Govindavajhala et al. presented<br />

a paper [17] (2003) where they applied strong temperature<br />

variations to memory modules. They used a custom designed<br />

adapter to influence and control the temperature resulting in<br />

flipped bits. This work demonstrated how to abuse randomly<br />

flipped bits to escape virtual machine environments, such as<br />

Microsoft .NET or Java Virtual Machine (JVM).<br />

Until then, errors in DRAM have been investigated predominantly<br />

in laboratory setups with controlled environment. But<br />

further research revealed that data corruptions are day to day<br />

business for operators of large scale data centers. Schroeder<br />

et al. used measurement data of commodity server operators<br />

within a period of 2.5 years to research errors of DRAM<br />

under real-world conditions. The findings are presented in<br />

[18] (2009). Schroeder was able to clearify two aspects of<br />

so far state of the art knowledge. First, it was identified that<br />

temperature changes have less effect on integrity of main<br />

memory as assumed. Second, error rates in DRAM are<br />

significant higher for real world applications than documented<br />

by scienfic research.<br />

Another large study on reliability of DRAM was conducted<br />

by Sridharan [19] (2012). Using a dataset comprising 11<br />

months of measurements for a data cluster he was able<br />

to identify typical reasons memory modules stop working.<br />

Nevertheless, both studies emphasise reliability of main<br />

memory as an essential requirement for IT systems.<br />

Researching a wide range of different DRAM memory<br />

modules from three large manufactures led to a publication<br />

with significant insights. In their work, Kim et al. [2] (2014)<br />

identified how bits in adjacent memory cells can be affected<br />

on purpose. Using certain access patterns they were able to<br />

cause targeted memory corruptions and hypothesised how to<br />

apply it as attack vector. From the high frequent access pattern<br />

on adjacent rows in RAM the name “rowhammer attack” was<br />

derived.<br />

121


Building on these research results, Seaborn presented a<br />

Proof-of-Concept (PoC) implementation of two exploits. The<br />

first PoC show cased how to use the rowhammer attack to<br />

cause targeted bit flips to escape Google’s isolated Native<br />

Client (NaCl) sandboxing environment. The second PoC is<br />

capable of gaining root privileges on a system [1] (2015).<br />

The PoC by Seaborn made use of instructions to circumvent<br />

caching in modern CPUs. Gruss et al. continued this work<br />

and ported the PoCs presented by Seaborn to JavaScript [3]<br />

(2015). In such higher languages there is no possibility to<br />

directly influence caching. The presented JavaScript approach<br />

proofed that cache eviction is sufficient fast, if RAM access<br />

patterns are adapted and optimised.<br />

The combination of smaller semiconductor scales and<br />

malicious access patterns opened a new field of possible attacks<br />

on current IT systems. The basics have been researched and<br />

documented by various research teams. Finding novel schemes<br />

preventing or mitigating such attacks was identified as new<br />

challenge. The work by Seyedzadeh [20] (2017) addressed this<br />

open issue by proposing approaches to identify and mitigate<br />

crosstalking in DRAM. Several encoding techniques were<br />

researched and compared in this paper. By applying them in<br />

upcoming memory modules the influences of the malicious<br />

access patterns of the rowhammer attack could be limited.<br />

On the other hand, there are still a lot legacy systems in<br />

operation. Providing a suitable fix for them is the objective of<br />

the research findings published by Brasser et al. [5] (2017).<br />

This work follows an idea proposed by Gruss and suggestes<br />

a separation of memory in areas with different privileges.<br />

By that, user space applications causing rowhammer typical<br />

access to main memory are not able to gain access to memory<br />

areas of other applications or reserved for higher privileged<br />

processes such as the OS kernel.<br />

IV. Attack Details in Theory<br />

In order to utilize the rowhammer bug for a successful attack<br />

on a target system, several preconditions have to be fulfilled<br />

by the attacker. The most essential one is the legitimation to<br />

execute code on the target platform. Further requirements are<br />

also discussed in Section IV-A.<br />

If an attacker satisfies all preconditions, the rowhammer<br />

bug can be exploited. The actual attack can be separated in<br />

different stages which are outlined in Section IV-B.<br />

A. Required Preconditions<br />

First of all, the attacker needs to be able to execute own<br />

code on the target platform. Documented working samples<br />

are available in C/C++ but also languages like JavaScript have<br />

been shown suitable to perform such kind of attacks [21],<br />

[22].<br />

Since this attack is very specific to the hardware it is<br />

executed on, detailed knowledge about the target is essential<br />

to fine tune memory access. Specifically, the attacker has<br />

to have information about the used CPU architecture and<br />

which kind of main memory is installed. Architectures such<br />

as Intel’s Sandy Bridge, Ivy Bridge or Haswell were found to<br />

be affected by the rowhammer bug. Successful attacks were<br />

also conducted on AMD’s Piledriver based systems [2]. The<br />

underlying CPU architecture has influence on the implemented<br />

algorithms used for memory access optimizations and caching<br />

approaches. Since the rowhammer bug requires direct access<br />

to the main memory, varying solutions for memory access<br />

need to be considered in the attacking application.<br />

Currently, only DRAM based memory is known to be<br />

affected by this bug. Comparing DRAM to other memory<br />

architectures, such as Static Random Access Memory (SRAM),<br />

there is a significant difference in structure. For instance,<br />

SRAM cell based systems are not affected by rowhammer.<br />

Since SRAM is composed from larger structures on the dies<br />

available memory capacity is reduced and manufacturing<br />

costs are increased. Consequently, it is not economical to<br />

replace DRAM cells with other memory architectures for a<br />

straightforward mitigation approach.<br />

Most of this technical insights are not publicly documented.<br />

The necessary documents are only available from the hardware<br />

manufacturers after signing Non-Disclosure Agreements<br />

(NDAs) and/or pay large amounts of money. One example<br />

for this practice are the notorious individually watermarked<br />

yellow and red covered specifications issued by Intel only to<br />

a small and selected circle of partners.<br />

To circumvent this issue, most researchers reverse engineer<br />

the target systems to understand the inner workings. Their<br />

documentation though, is not necessarily complete, correct<br />

or easy to understand for someone start working in this area.<br />

Therefore, having code fragments or some released samples<br />

is often not enough to get started. Tedious rework and guess<br />

work needs to complete the code at hand in order to catch up<br />

to the state of the art.<br />

B. Attack Stages<br />

The rowhammer attack exploits the memory organisation<br />

with PDs, PTs, and pages of current OSs. An example is used<br />

to illustrate the attack stages. The assumed and simplified<br />

memory hierarchy is illustrated in an abstract tree structure<br />

which is shown in Figure 5. There are two applications,<br />

each having a PD, PTs and pages assigned. Access paths are<br />

indicated by arrows.<br />

PT 0:0<br />

Pg 0:0:0<br />

PD 0<br />

CR3<br />

PD 1<br />

PT 1:0 PT 1:1<br />

Pg 1:1:0<br />

Figure 5. Simplified memory hierarchy as abstract tree before a rowhammer<br />

attack is conducted.<br />

Step 1): This hierarchy from Figure 5 is translated to a<br />

representation describing a simplified view on the DRAM<br />

layout, illustrated in Figure 6. The unallocated memory is<br />

highlighted by dotted areas. Analogous to Figure 5, access<br />

paths are indicated by arrows.<br />

122


Since each user space application has its own virtual<br />

memory area, applications are separated and access is only<br />

possible to pages of their own area. A direct write access<br />

to PTs is not possible for an application. Key aspect of<br />

the rowhammer attack is to allocate memory using the mmap<br />

syscall. By iteratively invoking mmap using all possible virtual<br />

addresses, each call results in a generated PT. This process<br />

is denoted as memory spraying.<br />

It is assumed, that the application belonging to the memory<br />

structure described by PD 1 has conducted this spraying. In<br />

the subsequent step it accesses its memory areas with high<br />

frequency. Basically, it is now performing the rowhammer<br />

attack.<br />

Addr<br />

0<br />

Addr<br />

Pg 0:0:0<br />

PT 1:1<br />

PT 1:0<br />

PD 1<br />

Pg 1:1:0<br />

Addr<br />

Addr<br />

Pg 1:1:0<br />

PT 0:0<br />

Addr<br />

PT 1:1<br />

Addr<br />

PD 0<br />

Addr<br />

PT 1:0<br />

Addr<br />

PD 1<br />

Addr<br />

CR3<br />

Figure 7. The rowhammer attack caused a bit flip in PT 1:1. Manipulated<br />

address points to PT 1:0. By that, PT 1:0 is treated like a page and write<br />

access is possible.<br />

Pg 0:0:0<br />

PT 0:0<br />

PD 0<br />

Addr<br />

Addr<br />

Addr<br />

3<br />

PT 1:1<br />

PT 1:0<br />

Pg 1:1:0<br />

Addr<br />

Addr<br />

2<br />

CR3<br />

Addr<br />

PD 1<br />

Addr<br />

Figure 6. Simplified memory layout with two applications organized in PD 0<br />

and PD 1. Dotted areas are unallocated memory. Arrows indicate access<br />

paths.<br />

Pg 0:0:0<br />

4<br />

Step 2): The first step generated lots of PTs, which are<br />

available in the memory. As a result of the high frequent<br />

access to the application’s allocated memory, it is assumed<br />

that finally a bit of PT 1:1 is flipped. Since memory is full<br />

of PTs, there is some probability that this bit flip results in<br />

pointing to another PT. This situation is illustrated by Figure<br />

7, indicated by number 0 ❥ .<br />

Accessing this address treats PT 1:0 like a normal page of<br />

the application. Thereby, PT 1:0 becomes writeable for the<br />

malicious user space application.<br />

Step 3): This manipulation to PT 1:1, enables the user<br />

space application to write any address to PT 1:0. With this<br />

acquired privilege, the system can be exploited. Writing any<br />

arbitrary address to PT 1:0 allows full access to the complete<br />

main memory. The numbers in Figure 8 indicate the sequence<br />

of the access path.<br />

The situation after a successful rowhammer attack can be<br />

summarized in an abstract tree model which is illustrated by<br />

PT 0:0<br />

PD 0<br />

Addr<br />

Addr<br />

CR3<br />

Figure 8. PT 1:0 is treated like a page. Thereby, it is writeable and any<br />

address can be used to access other memory areas.<br />

Figure 9. The black arrows indicate the access path utilising<br />

PT 1:0 accessing any desired memory area, symbolised by<br />

the gray box. For an adversary, full memory access is a very<br />

comfortable situation, as it can be used to systematically search<br />

the RAM for sensitive information. Seeking characteristic<br />

patterns, it is possible to identify cryptographic key martial<br />

1<br />

123


such as private Secure Shell (SSH) keys or other kinds of<br />

sensitive information.<br />

PT 0:0<br />

Pg 0:0:0<br />

PD 0<br />

CR3<br />

PD 1<br />

PT 1:0 PT 1:1<br />

Pg 1:1:0<br />

Figure 9. After a successful rowhammer attack, PT 1:0 is treated like a page.<br />

By writing any address to PT 1:0 full memory access from user spaec is<br />

possible.<br />

V. Result Replication<br />

This section documents the gained insight while working<br />

on replicating the findings documented by other researches<br />

[1]–[3]. For the test-bed, a laptop with “rowhammer-friendly”<br />

hardware configuration was used. For an identification of<br />

appropriate components, the documents referenced in related<br />

work provided a good orientation.<br />

The employed test-bed is described in Section V-A. Thereafter,<br />

available software tools to test for potential vulnerability<br />

of a given hardware are discussed. The section concludes<br />

with a discussion of the findings.<br />

A. Test-Bed<br />

Based on related work, laptop computers were identified to<br />

work well as a hardware platform to research the rowhammer<br />

issue. Testing different laptops of various manufactures, with a<br />

selection of different RAM modules is a tedious task. In order<br />

to identify a given configuration to be potentially affected by<br />

the rowhammer bug, the according test in the memory testing<br />

software MemTest86 [23] was conducted.<br />

As affected platform, a Lenovo x230 laptop based on Ivy<br />

Bridge architecture was identified. The used configuration<br />

included an Intel Core i5-3322M CPU in combination with<br />

Hynix RAM modules (PC3-10600 @1333MHz). This confirmed<br />

the findings of Seaborn, that Ivy Bridge based systems<br />

are affected by the rowhammer bug.<br />

Cross confirming with DRAM manufactures and their<br />

market share results in the insight, that Hynix is one of the<br />

top three producers of DRAM. An additional comparison<br />

of these findings with the research conducted by Kim et. al<br />

in [2] on affected memory modules, additionally supported<br />

plausibility of the created test-bed.<br />

B. Test Approach<br />

The identification whether a given hardware configuration<br />

is affected by the rowhammer bug is not trivial. Running<br />

software dedicated to test for this issue does not necessarily<br />

report a result, where as other test applications indicate a<br />

vulnerability.<br />

In his initial blog post, Seaborn published a example C-<br />

code to demonstrate the inner workings of the identified issue<br />

[1]. This code was published on github [21] for others to use,<br />

refine, and continue the work.<br />

Gruss et al. used this sample code in their research and<br />

adapted the available C-code to other CPU architectures, such<br />

as Intels Haswell and Skylake. They further developed a<br />

JavaScript based implementation as a PoC to demonstrate how<br />

to make use of the bug even without direct influcence on<br />

the caches of the CPU. This ported C-code and JavaScript<br />

version are also publicy available on github [22] to be used<br />

for further research.<br />

Finally, a test for rowhammer was also included in the well<br />

known application MemTest86 [23] by the company PassMark<br />

Software.<br />

As part of this research, the adapted C-code versions from<br />

Gruss were used. This sample application was used as<br />

a starting point, e.g., to understand the platform specific<br />

instructions to circumvent caches. Establishing detailed<br />

knowledge on how memory is organized in hardware as well<br />

as managed by the OS required many resources. Along the<br />

C-code and available documentation of the Linux kernel, the<br />

relationship between virtual memory management using Page<br />

Directories, Page Tables and pages was reworked [24]–[26].<br />

However, these structures are very complex but hold interesting<br />

aspects for further work.<br />

C. Discussion on Test Results<br />

This experience results in the insight, that the possibility<br />

to cause a bit to flip does not result in an exploit or even<br />

root privileges. Provided sample source code can be used<br />

as a first starting point to test a system for this bug. They<br />

are not yet a fully working exploit. An attacker as to invest<br />

some more resources to develop malicious applications which<br />

are making use of rowhammer. The fact that very hardware<br />

specific aspects are used and detailed knowledge about memory<br />

management is necessary increases the difficulty to mount<br />

such an attack.<br />

Current software is patched, e.g., calls to clflush is now<br />

restricted to privileged users only. But a final solution needs<br />

to be included in upcomming hardware versions.<br />

VI. Mitigation Strategies<br />

As stated by different results of related work, there are<br />

some recommendations to mitigate the rohammer attack vector.<br />

The timing between refresh cycles in DRAM modules is<br />

one approach often proposed. But the results of Kim et.<br />

al [2] clarify, applying this approach memory corruptions<br />

are still possible. Even with half of the default refresh time<br />

(32 ms) specified by the DRAM standard, flipped bits have<br />

been documented for a one sided hammering of a target<br />

row. Taking a double sided attack to a row into account,<br />

the results get even worse. On the other hand, shorter refresh<br />

cycles also causing the memory module spending even more<br />

time with a maintenance task (refreshing rows) impairing the<br />

system’s performance. Finally, not all currently deployed Basic<br />

Input/Output System (BIOS) versions support a configuration<br />

of significantly shorter refresh cycles necessitating updates.<br />

Actually, some hardware vendors provide these updates [27],<br />

124


ut applying them to the respective platform is potentially<br />

error-prone.<br />

In the work of Seaborn et. al [1] the command clflush<br />

was used to clear cache entries and force the system to<br />

access the values directly from RAM. The utilization of<br />

this low level command enabled the first implementations of<br />

the rowhammer attack. Restricting calls of this command to<br />

privileged users is far too ineffective, as the follow up work<br />

by Gruss demonstrated [3], that the same behaviour can be<br />

achieved utilising higher level languages. Therefore, it can<br />

be concluded, that even platforms without support of such<br />

cache sanitizing functionality are vulnerable to rowhammer<br />

attacks. Instead of actively clean the cache entries, the caching<br />

mechanisms are outperformed resulting in cache eviction<br />

which was shown to be fast enough for the rowhammer attack.<br />

Optimizing the generation of cache misses, the related work<br />

by Oren et. al demonstrated suitable approaches [28].<br />

More sophisticated solutions have been presented by Kim et.<br />

al in [2]. A probabilistic adjacent row activation (PARA) is<br />

proposed to refresh adjacent rows with a very low probability<br />

each time an access is performed. In case of rowhammer one<br />

row is accessed serveral thousend times within a short time<br />

frame. Over time this probability based solution causes the<br />

neighboor rows to be refreshed. Something similar is discussed<br />

by Seaborn in [6], called “Target Row Refresh (TRR)”. This<br />

solution refreshes adjacent rows based on an access counter.<br />

However, an update to the memory controller is necessary<br />

or it can be included in future memory generations, such as<br />

DDR4 chips. In his talk Gruss proposed a software based<br />

approach, utilizing memory hierarchies of different access<br />

privileges. This would restrict the impact to the performing<br />

application. This idea could be included in already available<br />

memory organisation using PDs and PTs.<br />

One insight gained in the course of ongoing research is, that<br />

mitigation strategies can more easily be applied to architectures<br />

using some runtime environment. There, the hardware access<br />

is abstracted and patches can be realised to restrict access<br />

to certain functions, e.g. clflush. However, related work<br />

showed various ways to circumvent such limitations.<br />

VII. Conclusions<br />

Corrupt memory is in general not a new phenomenon. The<br />

issue is known to the hardware developers of DRAM modules<br />

since the introduction of the first commercial DRAM module<br />

by Intel [14].<br />

For a long time the issue of flipping bits in main memory<br />

was seen to be just a reliability issue. In reaction to these<br />

issues, the developers included error correction codes and<br />

combined them with redundancy (e.g. ECC RAM) or memory<br />

remapping as it is the case when faulty RAM rows are detected<br />

and compensated.<br />

The findings of Kim et. al [2] indicate, that memory<br />

corruptions can also be used to specifically manipulate a<br />

computers main memory and affect executed software. With<br />

a PoC the exploitablility is demonstrated by Seaborn [1].<br />

This proves, that hardware reliability issues influence<br />

software on a IT system. In the context of IT security,<br />

such influences undermining basic security algorithms utilised<br />

implementig the Confidentiality, Integrity, Authenticity (CIA)<br />

triad for certain applications. Current findings on how<br />

performance optimization for code execution on CPUs can<br />

be abused to gain access to sensitive data [29]. In this<br />

attack branch prediction is exploited to retrieve data the<br />

CPU preprocessed in expectation of an soon occuring access.<br />

In cases this preprocessed branch is not required potential<br />

sensitive data is mapped to caches from which it can be<br />

extracted by malicious applications.<br />

Related work shows that hindrances such as closed source,<br />

Non-Disclosure Agreement (NDA), hidden or undocumented<br />

functions or the lack of working code samples are not<br />

stopping people from developing PoC exploits. In the case<br />

of rowhammer, released code is used to illustrate the issue<br />

and to test hardware for potential vulnerabilities. This helps<br />

developers and end users to establish mitigation strategies but<br />

a malicious attacker still has to invest resources to fill essential<br />

gaps.<br />

Following the methodology of responsible disclosure allows<br />

vendors to develop patches and notify their customers, e.g.,<br />

through security advisories [30]. This reduces the overall<br />

risk to end users, since they can prepare for the issue being<br />

released. But such findings again emphasise the importance of<br />

maintained and supported platforms. Especially for embedded<br />

systems also recent devices are build upon legacy OS versions,<br />

such as Linux with Kernel 2.6, or unsupported and outdated<br />

libraries [31].<br />

Yet unclear is the situation for embedded systems based on<br />

closed source OS, such as Windows Embedded. As these OSs<br />

often practice security by obscurity, it is not known to the<br />

public how these organise the RAM. Therefore, it is hard to<br />

estimate, whether an attack like rowhammer affects platforms<br />

using a proprietary OS. Without the possibility for producerindependent<br />

research, this issue and the possible impact of<br />

rowhammer to these platforms is hard to asses.<br />

A. Am I affected?<br />

Assessing whether a certain hardware configuration is<br />

potentially affected by the rowhammer bug, requires detailed<br />

investigations. Considering embedded platforms, one can<br />

perform some rough assessment to get an idea, if this specific<br />

attack is of relevance for further research.<br />

It need to be clarified, whether some of the essential<br />

preconditions of Section IV-A are fulfilled. Essential questions<br />

are:<br />

1) What kind of memory architecture is installed?<br />

2) Has an attacker the opportunity to execute own code?<br />

Since this kind of attack is very hardware specific, the<br />

possibility of influencing main memory is not yet an working<br />

exploit. Rather, it should be seen as a reliability issue with<br />

impact on the integrity of an IT system. Further on, memory<br />

corruptions affect the availability of a system. Finally, such<br />

an attack might have impact on confidentiality. Nevertheless,<br />

in combination with vulnerabilities in outdated libraries the<br />

rowhammer attack can be used to gain higher privileges [1].<br />

This emphasises the need for maintained software and hardware<br />

components and keep them up to date with the latest patch<br />

level.<br />

As another aspect, the potential attacker model needs to be<br />

considered. What kind of resources are assumed an attacker<br />

is capable to invest to mount such a specific attack? Are there<br />

any other attack paths, which might be easier to realize? By<br />

125


establishing an attacker model describing the resources and<br />

capabilities of an expected adversary, attacks like rowhammer<br />

can be proportioned. Hardware based security issues are<br />

significant, but often there are other, less complex attack<br />

vectors with same or even higher impact.<br />

VIII. Future Work<br />

This work can be continued by developing a user friendly<br />

tool to test given hardware setting for the rowhammer issue.<br />

Current solutions focus on expert usage, e.g., it is necessary to<br />

compile the test tool based on C-code samples. This demands<br />

knowledge about the target platform, to use the test tool on,<br />

since hardware specific adjustments need to be considered<br />

while compiling.<br />

Additionally, the theoretic basics of memory management<br />

are identified to be a valuable topic for further work. Based on<br />

insights gained from this research, hardware related attack like<br />

rowhammer can be better understood. Recent findings like<br />

Meltdown [29] or Spectre [32] also originate from hardware<br />

characteristics. Here, the optimisation in branch prediction<br />

for preprocessing likley required statements can be exploited<br />

to retrieve sensitive data from caches. If understanding the<br />

underlying concepts comprehenting such issues is supported<br />

and finding secure solutions becomes easier.<br />

Project Funding<br />

The presented work is part of the German national IT-<br />

Security reference project IUNO (https://www.iuno-projekt.de).<br />

The project is funded by the German Federal Ministry of<br />

Education and Research, funding № KIS4ITS0001. IUNO<br />

aims to research and provide building-blocks for IT-Security<br />

in the emerging field of Industry 4.0.<br />

References<br />

[1] M. Seaborn and T. Dullien. (Mar. 2015). Exploiting<br />

the DRAM rowhammer bug to gain kernel privileges,<br />

[Online]. Available: https://googleprojectzero.blogspot.<br />

de / 2015 / 03 / exploiting - dram - rowhammer- bug - to -<br />

gain.html (visited on 01/13/2018).<br />

[2] Y. Kim, R. Daly, J. Kim, C. Fallin, J. H. Lee, D. Lee,<br />

C. Wilkerson, K. Lai, and O. Mutlu, “Flipping Bits in<br />

Memory Without Accessing Them: An Experimental<br />

Study of DRAM Disturbance Errors”, in Proceeding of<br />

the 41st Annual International Symposium on Computer<br />

Architecuture, ser. ISCA ’14, Minneapolis, Minnesota,<br />

USA: IEEE Press, 2014, pp. 361–372, isbn: 978-1-<br />

4799-4394-4. [Online]. Available: http://dl.acm.org/<br />

citation.cfm?id=2665671.2665726.<br />

[3] D. Gruss, C. Maurice, and S. Mangard, “Rowhammer.Js:<br />

A Remote Software-Induced Fault Attack in JavaScript”,<br />

in Proceedings of the 13th International Conference on<br />

Detection of Intrusions and Malware, and Vulnerability<br />

Assessment - Volume 9721, ser. DIMVA 2016, San<br />

Sebastián, Spain: Springer-Verlag New York, Inc., 2016,<br />

pp. 300–321, isbn: 978-3-319-40666-4. doi: 10.1007/<br />

978-3-319-40667-1_15. [Online]. Available: http:<br />

//dx.doi.org/10.1007/978-3-319-40667-1_15.<br />

[4] K. S. Yim, “The Rowhammer Attack Injection Methodology”,<br />

in In Proceedings of the IEEE Symposium on<br />

Reliable Distributed Systems (SRDS), 2016, pp. 1–10.<br />

[5] F. Brasser, L. Davi, D. Gens, C. Liebchen, and A.-R.<br />

Sadeghi, “CAn’t Touch This: Software-only Mitigation<br />

against Rowhammer Attacks targeting Kernel Memory”,<br />

in 26th USENIX Security Symposium (USENIX Security<br />

17), Vancouver, BC: USENIX Association, 2017,<br />

pp. 117–130, isbn: 978-1-931971-40-9. [Online].<br />

Available: https : / / www . usenix . org / conference /<br />

usenixsecurity17 / technical - sessions / presentation /<br />

brasser.<br />

[6] M. Seaborn and T. Dullien. (2015). Exploiting the<br />

DRAM rowhammer bug to gain kernel privileges,<br />

[Online]. Available: https://www.blackhat.com/docs/us-<br />

15/materials/us-15-Seaborn-Exploiting-The-DRAM-<br />

Rowhammer-Bug-To-Gain-Kernel-Privileges.pdf.<br />

[7] R. Isaac, “The Remarkable Story of the DRAM Industry”,<br />

IEEE Solid-State Circuits Society Newsletter,<br />

vol. 13, no. 1, pp. 45–49, Winter 2008, issn: 1098-<br />

4232. doi: 10.1109/N-SSC.2008.4785692.<br />

[8] M. Redeker, B. F. Cockburn, and D. G. Elliott, “An<br />

investigation into crosstalk noise in DRAM structures”,<br />

in Proceedings of the 2002 IEEE International Workshop<br />

on Memory Technology, Design and Testing<br />

(MTDT2002), 2002, pp. 123–129. doi: 10.1109/MTDT.<br />

2002.1029773.<br />

[9] M. Seaborn. (Apr. 2015). L3 cache mapping on<br />

Sandy Bridge CPUs, [Online]. Available: http : / /<br />

lackingrhoticity . blogspot . de / 2015 / 04 / l3 - cache -<br />

mapping - on - sandy - bridge - cpus . html (visited on<br />

01/14/2018).<br />

[10] ——, (May 2015). How physical addresses map to<br />

rows and banks in DRAM, [Online]. Available: http:<br />

//lackingrhoticity.blogspot.de/2015/05/how-physicaladdresses-map-<br />

to-rows-and-banks.html (visited on<br />

01/13/2018).<br />

[11] P. Pessl, D. Gruss, C. Maurice, and S. Mangard,<br />

“Reverse engineering intel DRAM addressing and<br />

exploitation”, CoRR abs/1511.08756, 2015.<br />

[12] R. Hund, C. Willems, and T. Holz, “Practical Timing<br />

Side Channel Attacks Against Kernel Space ASLR”, in<br />

Proceedings of the 2013 IEEE Symposium on Security<br />

and Privacy, ser. SP ’13, Washington, DC, USA: IEEE<br />

Computer Society, 2013, pp. 191–205, isbn: 978-0-<br />

7695-4977-4. doi: 10.1109/ SP.2013.23. [Online].<br />

Available: http://dx.doi.org/10.1109/SP.2013.23.<br />

[13] L. Liu, Z. Cui, M. Xing, Y. Bao, M. Chen, and<br />

C. Wu, “A Software Memory Partition Approach<br />

for Eliminating Bank-level Interference in Multicore<br />

Systems”, in Proceedings of the 21st International<br />

Conference on Parallel Architectures and Compilation<br />

Techniques, ser. PACT ’12, Minneapolis, Minnesota,<br />

USA: ACM, 2012, pp. 367–376, isbn: 978-1-4503-<br />

1182-3. doi: 10.1145/ 2370816.2370869. [Online].<br />

Available: http : / / doi . acm . org / 10 . 1145 / 2370816 .<br />

2370869.<br />

[14] J. H. Saltzer and M. F. Kaashoek, Principles of Computer<br />

System Design: An Introduction. Morgan Kaufmann,<br />

2009, isbn: 978-0123749574. [Online]. Available:<br />

https://booksite.elsevier.com/9780123749574/<br />

casestudies/00~All_Chapters(7-11).pdf.<br />

126


[15] S. Khan, D. Lee, Y. Kim, A. R. Alameldeen, C.<br />

Wilkerson, and O. Mutlu, “The efficacy of error<br />

mitigation techniques for DRAM retention failures: A<br />

comparative experimental study”, in ACM SIGMET-<br />

RICS Performance Evaluation Review, ACM, vol. 42,<br />

2014, pp. 519–532. [Online]. Available: https://dl.acm.<br />

org/citation.cfm?id=2592000.<br />

[16] J. Liu, B. Jaiyen, Y. Kim, C. Wilkerson, and O. Mutlu,<br />

“An experimental study of data retention behavior in<br />

modern DRAM devices: Implications for retention time<br />

profiling mechanisms”, in ACM SIGARCH Computer<br />

Architecture News, ACM, vol. 41, 2013, pp. 60–71.<br />

[Online]. Available: https://dl.acm.org/citation.cfm?id=<br />

2485928.<br />

[17] S. Govindavajhala and A. W. Appel, “Using memory<br />

errors to attack a virtual machine”, in 2003 Symposium<br />

on Security and Privacy, 2003., May 2003, pp. 154–165.<br />

doi: 10.1109/SECPRI.2003.1199334.<br />

[18] B. Schroeder, E. Pinheiro, and W.-D. Weber, “DRAM<br />

Errors in the Wild: A Large-Scale Field Study”, in<br />

SIGMETRICS, 2009. [Online]. Available: https : / /<br />

research.google.com/pubs/pub35162.html.<br />

[19] V. Sridharan and D. Liberty, “A Study of DRAM Failures<br />

in the Field”, in Proceedings of the International<br />

Conference on High Performance Computing, Networking,<br />

Storage and Analysis, ser. SC ’12, Salt Lake City,<br />

Utah: IEEE Computer Society Press, 2012, 76:1–76:11,<br />

isbn: 978-1-4673-0804-5. [Online]. Available: http:<br />

//dl.acm.org/citation.cfm?id=2388996.2389100.<br />

[20] S. M. Seyedzadeh, D. Kline Jr, A. K. Jones, and<br />

R. Melhem, “Mitigating Bitline Crosstalk Noise in<br />

DRAM Memories”, in Proceedings of the International<br />

Symposium on Memory Systems, ser. MEMSYS ’17,<br />

Alexandria, Virginia: ACM, 2017, pp. 205–216, isbn:<br />

978-1-4503-5335-9. doi: 10.1145/3132402.3132410.<br />

[Online]. Available: http:// doi.acm.org/ 10.1145/<br />

3132402.3132410.<br />

[21] M. Seaborn. (Jun. 2015). Test DRAM for bit<br />

flips caused by the rowhammer problem., [Online].<br />

Available: https://github.com/google/rowhammer-test<br />

(visited on 01/19/2018).<br />

[22] D. Gruss. (Jul. 2015). Rowhammer.js - A Remote<br />

Software-Induced Fault Attack in JavaScript, [Online].<br />

Available: https : / / github . com / IAIK / rowhammerjs<br />

(visited on 01/16/2018).<br />

[23] P. Software. (Jul. 2017). MemTest86 V7.4 Free<br />

Edition Download, [Online]. Available: https://www.<br />

memtest86.com/download.htm (visited on 01/16/2018).<br />

[24] W. R. Stevens and S. A. Rago, Advanced programming<br />

in the UNIX environment. Addison-Wesley, 2013, isbn:<br />

978-0321637734.<br />

[25] M. Kerrisk, The Linux programming interface. No<br />

Starch Press, 2010, isbn: 978-1593272203.<br />

[26] M. Gorman, Understanding the Linux virtual memory<br />

manager. Prentice Hall Upper Saddle River, 2004,<br />

isbn: 978-0131453487.<br />

[27] Lenovo. (Sep. 2016). BIOS Update Utility, [Online].<br />

Available: https://download.lenovo.com/ibmdl/pub/pc/<br />

pccbbs/mobiles/8duj26us.txt (visited on 01/17/2018).<br />

[28] Y. Oren, V. P. Kemerlis, S. Sethumadhavan, and<br />

A. D. Keromytis, “The Spy in the Sandbox: Practical<br />

Cache Attacks in JavaScript and Their Implications”, in<br />

Proceedings of the 22Nd ACM SIGSAC Conference on<br />

Computer and Communications Security, ser. CCS ’15,<br />

Denver, Colorado, USA: ACM, 2015, pp. 1406–1418,<br />

isbn: 978-1-4503-3832-5. doi: 10.1145/ 2810103.<br />

2813708. [Online]. Available: http://doi.acm.org/10.<br />

1145/2810103.2813708.<br />

[29] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas,<br />

S. Mangard, P. Kocher, D. Genkin, Y. Yarom, and<br />

M. Hamburg, “Meltdown”, ArXiv e-prints, Jan. 2018.<br />

arXiv: 1801.01207.<br />

[30] (Sep. 2016). Row Hammer Privilege Escalation Vulnerability,<br />

[Online]. Available: https://tools.cisco.com/<br />

security/center/content/CiscoSecurityAdvisory/ciscosa-20150309-rowhammer<br />

(visited on 01/17/2018).<br />

[31] T. Roth. (Dec. 2017). Gateway to (s)hell, [Online].<br />

Available: https://media.ccc.de/v/34c3-8956-scada_-<br />

_gateway_to_s_hell (visited on 01/17/2018).<br />

[32] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg,<br />

M. Lipp, S. Mangard, T. Prescher, M. Schwarz, and<br />

Y. Yarom, “Spectre Attacks: Exploiting Speculative<br />

Execution”, ArXiv e-prints, Jan. 2018. arXiv: 1801.<br />

01203.<br />

Norbert Wiedermann Norbert Wiedermann, MSc.<br />

is employed since 2013 as scientific researcher at<br />

the Fraunhofer Institute for Applied and Integrated<br />

Security (AISEC). In his research projects he<br />

focuses on IT security apsects for embedded and<br />

industrial hardware. By performing risk analysis<br />

and developing security concepts he contributes<br />

to increase the protection level for the considered<br />

systems.<br />

Sven Plaga received the Dipl. Ing. (FH) and<br />

M. Eng. degrees in electrical engineering and<br />

computer science from the Deggendorf University of<br />

Applied Sciences (Germany) in 2007 and 2010 from<br />

the University of Limerick (Ireland), respectively.<br />

From 2007 to 2013 he was research fellow at<br />

Deggendorf University of Applied Sciences, where<br />

he was constantly participated in research projects<br />

regarding x86 Embedded Systems. Furthermore, he<br />

lectured Embedded Systems and C Programming.<br />

Currently, he is a research fellow at Fraunhofer<br />

Institute for Applied and Integrated Security (AISEC) and working toward the<br />

Ph.D. degree in the field of secure industrial communications in context to<br />

embedded systems. Additionally, he assists clients with risk analyses, security<br />

concepts and secure implementations within the scope of contracted industrial<br />

research projects. In his spare time, he loves to share his knowledge and<br />

experiences and likes to discuss his findings with others.<br />

127


You’ve been hacked! Now what?<br />

Haydn Povey<br />

Founder and CTO<br />

Secure Thingz<br />

Cambridge, UK<br />

haydn@securethingz.com<br />

Abstract — You’ve seen the headlines. Whether it's bots<br />

infecting home networks, the destruction of industrial<br />

systems, or the ability to take remote control of<br />

automobiles, the horror stories around Internet of Things<br />

security are starting to mount, like bodies in a bad movie.<br />

The bad guys will keep coming with malicious<br />

intent. The attacks on connected devices are only going to<br />

get worse and more sophisticated. Hardware, software,<br />

communications and communications protocol, device<br />

commissioning, applications layers and other systems<br />

considerations are just some of the many entry points that<br />

could impact security of a device, fall victim to malware,<br />

and lead to data breaches or weaponization. Boundary<br />

protection can be too porous. Systems that may seem<br />

secure today may have weaknesses that will lead to failure<br />

in the future. Failure at some point is almost an<br />

inevitability.<br />

Privacy, corporate reputations and even lives can depend<br />

on the ability to ensure the security of a device.<br />

So it’s time to face reality. There’s a good chance<br />

your device will get hacked. The question is: what are you<br />

going to do about it? How will you recover? And what can<br />

you do to prepare?<br />

This presentation focuses on what you need to do in the<br />

aftermath of an IoT compromise and how you get back to<br />

a trusted system.<br />

Keywords—security; hack; secure boot; secure element; IoT;<br />

IP; attack; system; architecture; cyber security<br />

I. INTRODUCTION<br />

We are becoming increasingly used to the headlines of<br />

hacking across our IT systems. Barely a week goes by where<br />

major flaws aren’t found in the computer infrastructure that<br />

surrounds our digital domain, and whilst these are all<br />

concerning attacks, there are two major corrosive effects.<br />

First, while the global press may publish details on the attacks,<br />

there is a consumer fatigue that “yet another attack has<br />

happened,” minimising the importance of implementing<br />

countermeasures and instilling hygiene amongst users.<br />

Secondly, and more egregiously, there is perhaps a<br />

hopelessness creeping into organizations, both on what they<br />

should do to prevent the impact of an attack on their<br />

businesses and how they can increase the security of their own<br />

products. While there is little we can do on the first, there is<br />

plenty we can do as an industry to solve the second by<br />

providing solid frameworks for responding to attacks, and<br />

building systems which are intrinsically resilient to these<br />

evolving attacks.<br />

The reality for any complex system is that there will always<br />

be flaws in their design, implementation, and management.<br />

We are all human, and we all have to get systems in market<br />

rapidly due to competitive and business pressures. As such we<br />

will never have the time or budget to get any system beyond a<br />

small kernel that’s technically correct and certified by a group<br />

of peers. Specifically, we see industry challenges arising in<br />

three focused areas: 1) architectural specification; 2)<br />

technological inheritance; and, 3) system integration.<br />

A. Architectural Specification<br />

System architectures are the bedrock of technology,<br />

and they are often open to group review and public domain<br />

investigation. And yet we continue to see many fundamental<br />

flaws appearing. We have seen this in the BlueBorne attacks,<br />

where flaws in the Bluetooth specification led to multiple<br />

attack vectors being discovered many years after definition,<br />

and of course we have seen the recent Meltdown and Spectre<br />

compromises, where incorrect architectural definition has led<br />

to multiple side-channel attacks.<br />

B. Technological Inheritance<br />

We all stand on the shoulders of giants, and<br />

leveraging standard hardware and software components is the<br />

bedrock of modern computing. However, we are all at risk<br />

from compromises within these building blocks, creating a<br />

bubbling-up of issues which we are ill prepared to manage. A<br />

recent example of this was compromises identified in lowlevel<br />

Transport Layer Security (TLS) communication drivers,<br />

which themselves are meant to be highly secure. The nature of<br />

these drivers is they are buried deep within the application and<br />

are not necessarily easy to patch or manage. In fact, the nature<br />

of many communication-level security flaws is that they hold<br />

privileged status within the system and present an Achilles<br />

128


heel within the system when they go wrong. In this case, the<br />

vendor of the drivers produced a patch when they identified<br />

the flaw, and correctly notified the OEMs who had built on<br />

this software. However, there is little guarantee that the OEMs<br />

and end users had the knowledge or capability to remediate<br />

systems in the field.<br />

C. Systems Integration<br />

Similar to technological inheritance, the integration of<br />

components from numerous vendors is fraught with issues. It<br />

is not always clear how the system will fit together, and if<br />

compromises are introduced when systems are built. A classic<br />

example of this has occurred in the mobile telephone world<br />

where an innocuous incompatibility between baseband<br />

chipsets and application processors led to a major compromise<br />

in 2017, subsequently patched by the baseband chip vendor.<br />

As we build increasingly complex systems, there is a demand<br />

that every system must be protected individually, and that the<br />

system should implement a “zero trust” model between<br />

modules. However, this will necessitate additional cost, and<br />

addition effort.<br />

II. DEVELOPING AN INCIDENT RESPONSE PLAN<br />

As mentioned earlier, every complex system will have<br />

multiple flaws, and hence we must assume that every system<br />

can, and will, become compromised at some point. At some<br />

point, you will be hacked. So what are we going to do about<br />

it? This paper outlines an initial best practice approach and<br />

longer-term mitigation strategy.<br />

Given that being hacked is inevitable, it becomes<br />

imperative that every vendor within the value chain<br />

understand their role within the industry; that they prepare for<br />

having components or systems that become compromised; that<br />

they have a pre-planned response mechanism; and that they<br />

follow through correctly to ensure the industry’s trust in them<br />

is maintained if they do not wish to lose brand value. It is also<br />

imperative that this incident response plan is not only in place<br />

for current products, but that it also covers systems that have<br />

already been released. To assist in this, a number of groups,<br />

including the Internet of Things Security Foundation<br />

(www.iotsecurityfoundation.org), are creating Best Practice<br />

Guidelines that IoT vendors can integrate into their own<br />

processes.<br />

III.<br />

BEING PREPARED<br />

Preparing for the inevitable compromises is a multi-faceted<br />

process however the following five components represent a<br />

good starting point.:<br />

1. Scope organizational impact<br />

2. Define internal cyber security policy<br />

3. Clear communications policy<br />

4. Develop a bug-bounty policy<br />

5. Execute deep threat analysis<br />

A. Scope Organization Impact<br />

It is difficult to judge how much to invest in protection<br />

against abstract threats, when compared against known budget<br />

constraints. Hence the first step any organization must judge is<br />

what the damage to the organization would be in a worst-case<br />

scenario.<br />

For many organizations, this will certainly include brand<br />

and reputational damage, but this is being exaggerated by<br />

other market forces, including the willingness of the industry<br />

to sue for impacts in their supply chain. A vulnerability in a<br />

communications stack may leave a process control system,<br />

and subsequently a large processing plant, at risk. Hence, a<br />

relatively small code fix could potentially protect a massive<br />

capital structure. Similarly, where there is the potential for<br />

customer data to leak, the balance of additional time and effort<br />

to minimise any threats is obvious compared to the €10M<br />

minimum fine for a data breach under GDPR regulations.<br />

Best practice: To enable a comprehensive impact, it is<br />

often the case that a specialist organization, external to the<br />

business unit, should be employed to challenge assumptions<br />

and ensure the corner cases are explored. This team may be a<br />

central function within large organizations, or may be an<br />

external contractor within smaller operations. Generally, this<br />

team will need to be supported by internal experts running an<br />

insurgency attack to see how they would best compromise<br />

their own products.<br />

B. Define Cyber Security Policy<br />

No battle plans ever survive the first encounters of war, but<br />

you would never wish to go into battle without one. The same<br />

is true of defining a cyber security policy around products for<br />

the IoT. We don’t know exactly when, where or how the<br />

compromises or exploits will be found, but we must have the<br />

framework to deal with them.<br />

• Assuming compromises will occur is the first step in<br />

defining a policy. We have to expect every device<br />

will be attacked, and every device will be<br />

compromised at some point in its life. Every complex<br />

device containing firmware or software will have<br />

exploits. Every device that has a communications<br />

stack will have exploits. Every system containing an<br />

active microcontroller or microprocessor will have<br />

exploits.<br />

Best practice: Assume all devices will be exploited at<br />

some point. Ensure active patch management is<br />

possible, and ensure functionality is available within<br />

the product to update it. Ensure that patch<br />

management is built into the project costs and<br />

lifecycle management, and that development tools<br />

and firmware can support ongoing releases.<br />

• Executive Ownership is critical in developing a cyber<br />

security policy, and responding to an exploit.<br />

129


However, it also important to have executive<br />

overview of security within the products being<br />

created. Security must be supported holistically in the<br />

organization, not relegated to the IT department.<br />

Best practice: A main board member (CxO) should<br />

report fortnightly to the board on any cyber<br />

incidences, and on cyber security policy<br />

implementation<br />

• Engineering leadership engagement is a key<br />

requirement in producing secure IoT devices.<br />

Engineering must decide what level of security is<br />

required for a product with justification, commit to<br />

maintaining software for the lifecycle of the product,<br />

and ensure that security test frameworks are part of<br />

the signoff criteria<br />

Best practice: Security must move from an afterthought<br />

into the fabric of the engineering process. All<br />

architectures, components and sub-components must<br />

be validated against an evolving threat model, and<br />

existing products should be verified against major<br />

new threats.<br />

Active patch and update management are critical in<br />

supporting devices across their lifecycles, with the<br />

ability to recover to a known good state.<br />

• Product leadership must continue to own its products<br />

over its entire lifecycle, and must integrate security<br />

into the fabric of the offering. As such, the transition<br />

from selling commodities to services and lifecycle<br />

support is incredibly important, but also an important<br />

source of additional revenue.<br />

Best practice: Ensure the product mix includes<br />

ongoing lifetime maintenance, or the ability to<br />

manage the device to mitigate exploits. To achieve<br />

this, low-level security services must be instilled in<br />

the device, providing a secure kernel which can be<br />

utilised to support ongoing interaction.<br />

• Rapid escalation of issues is critical in gaining<br />

support from executives for necessary actions, and to<br />

ensuring the industry retains confidence in the<br />

organization. This is true whether a flaw has been<br />

identified internally, a compromise has been found<br />

externally, or an ethical hacker approaches the<br />

organization with a new exploit.<br />

Best practice: Implement a flattened management<br />

structure group for exploit investigation and<br />

responsiveness.<br />

• Communication policy is critical, and will be covered<br />

in more depth later. However, no Best Practice would<br />

be complete without a clear communication strategy<br />

for when attacks are found, how they are solved in<br />

partnership with clients, and how they are<br />

communicated publicly.<br />

Best practice: A set of policies should be<br />

implemented ensuring rapid and formalised<br />

communications which can convey urgency while<br />

demonstrating a managed and resolving situation.<br />

• Proper triage of an incoming compromise is critical<br />

to ensure the organization reacts positively without<br />

becoming overwhelmed by every single flaw in its<br />

products. A formal process of evaluation,<br />

investigation, mitigation and communication is<br />

required.<br />

Best practice: A written process for initial<br />

engagement with issue is important, including the<br />

creation of a template to record all aspects of the<br />

incident, ensuring a consistent approach. This enables<br />

simple recording and comparison of events, and<br />

ensures all stakeholders, including technical teams,<br />

suppliers, legal resources and human resources,<br />

alongside business management, are educated<br />

rapidly.<br />

Figure 1. Triage steps.<br />

• Process management should already be key function<br />

of the organization. It is important that version<br />

management and product evolution are managed<br />

aggressively within an organization. This covers both<br />

aspects of versioning, where previous versions may<br />

have known exploits which have been subsequently<br />

fixed. Similarly, product evolution will invariably<br />

enable flaws to creep into the system, and there<br />

should be the ability to roll back to a known-good<br />

version. Identifying where a flaw occurred and how it<br />

was introduced are key mechanisms in developing<br />

better internal processes to manage compromises<br />

over many years.<br />

Best practice: Maintaining clear records of releases<br />

within a mastering system is important in being able<br />

to understand when and how flaws were introduced,<br />

and who holds ultimate responsibility for these.<br />

These records are important from a technical<br />

perspective, but also critical from a legal perspective<br />

if the organization finds itself in court for a GDPR<br />

breach or lawsuit.<br />

• Patching and update management are important<br />

outputs from any cyber security policy, as this will be<br />

130


the primary response to any compromise, unless a<br />

full recall is required. Ensuring that all systems are<br />

manageable requires that low-level services have<br />

been fully certified and are small enough that they<br />

have no known attack vectors is important to<br />

constraining the update process.<br />

Best practice: Ensure development tools support<br />

version management and interim releases, and that<br />

patches can be signed and encrypted for specific<br />

target devices, or product groups. Assume that an<br />

update mechanism can also be exploited by an<br />

attacker, and therefore provide development<br />

frameworks which enable constrained mastering<br />

releases of software, with authentication and<br />

authorisation. Further ensure that updates are as small<br />

as possible to ensure network bandwidth and battery<br />

power impact are as small as possible.<br />

• Minimising attack impact through authentication is a<br />

clear goal of any cyber security policy. It should be<br />

the case that all devices have a strong cryptographic<br />

identity associated with them, and that all<br />

communications to and from the device have implicit<br />

authentication. This way, we should be able to both<br />

manage the communications to reduce the attack<br />

surface of the device, where an attacker must shape<br />

their attack to the unique device; and additionally<br />

create a zero-trust framework where devices cannot<br />

easily propagate attacks as they do not have<br />

authentication capability for other devices.<br />

Best practice: Ensure all devices have<br />

cryptographically strong certificates and<br />

authentication mechanisms, including PKI<br />

frameworks, the ability to hold secret information<br />

securely, and unique addressability, such as X.509<br />

(or equivalent).<br />

C. Clear communications policy<br />

As mentioned previously a clear communication policy is<br />

the bedrock of a successful Incident Response Plan. However,<br />

this is often easier stated than done, especially given the<br />

embarrassment and sensitivities traditionally associated with<br />

flaws.<br />

There are three stages to a successful incident<br />

communications policy<br />

1. Confirmation.<br />

In many cases a compromise may be identified by an<br />

ethical hacker. We will come onto bug-bounties in a<br />

moment, however, whether you offer one or not, it is<br />

always best to engage positively when approached. In<br />

the past, companies often buried their heads in the<br />

sand, but today that approach will bring industry<br />

condemnation and far greater reputational damage, as<br />

it gives the impression that they are both ignoring the<br />

issue and disrespecting their customers. Engage<br />

positively and strongly, and if possible, attain the use<br />

of the hacker to confirm the flaw in as much detail as<br />

possible.<br />

2. Notification<br />

Following initial triage, it is important to notify<br />

customers as soon as possible that there is a potential<br />

issue and that the organization is working on a fix. If<br />

the fix is simple, it may be that the organization will<br />

quickly make a patch available when they notify the<br />

clients. However, if the fix is complex, it is important<br />

to notify clients as early as possible as they will need<br />

to clarify how to integrate the workaround into their<br />

products. If the issue is catastrophic in nature, it is<br />

also important to flag this early to clients to ensure<br />

they, and their customers, are aware of any potential<br />

impacts to their businesses. A critical flaw in an<br />

automotive component may mean vehicles are at risk,<br />

and subsequently so are human lives. In this case, a<br />

failure to notify will raise the highest legislative<br />

impact and could put your company’s survival at risk.<br />

3. Publication<br />

Traditionally potential flaws were seen as<br />

embarrassments. Today that approach is changing,<br />

and the open publication of flaws and subsequent<br />

fixes is a prerequisite for trust within the industry. If<br />

you are not publishing flaws, you are seen as trying<br />

to hide mistakes, and this itself is a good reason for<br />

not doing business with an organization. Publication<br />

does not mean self-flagellation, and unless a flaw is<br />

critical, you do not have to publish it on the front<br />

page of your website. However, there should be a<br />

specific publication mechanism, with subscription<br />

notification, for users.<br />

D. Develop a bug-bounty policy<br />

We have touched on bug-bounty in the communication<br />

policy, but it is worthy of additional focus. A company should,<br />

as part of its Best Practice, engage with ethical hackers, and<br />

we are seeing this encoded into the new laws currently coming<br />

through the US Congress, where this practice has been<br />

legitimized.<br />

The nature of ethical hacking is that a third-party<br />

independently finds flaws and effectively ransoms the<br />

information to the organization. This approach is obviously<br />

distasteful, although better than a third party finding the flaws<br />

and actively exploiting the issue. A better approach is to be<br />

aggressive in engaging with the community, to build a<br />

predefined set of rules of engagement, and to predetermine<br />

where specific value is attributed to finding compromises.<br />

This approach also ensures that the bug-bounty explorers are<br />

always operating within the law, and they are far more likely<br />

to be true white-hats.<br />

Best practice: It is suggested that an engagement framework is<br />

published on the organization’s website, as part of the Cyber<br />

131


Response Initiative, and that the executive leader who is<br />

tasked with managing product security has a clear<br />

responsibility for engaging with this process and paying the<br />

hackers. After all, the cost of an exploit going wild is far<br />

higher than the cost of managing it internally.<br />

E. Execute deep threat analysis<br />

Of course, all of the above is great for developing a<br />

process to manage exploits. However, it is only through<br />

running an internal deep threat analysis that the organization<br />

will really start to understand how exposed they are.<br />

Best practice: In the first instance, the organization should<br />

create an incident response policy and dry run the process.<br />

This should involve senior business and technical leadership<br />

instigating a tear-down attack on one of their own products,<br />

identify all the suspected compromises, and review the<br />

potential consequences of a successful attack. In most cases,<br />

organizations receive a nasty shock.<br />

Identifying and ranking any exploits forthcoming will help<br />

build a set of priorities which engineering can start to work on.<br />

As this approach is targeted at the first set of products, it is<br />

likely that common threats emerge, such as communication<br />

stacks, lack of prescribed identity, limited patching, and poor<br />

version management. These common threats subsequently<br />

build a back log of threats to mitigate but also ensure that new<br />

products under development learn from these issues, creating a<br />

positive engagement across the organization.<br />

Following this initial phase, the policy should be<br />

readdressed based on feedback, and then the organization is<br />

probably best tasked with bringing in a third-party security<br />

analysis organization to review the process and focus on<br />

additional areas which may have been missed, or outside of<br />

the organizations knowledge base.<br />

IV.<br />

RESPONDING TO AN ATTACK<br />

If a full Incident Response Policy has been implemented,<br />

then the Response phase to a new attack vector should flow<br />

easily. However, each one will be a learning experience and<br />

the focus should be on investigation of the attack,<br />

identification of the flaw, and ongoing mitigation in both<br />

existing and new products through updates and patch<br />

mitigation.<br />

The following steps are suggested:<br />

• Identify cyber security incident<br />

Although obvious, it is imperative that the<br />

organization understand quickly whether the<br />

compromises found are new issues, or whether these<br />

have been found before and are either unpatched<br />

exploits or a novel leveraging of existing issues.<br />

To achieve this the team must be able to replicate the<br />

system and rapidly understand the impact and<br />

pathology of the attack. This necessitates the use of a<br />

“clean room” system disassociated for the main IT<br />

system where the attack can be replicated without<br />

impacting the company if it propagates.<br />

• Define a clear set of objectives<br />

The outcome of identifying the pathology of the<br />

attack may be many fold.<br />

First, it may be that the attack is leveraging a known<br />

compromise in a new way. This obviously needs to<br />

be understood further. However, it may be that the<br />

organization needs to warn customers to more rapidly<br />

update or apply patches. In many industrial<br />

organizations, the adoption of patches is slow as there<br />

may be unknown consequences which the client<br />

needs to mitigate against.<br />

Secondly, the attack may be a new variant of a<br />

known attack, compromising a new area of the<br />

codebase or system. In this case it may be that a<br />

patch can be rapidly reworked to cover the new hole,<br />

and future products can be coded to mitigate against<br />

that attack.<br />

Thirdly, the attack may be a completely new “zeroday”<br />

compromise. In this case the organization needs<br />

to rapidly identify the consequences of the attack, and<br />

develop a mitigation capability. They will need to<br />

judge the impact and potential scale of the exploit,<br />

and how quickly to alert their customer base.<br />

• Recover and remediate<br />

The primary goal of any attack response is to<br />

minimise the impact on the user.<br />

As such, in the first instance the system should be<br />

able to enter a quiescent state where advanced<br />

functionality is switched off, to reduce the attack<br />

surface and inhibit the propagation of any attack.<br />

This capability may be a feature of the Real Time<br />

Operating Systems (RTOS), or may be reliant on<br />

underlying security services, such as a Secure Boot<br />

Manager. In the worst case, this functionality may be<br />

implemented by a quarantine system within the user,<br />

but this should act as a last resort.<br />

The second phase should be to recover the device to a<br />

known good state. For advanced devices, this<br />

probably means recovering the Secure Boot Manager,<br />

which can be achieved through a soft reset. In this<br />

domain, the system should be able to isolate any<br />

possible attacks and inhibit them, while holding the<br />

main system in a safe-mode or shut down.<br />

The third phase is to remediate with the release of<br />

patches and upgrades which should be applied<br />

through the low-level security services, again<br />

leveraging the secure boot manager functionality to<br />

ensure the patches are signed and encrypted, and to<br />

ensure that these are version managed to stop any<br />

roll-back attacks.<br />

132


V. BUILDING RESILIENT SYSTEMS<br />

Previously, we introduced many aspects necessary to<br />

recover from an attack, focusing on executing from a known<br />

good position, the ability to patch identified compromises, and<br />

the desire to update with versioning.<br />

The mechanism to do this within a microcontroller is a<br />

Secure Boot Manager, a small and lightweight security kernel<br />

operating at the lowest execution level, which is capable of<br />

being certified and flexible enough to support the long<br />

lifecycles needed for IoT-centric devices.<br />

The Secure Boot Manager, outlined below, is a modular<br />

framework which can be configured within a commercial IDE,<br />

such as the IAR Embedded Workbench, to deliver a range of<br />

solutions that stretch from a very small codebase to a feature<br />

rich set of security functions. The Secure Boot Manager itself<br />

leverages the device boot framework of security orientated<br />

microcontrollers, such as the STMicroelectronics STM32H7<br />

and the Renesas Synergy S5 families, to create a secured<br />

domain operating below the RTO, enabling even low-level<br />

drivers to be managed an updated over the lifecycle of the<br />

devices.<br />

A modern SBM, such as that shown in Figure 2, should<br />

implement a rich set of functions to support the next<br />

generation of ultra-long-life devices.<br />

Figure 2. Citadel Edge Secure Boot Manager. (Secure Thingz)<br />

Firstly, it is important that the SBM support the secure key<br />

storage and management resident in the microcontroller.<br />

Fundamentally, to enable secure services, the devices must<br />

have sufficiently secure storage in which to store the private<br />

keys upon which the first secure communications over PKI<br />

asymmetric cryptography rely. Once a secure channel to the<br />

device is established, it is possible to program the device<br />

securely.<br />

Through leveraging the secure channel, it is possible to<br />

start building the secure foundations a product requires. First,<br />

the identity must be provisioned. The identity may be a<br />

certificate structure, such as the standard x.509 format, or a<br />

more bespoke form tailored to the IoT. The certificate relies<br />

on additional private keys being provisioned into the device to<br />

lock the device to the certificate. This may be extended with<br />

other forms of identity, such as physically unclonable<br />

functions (PUFs) or additional ownership keys. Once the<br />

device has been provisioned, it can be securely programmed.<br />

Secure programming enables OEMs to develop a master<br />

their application, ensuring the code is both signed for specific<br />

devices, and encrypted to manage it securely in transit.<br />

Through a secure programming system, it is then possible to<br />

validate the device, via its certificate infrastructure, and<br />

deliver the encrypted image to the device, block by block. The<br />

key generated at mastering is also then exposed to the device,<br />

unlocking the code and enabling a raw image to be written to<br />

flash.<br />

The update framework within the SBM further extends this<br />

capability to enable patches generated by the IDE to be<br />

targeted to groups of devices or individual devices, based on<br />

the identity provisioned into the device, and on the definition<br />

of the Security World within the IDE itself. The update can<br />

take many forms including a single line code fix, a module, or<br />

an entire image depending on the fix being undertaken. In this<br />

way we can mitigate the impact of updates to a minimum, and<br />

speed the adoption of patch infrastructure.<br />

Additionally, as mentioned earlier in the paper it is<br />

important to integrate version management into our IoT<br />

devices. Whilst signed images proved they came from a valid<br />

source, they may unfortunately contain known exploits, and<br />

as such, it is important that attackers cannot explicitly rollback<br />

software to a known-bad version and take control of the<br />

device. This solution obviously requires integration into both<br />

the target device (MCU) and the development environment to<br />

function correctly, alongside the mastering tool which injects<br />

and manages keys. Hence a holistic and well integrated<br />

system is required.<br />

Modular updates are traditionally outside of the scope of<br />

microcontrollers due to the monolithic nature of their memory<br />

systems. However, with the advent of the TrustZone for<br />

Cortex-M devices (ARMv8-M architecture), we are seeing<br />

this change, with multiple modules now laying across<br />

protected and open memory. In this context, we can now focus<br />

on delivering smaller modules, but obviously the system needs<br />

to be compiled and built with this in mind. The advantage to<br />

this approach is that we can more easily constrain modules,<br />

increasing security, but also ensure that we minimise<br />

bandwidth impact alongside conserving battery power, as we<br />

do not have to program so much flash.<br />

VI.<br />

SUMMARY<br />

The reality of modern connected systems is that every<br />

system will be compromised, and as such, the impact on the<br />

industry will be long-term and pervasive. The solution to this<br />

inherent insecurity is to provide both business and technical<br />

frameworks that accept flaws and compromises, embrace them<br />

133


as a way of improving the product; and ultimately drive<br />

patches and updates across the lifecycle of the device.<br />

The end perspective is that while we cannot stop flaws we<br />

can, and should, continually improve the solution. To achieve<br />

this a Secure Boot Manager, delivering low level secure<br />

services for patching, updates, remediation and foundational<br />

recovery mechanisms are key. The advent of Secure<br />

Microcontrollers, leveraging advance technologies such as the<br />

ARM TrustZone for Cortex-M, lend themselves to dynamic<br />

remediation and recovery, and coupled with advanced Secure<br />

Boot Managers, provide the foundation for a secure future.<br />

VII.<br />

ACKNOWLEDGMENT<br />

All trademarks are properties of their respective<br />

holders.<br />

134


Hack-Proofing Your C/C++ Code<br />

Copyright 2018 by Greg Davis<br />

Introduction<br />

We are good at working with unreliable machines. At home, I lease a DVR from my<br />

cable company. It often locks up when I fast forward through commercials, so I have<br />

learned to hit a different key that skips rather than runs fast forward, and the DVR<br />

behaves better. When it comes time for the manufacturer of the DVR to release a<br />

product, they are under time pressure, and they focus on fixing the most critical bugs<br />

from a usability standpoint. A company that is not concerned about security is only<br />

considered about “good enough”.<br />

But, if we are to focus on making our product hack-proof, we must hold ourselves to a<br />

higher standard. Hackers are known to use extreme testing, fuzzing, and static analysis to<br />

look for reliability problems in the product. When these reliability problems are found,<br />

they analyze them to see if they are commonly exploited problems, such as a buffer<br />

overflows. The most promising bugs are pushed further as the hackers search for an<br />

exploit that will allow them to take control of the system. It is worth noting that source<br />

code is not necessary for a hacker to do these things. Source code makes their job easier,<br />

but it is not a requirement. The same thing can be said of safety mechanisms such as<br />

ASLR, execute bits in the MMU, or stack canaries; their presence merely makes a<br />

hacker’s job harder. So to a hacker, “good enough” is not good enough to keep them out.<br />

You need to achieve a much higher level of reliability.<br />

The architecture of your software is definitely important, but this when it comes to<br />

security, the way you write your code is just as important. Try looking over the recent<br />

security vulnerabilities in your browser; by my count, 75% of the critical problems are<br />

due to typical C/C++ program flaws (such array overruns, use after free, etc) as opposed<br />

to architectural flaws (such as privilege escalation or security bypasses). Thus, this paper<br />

focuses on tools and techniques that can be used to help prevent these coding flaws.<br />

Coding Standards<br />

An effective tool is to restrict C and C++ to avoid the problematic areas of the language.<br />

Coding standards do just this. When many people think of coding standards, they think<br />

of naming conventions, indentation styles, commenting, and the like. While these things<br />

are important, they’re also a religious issue. These issues are also just as applicable in<br />

other “safer” programming languages.<br />

The coding standards I’d like to discuss make the language safer and easier to<br />

understand. A number of standards exist such as MISRA C, MISRA C++, “The Power<br />

of Ten”, the Joint Strike Fighter C++ Coding Standard, and the CERT standard. I’ll give<br />

a couple of examples of what kinds of rules these standards use.<br />

135


For example, can you spot the problem in this code?<br />

line_a |= 256; /* set bit 8 */<br />

line_b |= 128; /* set bit 7 */<br />

line_c |= 064; /* set bit 6 */<br />

The problem is that in C and C++, any constant that starts with a 0 is an octal constant.<br />

So, while 64 == 0x40 and is a valid representation of bit 6, 064 == 52 or 0x34. Many<br />

coding standards avoid this problem by making it illegal to use octal constants at all. So,<br />

you’d have to express that last line as either:<br />

line_c |= 64; /* set bit 6 */<br />

or<br />

line_c |= 0x40; /* set bit 6 */<br />

As another example, can you spot the problem in this code? (The code will execute OK,<br />

but not in the way the programmer imagined)<br />

int round_to_nearest(float num)<br />

{<br />

if (num >= 0.0) {<br />

return (int)(num + 0.5);<br />

} else {<br />

return (int)(num – 0.5);<br />

}<br />

}<br />

The problem is that the constants 0.0 and 0.5 are expressed in double precision, while the<br />

input number is just in single precision. This means that all of the floating point in the<br />

function needs to be converted to double precision before it is operated on. While I<br />

specifically chose an example that will run OK, this misunderstanding between the<br />

programmer and the compiler is exactly the sort of thing that can cause reliability and<br />

performance problems later on. Coding standards may prevent this sort of thing by<br />

disallowing implicit casts between types.<br />

An important distinction when it comes to coding standards is to decide what can be<br />

automatically enforced. It’s one thing to catch problems during code reviews, but it’s<br />

another thing to have the problem pointed out immediately when you first try to compile<br />

the code. Manual code reviews also introduce room for human error or omission; who<br />

would have noticed the implicit conversion in the round_to_nearest()function,<br />

above?<br />

136


My recommendation when it comes to coding standards is twofold. First, start by reading<br />

up on some of the standards that I mentioned above. Hopefully they’ll give you some<br />

background and will pique your interest. Then, look at some of the tools that are<br />

available. Some are freely available, while others cost anywhere from $100 to thousands<br />

of dollars. What will your budget allow for? Look to configure the tools in a way that<br />

allows you to pick and choose the rules according to what you believe makes sense. You<br />

may agree with some rules in principle, but you may find that they are just too hard to<br />

address in your current code base. Start where you can, and improve your position over<br />

time.<br />

Static Analysis<br />

Static analysis is often seen as the big brother to coding standards. While coding<br />

standards are look at code from a syntactic point of view, static analysis works during the<br />

compilation or building stage by simulating the effects of executing the code.<br />

As an example of what static analysis can detect, can you see the problem in the<br />

following code?<br />

int write_it(int dest /*fd*/, uintptr_t srcAddr,<br />

size_t len)<br />

{<br />

unsigned char *buf = (unsigned char *)srcAddr;<br />

int ret;<br />

}<br />

while (len > 0 && (ret = my_write(dest, buf,<br />

len)) > 0)<br />

{<br />

buf += ret;<br />

len -= ret;<br />

}<br />

return ret;<br />

The problem is that if “len” is less than or equal to zero, the return value of “ret” will<br />

not have be initialized, so you’ll be returning an essentially random number.<br />

(Technically speaking, the behavior is much worse than this 1 .)<br />

1<br />

Although in many cases, the practical result of reading an uninitialized automatic variable is that the read<br />

may result in an indeterminate value, the result is considered undefined in the C and C++ standards.<br />

Examples exist where compilers will inadvertently optimize away sections of code due to a read of an<br />

uninitialized automatic.<br />

137


It’s worth noting that some of the coding standards might have prevented this bug by<br />

disallowing side effects on the right hand side of an short-circuit operator. Forcing the<br />

user to write this code without the short-circuit operator might have been enough for the<br />

programmer to spot the error.<br />

That said, static analysis has a number of advantages over coding standards:<br />

1. Static analysis doesn’t prohibit any coding constructs. You can keep your<br />

existing code.<br />

2. Static analysis points out bugs that are likely to arise in practice, while coding<br />

standards require many changes in cases where there wasn’t actually an existing<br />

problem in the code.<br />

3. Modern static analysis tools look at code globally, allowing for detection of<br />

problems that only occur across procedure boundaries.<br />

Still, static analysis suffers from a number of limitations.<br />

1. Static analysis tools employ a relatively limited number of rules that they check<br />

for. Many sources of bugs are not covered by these rules.<br />

2. Whereas a false positive for a coding standard suggests an obvious rewrite to the<br />

code that will silence the diagnostic, a false positive for static analysis may be<br />

more cumbersome to work around.<br />

3. Some static analysis vendors have found it counterproductive to point out<br />

problems that cannot be easily explained. While this is certainly pragmatic, one<br />

worries about the omitted issues.<br />

Automatic Run-Time Error Checking<br />

Static analysis is called “static” because it works during the compilation process.<br />

Automatic run-time error checking (RTEC for short) works in an entirely different<br />

manner. RTEC looks for specific problems during the execution of your code.<br />

RTEC may be implemented in a number of ways:<br />

1. A compiler may add the checks automatically.<br />

2. Uninstrumented code may be run in a simulation environment, where the<br />

environment performs checks. The open-source “valgrind” project is an example<br />

of this.<br />

To see the difference between static analysis and RTEC, consider the following fragment<br />

of code:<br />

int array[5];<br />

int get_and_set1(int index, int value)<br />

{<br />

int ret = array[index];<br />

array[index] = value;<br />

138


}<br />

return ret;<br />

Obviously, any call to get_and_set1()will be invalid when the “index”<br />

argument is not between the values of 0 and 4, inclusive. A static analysis tool won’t<br />

report a problem unless it is reasonably sure that a value outside this range will be used.<br />

RTEC doesn’t guess. It essentially treats the code as if it were written:<br />

int array[5];<br />

int get_and_set1(int index, int value)<br />

{<br />

int ret;<br />

if (index < 0 || index > 4) {<br />

report_error();<br />

ret = array[index];<br />

array[index] = value;<br />

return ret;<br />

}<br />

RTEC has the advantage of being able to detect error cases that are not apparent at<br />

compile time. On the other hand, it requires greater run-time resources, while static<br />

analysis runs solely at compile time.<br />

Not all classes of errors require RTEC. For example, consider the code:<br />

int get_and_set2(int index, int value)<br />

{<br />

static int *ptr = NULL;<br />

static int len = 0;<br />

int ret;<br />

if (index >= len) {<br />

int *nptr = calloc(value+1, sizeof(int));<br />

memcpy(nptr, ptr, len*sizeof(int));<br />

ptr = nptr;<br />

len = value + 1;<br />

}<br />

ret = array[index];<br />

array[index] = value;<br />

return ret;<br />

}<br />

139


This code takes care of the array overrun that plagued get_and_set1(), but it<br />

doesn’t check for a NULL pointer return from the calloc() function. For all<br />

practical purposes, any use of a pointer returned from a C memory allocation routine<br />

before checking for a NULL pointer is an error. Static analysis is perfectly capable of<br />

detecting this class of problem. RTEC focuses on problems that are dynamic. RTEC<br />

complements coding standards and static analysis because of its dynamicism, but it is<br />

only as good as your test vectors.<br />

Assertions<br />

Another pillar of high reliability software is the frequent use of assertions.<br />

A static assertion is an assertion that must be checked at compile-time. It can be used for<br />

defensive programming and to make explicit the assumptions in code.<br />

// The following function assumes that a pointer<br />

// and an int are the same size.<br />

static_assert(sizeof(int) == sizeof(void *), “”);<br />

void do_sketchy_pointer_arithmetic(void)<br />

{<br />

// ...<br />

A static assertion can only be used to check compile-time constants. For example:<br />

static_assert(sizeof(header) 64);<br />

// ...<br />

140


The assertion generates code that will be checked at run-time. Typically, assertions are<br />

enabled during development to find as many problems as possible. Then when a product<br />

enters a testing phase, assertions are disabled so that the product will run as fast as<br />

possible.<br />

Most development projects will implement their own form of assertions rather than using<br />

the standard implementation. Since the standard assert.h requires a lot of<br />

strings for each dynamic assert, a custom assert macro may be desired in order to produce<br />

smaller code. Some aspects to consider include:<br />

1. What happens when an assertion fails? Is there some kind of console where the<br />

error can be printed? Other assertion systems will go into an infinite loop when<br />

an assertion fails, expecting the programmer to find the problem in a debugger.<br />

2. What happens to a run-time assertion when run-time assertions are disabled?<br />

You should not write code that relies on an assertion in order to be correct. For example:<br />

// Bad example: We need save_to_flash to be<br />

// executed even when the run-time assertion<br />

// macro is set to not do anything.<br />

assert(save_to_flash(data) == err_none);<br />

// Better example:<br />

err_t result = save_to_flash(data);<br />

assert(result == err_none);<br />

It is OK if assertions use conditions involving function calls, so long as these function<br />

calls are not doing anything that the system might come to rely on. For example:<br />

assert(dictionary.size() > 10000); // OK<br />

Like compile-time assertions, run-time assertions are most valuable to you when they<br />

break your system, because fixing the assertion will be easier than finding the problem<br />

when it manifests itself as a downstream glitch.<br />

Conclusion<br />

We have explored a number of tools and techniques that you can use to help your project<br />

become more secure.<br />

141


What Is an IoT OS?<br />

Christian Légaré<br />

Silicon Labs Inc.<br />

Montréal, Québec, Canada<br />

christian.legare@silabs.com<br />

Abstract— A lot of attention in the Internet of Things (IoT) is<br />

given to the cloud: data, analytics, networking (fog computing),<br />

and mobile devices (tablets and smartphones). Unfortunately, IoT<br />

devices – the devices that produce the data – are often the<br />

neglected element in this system. Little attention is given to the<br />

architecture, design, and implementation of IoT devices. This<br />

paper will cover one aspect of the development of IoT devices: The<br />

IoT OS.<br />

First, we need to differentiate between a real-time kernel and<br />

an operating system. In the embedded space, a kernel is often<br />

referred to as an RTOS (real-time operating system), but that’s<br />

something of a misnomer: it’s not actually a full-fledged operating<br />

system. Rather, a kernel is the basis of a complete operating<br />

system.<br />

The embedded OS, or IoT OS, is composed of an RTOS (realtime<br />

kernel) plus multiple services and middleware stacks that<br />

provide connectivity and security.<br />

Keywords—IoT, OS, RTOS, MCU, Connectivity, Security,<br />

Modularity, Scalability, Machine Learning, Blockchain<br />

Yet at the same time, these devices may require multiple<br />

networking protocols, security (multiple encryption and<br />

decryption algorithms), and the ability for remote firmware<br />

updates (Firmware Over the Air, FOTA). All this requires<br />

resources that push the limits of the average microcontroller.<br />

And we haven’t even mentioned new technologies such as<br />

machine learning, blockchain, and others. So, how do we<br />

address these requirements with such limited resources?<br />

We must architect the IoT system in such a way so that we can<br />

achieve all these requirements and still meet an ownership cost<br />

that makes the system commercially viable.<br />

First, the use of a gateway in many cases is a virtual necessity.<br />

It is impossible to run all the software that an IoT system needs<br />

on a sensor/actuator device. So, which operating system do we<br />

run on the device, and which functions do we enable? And<br />

similarly, which operating system do we run on the gateway,<br />

and which services do we provide?<br />

I. INTRODUCTION<br />

The IoT is often presented as information technology (IT) being<br />

pushed into the realm of operational technology (OT). What<br />

this means is that the typical set of technologies employed by<br />

IT – especially web technologies – cannot be applied to<br />

building IoT edge devices.<br />

Looking at the IoT this way presents a major problem:<br />

specifically, the cost of the hardware used for IoT devices.<br />

Because IoT devices are produced in enormous quantities, we<br />

need them to be produced as cheaply as possible to make the<br />

system affordable and to make the business case ROI-positive.<br />

The average IoT device microcontroller runs between 50 MHz<br />

and 200 MHz, contains between 64 KB and 1 MB of flash<br />

memory (for code space), and has 4 to 512 KB of RAM. By<br />

comparison, processors that run smartphones, tablets, or cloud<br />

servers run at gigahertz speed, have terabytes of available code<br />

space, and gigabytes of RAM. So, the average IoT device is not<br />

capable of running typical IT software.<br />

Figure 1: Generic IoT system architecture<br />

Typically, the gateway will be running on an application<br />

processor (Cortex-A, Intel Quark or similar) and will use a<br />

general-purpose OS (GPOS) such as Android or Linux.<br />

The majority of the predicted billions of IoT devices will use<br />

microcontrollers (typically a Cortex-M), and so a GPOS is out<br />

of the question. Some software developers may prefer to use<br />

fine-tuned bare metal code to maximize the amount of code that<br />

www.embedded-world.eu<br />

142


can run on the device. This can be a solution, but it typically<br />

takes a lot of time to develop a reasonable IoT application using<br />

a single-threaded (bare metal) approach, which increases time<br />

to market.<br />

The essence of IoT is connectivity. Connectivity stacks<br />

(TCP/IP, Wi-Fi, Bluetooth, Thread, Zigbee, Wireless Hart, and<br />

many others) are large pieces of code, and are time-sensitive<br />

(protocols rely on timeouts). Therefore, the use of a real-time<br />

kernel is a good practice. It simplifies the software architecture,<br />

helps achieve performance, and reduces maintenance costs.<br />

Designing a product is all about the system requirements. A<br />

bare-metal (single threaded, super-loop) approach might be<br />

satisfactory for the design. In other situations, an open-source<br />

solution might be the right solution, based on the desired<br />

functions and features.<br />

An IoT-specific OS will increasingly play a significant role in<br />

device design. While there are GPOSs out there that provide<br />

connectivity, and do find their way into embedded systems,<br />

those GPOS-based systems do not meet real-time requirements.<br />

And there are other types of requirements where an IoT OS is a<br />

much better match.<br />

Safety certification is also a concern for IoT systems, as there<br />

are industries where devices and the software running on them<br />

must meet safety-critical regulations. Using an IoT OS (or at<br />

least a kernel) that already has a validation suite available will<br />

save time and money in these markets.<br />

As of today, there is still no industry definition of an IoT OS.<br />

The following sections will lay the foundation of what such an<br />

OS must be. For the industry to build and ship the forecasted<br />

billions of deployed devices, such a definition is mandatory.<br />

II. INDUSTRIAL VS COMMERCIAL<br />

The software requirements for industrial and consumer IoT<br />

devices can differ quite a bit. Although they might share a<br />

common kernel and low-level services, the middleware<br />

required by their applications can be radically different.<br />

low-power, low-cost device that may run entirely on battery.<br />

Such a device might typically use a Cortex-M0 or Cortex-<br />

M3/M4 MCU. It would use a highly efficient wireless network<br />

protocol such as Zigbee, Thread or Z-Wave to reduce<br />

transmission time and save power. And it would communicate<br />

over short distances wirelessly using Bluetooth or low-power<br />

Wi-Fi, or else use Ethernet, Sigfox, LoRa or NB-IoT when used<br />

as an edge node. This kind of industrial product must never fail<br />

once deployed, and it would be in service for decades.<br />

The right side of Figure 2 illustrates the software stack for a<br />

consumer IoT device. In a consumer environment, web<br />

technologies are more common, and this example includes a<br />

Java virtual machine. Consumer products may also make use of<br />

specific vertical market protocols such as AllJoyn, HomeKit,<br />

HomePlug/HomeGrid, Continua Health Alliance, or 2net. Such<br />

a device typically might use a Cortex-M3/M4 or a Cortex-A<br />

processor.<br />

Consumer products typically have shorter lifespans than<br />

industrial products, and tend to be replaced more frequently.<br />

Consumers also are more accepting of product failures; for<br />

example, how often do you reboot your smartphone? The fact<br />

is that failures in consumer products are tolerable, whereas in<br />

industrial products, failures can endanger people’s lives. And<br />

the lengthy validation process to establish the reliability of<br />

embedded software takes more time and money than the makers<br />

of consumer devices are willing (or even need) to spend.<br />

These requirements will drive your choice of operating system,<br />

as platform choice shouldn't dictate a device's functionality.<br />

III. MODULARITY<br />

IoT devices will also require a modular operating system that<br />

separates the core kernel from middleware, protocols, and<br />

applications. The reasons are ease of development and keeping<br />

the memory footprint of the software to a minimum.<br />

Figure 3: Modular IoT OS.<br />

Common modules in orange, optional stacks in blue.<br />

Figure 2: A low-power industrial IoT device (left)<br />

and a consumer IoT device (right)<br />

In Figure 2, the left side depicts the software stack for an<br />

industrial IoT device such as a wireless sensor node. This is a<br />

Using a modular OS simplifies the development process,<br />

especially when developing a family of devices with different<br />

capabilities. Relying on a common core allows the entire family<br />

of devices to share a common code base, while each device is<br />

143


customized with only the middleware and protocol stacks<br />

required by the application.<br />

This approach also allows for a smaller memory footprint in the<br />

device. Unlike a monolithic operating system that bundles an<br />

entire suite of software together, a modular operating system<br />

allows for tailoring the embedded software for the device,<br />

requiring less RAM and flash memory and reducing costs.<br />

A real-time kernel, used as a simple scheduler, requires about<br />

4–5 KB of code space. In a multithreaded environment, we’ll<br />

need most of the kernel services (semaphores, mutexes,<br />

message queues, event flags, and so on), and the kernel will<br />

require something like 20–25 KB of flash memory for code and<br />

about 2 KB of RAM. Depending on the number of tasks the<br />

kernel needs to manage, RAM usage will grow because each<br />

task requires a stack. We can estimate an average of 1 KB of<br />

RAM per task; the total depends on the application complexity<br />

and the call stack depth.<br />

The total required code space depends on how many stacks are<br />

involved. For example, a TCP/IP stack can require about 50 KB<br />

of code and tens of KB of RAM, depending on how many<br />

connections are opened and the desired performance. Similarly,<br />

the code for Bluetooth requires about 120 KB, and its RAM<br />

usage is about 20 KB.<br />

So, you need to sum up all the code and RAM usage and<br />

evaluate whether the hardware can support the required<br />

software. With an IoT OS, the total is still very small when<br />

compared to a GPOS, but the average microcontroller does not<br />

have the resources of an application processor.<br />

The other important part of Figure 3 is the yellow box, the<br />

common API. If you are using a commercial OS, the API for all<br />

the various functions are likely to have commonalities and be<br />

easy to use, something that is not the case when using an open<br />

source operating system. A common API between different<br />

communication stacks makes it easier and faster for you to<br />

develop, test and validate your application.<br />

IV. SCALABILITY<br />

A flexible, scalable RTOS can help increase return on<br />

investment, cut development costs, and reduce time to market.<br />

Although deeply embedded systems have historically been built<br />

entirely around 8- and 16-bit MCUs, the price of 32-bit MCUs<br />

has been dropping rapidly. As they have become commodity<br />

products, their popularity for embedded devices has<br />

skyrocketed.<br />

A common engineering solution for networked sensor systems<br />

is to use two processors in the device. In this arrangement, an<br />

8- or 16-bit MCU is used for the sensor or actuator, while a 32-<br />

bit processor is used for the network interface. That second<br />

processor runs an IoT OS.<br />

Sales of 32-bit MCUs have exploded in the last decade and have<br />

become the largest segment of the MCU market. The 32-bit<br />

MCU segment alone is expected to grow to 70% of the MCU<br />

market, forecasted to total 100 billion MCUs by 2024. (For<br />

more information: https://goo.gl/PZ9QhY.)<br />

IoT devices will still contain a mixture of small and large MCUs<br />

for years to come. A scalable IoT OS that runs on a variety of<br />

16- and 32-bit MCUs will meet tight memory requirements,<br />

reduce processor demands, and save money.<br />

As a result, it is difficult to run all the security and marketspecific<br />

IoT software (AI, blockchain, vertical market specific<br />

protocols and more) on a single microcontroller. The business<br />

case will drive the design. When using a larger microcontroller<br />

is not cost effective, you will need to centralize tasks on a<br />

gateway.<br />

Designing an IoT system to include a gateway has two main<br />

advantages:<br />

• The ability to scale down nodes to have a lighter software<br />

load<br />

• The ability to take cloud software to the edge (i.e., run part<br />

of your device lifecycle management locally)<br />

Depending on the target cost for the system, the gateway<br />

processor could be a microcontroller or an application<br />

processor. It could run either an IoT OS or heavier software<br />

such as Linux or Android, but it should be said that an IoT OS<br />

will allow you to squeeze every MIPS out of the processor.<br />

Using the same operating system for both the edge devices and<br />

the gateway will simplify your software development, ease the<br />

learning curve, and reduce product maintenance costs.<br />

Over time, the preferred CPU for the gateway will likely settle<br />

on an application processor. The edge nodes, on the other hand,<br />

will continue to require highly specialized components. To<br />

reach the desired target cost, the device designers will need to<br />

implement only the features that are strictly required for a given<br />

application. And it is crucial that the IoT OS is scalable across<br />

many varieties of devices, so that code can be portable.<br />

V. RELIABILITY<br />

Many IoT systems will be deployed in safety-critical<br />

environments, or in locations where repair and replacement<br />

are difficult. IoT devices will need to be faultlessly reliable.<br />

In these situations, an IoT OS must have safety-critical<br />

certification. This kind of certification is vital to demonstrate<br />

the reliability and safety of your device. Certifications that you<br />

may require include:<br />

• DO-178B for avionics systems<br />

www.embedded-world.eu<br />

144


• IEC 61508 for industrial control systems<br />

• ISO 62304 for medical devices<br />

• IEC SIL3/SIL4 for transportation and nuclear systems<br />

Certifying code is an expensive proposition, but the entire<br />

IoT OS may not require complete certification. The best<br />

practice would be to certify the kernel, its memory protection<br />

unit, and any task that is deemed safety-critical. Other nonsafety-critical<br />

tasks can run in separate memory regions,<br />

allowing them to be isolated from the safety-critical parts of the<br />

application.<br />

When building products for use in a safety-critical environment,<br />

software that is already certified can reduce certification time<br />

for a device, and reduce costs. Every safety-critical part of the<br />

device will require certification and extensive documentation.<br />

Validation suites and certification kits, typically available from<br />

third parties, provide thousands of pages of documentation. To<br />

be clear, it is not the components that are certified; it is the<br />

complete product. But using modules that have existing<br />

validation suites or certification kits will save time and reduce<br />

cost.<br />

Even if certification isn't required for the device, knowing that<br />

the OS running within it has been certified can provide<br />

confidence and peace of mind that your product will perform<br />

reliably.<br />

VI. CONNECTIVITY<br />

Network connectivity is essential to the Internet of Things.<br />

Whether we are talking about wireless sensor nodes in a factory,<br />

or networked medical devices in a hospital, the industry now<br />

expects embedded devices to be connected to each other and to<br />

communicate with corporate or public networks.<br />

This fact changes how you think about product development:<br />

there will be less emphasis during development on<br />

multithreading. Your chosen platform must be easy to use and<br />

already hardened, feature robust connectivity out of the box,<br />

and must work with your chosen hardware. Essentially, you<br />

want to focus on your application and trust that the software<br />

platform is usable, rugged, and stable.<br />

To achieve these goals, the IoT OS must support<br />

communications standards and protocols such as Ethernet,<br />

IEEE 802.15.4, Wi-Fi, and Bluetooth. The device must be able<br />

to connect to IP networks using bandwidth-efficient protocols<br />

such as Thread. In Figure 3, the blue vertical stacks show many<br />

of the available connectivity technologies. The list is not<br />

exhaustive.<br />

An IoT OS will allow you to select only the specific protocol<br />

stacks you need, which again means saving memory on the<br />

device, and reducing cost. And it can help retrofit existing<br />

devices with new connectivity options without reworking the<br />

core of the embedded software.<br />

VII. POWER MANAGEMENT<br />

Referring again to Figure 3, an IoT system that uses two or more<br />

stacks with a power management strategy will require a power<br />

management service. Such a service receives the sleep signals<br />

(which specify when the system can enter a sleep mode),<br />

manages the peripherals going into a sleep mode, and wakes up<br />

the processor and peripherals when a wake signal (time based<br />

or event based) is detected.<br />

A power management service cannot be managed by one stack.<br />

Power management must be centralized so that all peripherals<br />

and memory are managed properly when the system is entering<br />

or exiting sleep mode.<br />

Not so long ago, a real-time kernel did not have to worry about<br />

this kind of feature; it was left to the designer to implement it.<br />

But now, with the growing complexity of edge devices and the<br />

multiplication of battery-operated devices, power management<br />

is becoming a commodity. This service must now be a standard<br />

component of a real-time kernel.<br />

This presents a problem for portability. Each silicon vendor<br />

implements sleep modes differently. Even if the real-time<br />

kernel implements sleep and wake-up functions, these functions<br />

must be ported to each unique hardware architecture.<br />

VIII. SECURITY<br />

Security is the hot button topic in media coverage of IoT today.<br />

The average lifetime of a consumer IoT device may be years,<br />

but industrial IoT devices must function for decades. We need<br />

to rethink how to protect these devices. Consider the increasing<br />

lifespan of IoT devices, combined with the huge number of such<br />

devices producing an IoT blanket covering the globe, plus the<br />

rapid advances in knowledge and tools used by attackers. It is<br />

simply not feasible to build end-node devices that are supposed<br />

to remain secure throughout their practical lifetimes.<br />

To help protect the software running on these devices, silicon<br />

vendors are adding hardware security features to their<br />

processors. Memory protection units (MPU) have been used for<br />

a long time in safety-critical applications to isolate code and<br />

data. Now, MPUs are being applied to security. But MPUs<br />

alone are not sufficient. When it comes to cryptography and<br />

security key management, we need to guarantee that the keys<br />

will not be tampered with, and are stored securely.<br />

A complete trusted execution environment (TEE), which is<br />

usually found on application processors, is now available on<br />

microcontrollers. ARM-based microcontrollers now feature<br />

hardware components such as a secure element (which contain<br />

encrypted keys) and/or TrustZone. The inclusion of these new<br />

hardware features requires additional software to configure and<br />

control them. The configuration and management of these<br />

145


hardware components – MPU, secure elements and TrustZone<br />

– are the responsibility of the real-time kernel and so become<br />

new services available to the application tasks.<br />

IX. AI AND MACHINE LEARNING<br />

With the introduction of artificial intelligence (AI)<br />

technologies, end-users will start perceiving IoT systems as<br />

having human-like qualities. When a user wants to get or set<br />

data on a device, the response time (often referred to as latency)<br />

will be crucial. And if most of the AI processing is done in the<br />

cloud, it only adds to the response time. This is one reason AI<br />

technologies such as deep learning are moving to the edge of<br />

the network. Edge devices will increasingly handle some<br />

portion of the AI processing.<br />

In digital assistant services such as the ones offered by Amazon,<br />

Apple, and Google, cloud computing is being used for natural<br />

language processing. As this technology evolves, the<br />

algorithms could be transferred to the edge, allowing a faster<br />

response time. Applying the same principle to other forms of<br />

AI in devices, we can see how to build larger, connected<br />

systems. As AI decision making moves closer to the edge,<br />

automation becomes faster and therefore applicable to more<br />

scenarios.<br />

As AI and deep learning algorithms improve, and as processors<br />

become more powerful and optimized to run such software,<br />

decision making at the edge will become a commodity service<br />

and provide the framework for a new generation of<br />

applications.<br />

An IoT OS provides the software architecture for these new<br />

algorithms to coexist with sensor/actuator and communication<br />

software. AI algorithms would be implemented as tasks that run<br />

concurrently with other system tasks.<br />

X. BLOCKCHAIN<br />

There are many obstacles today slowing down the adoption of<br />

IoT. First, the market for IoT devices and platforms is<br />

fragmented, with many standards and many vendors. The<br />

uncertainty about the technology, the vendors, and the solutions<br />

are some of these obstacles.<br />

Second, there are also concerns about interoperability, and the<br />

solutions implemented often tend to create new data silos.<br />

As I mentioned above in the Security section, data is often<br />

stored in the cloud securely, but these cloud-based security<br />

implementations cannot protect devices against compromised<br />

integrity, nor against tampering of data at the source.<br />

Finally, the centralized architecture of most IoT solutions<br />

means that there are potentially serious issues with resiliency.<br />

When all transactions are processed in the cloud, unavailability<br />

of cloud resources will freeze your business operations.<br />

Blockchain is a technology that could help with system<br />

resiliency. The basic concept of blockchain is quite simple: it is<br />

a distributed database that maintains a continuously growing<br />

list of ordered records. But the term “blockchain” is usually tied<br />

to transactions, smart contracts, or cryptocurrencies. This is<br />

why we need to dissociate blockchain from specific<br />

implementations such as Bitcoin and Ethereum. In fact, the<br />

convergence of blockchain and the IoT is on the agenda for<br />

many companies. And there are existing implementations,<br />

solutions, and initiatives in several areas outside of IoT and<br />

financial services.<br />

The blockchain community believes that KSI (Keyless<br />

Signature Infrastructure) blockchain is a technology that can be<br />

used to provide integrity for all assets, including the cloud and<br />

edge devices.<br />

According to IBM, the three benefits of blockchain for IoT are:<br />

building trust, cost reduction and the acceleration of<br />

transactions. Specifically:<br />

• Building trust between the parties and devices with<br />

blockchain cryptography and reducing the risk of<br />

collusion and tampering<br />

• Reducing cost by removing the overhead associated with<br />

middlemen and intermediaries<br />

• Accelerating transactions by reducing the settlement time<br />

from days to nearly instantaneous<br />

Blockchain can add a commercial dimension to IoT. A block<br />

contains the transaction, but can also contain the contract. So,<br />

an IoT device could buy or sell data from/to another device or<br />

system.<br />

The blending of blockchain and IoT devices is not for the near<br />

future. Blockchain processing tasks are computationally<br />

difficult and time-consuming, and IoT devices are still<br />

relatively underpowered, lacking the processing power to<br />

directly participate in a blockchain. This is for good reason: the<br />

heavy computational load helps protect integrity. As Salil<br />

Kanhere, an associate professor and researcher at the University<br />

of New South Wales presents it: “Standard IoT devices can’t do<br />

this kind of heavy computational work, just like you can’t mine<br />

bitcoins on a standard laptop anymore”. So initially, this type<br />

of application will be seen on high-end gateways first.<br />

But as hardware and software technologies evolve, the<br />

architecture provided by an IoT OS would allow the device<br />

blockchain function to integrate well with all the other system<br />

tasks.<br />

XI. TOOLS<br />

An IoT OS has many more layers of complexity than an RTOS.<br />

And any state-of-the art commercial offering ought to include<br />

its own customized development environment.<br />

www.embedded-world.eu<br />

146


Of course, a toolchain (compiler, linker, loader) is mandatory<br />

for software development. Commercial toolchains, including<br />

IDEs and other advanced debugging tools, are commodity<br />

products, and every developer has his or her own favorite tool.<br />

But from the point of view of integration with the OS itself, they<br />

are not strongly differentiated.<br />

Modern developer tools should provide a GUI-driven interface.<br />

And ideally, they should work in a stand-alone fashion so that<br />

they can be integrated into the customer’s development/test<br />

environment. This requires that they provide command-line<br />

interfaces and perhaps even run-time APIs (i.e., a server/client<br />

interface) for easy integration.<br />

provides the software architecture and efficiency required for<br />

designing IoT devices.<br />

A modern IoT system can feature tiny sensor nodes running on<br />

small, low-power MCUs, as well as large gateways running on<br />

powerful application processors. A single OS that can run on all<br />

these types of processors is a vital component for any IoT<br />

system design. While the RTOS has been a commodity<br />

component for embedded devices for many years, we are now<br />

looking at a full-fledged operating system for a new class<br />

devices: The IoT OS.<br />

The tools must be as easy to use as possible. Admittedly, it is a<br />

difficult design challenge to bring all these tools into a simple,<br />

streamlined workflow.<br />

The following are recommended requirements for any<br />

development environment that supports an IoT OS:<br />

• It should run on the most popular desktop platforms<br />

(Windows, Linux, macOS).<br />

• It should be context-aware, based on the selected<br />

processor.<br />

• It should use an attractive, intuitive, and responsive<br />

developer portal so that you can easily pull all the code<br />

and configuration data you require.<br />

• It should use a secure delivery method for its software and<br />

tools.<br />

• It should provide access to online training, such as videos,<br />

user guides, and API references.<br />

• The development environment should be agnostic with<br />

respect to the toolchain, and so it should provide seamless<br />

interoperation with various IDEs.<br />

• It should be web and cloud-friendly.<br />

• It should be easy to update, as software is a living entity,<br />

and is always evolving.<br />

In summary, the IoT OS development tool should allow you to<br />

build a platform in a few minutes. It should resolve the<br />

software/hardware and software/software dependencies for<br />

you, and create a code base that compiles without warnings and<br />

errors. It always has been the goal of the embedded industry to<br />

provide a comprehensive tool that allows you to concentrate on<br />

your application and not the platform. With an IoT OS and<br />

proper tooling, this is an achievable goal.<br />

XII. CONCLUSION<br />

Commercial-grade real-time operating systems (RTOSs) are<br />

being deployed more and more widely for their determinism,<br />

flexibility, portability, scalability, and support.<br />

The use of a kernel, especially an RTOS, plus all the<br />

connectivity, security, and market-specific middleware<br />

147


Software Architectures for IoT<br />

Rob Oshana<br />

Vice President, Software Engineering<br />

Microcontrollers, NXP Semiconductor<br />

Austin, TX, USA<br />

robert.oshana@nxp.com<br />

Abstract—In this paper we will introduce a reference<br />

software architecture for the Internet of Things. Based on real<br />

world examples, we will define the software architecture<br />

requirements and constraints, discuss the fundamental<br />

structure of the software architecture, talk about the relevant<br />

cross cutting concepts for IoT architectures such as logging,<br />

error handling, security and recovery, IoT operating systems,<br />

and connectivity stacks. We will discuss what software<br />

components are available in the open source community and<br />

how to leverage the software ecosystem to develop software<br />

architectures for IoT systems. We will discuss the three<br />

software stack architectures for device, gateway and cloud that<br />

together make up the IoT system architecture required for<br />

todays IoT systems. We will demonstrate this with some real<br />

world examples showing IoT software architectures for<br />

specific vertical markets.<br />

Keywords—IoT, Software, Architecture<br />

I. INTRODUCTION<br />

In this paper we will introduce a reference software<br />

architecture for the Internet of Things. Based on real world<br />

examples, we will define the software architecture<br />

requirements and constraints, discuss the fundamental structure<br />

of the software architecture, talk about the relevant cross<br />

cutting concepts for IoT architectures such as logging, error<br />

handling, security and recovery, IoT operating systems, and<br />

connectivity stacks. We will discuss what software components<br />

are available in the open source community and how to<br />

leverage the software ecosystem to develop software<br />

architectures for IoT systems. We will discuss the three<br />

software stack architectures for device, gateway and cloud that<br />

together make up the IoT system architecture required for<br />

todays IoT systems. We will demonstrate this with some real<br />

world examples showing IoT software architectures for<br />

specific vertical markets.<br />

II.<br />

ACCUMULATE LEGO BLOCKS<br />

A. IoT Software Architectures<br />

IoT software architectures should be scalable across<br />

different categories;<br />

• Smart things<br />

• Connected things<br />

• Secure things<br />

• Safe things<br />

We need this Lego block approach because there are<br />

several IoT software architectures;<br />

1. Software stacks for constrained devices<br />

• Light weight RTOS or bare metal, hardware<br />

abstractions, Communications, remote<br />

management<br />

2. Software stacks for gateways<br />

• General purpose operating systems like<br />

Linux, communications/connectivity, data<br />

management, messaging, remote<br />

3. Software stacks for cloud IoT<br />

• Analytics and applications<br />

We need to also consider cross cutting concerns;<br />

• Security<br />

• Tools and Software development kits<br />

• Ontologies<br />

B. Key characteristics for IoT<br />

Key characteristics involved in developing software<br />

architectures for IoT include;<br />

• Loosely coupled; IoT stacks exist for small<br />

microcontrollers, edge microprocessors, and cloud<br />

providers. It should be possible to use software<br />

stacks for these different nodes independently,<br />

from different internal and external vendors.<br />

• Modular; when developing a software<br />

architecture for the different levels of IoT, it<br />

should be possible to use different components<br />

(e.g. a security stack) from different vendors to put<br />

together a solution.<br />

148


• Platform independent; use the appropriate<br />

hardware abstraction layers (HAL) to separate the<br />

software stack from the underlying hardware<br />

device. A mature Software Development Kit<br />

(SDK) should support this for multiple device<br />

families.<br />

• Based on open standards; consider using an open<br />

approach instead of an internal approach when<br />

possible. For example, openThread should be<br />

considered for a Thread stack instead of a<br />

proprietary internal Thread stack unless there is a<br />

compelling advantage in performance,<br />

functionality etc.<br />

Figure 2. Zephyr IoT operating system<br />

• Defined APIs; although IoT standards are<br />

evolving, its important to use standard APIs when<br />

available.<br />

III. ENSURE LEGO BLOCKS WORK WELL TOGETHER<br />

Once you have a scalable software architecture for IoT that<br />

supports device, edge, and cloud platforms, the next step is to<br />

ensure these blocks work well together. Figure 1 below is an<br />

example of a software development kit for low cost<br />

microcontrollers.<br />

Figure 3. ARM mbed IoT platform<br />

Challenges with IoT integration<br />

There are many challenges in getting an IoT system to work<br />

effectively “out of the box”. These include;<br />

• The connectivity framework can be difficult to learn,<br />

internalize and use<br />

• Documentation is often insufficient<br />

Figure 1. SDK for low cost IoT microcontroller<br />

A. Open Souce components<br />

Leverage open source components in your software IoT<br />

architecture for increased community support,<br />

interoperability, and scalability. Two examples of this are;<br />

• Zephyr (Figure 2); Zephyr is a small real-time<br />

operating system for connected, resource-constrained<br />

devices supporting multiple architectures and released<br />

under the Apache License 2.0<br />

• Mbed from ARM (Figure 3); The Arm® Mbed IoT<br />

Device Platform provides the operating system, cloud<br />

services, tools and developer ecosystem to make the<br />

creation and deployment of commercial, standardsbased<br />

IoT solutions possible at scale.<br />

• It can take weeks/months of effort to get an<br />

integration to work<br />

• IoT frameworks are architected to reflect a<br />

“connectivity centric” view, not an application view.<br />

Apps need to be twisted and turned to align with the<br />

framework (not simply installed on it).<br />

• IoT integrations can lead to bastardized<br />

implementations which end up being unique one-offs<br />

suited only for the specific app being done<br />

• End to end security framework for IoT configurations<br />

To overcome these challenges, interoperability and stress<br />

testing is required based on multiple industry and customer<br />

use cases.<br />

IV. BUILD YOUR LEGO BLOCK CASTLE<br />

To build your own “Lego castle” the first step is to select<br />

the platform of choice, device end node, gateway, or cloud.<br />

For example Figure 4 shows an end node microcontroller that<br />

support IoT with connectivity, processing, and security. The<br />

149


appropriate HAL will abstract the device specifics from the<br />

software architecture.<br />

Figure 4 Microcontroller device for IoT<br />

An IoT based software development kit (SDK) would<br />

support an IoT device like this with a standard set of<br />

enablement software;<br />

• CMSIS-CORE compatible software drivers<br />

• Single driver for each peripheral<br />

• Transactional APIs w/ optional DMA support for<br />

communication peripherals<br />

Integrated RTOS:<br />

• RTOS-native driver wrappers<br />

Integrated Stacks and Middleware<br />

• USB Host, Device and OTG<br />

• lwIP, FatFS<br />

• Crypto acceleration plus wolfSSL & mbedTLS<br />

• SD and eMMC card support<br />

• Multicore - eRPC<br />

Ecosystems<br />

• Mbed<br />

Reference Software:<br />

• Peripheral driver usage examples<br />

• Application demos<br />

• FreeRTOS usage demos<br />

License:<br />

• BSD 3-clause for startup, drivers, USB stack<br />

Toolchains:<br />

• KDS, IAR®, ARM® Keil®, GCC w/ Cmake<br />

• + MCUXpresso IDE<br />

Quality<br />

• Production-grade software<br />

• MISRA 2004 compliance<br />

• Checked with Coverity® static analysis tools<br />

Once you have a scalable device SDK, it makes it easier to<br />

integrate a device into a cloud SDK as shown in Figure 5.<br />

Figure 5 Integrating a device SDK with a Cloud SDK<br />

ACKNOWLEDGMENTS<br />

I would like to thank Jason Martin and Constantin Enascuta<br />

for contributing material used in this paper.<br />

REFERENCES<br />

[1] Srivaths Ravi, Anand Raghunathan, Paul Kocher, Sunil Hattangady,<br />

"Security in embedded systems: Design challenges", Transactions on<br />

Embedded Computing Systems (TECS), vol. 3, no. 3, August 2004<br />

[2] Ala Al-Fuqaha, Mohsen Guizani, Mehdi Mohammadi, Mohammed<br />

Aledhari, Moussa Ayyash, "Internet of Things: A Survey on Enabling<br />

Technologies Protocols and Applications", Communications Surveys &<br />

Tutorials IEEE, vol. 17, pp. 2347-2376, 2015, ISSN 1553-877<br />

150


Implementation of a Web Development Platform for<br />

Embedded System Designers<br />

Milan Raj<br />

IoT Software Technologies<br />

National Instruments<br />

Austin, Texas, United States<br />

milan.raj@ni.com<br />

Abstract—Distributed intelligent embedded devices rely on<br />

continuous feature improvements and on-demand user<br />

interactions to meet the evolving requirements of distributed<br />

systems. Technologies like HTML, CSS, and JavaScript are<br />

quickly becoming the de facto standard for developing crossplatform<br />

software. However, most embedded systems developers,<br />

despite proficiency in low-level programming, find web<br />

development unapproachable due to unfamiliar programming<br />

patterns, standards, frameworks and tools. The use of<br />

proprietary technologies or constrained platforms limits the<br />

ability of embedded developers to keep up with changing<br />

requirements in distributed applications. This paper explores the<br />

implementation of a next generation, open HMI development<br />

platform based on web standards and open source codebases.<br />

network requests and perform parsing and analysis of plain text<br />

and JSON responses. The HTML UI and dataflow<br />

programming diagram are both contained in a source file<br />

format operating as a functional unit and referred to as a Web<br />

Virtual Instrument. Multiple Web Virtual Instrument files can<br />

reference each other and are compiled together to generate a<br />

complete web application consisting of an application specific<br />

HTML file and a Virtual Instrument Assembly file containing a<br />

text representation of the transformed and merged dataflow<br />

programming diagrams. In addition to application specific<br />

files, static resources are generated containing JavaScript, CSS,<br />

and other media assets that are consistent between each<br />

generated web application.<br />

Keywords—HTML; JavaScript; CSS; WYSIWYG; asm.js;<br />

HTTP; HMI; Web<br />

I. INTRODUCTION<br />

Despite the introduction of many new user interface<br />

technologies and application development practices over the<br />

years, the web browser has maintained consistent availability<br />

across practically every device with rich graphical user<br />

interfaces. Active efforts by browser developers to improve<br />

interoperability [1] and focus on development of low-level<br />

primitives [2] have shifted the role of the browser from an<br />

interactive document scripting environment to a highperformance<br />

application virtual machine.<br />

We demonstrate how to leverage open standards and<br />

community driven developments to minimize risk of platform<br />

divergence and build an adaptable code base capable of<br />

adopting new standards as they stabilize. We present a<br />

WYSIWYG web-based HMI development platform for<br />

embedded system designers and discuss our strategies for web<br />

technology selection.<br />

II.<br />

SYSTEM ARCHITECTURE OVERVIEW<br />

A development environment hosting a modern embedded<br />

web browser is used to enable the WYSIWYG creation of<br />

HTML user interfaces. A user of the development environment<br />

can write logic for the corresponding HTML UI using a<br />

graphical dataflow programming language to read from and<br />

write values to HTML UI elements. The user is also capable of<br />

using the graphical programming language to perform HTTP<br />

Fig. 1. Compilation of multiple Web Virtual Instrument source files to a<br />

standalone web application.<br />

As an open platform, the development environment creates<br />

web applications capable of communicating with arbitrary<br />

HTTP-based web services. Included with the development<br />

environment is a data services platform to facilitate<br />

communication between generated web applications and<br />

embedded devices. The data services platform consists of<br />

HTTP endpoints queryable from generated web applications<br />

and allows for reading and writing named variables with latest<br />

values and subscribing to named queues of messages. An<br />

embedded device on a network can use HTTP or AMQP to<br />

communicate with the data services platform to publish the<br />

current device state or subscribe to message queues.<br />

151


A complete HMI solution may consist of a web application<br />

generated by the development environment that communicates<br />

with the data services platform for monitoring and control of<br />

network connected embedded systems. An operator can use<br />

commodity devices with modern web browsers to access the<br />

generated web applications. The focus of this paper will be on<br />

the architecture of the generated web applications which utilize<br />

open standards to implement modern user interfaces and<br />

implement client-side logic driven by a high-level graphical<br />

programming language.<br />

Fig. 2. Example configuration for a complete HMI solution.<br />

III.<br />

GENERATED WEB APPLICATION ARCHITECTURE<br />

A. Overview of Deployed Files<br />

Each built web application consists of an HTML file with<br />

configuration used for the user interface of the application and<br />

a Virtual Instrument Assembly text file containing a low-level<br />

text representation of the user-developed dataflow<br />

programming diagram. In addition, there is a static resources<br />

directory containing the JavaScript files for implementing the<br />

HTML UI controls, CSS for HTML UI control theming,<br />

JavaScript implementing a dataflow programming language<br />

runtime, and other resources used such as images and<br />

localization files.<br />

The primary entry point of the web application is the<br />

HTML file. The HTML file contains references to the<br />

resources previously described, an inline stylesheet<br />

corresponding to the HTML UI controls the user interactively<br />

placed in the WYSIWYG editor, and a series of custom HTML<br />

elements with attributes that represent the configured state of<br />

the HTML UI controls.<br />

B. Custom Element Utilization<br />

Custom Elements are a technology that allow user-defined<br />

custom HTML tags to be registered in a web application. These<br />

custom HTML tags enable new types of user interface<br />

elements to be added to a web page by inserting the newly<br />

defined tag in the HTML. The author of a Custom Element can<br />

observe when the element is inserted or removed from the<br />

HTML or when the element's HTML attributes are modified<br />

[3]. Custom Elements behave like other HTML Elements in the<br />

Document Object Model (DOM) of a web page. Like other<br />

HTML Elements natively supported by a web browser, each<br />

Custom Element has HTML attributes, JavaScript properties<br />

and methods, and the ability to fire and listen for events<br />

triggered in the DOM hierarchy.<br />

For web applications generated by the development<br />

environment, the entire state of each HTML UI control is<br />

captured in the respective HTML attributes of their Custom<br />

Element. The benefit of this approach over having HTML UI<br />

control configuration stored in generated JavaScript or other<br />

data formats is improved embeddability and styling<br />

customizability. The Custom Elements can be moved around,<br />

styled, and manipulated like any other HTML element and hold<br />

their configuration information. These properties also make<br />

Custom Element-based HTML UI controls highly reusable in<br />

other web applications.<br />

C. Modeling and Framework<br />

During the startup of a generated web application the<br />

lifecycle events of Custom Elements are monitored by a<br />

JavaScript framework that implements semantics like the<br />

Model-View-ViewModel (MVVM) pattern. After a Custom<br />

Element is connected to the DOM the framework triggers the<br />

creation of a Model object and ViewModel object that<br />

corresponds to the View represented by the Custom Element.<br />

In the MVVM framework, the Model provides an<br />

abstraction over the representation of the properties for a<br />

HTML UI control. The Model representation of a control’s<br />

properties provide a consistent interface for other agents such<br />

as the desktop environment or embedded dataflow runtime to<br />

send updates. When an update occurs to a Model, the<br />

corresponding ViewModel updates a Render object<br />

representing a mutation action for the corresponding HTML UI<br />

control. The Render objects are queued into the Render engine<br />

that is serviced on a requestAnimationFrame [4] callback for<br />

the web page. The net effect is that high frequency Model<br />

updates can be performed which are collated and serviced at<br />

optimal rendering times across all Custom Elements managed<br />

by the MVVM framework.<br />

Fig. 3. Flow of a property update through the MVVM framework to apply to<br />

a Custom Element.<br />

D. Application Update Service Management<br />

In addition to having Custom Elements to capture the state<br />

and initialize modeling for HTML UI controls, there are also<br />

nonvisible Custom Elements used to store the configuration for<br />

update service management. The update services are state<br />

machines tasked with transitioning the state of a web<br />

application from page load through completion. Different<br />

update services are implemented for the different environments<br />

and expected behaviors of the web application.<br />

Two of the most significant update services are the editor<br />

update service and the local update service:<br />

The editor update service is utilized when the web<br />

application is running inside the development environment and<br />

is expected to respond to WYSIWYG editing operations. As<br />

the user performs editing operations such as changing size,<br />

position, or configuration of a control, messages are sent<br />

asynchronously from the development environment to the<br />

embedded web browser hosting the web application. The editor<br />

152


update service receives the messages and applies the updates to<br />

the Models targeted by the development environment.<br />

The local update service is used when a user tests the<br />

execution of their web application in the development<br />

environment or when the web application is deployed and<br />

running in a standalone web browser. In this configuration the<br />

local update service has the responsibility of fetching the<br />

application-specific Virtual Instrument Assembly file, passing<br />

the Virtual Instrument Assembly file contents to the bundled<br />

dataflow runtime environment, and mediating control updates<br />

between the dataflow runtime environment and the Models.<br />

E. Virtual Instrument Runtime Engine Object (Vireo)<br />

Vireo is an open source dataflow runtime used in the web<br />

application to execute the instructions stored in Virtual<br />

Instrument Assembly files. The runtime is a compact C++<br />

project capable of managing memory and scheduling execution<br />

of the low-level dataflow programs created by the user in the<br />

development environment. The runtime has been designed for<br />

execution in resource constrained embedded systems giving it a<br />

small size and memory footprint suitable for web applications.<br />

To make Vireo executable in the browser environment we<br />

leveraged the open source Emscripten toolchain [5] to compile<br />

C++ source to a subset of the JavaScript language known as<br />

asm.js [6]. The asm.js subset of JavaScript makes an efficient<br />

target for compilers by restricting code to use primarily math<br />

operations and to perform those operations on one large shared<br />

JavaScript ArrayBuffer. Benchmarking has shown that C++<br />

projects compiled to asm.js and running on modern JavaScript<br />

browser runtimes execute within two-thirds the speed, or<br />

better, of the same C++ projects compiled using Clang or GCC<br />

and executing natively [7].<br />

IV.<br />

PERFORMANCE CHARACTERISTICS OF WEB-BASED<br />

CONTROLS<br />

A common concern based on historical JavaScript<br />

execution behavior and conflated with poor usage patterns of<br />

the browser DOM API is that HTML UI controls may be<br />

unable to maintain fast and responsive user interfaces. What we<br />

have observed is that we can demonstrate desirable<br />

performance characteristics when we approach development of<br />

HTML UI elements with the same rigor we would use in other<br />

user interface environments.<br />

The best demonstration of performance characteristics for<br />

HTML UI controls comes from the graphing and charting<br />

Custom Element implementations. These Custom Elements<br />

were implemented by leveraging the existing open source Flot<br />

charting library [8] and creating an open source fork, known as<br />

engineering-flot, with features and optimizations well-suited<br />

for engineering and scientific applications [9].<br />

In the Custom Elements built using the engineering-flot<br />

codebase we utilize the HTML5 canvas element for drawing,<br />

prevent unnecessary copies of buffers, and implement data<br />

decimation algorithms to avoid unnecessary drawing<br />

operations. Benchmarking of the graph Custom Elements on<br />

modern desktop web browsers has shown the capability of<br />

rendering over 500,000 data points per frame at sixty frames<br />

per second [10].<br />

V. CONSIDERATIONS FOR ADOPTION OF NEW WEB<br />

TECHNOLOGIES<br />

Selecting features of the modern web platform to adopt in a<br />

new project has additional considerations compared to<br />

traditional desktop application development or even traditional<br />

web development. Historically web browser versioning was<br />

highly coupled to the operating system platform and operating<br />

system version for which the browsers were designed. This<br />

coupling made it common to use browser version of the target<br />

audience as the primary consideration for choosing which<br />

features to adopt during web application development. Many<br />

modern browsers release frequently and can have releases<br />

performed independently from their hosted operating system.<br />

These continuously updated browsers are referred to as<br />

"evergreen" browsers and can result in benefits such as<br />

improved interoperability with other browsers [11].<br />

With browsers updating frequently, decoupled from the<br />

underlying operating system, and containing increasingly<br />

interoperable sets of shared features, it becomes possible to<br />

change the web platform feature selection process for new<br />

application development. Instead of choosing a browser and<br />

opting into all the web platform features that browser<br />

implements, it is possible to choose a feature and see if it is<br />

implemented in all the browsers you choose to support.<br />

If there is a subset of browsers that do not support a feature<br />

it may be possible to utilize a polyfill for the feature. A polyfill<br />

is code that attempts to implement a feature that might be<br />

missing from a browser where existing browser features can be<br />

used to replicate or closely approximate the missing feature<br />

[12]. From our experience, a well-designed and low-risk<br />

polyfill may have some, or all, of the following characteristics:<br />

• Compact in code size in a shipping web application<br />

• Comparable performance to the feature as natively<br />

implemented in a browser<br />

• Closely implements the native browser feature with few<br />

polyfill-specific exceptions<br />

• Makes well-understood changes to the browser global<br />

environment<br />

• Represents a specification that browsers follow or have<br />

committed to follow in the future<br />

• Delegates execution of the feature to the native<br />

implementation if available<br />

• Removable with little to no changes in source code as<br />

browsers enable the feature natively<br />

As opposed to consuming libraries that try to abstract over<br />

differences between browsers by providing a nonstandard API<br />

on top of those differences, polyfills attempt to bring up the<br />

baseline usable set of features across browsers. While custom<br />

libraries lead to an increasingly siloed ecosystem of libraries<br />

interdependent on nonstandard APIs, utilizing polyfills that are<br />

backed by open standards leads to the polyfills becoming<br />

removable as they are backed by native browser<br />

implementations over time.<br />

153


VI.<br />

SUMMARY<br />

We have discussed an architecture for enabling<br />

development of web applications supporting WYSIWYG<br />

manipulation using an MVVM style framework to schedule<br />

HTML UI control updates in a performant manner. It is<br />

described how Custom Elements are used to hold UI<br />

configuration and form the basis of reusable HTML UI<br />

controls that are highly customizable by users in a deployed<br />

application. In addition, nonvisible Custom Elements also<br />

maintain configuration in HTML attributes creating a<br />

consistent interface for all configuration of the web application.<br />

We demonstrated the ability to use open source tooling to<br />

leverage existing C++ code through compilation to the asm.js<br />

JavaScript subset and described results of highly performant<br />

and responsive HTML UI graph controls. Finally, we presented<br />

an approach for selecting web platform features for use in new<br />

web application development by selecting features backed by<br />

open standards with native browser implementations or with<br />

implementations that can be backed by well-designed polyfills.<br />

REFERENCES<br />

[1] About, Mozilla, webcompat.com/about.<br />

[2] The Extensible Web Manifesto, Extensibleweb,<br />

extensiblewebmanifesto.org/.<br />

[3] Denicola, Domenic. “Custom Elements”. W3C, 13 Oct. 2016,<br />

w3.org/TR/custom-elements/#custom-element-reactions.<br />

[4] “Window.requestAnimationFrame().” Mozilla Developer Network,<br />

Mozilla, 28 Nov. 2017, developer.mozilla.org/en-<br />

US/docs/Web/API/window/requestAnimationFrame.<br />

[5] “Emscripten.” GitHub, github.com/kripken/emscripten.<br />

[6] Herman, David, Zakai, Alon, and Luke Wagner. Asm.js, Mozilla, 18<br />

Aug. 2014, asmjs.org/spec/latest/.<br />

[7] Zakai, Alon, and Robert Nyman. “Gap between asm.Js and native<br />

performance gets even narrower with float32 optimizations – Mozilla<br />

Hacks - the Web developer blog.” Mozilla Hacks – the Web developer<br />

blog, Mozilla, 20 Dec. 2013, hacks.mozilla.org/2013/12/gap-between-<br />

asm-js-and-native-performance-gets-even-narrower-with-float32-<br />

optimizations/.<br />

[8] “Flot.” GitHub, github.com/flot/flot.<br />

[9] “Engineering-flot.” GitHub, github.com/ni-kismet/engineering-flot<br />

[10] “Creating Web Enabled HMIs with LabVIEW NXG.” Performance by<br />

Mark Black, Eli Kerry, and Omid Sojoodi, Creating Web Enabled HMIs<br />

with LabVIEW NXG, National Instruments, 23 May 2017,<br />

youtube.com/watch?v=N4XCNfGapc4&t=1m17s.<br />

[11] Beeman, Hadley. “The evergreen Web.” W3C, 9 Feb. 2017,<br />

w3.org/2001/tag/doc/evergreen-web/.<br />

[12] Lawson, Bruce, and Remy Sharp. Introducing HTML5. New Riders,<br />

2012, pp. 276-277.<br />

154


Make your industrial device smart using a SaaS IoT<br />

platform<br />

Stefan Vaillant<br />

CTO<br />

Cumulocity GmbH<br />

Dusseldorf, Germany<br />

cumulocity@piabo.net<br />

Authors Name/s per 2nd Affiliation (Author)<br />

line 1 (of Affiliation): dept. name of organization<br />

line 2-name of organization, acronyms acceptable<br />

line 3-City, Country<br />

line 4-e-mail address if desired<br />

I. ABSTRACT<br />

Today and in the future, more and more machines are<br />

transformed into smart machine. Pumps, compressors, bikes,<br />

transformators, industrial vehicles, and more need to get smart.<br />

Smart machines provide remote access, preventive and<br />

predictive maintainance, pay per use and other service. The<br />

fastest and most risk free approach to make machines "smart"<br />

is to connect them to an Software as a Service IoT Platform.<br />

This presentation will present the overall approach and<br />

advantages. It will also bring many industrial examples from<br />

real-world customers.<br />

www.embedded-world.eu<br />

155


Which IoT Protocol Should I Use for My System?<br />

Christian Légaré<br />

Silicon Labs Inc.<br />

Montréal, Québec, Canada<br />

Abstract—Embedded systems using sensors and connectivity<br />

are not new to embedded developers. However, using these<br />

elements with multiple additional internet technologies is.<br />

Internet protocols (IPs) are not new, but dedicated IPs for the<br />

IoT are, and they are used to help shape system capabilities.<br />

There are multiple IP application layer protocols that are<br />

above the TCP/IP sockets. Each one has its advantages and<br />

constraints. Knowing them helps developers make the best<br />

design choices for a product. Bandwidth requirements, realtime<br />

performance and memory footprint are some of the main<br />

criteria to use in selecting an IoT protocol. Many IoT projects<br />

are being driven by CIOs and IT departments, which are<br />

pushing developers to use the technologies and protocols they<br />

know in IoT devices. However, IoT devices are often closer to<br />

operational technologies (OTs), so, pushing IT technologies<br />

into the OT domain is often not an optimal choice.<br />

I. INTRODUCTION<br />

Developers need to be educated that there are better choices<br />

for IoT devices than IT technologies.<br />

There are multiple categories of IP:<br />

• Consumer vs. industrial<br />

• Web services<br />

• IoT services<br />

• Publish/Subscribe<br />

• Request/Response<br />

been developed and refined so that ordinary, nontechnical<br />

people can use the internet easily and productively. For<br />

example, the human interface for the internet now includes<br />

email, search engines, browsers, mobile apps, Facebook and<br />

Twitter, among other popular social media.<br />

By comparison, in the IoT, the idea is for electronic devices to<br />

exchange information over the internet. But these devices<br />

don’t yet have the machine equivalent of browsers and social<br />

media to facilitate communication. The IoT is also different<br />

from the web because of the speeds, scales, and capabilities<br />

that IoT devices require in order to work together. These<br />

requirements are far beyond what people need or use. We are<br />

at the beginning of the development of these new tools and<br />

services, and this is one of the reasons why a definition for IoT<br />

is difficult to lock down. Many visions about what it can, or<br />

could be, collide.<br />

III. TCP/IP PROTOCOL STACK<br />

The TCP/IP protocol stack is at the heart of the internet and<br />

the web. It can be represented using the OSI seven-layer<br />

reference model, as illustrated below (Figure 1). The top three<br />

layers are grouped together, which simplifies the model.<br />

All these factors must be considered when designing a new<br />

system. Let’s look at IPs for the IoT and define the selection<br />

criteria.<br />

II. THE INTERNET<br />

The internet is the sum of all network equipment used to route<br />

IP packets from a source to a destination. The world wide<br />

web, by comparison, is an application system that runs on the<br />

internet. The web is a tool built for people on which to<br />

exchange information, and in the last 2- years, the web has<br />

Figure 1. OSI Seven-layer reference model.<br />

www.embedded-world.eu<br />

156


Figure 1. TCP/IP Stack Reference Model<br />

The following is a quick description of the important layers<br />

from the perspective of embedded system integration:<br />

1. Physical and Data Link Layers<br />

The most common physical layer protocols used by<br />

embedded systems are:<br />

• Ethernet (10, 100, 1G)<br />

• Wi-Fi (802.11b, g, n)<br />

• Serial with PPP (point-to-point protocol)<br />

• GSM 3G, LTE, 4G<br />

2. Network Layer<br />

This is where the internet lives. The internet—short for<br />

inter-network—is named so because it provides<br />

connections between networks, between the physical<br />

layers. This is where we find the ubiquitous IP address.<br />

3. Transport Layer<br />

Above IP, we have TCP and UDP, the two transport<br />

protocols. Because TCP is used for most of our human<br />

interactions with the web (email, web browsing, etc.), it is<br />

widely believed that TCP should be the only protocol<br />

used at the transport layer. TCP provides the notion of a<br />

logical connection, acknowledgment of packets<br />

transmitted, retransmission of packets lost and flow<br />

control—all of which are great things. But for an<br />

embedded system, TCP can be overkill. Therefore, UDP,<br />

even if it has long been relegated to network services such<br />

as DNS and DHCP, is now finding its place in the<br />

domains of sensor acquisition and remote control. If you<br />

need some type of management of your data, you can<br />

even write your own lightweight protocol on top of UDP<br />

to avoid the overhead imposed by TCP.<br />

UDP is also better suited than TCP for real-time data<br />

applications such as voice and video. The reason is that<br />

TCP’s packet acknowledgment and retransmission<br />

features are useless overhead for those applications. If a<br />

piece of data (such as a bit of spoken audio) does not<br />

arrive at its destination in time, there is no point in<br />

retransmitting the packet, as it would arrive out of<br />

sequence and would garble the message.<br />

TCP is sometimes preferred to UDP, because it provides a<br />

persistent connection. So, to do the same thing with UDP,<br />

you must implement this feature yourself in a protocol<br />

layer above UDP.<br />

When you are deciding how to move data from the<br />

“thing’s” local network onto an IP network, you have<br />

several choices. Because the technologies used are<br />

familiar and available from a wide range of sources, you<br />

can link the two networks via a gateway, or you can build<br />

this functionality into the “thing” itself. Many MCUs now<br />

have an Ethernet controller on chip, which makes this an<br />

easier task.<br />

IV. IOT PROTOCOLS<br />

It is possible to build an IoT system with existing web<br />

technologies, even if it is not as efficient as the newer<br />

protocols. HTTP(S) and WebSockets are common standards,<br />

together with XML or JavaScript Object Notation (JSON) in<br />

the payload. When using a standard web browser (HTTP<br />

client), JSON provides an abstraction layer for web developers<br />

to create a stateful web application with a persistent duplex<br />

connection to a web server (HTTP server) by holding two<br />

HTTP connections open.<br />

HTTP<br />

HTTP is the foundation of the client-server model used for the<br />

web. The safest method with which to implement HTTP in<br />

your IoT device is to include only a client, not a server. In<br />

other words, it is safer when the IoT device can initiate<br />

connections to a web server but is not able to receive<br />

connection requests: We don’t want to allow outside machines<br />

to have access to the local network where the IoT devices are<br />

installed.<br />

WebSocket<br />

WebSocket is a protocol that provides full-duplex<br />

communication over a single TCP connection over which<br />

messages can be sent between client and server. It is part of<br />

the HTML 5 specification. The WebSocket standard simplifies<br />

much of the complexity around bidirectional web<br />

communication and connection management.<br />

XMPP<br />

Extensible messaging and presence protocol (XMPP) is a<br />

good example of an existing web technology finding new use<br />

in the IoT space.<br />

XMPP has its roots in instant messaging and presence<br />

information, and has expanded into voice and video calls,<br />

collaboration, lightweight middleware, content syndication,<br />

and generalized routing of XML data. It is a contender for<br />

mass scale management of consumer white goods such as<br />

washers, dryers, refrigerators and so on.<br />

XMPP strengths are its addressing, security and scalability.<br />

This makes it ideal for consumer-oriented IoT applications.<br />

HTTP, WebSocket and XMPP are examples of technologies<br />

being pressed into service for IoT. Other groups are also<br />

157


working furiously to develop solutions for the new challenges<br />

IoT is presenting us.<br />

Wannabe Generic Protocols<br />

Many IoT experts refer to IoT devices as constrained systems,<br />

because they believe IoT devices should be as inexpensive as<br />

possible and use the smallest MCUs available, while still<br />

running a communication stack.<br />

Table 1. Constrained systems standardization work<br />

and a low throughput (tens of kilobits per second). CoAP can<br />

be a good protocol for devices operating on battery or energy<br />

harvesting.<br />

Features of CoAP:<br />

• Because CoAP uses UDP, some of the TCP<br />

functionalities are replicated directly in CoAP. For example,<br />

CoAP distinguishes between confirmable (requiring an<br />

acknowledgment) and nonconfirmable messages.<br />

• Requests and responses are exchanged<br />

asynchronously over CoAP messages (unlike HTTP, where an<br />

existing TCP connection is used).<br />

• All the headers, methods and status codes are binary<br />

encoded, which reduces the protocol overhead. However, this<br />

requires the use of a protocol analyzer to troubleshoot network<br />

issues.<br />

• Unlike HTTP, the ability to cache CoAP responses<br />

does not depend on the request method, but the response<br />

Code.<br />

CoAP fully addresses the need for an extremely light protocol<br />

exhibiting a behavior similar to a permanent connection. It has<br />

semantic familiarity with HTTP and is RESTful (resources,<br />

resource identifiers and manipulating those resources via<br />

uniform application programming interface (API)). If you<br />

have a web background, using CoAP is relatively easy.<br />

Currently, adapting the internet for the IoT is one of the main<br />

priorities for many of the global standardization bodies. Table<br />

1 contains a short summary of the current activities.<br />

If your system does not require the features of TCP, and can<br />

function with the more limited UDP capabilities, removing the<br />

TCP module significantly helps reduce the size of the total<br />

code footprint of your product. This is what 6LoWPAN (for<br />

WSN) and CoAP (light internet protocol) bring to the IoT<br />

universe.<br />

CoAP<br />

Although the web infrastructure is available and usable for IoT<br />

devices, it is too heavy for most IoT applications. In July<br />

2013, IETF released the constrained application protocol<br />

(CoAP) for use with low-power and lossy (constrained) nodes<br />

and networks (LLNs). CoAP, like HTTP, is a RESTful<br />

protocol.<br />

It is semantically aligned with HTTP, and even has a one-toone<br />

mapping to and from HTTP. Network devices are<br />

constrained by smaller microcontrollers with small quantities<br />

of flash memory and RAM, while the constraints on local<br />

networks such as 6LoWPAN are due to high packet error rates<br />

MQTT<br />

MQ telemetry transport (MQTT) is an open source protocol<br />

that was developed and optimized for constrained devices and<br />

low-bandwidth, high-latency or unreliable networks. It is a<br />

publish/subscribe messaging transport that is extremely<br />

lightweight and ideal for connecting small devices to networks<br />

with minimal bandwidth. MQTT is bandwidth efficient, data<br />

agnostic and has continuous session awareness, as it uses TCP.<br />

It is intended to minimize device resource requirements while<br />

also attempting to ensure reliability and some degree of<br />

assurance of delivery with grades of service.<br />

MQTT targets large networks of small devices that need to be<br />

monitored or controlled from a back-end server on the<br />

internet. It is not designed for device-to-device transfer.<br />

Neither is it designed to “multicast” data to many receivers.<br />

MQTT is simple, offering few control options. Applications<br />

using MQTT are generally slow, in the sense that the<br />

definition of “real time” in this case is typically measured in<br />

seconds.<br />

MQTT vs. COAP<br />

MQTT publish/subscribe scales well. MQTT has<br />

demonstrated the advantages of this architecture. COAP in the<br />

www.embedded-world.eu<br />

158


latest IETF COAP RFCs have introduced the support of<br />

publish/subscribe by COAP.<br />

The COAP light payload is well-suited for wireless sensor<br />

networks. MQTT-SN has taken that idea and reproduced it.<br />

So, the two main IoT dedicated protocols are borrowing ideas<br />

from each other. Will these two protocols remain mainstream?<br />

We believe so, for at least five to 10 years.<br />

V. COMPARISON OF POTENTIAL IOT PROTOCOLS<br />

Cisco is at the heart of the internet; its IP equipment is<br />

everywhere. Cisco is now actively participating in the<br />

evolution of IoT. It sees the potential for connecting physical<br />

objects, getting data from our environment and processing this<br />

data to improve our living standards.<br />

Table 2 is drawn from Cisco’s work in IoT standards.<br />

Figure 2. Comparison of web and IoT protocols<br />

Table 2. Beyond MQTT: A Cisco View on IoT Protocols by Paul<br />

Duffy, April 30, 2013<br />

These internet-specific IoT protocols have been developed to<br />

meet the requirements of devices with small amounts of<br />

memory, and networks with low bandwidth and high latency.<br />

Figure 2 provides another good summary of the performance<br />

benefit that these protocols bring to IoT. The source is Zach<br />

Shelby in his presentation “Standards Drive the Internet of<br />

Things.”<br />

VI. CONCLUSION<br />

Connecting sensors and objects opens up an entirely new<br />

world of possible use cases—and it’s precisely those use cases<br />

that will determine when to use the right protocols for the right<br />

applications.<br />

The high-level positioning for each of these protocols is<br />

similar. Apart from HTTP, all these protocols are positioned<br />

as real-time publish/subscribe IoT protocols with support for<br />

millions of devices. Depending on how you define “real time”<br />

(seconds, milliseconds or microseconds) and “things” (WSN<br />

node, multimedia device, personal wearable device, medical<br />

scanner, engine control, etc.) the protocol selection for your<br />

product is critical. Fundamentally, these protocols are very<br />

different.<br />

Today, the web runs on hundreds of protocols. The IoT will<br />

support hundreds more. What you need to do when designing<br />

your system is to define the system requirements very<br />

precisely, and chose the right protocol set to address these<br />

requirements.<br />

The internet protocol is a carrier; it can encapsulate just as<br />

many protocols for the IoT as it does today for the web. Many<br />

industry pundits are asking for protocol standardization. But if<br />

there are so many protocols for the web, why wouldn’t there<br />

be just as many for the IoT? You choose the protocols that<br />

meet your requirements. The only difference is that the IoT<br />

protocols are still young and must demonstrate their reliability.<br />

Remember that when the internet became a reality, IP version<br />

4 was what made it possible. We are now massively deploying<br />

IP version 6, and IoT is the killer application that<br />

telecommunication carriers have been waiting for to justify the<br />

investment required.<br />

159


Predictive maintenance using a fully compound<br />

materialintegrated measuring system<br />

Sven Grunwald, Andy Batzdorf, Steffen Kutter, Bernard Bäker<br />

Chair in Automotive Mechatronics, Dresden University of Technology<br />

George-Baehr-Straße 1C, Germany<br />

Sven.Grunwald@tu-dresden.de, Andy.Batzdorf@tu-dresden.de, Steffen.Kutter@tu-dresden.de, Bernard.Baeker@tu-dresden.de<br />

Abstract— This paper presents the integration of a<br />

measurement system which enables Internet of Things (IoT)<br />

driven predictive maintenance in an industrial environment<br />

without the need of external mounted sensors. The main<br />

methodology for the predictive wear detection is illustrated by the<br />

example of an electromagnetic operated spring pressure brake.<br />

First real-world application results are shown within this paper.<br />

Furthermore, an outlook to resulting possibilities of a future<br />

industrial IoT-application with the focus on low costs is provided.<br />

For first practical application, a braking rotor with a<br />

corresponding measuring system has been designed. The<br />

measurement system consists of a low power ARM-based<br />

processor with integrated wireless interface and additional MEMS<br />

based sensors. This system is integrated in the friction material<br />

itself by using the hot-pressing method. With the help of this<br />

method an encapsulated sensor and measurement system can be<br />

fabricated turning the conventional brake disc into a rotating<br />

Industrial IoT (IIoT) sensor node which is powered wirelessly over<br />

a resonant inductive link.<br />

in the actual moving parts of the machine and usable during<br />

manufacture or maintenance. The data collected by an<br />

embedded sensor would provide much more detailed<br />

information about what is happening in the machine, in real<br />

time, which would dramatically improve process control and<br />

improve machine maintenance. But even with today's compact,<br />

energy-saving and inexpensive wireless sensors, it is already<br />

possible to integrate the device into the fabric of a machine and<br />

to expect it to transmit reliable information over months or even<br />

years. The feasibility was investigated in a research project at<br />

the Technische Universität Dresden (TUD).<br />

Keywords—Industrial IoT, Encapsulated sensor platform,<br />

Rotating sensor node, Electromagnetic safety brake, Fiberreinforced<br />

material<br />

I. MOTIVATION AND BACKGROUND<br />

The digital network capability of industrial machines enables<br />

the monitoring of the entire plant condition and gives the<br />

possibility to combine this information with additional sensors<br />

installed in the machine. In the predicted expansion stage such<br />

as Factory 4.0, these machines can communicate via networks<br />

over the Internet. The IoT is also moving into the industrial<br />

sector, with networks that can collect data to monitor and<br />

control production lines, inventories and energy consumption<br />

to ensure sustainable and reliable production. [1]. As part of the<br />

Industrial IoT (IIoT), these models are based on nodes such as<br />

thermostats or optical sensors at the edges of the network,<br />

which receive and send data that a central system can analyze<br />

and respond to, for example, to optimize a manufacturing<br />

process.<br />

Today, sensors are typically mounted on the outside of<br />

equipment where access to the device is easier. But another<br />

conceivable approach would be if sensors the were embedded<br />

Fig. 1 Smart rotor concept suitable for industrial electromagnetic safety brakes<br />

The focus was to encapsulate the sensors and the<br />

microprocessor into the brake disc as part of the<br />

electromagnetic safety brake. Those elements are critical for<br />

safety relevant applications in industrial equipment. One<br />

example is in elevators, where the discs are used to control the<br />

speed of the ascending car, protect against unintended<br />

movement, and maintain the car’s position when it stops at each<br />

floor.<br />

Measuring specific system parameters, such as vibration or<br />

wear, traditionally requires the addition of expensive hardware,<br />

such as torque shafts. The Usage of embedded wireless sensors<br />

can overcome these disadvantages. The disc is manufactured<br />

out of phenolic-resin-based composite material which is far<br />

superior to conventional, metallic materials in terms of load<br />

www.embedded-world.eu<br />

160


esistance and wear [2]. The following picture points out the<br />

basic setup of braking system within an industrial environment.<br />

Self-Aligning<br />

Coupling<br />

Anchor plate<br />

Fig. 2<br />

Conventional brake disc within an electromagnetic safety brake<br />

II.<br />

Conventional<br />

Rotor<br />

Electromagnetical<br />

Safety Brake<br />

PRELIMINARY RESEARCH<br />

The core idea in this project was to integrate the necessary<br />

components for monitoring an industrial device into the actual<br />

industrial component, but before the integration process is<br />

started with a highly integrated system equipped with these<br />

sensors, it is important to point out that there is no test data<br />

available to verify the components according to the hotpressure<br />

method and to validate the sensor data as well as the<br />

entire system behavior after the integration step with respect to<br />

defects. Generally speaking, a System on Chip (SoC) is a<br />

complex device consisting of analog and digital circuit<br />

elements that interact on a single silicon chip. A complete<br />

structural test of digital components within the integrated<br />

circuit (IC) is not possible at this point, because on the one hand<br />

a verified Verilog or VHDL netlist of the entire IC is not<br />

available to the customer and on the other hand the accessibility<br />

of the IC after integration into the material is not given.<br />

Fig. 4 Technology carrier Smart-rotor<br />

The sensors include devices for measuring acceleration in all<br />

three axes, three-axis gyroscopes for measuring the angle of<br />

rotation, magnetic field sensors and classic temperature sensors.<br />

These sensors are well suited for the task, as they are costeffective,<br />

small and highly integrated in a small housing and<br />

thus provide an ideal choice for integration and measurement<br />

directly from the material.<br />

The data can be recorded from inside the raw braking material<br />

and compressed after the actual measurement and sent to a host<br />

system, e. g. a wireless router, for further processing within the<br />

network. The system properties are highly customizable by the<br />

user; e. g. sensor fusion of the filter implantation is still possible<br />

even after the integration process.<br />

Electric engine<br />

Wireless Power<br />

Transmitter<br />

Self-Aligning<br />

Coupling<br />

Smart-Rotor<br />

Anchor plate<br />

Electromechanical<br />

Braking System<br />

Fig. 3 Oscillation of the analog test structure to validate integration<br />

In order to make an initial assessment of the extent to which it<br />

is possible to integrate an electronic system into fiber composite<br />

material without damaging the electronics, simple and<br />

manageable digital and analog elements (figure 3) were used<br />

for experiments [3], [4]. By using these elements, parasitic<br />

effects caused by the fiber-reinforced material itself can be<br />

taken into account. Using the declared method of error modeldriven<br />

test structures, the usable component sizes of the passive<br />

and active element was identified. These results are further<br />

documented in an earlier work [6], [7].<br />

Fig. 5 Electric drive equipped with the Smart-rotor<br />

This is achieved by the implemented over-the-air (OTA) update<br />

capability. Due to the fact that the SoC supports Bluetooth 4.2,<br />

it is possible to equip the embedded system with an IP address<br />

via which a remote connection to the embedded system can be<br />

established.<br />

This enabled access to the Smart Rotor (figures 4, 5) via the<br />

Internet, which enables remote diagnosis of the brake system<br />

without the need for a technician to check the system and stop<br />

the machine. Knowing the optimum component sizes and<br />

materials that can withstand the hot-pressure process, it was<br />

possible to successfully integrate the embedded system itself.<br />

The following picture shows parts of the system inside the<br />

fiber-reinforced material after the manufacturing process. This<br />

also includes the post-curing process required for the fiberreinforced<br />

material.<br />

161


III.<br />

MEASUREMENT AND PERFORMANCE RESULTS<br />

The whole integration process is achieved with the hotpressure-method<br />

for fiber-reinforced materials. This<br />

methodology is a well-known manufacturing process for this<br />

kind of material. Nevertheless, this method stresses the<br />

components because of the high temperature and the high<br />

pressure within the actual fabrication. Therefore, it is necessary<br />

to analyze the system and sensor behavior after the actual<br />

integration took place.<br />

It is not sufficient to integrate the components free of defects<br />

but to measure and test the quality of the sensor signals and the<br />

whole behavior. This approach is necessary because MEMSbased<br />

sensors are very sensitive to mechanical stress. Already<br />

the mounting on the printed circuit board plays an important<br />

role. The mechanical and thermal stress, which are exposed to<br />

the elements, can cause a performance degradation leading to<br />

insufficient measurements in a future application.<br />

The examined results showed there is nearly no degrading in<br />

the signal quality attributed to the manufacturing process [8].<br />

One method to analyze a MEMS-based inertial sensor is the<br />

Allan variance, introduced by David W. Allan to measure the<br />

frequency stability in oscillators [10]. This method can be<br />

adopted to characterize MEMS-based sensors and to analyze a<br />

sequence of data in the time domain [9]. In the present work,<br />

the calculation rules for the evaluation of a MEMS-based sensor<br />

on the basis of the Allan variance and their graphical contexts<br />

were applied to determine and quantify the different noise terms<br />

that exist in inertial sensor data.<br />

In general, the Allan variance analysis of a signal in the time<br />

domain consists of computing its Allan deviation as a function<br />

of different averaging times and then analyzing the<br />

characteristic regions and log-log scale slopes of the Allan<br />

deviation curves to identify the different noise modes [8]. The<br />

major noise relevant terms within this example are the<br />

Acceleration-Random-Walk (1), the Bias instability (2) and the<br />

Rate-Random-Walk (3).<br />

σ<br />

= σ ( τ ) ⋅<br />

1<br />

ARW 0<br />

(1)<br />

τ<br />

0<br />

( π<br />

σ Bias<br />

= σ τ ) ⋅ 1 2 ⋅<br />

(2)<br />

ln(2)<br />

the sensor resolution. Therefore, the signal quality needs to be<br />

evaluated. The pictures show that the integration process leads<br />

to a slight degradation especially in the z-axis due to the<br />

hot-pressure integration process. But the sensor readings are<br />

still useable and can produce valid values which further can be<br />

used for measurement tasks. Those measurement tasks will be<br />

introduced in the following.<br />

σx(τ) in |g|<br />

σy(τ) in |g|<br />

σz(τ) in |g|<br />

10 −2<br />

10 −3<br />

10 −4<br />

10 −5<br />

10 −2<br />

10 −3<br />

10 −4<br />

10 −5<br />

10 −2<br />

10 −3<br />

10 −4<br />

10 −2 10 −1 10 0 10 1 10 2 10 3<br />

Fig. 6<br />

Fig. 7<br />

Befor the integration<br />

After the integration<br />

τ in s<br />

Allan deviation plot of the X-Axis<br />

Befor the integration<br />

After the integration<br />

10 −2 10 −1 10 0 10 1 10 2 10 3<br />

τ in s<br />

Allan deviation plot of the Y-Axis<br />

Befor the integration<br />

After the integration<br />

σ<br />

= σ ( τ ) ⋅<br />

3<br />

RRW 2<br />

(3)<br />

τ<br />

2<br />

The following figures point out a degradation regarding the<br />

noise relevant terms after the integration step as a sensor quality<br />

indicator. A defective sensor behavior could be identified via<br />

this method. Worst case this could introduce problems and<br />

wrong readings in a further demanding application concerning<br />

10 −5<br />

10 −2 10 −1 10 0 10 1 10 2 10 3<br />

τ in s<br />

Fig. 8 Allan deviation plot of the Z-Axis<br />

The relevant monitoring parameters of industrial braking<br />

systems are the braking torque over lifetime and also<br />

parameters such as temperature, speed and critical failure<br />

effects e.g. a broken power supply which are recorded by the<br />

www.embedded-world.eu<br />

162


device. While using the embedded system which is<br />

encapsulated into the fiber-reinforced material following<br />

measurements were carried out with a custom test bench to<br />

simulate degrading effects and long-time damages of the<br />

system. Various case studies were carried out with the help of<br />

the test stand.<br />

For this purpose, use cases were defined which allow the<br />

methodology to be applied to the field of electromagnetic safety<br />

brakes and industrial electric drives. Simple scenarios such as<br />

the detection of the rotation direction and position of the rotor<br />

were considered and implemented with the help of MEMSbased<br />

sensors. Thus, it is possible to detect and quantify the<br />

direction of rotation, angular speed and the position of the rotor<br />

without any external sensors evaluating (4) and (5). With a r<br />

being the radial acceleration, r a the distance between the sensor<br />

and the center of the printed circuit board and n the speed.<br />

exceeding or falling below the limits can be detected when a<br />

limit value is reached A warning about the wear of the rotorhub<br />

connection can be generated. Therefore, it is possible to<br />

detect the defect very early while running the machine.<br />

a<br />

= ( 2⋅π ) ⋅<br />

(4)<br />

2<br />

r<br />

r a<br />

Fig. 9<br />

Smart-rotor based detection of a good rotor hub connection<br />

1<br />

n = ⋅<br />

2⋅π<br />

a<br />

r<br />

r<br />

a<br />

(5)<br />

In addition, much more demanding tasks, such as the state<br />

monitoring of the rotor-hub connection, were also considered.<br />

Two of those representative examples and the results will be<br />

presented in the following using only the brake disc equipped<br />

with sensors and microcontroller.<br />

A. Connection between the rotor and the hub<br />

If the braking system is used as an active deceleration<br />

component, the rotor-hub connection can deflect by the<br />

transmission the mechanical torque. According to the current<br />

state of technology the rotors are replaced upon reaching the<br />

limit for the air gap also a critical condition within the rotor-hub<br />

connection cannot be detected yet.<br />

If the connection fails, the drive shafts can no longer be stopped<br />

by the brake which could result in major damage.<br />

dω<br />

at<br />

= ra<br />

⋅<br />

(6)<br />

dt<br />

The radial acceleration a r, shown as the x-component in the<br />

following figure, indicates the loose connection as a shift in the<br />

radial direction. The tangential acceleration a t as the<br />

y-component can be calculated according to (6). It indicates the<br />

change in the angular velocity within the time and obviously<br />

dependents on the distance of the sensor from the center.<br />

Noticeable for an occurring defect are the peaks marked with<br />

the arrows in Figure 10 within the acceleration values. The<br />

loose connection during the rotation leads to a radial<br />

displacement, which causes a measurable acceleration as a<br />

peak.<br />

For the detection of this defect an upper and lower limit can be<br />

taking into account and tolerance band for the expected<br />

acceleration values could be defined. With a simple counter, the<br />

Fig. 10 Smart-rotor based detection of a defective rotor hub connection<br />

The greater the wear occurring, the greater the number of peaks.<br />

The measured clearance between the rotor and the brake disc<br />

was around 0.1 mm during the measurement. With the<br />

presented method, the connection between the rotor and the<br />

attached hub could be monitored on the shaft by using the<br />

smart-rotor. This option is for the safety critical component a<br />

great added value and the necessary form fit for optimal<br />

deceleration can be monitored during the main operation.<br />

B. Status Detection of the brake system<br />

In addition to the speed measurement, the state recognition of<br />

the anchor plate is already detectable according using state of<br />

the art external sensors. By now only the two states, brake<br />

closed or brake ventilated can be detected by using e.g. micro<br />

buttons. A disadvantage is the occurring wear of the mechanical<br />

components inside the button and furthermore the button itself<br />

must be adjusted very accurately to correctly detect the end<br />

position of the anchor plate. In comparison the advantage of the<br />

condition detection via the measuring rotor with a<br />

magnetometer is shown in the following figure. By<br />

continuously evaluating the measured magnetic flux density in<br />

the z-direction B z, it is possible to make further statements<br />

about the spring-applied brake system in addition to the status<br />

determination.<br />

163


For the measurement, the mounting direction of the rotor was<br />

chosen so that only positive values are recorded. Pictured is the<br />

z-component of the magnetometer for the single release of the<br />

brake over a period of 12 seconds under the influence of<br />

different supply voltages.<br />

This variation of the voltage is intended to simulate a fault<br />

occurring in the energy supply of the brake. The curve for the<br />

rated voltage of 24 V is used as a reference. When opening the<br />

brake, an increase in the magnetic flux density is expected. The<br />

coil in the magnet housing of the spring-pressure brake builds<br />

up a magnetic field that can be detected by the connection to a<br />

power supply. After switching off, the field degrades again.<br />

Fig. 12<br />

Smart-rotor based detection of a jammed anchor plate<br />

Fig. 11 Smart-rotor based detection of a broken power supply<br />

For a reliable state determination, the evaluation of the<br />

z-component of the magnetometer according to the condition<br />

would satisfy 400 µT < B z < 500 µT in this case. The nominal<br />

voltage of 24 V was used as a reference for this purpose. The<br />

400 µT result from releasing the brake against the preload force<br />

of the spring. With the upper limit of 500 µT, the jamming of<br />

the brake disc is taken into account so that the digital button<br />

does not generate a wrong status. If the detected flux density is<br />

below the limit of 400 µT, there must be a fault in the power<br />

supply.<br />

A jammed anchor plate can be detected by exceeding the limit.<br />

In direct comparison to the prior techniques, this method has<br />

the advantage of monitoring the coil and the power supply of<br />

the spring pressure brake in addition to the state detection. If an<br />

error occurs, the cause can be identified. With a simple<br />

pushbutton, it is only possible to check in conjunction with a<br />

control whether the desired control pulse has been converted by<br />

the brake.<br />

By evaluating the absolute values of the magnetic flux density,<br />

this system can determine the cause of the fault. It can be<br />

distinguished between a shortage or a defect on the coil and<br />

additionally the mechanical clamping of the anchor plate. By<br />

using this methodology, a fault tree analysis could be derived<br />

supporting the user while diagnose the system.<br />

IV.<br />

CONCLUSION AND OUTLOOK<br />

The presented work shows the successful integration of<br />

microelectronic circuitry into a fiber-reinforced material with<br />

the help of the hot-pressure manufacturing method. Selected<br />

measurement tasks for the predictive maintenance where<br />

presented while using the wirelessly powered brake rotor. It is<br />

to mention that the presented measurement tasks are only an<br />

excerpt and are tailored to the area of industrial brake systems.<br />

By using and installing them in a machine, the repair times and<br />

maintenance costs could be optimized instead of constantly<br />

struggling with unplanned repairs of failed machines.<br />

As a vision, companies can reduce costs through scheduled<br />

maintenance when wear and tear is reported by wireless<br />

sensors [11]. In addition, cloud servers could use sensor data in<br />

algorithms to predict high and low demand periods, and the<br />

operation of the system could be adapted to save energy. The<br />

research shows these models can be refined by moving the<br />

sensors into the very heart of the machine, accelerating the<br />

progress of the smart factory.<br />

With cost-effective, wireless SoCs and sensors embedded<br />

directly in components and connected to the cloud via a mesh<br />

network, high-performance servers can, for example, determine<br />

the location of underutilized devices and identify bottlenecks in<br />

processes to adjust process speed accordingly.<br />

This vision will also evolve within the automotive sector for<br />

upcoming important applications turning conventional<br />

automotive parts like wheels into "smart" devices with<br />

integrated sensors - enabling real-time monitoring of proper<br />

functionality - a basic requirement for getting safety relevant<br />

“conventional” hardware of highly automated driving vehicles<br />

compliant to the requirements of functional safety according to<br />

ISO 26262.<br />

Annotation<br />

The work presented in the context of this paper is funded by the<br />

Central Innovation Program for Small and Medium-Sized<br />

Businesses (ZIM).<br />

www.embedded-world.eu<br />

164


V. REFRENCES<br />

[1] Caroline Hayes, “IoT extends its reach into the industrial landscape”,<br />

Nordic Semiconductor ULP WQ Summer 2017 page 20.<br />

[2] Gehard: „Verzahnung bis 25 Millionen Lastzyklen verschleißfrei“, Iris<br />

Gehard, freie Journalistin in München, im Auftrag der Rex Industrie-<br />

Produkte Graf von Rex GmbH, Vellberg. Konstruktion 3-2013, s.l.:<br />

Springer VDI Verlag.<br />

[3] H.J. Wunderlich: „Models in Hardware Testing“, Springer Netherlands,<br />

2010.<br />

[4] G. Huertas Sánchez, D. Vázquez García de la Vega, A. Rueda Rueda, J.<br />

Huertas Díaz: “Oscillation-Based Test in Mixed-Signal Circuits”,<br />

Springer Netherlands, 2006.<br />

[5] M.L. Bushnell and V.D. Agrawal: „Essentials of Electronic Testing for<br />

Digital, Memory and Mixed-Signal VLSI Circuits “, Springer, New York,<br />

2010“, Springer, New York, 2010.<br />

[6] S. Grunwald, B. Bäker: „Integration eines Mess- und Sensorsystems in<br />

einen Integralrotor für elektrische Antriebe unter Verwendung des<br />

Heißpressverfahrens“, 12. VDI/VDE Mechatronik-Tagung, Dresden, 09.-<br />

10. March 2017.<br />

[7] S. Grunwald, B. Bäker: „Integrated measurement units and sensor<br />

systems for harsh industrial applications“, AMA Sensor Conference 2017,<br />

Nuremburg 30.5.2017 - 1.6.2017.<br />

[8] N. El-Sheimy, H. Hou, X. Niu: „Analysis and modeling of inertial sensors<br />

using allan variance”. IEEE Trans. Instrum. Meas. 2008, 57, 140–149.<br />

[9] A. A. Hussen, I. N. Jleta: „Low-Cost Inertial Sensors Modeling Using<br />

Allan Variance”, World Academy of Science, Engineering and<br />

Technology International Journal of Computer, Electrical, Automation,<br />

Control and Information Engineering Vol: 9, No: 5, 2015.<br />

[10] D. W. Allan: „Statistics of atomic frequency standards”, Proceedings of<br />

the IEEE, vol. 54, no. 2, pp. 221–230, 1966<br />

[11] PricewaterhouseCoopers (PwC), “Industry 4.0: Building the digital<br />

enterprise”, Global Industry 4.0 Survey 2016.<br />

165


IoT Integration in Machines and Production Facilities<br />

Made Simple!<br />

Dipl.-Ing. (FH) Robert Schachner<br />

CEO<br />

RST Industrie Automation GmbH<br />

Ottobrunn-Riemerling, Germany<br />

r.schachner@rst-automation.com<br />

Abstract— Companies are beginning to retrofit their<br />

production sites to meet the requirements of the future. But still<br />

there are more questions than answers. How does a cloud work,<br />

what does service oriented mean? Are big players always the go-to<br />

solution or do we need SMEs? This presentation intends to<br />

encourage SME machine builders and software providers to<br />

broach the subject and get involved.<br />

Keywords—middleware; production; machine control; IoT;<br />

communication; integration; PLC; digitalization<br />

I. INTRODUCTION (HEADING 1)<br />

The realization of machines and production facilities in<br />

accordance with methods described in the Reference<br />

Architecture Model Industry 4.0 (RAMI4.0) or the parallel IoT<br />

development towards Industrial Internet Reference Architecture<br />

(IIRA) is currently only actively pursued by a small number of<br />

companies. In a lot of cases, a simple lack of know-how prevents<br />

a successful digitalization.<br />

But who has the necessary skills to implement this kind of<br />

project? The obvious answer would be a big player like IBM<br />

with Watson or Siemens and their Mindsphere program. And<br />

yes, the global players are crucial on the Office Floor level. But<br />

the most important players are the facility owners themselves<br />

with their service personnel and IT departments. They know the<br />

often decades old established (manual) procedures and methods<br />

by heart.<br />

But there is also a need for a third player: small and mediumsized<br />

businesses that are tightly networked (like in the German<br />

trade organization Embedded4You e.V.) who, in close<br />

cooperation with the owners, are able to integrate old and new<br />

requirements and processes machine by machine, service by<br />

service, on the Shop Floor.<br />

Combining all three creates the development team that is<br />

needed for a successful implementation.<br />

II. REFERENCE ARCHITECTURE MODEL INDUSTRY 4.0<br />

(RAMI4.0)<br />

Before delving deeper into implementation itself, it is useful<br />

to have a look at the Reference Architecture Model Industry 4.0<br />

as described in DIN SPEC 91345 in order to define the basic<br />

terms and establish a common understanding of reference<br />

models and architectures.<br />

Fig. 2. RAMI4.0 – Basic Structure of a Modern Production Facility<br />

Fig. 1. Participants in the Digitalization of Production Facilities<br />

As the structure diagram shows, at the center of production<br />

is not a (possibly internet based) cloud but two network tiers<br />

spanning a common semantic model between their endpoints.<br />

The so called Enterprise Network, or Office Floor, comprises<br />

the business processes that indirectly also control production<br />

through order management. This is where the main focus is on<br />

the big players like IBM or SAP. When digitalizing existing<br />

production facilities, it is best to start by modernizing the big<br />

player technology that's already in place. We're not going to<br />

linger on this subject as our focus is on successfully<br />

implementing the Real Time Network, or Shop Floor, which is<br />

usually the harder part of modernization.<br />

166


When digitalizing a factory, a consistent administration shell<br />

has to emerge that is able to integrate the countless different<br />

controls of assets from different manufacturers, often originating<br />

from different technological ages. This necessitates both a<br />

flexible middleware platform and an innovative SME partner<br />

capable of developing that AAS with its respective Shop Floor<br />

services and then integrating the machines one by one.<br />

Fig. 3. Reference Architecture Model Industry 4.0 (RAMI4.0)<br />

The layer model shown above is probably the most well<br />

known representation of the reference architecture, as it<br />

classifies all major parts of a production system, arranges them<br />

on three axes and assigns the relevant terms.<br />

The layer axis classifies the layers of a production system in<br />

more detail, starting with the machines or assets on the Shop<br />

Floor all through the business layer on the Office Floor. The<br />

hierarchy axis on the other hand classifies items ranging from<br />

product to internet.<br />

Of special interest is the life cycle or value stream axis. It<br />

describes procedures starting with receipt of parts all the way<br />

through to the finished product being delivered. Or e.g. the<br />

sequence of error management. It is interesting to note that error<br />

management is moving towards diagnosing and rectifying<br />

potential errors even before they happen (predictive<br />

maintenance). These procedures are then implemented by<br />

defining appropriate service processes.<br />

Going another layer deeper, we arrive at the so called<br />

"industry 4.0 component". This comprises an often already<br />

present asset, the machine with its existing control and the so<br />

called administration shell (or asset administration shell, or AAS<br />

for short). The AAS surrounds the legacy control and is directly<br />

connected to the Shop Floor communication layer through<br />

service processes.<br />

Fig. 4. Industry 4.0 Component<br />

III. COMMUNICATION METHODS<br />

Now that we have a general overview, let's go another layer<br />

deeper and look at the technologies that are necessary for the<br />

implementation. In most cases control of existing mechanical<br />

assets is handled by PLC control systems as described by IEC<br />

61131-3. They are part of their respective machines and should<br />

generally not be changed due to the enormous effort and loss of<br />

manufacturer support that come with such a change. Here,<br />

process data is cyclically read, processed and output. At the core<br />

of this form of communication is a common memory area<br />

containing all pertinent values. Since these are cyclically<br />

overwritten only the latest values are available at any time.<br />

Fig. 5. Synchronous Communication Model<br />

This model is the optimal implementation for cycle based<br />

and synchronous processes like regulators. On the other hand,<br />

since any data that gets overwritten is lost, its usefulness for<br />

handling status information like error messages or commands is<br />

extremely limited. Nevertheless, we need this model to build a<br />

so called "digital twin" that will be used to mirror data from the<br />

PLC control. More on that later.<br />

As mentioned, the implementation of flexible and persistent<br />

sequences necessitates the use of another form of<br />

communication. This is ideally solved using serialized<br />

communication of data and commands via brokers that distribute<br />

and organize the messages. This way, any data remains in the<br />

system until fetched by a client. Most middleware based<br />

technologies use this as their only means of communication.<br />

Often they do not support the synchronous process data model<br />

even though it is imperative for cyclical processing and<br />

establishing the digital twin.<br />

Information can now be sent to specific clients through<br />

explicit addressing. Alternatively, messaged can be published<br />

to so called topics (semantically addressed virtual channels).<br />

Clients can then subscribe to topics and any messaged published<br />

to the subscribed topic will automatically be delivered to them.<br />

Powerful wildcard functions facilitate choosing relevant topics.<br />

Publish / Subscribe is the only communication model that does<br />

167


not necessitate a direct link between sender and receiver. This is<br />

currently the most important innovation in process<br />

communications since it is the indispensable basis for building a<br />

cloud.<br />

Fig. 6. Asynchronous Message Based Communication<br />

This form of communication isn't slow, either. Control<br />

programs based on "state machines", which are commonly used<br />

in automation systems, can be implemented faster and more<br />

efficiently with messaging services.<br />

The message based model also offers additional quality of<br />

service functions like "last will": a predefined message that is<br />

automatically distributed when a client fails or disconnects from<br />

the network.<br />

A deciding factor in choosing a middleware is the<br />

availability of a multi broker architecture (as illustrated in fig.<br />

7). The clients inside a system can communicate via their own<br />

dedicated broker, which is much faster and independent from<br />

network access. System spanning messages are handled via<br />

broker-to-broker traffic and distributed locally.<br />

Communications become redundant and Industry 4.0<br />

components retain their functionality even if the network goes<br />

down.<br />

Though the Shop Floor is usually not connected to the<br />

internet, one should still take heed to implement adequate<br />

security including certificate based identification and encryption<br />

of transmitted data. It is further recommended to restrict<br />

available communication functions through user roles.<br />

Fig. 7. Multi Broker System with Security Functions<br />

Communication models like the ones described above are<br />

found in numerous products that are collectively called<br />

"middleware". Especially the IoT sphere offers a wide range of<br />

available solutions. Let us compare some exemplary products<br />

and their communications functionality:<br />

Functionality<br />

OPC<br />

UA<br />

Middleware Products<br />

DDS Gamma V MQTT<br />

Asynchronous / Message Based coming yes yes yes<br />

Publish / Subscribe coming yes yes yes<br />

Synchronous / Cyclical yes yes yes no<br />

Real Time Capable coming yes yes yes<br />

Multi Broker no yes yes no<br />

Integrated Security yes yes yes yes<br />

Simple Programming yes no yes yes<br />

IV. SHOP FLOOR SERVICES<br />

Now that we have had a look at the different available<br />

communication strategies and their uses, let's examine how they<br />

work with our Industry 4.0 component.<br />

Fig. 8. Schematic of an Industry4.0 Component Based on Middleware<br />

Architecture<br />

In this example, the PLC stands for the existing control<br />

systems of a machine. It is irrelevant whether we're discussing<br />

designing a new production facility or retrofitting a legacy site<br />

for the digital age. Machines sourced from external suppliers<br />

often come with all kinds of different control systems, interfaces<br />

and field buses. It is not advisable to change these so as not to<br />

lose warranty and support (not to mention the necessary effort).<br />

Therefore we create a digital twin of the PLC control based on<br />

our synchronous communications model. The twin is connected<br />

to the real system through appropriate interfaces (usually field<br />

buses) and process data from the PLC is mirrored in the semantic<br />

process data model. Now all data can be managed and processed<br />

without hindrance and the services of the administration shell<br />

can be connected.<br />

168


When developing new machines the same concept can be<br />

used. Instead of the PLC control a field bus is adapted that<br />

connects I/O signals directly to the process data model. Now the<br />

control software can coordinate processes while the overarching<br />

services of the administration shell process data.<br />

A good middleware should not only provide a diverse range<br />

of tools for control and service development but also for test and<br />

simulation. On-site programming and troubleshooting, which<br />

used to shut down entire production lines until all errors were<br />

eliminated, is no longer a desirable option. Therefore, one<br />

should strive to implement Continuous Integration techniques<br />

with automatic code rollouts.<br />

V. ADMINISTRATION SHELL SERVICES<br />

The services shone in figure 8 are utility programs that pool<br />

and process information from one or several machines. In an<br />

Industry 4.0 component, such services form the administrative<br />

shell and communicate with services on a higher level though<br />

the resource manager. Thus, information is collected and<br />

coordinated across numerous machines.<br />

An important example for the use of services in a production<br />

facility is error management. A service in the administrative<br />

shell identifies an error based on the available information and<br />

publishes the error state to a predefined topic. Modern<br />

"predictive maintenance" services are able to deduct from<br />

available information that an error is about to occur shortly and<br />

publish relevant warnings. Superior services running on a higher<br />

system level subscribe to these topics and can log them in a<br />

database or inform a service technician.<br />

translates data between Office Floor and Shop Floor. This results<br />

in connectivity between arbitrary endpoints and an overarching,<br />

common semantic model.<br />

Fig. 10. Exemplary Overview of a Networked Production Facility<br />

The successful implementation of a digital production site in<br />

only possible when owners, SMEs and global players work hand<br />

in hand to pool their strengths. Implementation strategies as well<br />

as migration strategies have to be developed. Single handed<br />

attempts or rushed implementations without proper planning, as<br />

are regrettably occurring at the moment, pose a considerable<br />

threat to both the project itself and the owners running them.<br />

Fig. 9. Error Management Services<br />

VI. FURTHER IMPLEMENTATION<br />

Once all necessary workflows have been implemented in<br />

accordance with the reference system, all machines on a<br />

production site can now be transferred into the new homogenous<br />

service landscape.<br />

In addition to the message based communication structures<br />

between machines and their synchronous controls, it is often<br />

necessary to also establish a synchronous network. This<br />

facilitates transferring cyclical process parameters in real time.<br />

Only now does the connection to the overarching Office<br />

Floor become possible and make sense. A protocol converter<br />

169


Dotdot Unifies IoT Device Networks<br />

Jason Rock<br />

IoT Products<br />

Silicon Labs<br />

Austin, TX USA<br />

jason.rock@silabs.com<br />

Abstract— Dotdot is the universal language of the Internet of<br />

Things (IoT). Today, devices are being connected all around us.<br />

Most of these connected devices, however, cannot interoperate<br />

with each other due to their unique skills, rules and systems. If<br />

cost, power and RF regulatory compliance were not factors, the<br />

simplest solution is to replace or retrofit every device and ensure<br />

they use a single physical protocol such as Wi-Fi. In reality, each<br />

device’s chosen wireless technology is optimized for its particular<br />

application. Therefore, the better solution to achieve device<br />

interoperability across an array of devices with various protocols<br />

is to embed and use an application library such as Dotdot. This<br />

will allow each connected device to be physically optimal for their<br />

intended purpose and facilitates the ability for devices in multiple<br />

ecosystems to communicate with each other.<br />

Keywords—IOT, Dotdot, Zigbee, Thread, Bluetooth, multiprotocol,<br />

software, edge, cloud<br />

I. INTRODUCTION<br />

Devices are being connected all around us due to the growth<br />

and evolution of the IoT. In fact, Gartner predicts, “By 2020, IoT<br />

technology will be in 95% of electronics for new product<br />

designs” [1]. Most of these connected devices, however, cannot<br />

interoperate with each other because of unique device attributes<br />

and behaviors required by an ecosystem, which in turn means<br />

each of these devices likely cannot communicate with other<br />

devices when outside of their intended ecosystem by default.<br />

This forces consumers to likely lock into branded ecosystems<br />

based on their desired device selection, and vice-versa.<br />

One challenge with the walled gardens created by branded<br />

device ecosystems is that it makes it difficult for device<br />

manufacturers to develop end devices or manage product<br />

variants that can communicate across these multiple ecosystems.<br />

Additionally, the desire for a single wireless protocol such as<br />

Wi-Fi will not be the solution as some may predict because<br />

multiple physical protocols will continue to coexist for the<br />

foreseeable future.<br />

Rather than seeking a single physical protocol, a common<br />

application language residing above these protocols will become<br />

the preferred option. Of these, an optimal option is Dotdot.<br />

Developed by the Zigbee Alliance, Dotdot can instead be<br />

installed or upgraded into each set of connected devices to<br />

process interaction commands between each system. The<br />

presence of Dotdot’s device descriptors presents a common<br />

language allowing each device to operate optimally for its<br />

purpose while enabling connected devices to communicate with<br />

other devices within or outside of its branded ecosystem.<br />

II.<br />

BACKGROUND<br />

A. Before IoT, mobile phones became smart<br />

Today, IoT faces with a similar scenario that the mobile phone<br />

industry experienced in the late 1990s when mobile operators<br />

had to choose a protocol such as CDMA or GSM. Based on that<br />

selection, international travelers discovered the challenges of<br />

getting phone service as they roamed into networks with<br />

unregulated cost structures. Consumers choosing a different<br />

service found themselves with phones unable to interoperate on<br />

the chosen network. While some predicted one standard would<br />

prevail, the eventual outcome was the existence of multiple<br />

technologies with multiple physical wireless standards<br />

interoperating with each other. This caused mobile phone<br />

manufacturers to introduce multiband phones that could<br />

interoperate over an array of deployed wireless technologies,<br />

allowing mobile phone operators to negotiate favorable terms<br />

to share their networks. These actions to work together to<br />

contend with what was available and establish a way for mobile<br />

phones to work across competing networks ultimately benefited<br />

the consumer. It also helped facilitate the next evolution of the<br />

smart phone, with Android at the center of this evolution, which<br />

led to the creation of the mobile app industry [2]. The Internet<br />

of Things is beginning to use the same approach.<br />

B. Walled gardens don’t make good networks<br />

Concannon Business Consultant Michael Dorazio<br />

poignantly states the connected device conundrum: “The<br />

Internet of Things will weave a seamless tapestry of connected<br />

devices into your life. Except that it won’t … if things keep<br />

going the way they are” [2]. This means that ‘walled gardens’<br />

that only work with branded devices will likely leave the<br />

consumer in a struggle to determine which new devices work<br />

with the devices they currently own.<br />

C. The woes of the modern building manager<br />

To understand the problem, consider commercial buildings<br />

being managed by today’s building managers. A modern<br />

building will contain connected LED lighting, a state-of-the-art<br />

security system and an efficient HVAC with remote sensors.<br />

www.embedded-world.eu<br />

170


Without a standardized application layer, it is likely that none of<br />

these systems can easily communicate with each other.<br />

Further, what if a building manager overseeing a dozen<br />

buildings located in multiple regions needs to apply a uniform<br />

policy to:<br />

• Pre-condition every building starting at 6:00 am,<br />

• Activate all lights at 7:00 am,<br />

• Then at 7:00 pm, set all buildings to an energy-efficient<br />

mode for lighting and HVAC systems<br />

• And, provide exceptions for weekends and holidays.<br />

To do so, this building manager would likely have to log into 24<br />

different accounts (12 for lighting, 12 for HVAC), setup 12 sets<br />

of lighting policy rules as well as the same number of individual<br />

rules for the HVAC. To complete the change, he/she will then<br />

ask localized staff to help verify that each system operated as<br />

programmed.<br />

D. Contending with regulations<br />

While this scenario may seem to be a temporary problem that<br />

only occurs during installation, maintenance or infrequent policy<br />

changes, government and regulatory entities at the federal, state,<br />

and municipal level are taking a more active role by mandating<br />

energy policies that building managers have to contend as they<br />

apply their energy policies across the buildings under their<br />

administration. For example, the State of California’s Title 24’s<br />

2013 standard requires buildings greater than 10,000 square feet<br />

to be capable of automatically reducing lighting power in<br />

reaction to a demand response signal by a minimum of 15%<br />

below the total installed lighting power [3]. These regulations<br />

additionally may require the submission of monthly or yearly<br />

reports. If each of these systems does not have a way to<br />

intercommunicate, compiling such reports can be challenging as<br />

the building manager has to manually compile the necessary<br />

data. Failure to adhere to specific regulations could lead to<br />

penalties such as fines, loss of tax incentives, or possibly<br />

decertification of the building’s occupancy permit.<br />

E. Consider the array of physical protocols<br />

Connected LED lighting best represents the challenges faced<br />

within the IoT evolution due to the varied adopted physical<br />

protocols. Connected lighting especially in a commercial setting<br />

tends to use wired protocols such as DALI (Digital Addressable<br />

Lighting Interface), DMX, Power over Ethernet and Power Line<br />

Communications. Wireless protocols, however, are rapidly<br />

becoming a presence within connected lighting using Zigbee,<br />

Bluetooth, Wi-Fi, EnOcean, Li-Fi, and recently 6LoWPAN and<br />

Thread, to name a few.<br />

HVAC, however, does not have as many deployed protocols,<br />

but these systems use similar protocols used in lighting along<br />

with others such as BACnet, Modbus, LonWorks and KNX.<br />

How does a building manager decide which protocol to<br />

deploy, especially when faced with having to choose a specific<br />

brand for lighting and a different brand for HVAC? Each brand<br />

may have its own control software, connected to different cloud<br />

management solutions requiring multiple accounts and multiple<br />

passwords to maintain.<br />

F. Using cloud APIs to patch device interoperability<br />

One method to facilitate these challenges is to use some of<br />

the new cloud-capable platform services being offered by<br />

startups and established companies. The purchase of additional<br />

equipment of IoT-enabled gateways and subscription services<br />

offers building managers a way to connect their deployed<br />

systems and control them within a unified cloud-centric<br />

platform. In effect, these platforms take advantage of the walled<br />

garden problem by offering a paid service to help alleviate the<br />

struggle building managers are facing. In other cases such as<br />

smart cities and transportation, there are many emerging<br />

companies offering to help via paid services to solve the same<br />

sorts of challenges as more connected devices, which mostly are<br />

not interoperable, appear within their particular industry.<br />

III.<br />

MULTI-PROTOCOL WIRELESS IS HERE TO STAY<br />

A. One protocol to rule them all<br />

As in the case of connected lighting, network operators, as<br />

well as device makers hoping to solve their customers<br />

connectivity problems, find themselves forced to offer<br />

connected devices in multiple physical protocols. If cost, power<br />

or other RF aspects such as operating frequency and range were<br />

not factors, manufacturers could standardize on a single protocol<br />

like Wi-Fi. This is not realistic and would require all device<br />

makers to arrive at the same decision across all devices within<br />

the respective ecosystems.<br />

B. Wi-Fi is great for data, bad for batteries<br />

The reality is each device’s chosen technology is in fact<br />

optimal for the end application, and to select a single protocol<br />

requires a much more complex decision that outweighs the<br />

technological tradeoffs. Replacing or retrofitting every device to<br />

ensure they use a single protocol is not realistic. The reality is a<br />

device’s chosen connectivity is optimal for its intended<br />

application within its targeted industry.<br />

TABLE I. COMMON WIRELESS RADIO STANDARDS [4]<br />

Wi-Fi Z-Wave Zigbee Thread BLE<br />

Launched to<br />

the Market<br />

1997 2003 2003 2015 2010<br />

PHY/MAC<br />

Standard<br />

IEEE<br />

802.11.1<br />

ITU-T<br />

G.9959<br />

IEEE<br />

802.15.4<br />

IEEE<br />

802.15.4<br />

IEEE<br />

802.15.1<br />

Frequncy<br />

868/900<br />

2.4 GHz<br />

Band<br />

MHz<br />

2.4 GHz 2.4 GHz 2.4 GHz<br />

Maximum<br />

Data Rate<br />

> 1<br />

Git/s<br />

40-100<br />

kbit/s<br />

250<br />

kbit/s<br />

250<br />

kbit/s<br />

2<br />

Mbit/s<br />

Topology Star Mesh Mesh Mesh P2P/Mesh<br />

Power<br />

Usage<br />

Alliance<br />

High Low Low Low Low<br />

Wi-Fi<br />

Alliance<br />

Z-Wave<br />

Alliance<br />

ZigBee<br />

Alliance<br />

Thread<br />

Group<br />

Bluetooth<br />

SIG<br />

Wi-Fi was designed as a way to wirelessly connect personal<br />

computers within a local network environment. The protocol<br />

facilitates the ability for connected devices to securely connect<br />

as they nomadically move in and within the given local network.<br />

Given the data transfer requirements, it uses higher overall<br />

energy consumption than other radio technology options<br />

currently available. Combined with an on-board SoC as in the<br />

171


case of computers and now mobile phones, these Wi-Fi enabled<br />

devices use rechargeable lithium-polymer batteries versus small<br />

disposable batteries. In essence, one would be tossing out<br />

numerous batteries if these devices were not built to be<br />

rechargeable.<br />

On the other hand, wireless protocols such as Zigbee and Z-<br />

wave are being adopted by home automation and security device<br />

manufacturers as they are better suited for battery-efficient<br />

applications. Bluetooth is widely adopted for direct mobile<br />

phone connectivity across a variety of devices such as headsets,<br />

peripherals, fitness trackers, light bulbs, beacons and other asset<br />

trackers.<br />

IV. DOTDOT: A COMMON DEVICE LANGUAGE<br />

While there is a desire for a single wireless protocol spurred<br />

by the need for device interoperability, each protocol was<br />

conceived to satisfy a particular physical aspect of a specific<br />

market application. As a result, the best way to satisfy the desire<br />

for interoperability is to provide a method for each technology<br />

to be able to communicate with each other in a language<br />

designed for connected devices.<br />

The most widely deployed use of a common device language<br />

is the Zigbee Cluster Library (ZCL) developed by the Zigbee<br />

Alliance, which is running on hundreds of millions of IoT<br />

devices. In 2017, the Zigbee Alliance introduced Dotdot, which<br />

is effectively a rebranding and extension of the ZCL to operate<br />

across other wireless protocols. As one journalist noted: “The<br />

ZCL is mature and comprehensive, representing years of work<br />

defining and cataloging how ‘things’ will interoperate.” [5]<br />

For new devices using Wi-Fi or Thread, Dotdot residing<br />

within these devices can easily control and respond to<br />

commands as Dotdot was designed for IP-centric transport<br />

protocols that support UDP. For IoT legacy systems, Dotdot can<br />

be deployed via firmware upgrade into these devices to give<br />

them the language to communicate with ecosystems that have<br />

existing or future products that use Dotdot.<br />

In the case of Z-Wave and Bluetooth, small language<br />

translation APIs can be added to its embedded application to<br />

translate protocol commands into the Dotdot language. A<br />

translation API, for example, requires only kilobits of memory<br />

and minimal processor usage. Further adaptions of translating<br />

UDP to TCP or a RESTful protocol may require a little more<br />

planning, but its complexities are hardly difficult for most<br />

firmware-savvy engineers to implement onto a low-cost wireless<br />

connected device.<br />

V. SUMMARY<br />

As more devices become connected thanks to advances<br />

within IoT, these connected devices will need to interoperate<br />

with each other regardless of ecosystem to enable the next<br />

generation of services to arise. This was the case when mobile<br />

phones evolved into smart phones, fostering the introduction of<br />

mobile apps.<br />

While the walled garden approach by branded ecosystems<br />

allowed the connected device industry to emerge, the reality is<br />

device manufacturers need to leverage the right wireless<br />

protocol without having to contend with supporting multiple and<br />

ever-changing APIs to ensure their products can operate across<br />

an array of ecosystems.<br />

Given its ability to describe device behavior, Dotdot, with<br />

the backing of a consortium of established IoT vendors, can<br />

provide a way for device manufacturers with minimal coding<br />

effort to augment their connected devices to process interaction<br />

commands between each system. The presence of Dotdot’s<br />

device descriptors presents a common language allowing each<br />

device to be optimized for their intended purpose while enabling<br />

devices in multiple ecosystems to communicate with each other.<br />

REFERENCES<br />

Fig. 1. Example of a Zigbee attribute specification for level control<br />

Instead of finding the perfect transport protocol, Dotdot can<br />

be augmented into devices as a language to unify the many<br />

products leveraging the varied protocols. It enables an array of<br />

innovations for device control agnostic to the underlying IoT<br />

protocols within each deployed device. Using Dotdot can give a<br />

connected lighting unit a lighting profile, whether it uses Wi-Fi,<br />

DALI or Zigbee as its wireless technology. Its device behavior<br />

and attributes are specified so that messages from a Wi-Fi<br />

lighting dimmer panel using Dotdot can be properly understood<br />

by a Zigbee light using Dotdot connected to the same local<br />

network via a wireless protocol translation bridge. Adding<br />

Dotdot to each system is relatively simple as its memory<br />

footprint for the particular device is quite small.<br />

[1] D.C. Plummer, et al, “Top Strategic Predictions for 2018 and Beyond:<br />

Pace Yourself, for Sanity's Sake”, September 29, 2017,<br />

https://www.gartner.com/doc/3803530?srcId=1-6595640685<br />

[2] The Verge staff, “Android: A visual history”, December 7, 2011,<br />

https://www.theverge.com/2011/12/7/2585779/android-history<br />

[3] Michael Dorazio, “Walled Gardens are Killing the Internet of Things”,<br />

December 21, 2015, http://www.concannonbc.com/walled-gardens-arekilling-the-internet-of-things/<br />

[4] California Energy Commission, 2013 Building Energy Efficiency<br />

Standards for Residential and Nonresidential Buildings, November 25,<br />

2013, Section130.1(e) http://www.energy.ca.gov/2012publications/CEC-<br />

400-2012-004/CEC-400-2012-004-CMF-REV2.pdf<br />

[5] C. Liew, “The Smart Home radio protocols war”, August 10, 2015,<br />

https://www.iot-now.com/2015/08/10/35653-the-smart-home-radioprotocols-war/<br />

[6] D. Ewing, “Delving deeper into Dotdot -- ZigBee's new 'Universal<br />

Language for the IoT'”, April 4, 2017, https://www.embedded.com/<br />

electronics-blogs/say-what-/4458281/Delving-deeper-into----ZigBee-s-<br />

new--Universal-Language-for-the-IoT-<br />

www.embedded-world.eu<br />

172


Localizing Analytics<br />

for Speed, Reliability and Reduced Power Consumption<br />

John Milios<br />

Sendyne Corp.<br />

New York, NY, USA<br />

jmilios@sendyne.com<br />

Nicolas Clauvelin<br />

Sendyne Corp.<br />

New York, NY, USA<br />

nclauvelin@sendyne.com<br />

Abstract --- New tools make possible physics-based analytics<br />

in an embedded environment. By computing locally – performing<br />

predictive and prescriptive analytics at the edge of the IoT –<br />

significantly less data must be directed to the cloud. Further, the<br />

data sent are more informative and they are available in serverless<br />

situations. This improves reliability, speeds computation<br />

time and reduces power consumption. In addition, physics-based<br />

models have the ability to assess the internal state of an observed<br />

system. This makes their predictions more accurate. By enabling<br />

physics-based models to operate in real time in small footprint<br />

embedded devices, the resultant robust predictive ability can lead<br />

to a reduction of needed, and often expensive, system monitoring<br />

sensors. To illustrate how embedded model-driven analytics can<br />

be implemented, a real world examples will be demonstrated: an<br />

electric motor health monitor. Each step in the implementation<br />

process will be shown, from model design to the utilization<br />

of embedded scientific computing tools, final real-time model<br />

optimization, and system predictions.<br />

Keywords --- analytics; physical analytics; Edge of the IoT;<br />

model-based analytics<br />

I. Introduction<br />

The vast amount of data generated by Internet of Things (IoT)<br />

devices and sensors is threatening to disrupt the current Internet<br />

infrastructure. The prospect of billions of interconnected devices<br />

and sensors ceaselessly generating data creates unsustainable<br />

requirements in storage, energy and bandwidth. According to<br />

CISCO, Machine to Machine (M2M) traffic alone is growing<br />

at a 49% CAGR, and is projected to generate 14 exabytes/<br />

month up from 3 exabytes/month experienced in 2017[1]. To<br />

have a measure of comparison, one Exabyte is10 18 bytes and<br />

some educated guesses about the storage capacity of Google<br />

place it somewhere around 10-15 Exabytes of data [2]. Industry,<br />

academia and even governments are becoming aware of this<br />

threat and are investigating multiple approaches to address the<br />

impending very big data crisis [3],[4]. It is beyond the scope<br />

of this paper to delve in depth in all aspects of the problem.<br />

Instead we will focus on the role of analytics in reducing the<br />

traffic between IoT devices and the cloud.<br />

II. The Role of IoT Analytics<br />

All IoT generated data can be valuable but only if they are<br />

interpreted in a useful way. This is the role of analytics – one<br />

IoT sensor data<br />

Structured data<br />

10000<br />

8000<br />

Exabytes<br />

6000<br />

4000<br />

2000<br />

IoT generated data<br />

social & computer<br />

generated data<br />

edge<br />

IoT sensor data Derived (model) data Structured data<br />

2010 2015 2020<br />

Figure 1: IoT generated data is growing faster than social & computer generated<br />

data<br />

Figure 2: Performing analytics at the edge reduces storage, energy and bandwidth<br />

requirements<br />

edge<br />

www.embedded-world.eu<br />

173


of the most important applications in the IoT, which can be<br />

categorized as predictive and prescriptive analytics.<br />

Predictive analytics aims to identify potential issues<br />

before they occur. The benefits are immediate; for example,<br />

unscheduled down time in a production line can be significantly<br />

reduced or eliminated.<br />

Prescriptive analytics goes one step further by acting on the<br />

data through a feedback system that optimizes a process.<br />

Given the storage, energy and bandwidth concerns that<br />

the very big IoT data is creating, it is optimal to perform the<br />

data analysis as close as possible to their source, thus reducing<br />

transmission of uneccessarily large amount of data as illustrated<br />

in Fig.2. Edge performed prescriptive analytics provide an<br />

added benefit by enabling local operational functionality during<br />

scheduled or unscheduled server-less conditions.<br />

Model complexity &<br />

processing power<br />

Full physics<br />

models<br />

Dynamic &<br />

data dependent<br />

Static &<br />

data independent<br />

Compact<br />

model analytics<br />

Statistical<br />

analytics<br />

III. The Power of Physics in IoT Analytics<br />

Big data generated by IoT devices and sensors are different<br />

than data created by social networks, financial or business<br />

transactions. Most analytics in the latter category are statistical,<br />

dealing for example with frequency of appearance of keywords<br />

or with relating healthcare protocols to patient outcomes. In<br />

contrast, data generated by IoT sensors are measurements of<br />

natural or man-made physical systems (e.g., temperature of a<br />

specific location or velocity of a motor shaft).<br />

These physical systems are by nature deterministic, and their<br />

analysis traditionally has been physics-based. The purpose of<br />

phsysical systems analytics is primarily to extract information<br />

about the internal state of an observed system. Measurements<br />

are often limited to the observable behavior of the system. What<br />

we are interested, though, in most of the cases, is the hidden<br />

information behind these data: The internal state of the system.<br />

For example, we can measure the surface temperature of a<br />

battery but what we may be interested in is its core temperature<br />

which we cannot measure directly.<br />

The association between the observables and the hidden<br />

state information is accomplished best through physics-based<br />

and mathematical models which by design relate the inputs<br />

to the outputs through a description of the system’s internal<br />

dynamics. Physics models once they are derived and formulated<br />

do not change and they are data independent. If they contain<br />

enough details, at least in theory, they can predict accurately the<br />

response of an observed system to changing input conditions.<br />

# of data points<br />

Figure 4: IoT physics based analytics lies between traditional physics models<br />

and big data analytics<br />

So in theory if there is a physical model for a given system and<br />

we know its inputs we could predict its outputs. In practice,<br />

many physical systems are too complicated for accurately<br />

solving them within the time and computing resources<br />

available. In addition, there are physical phenomena that escape<br />

first principle approaches, or they are too complicated to model.<br />

Finally, real world observable data are noisy. These are some of<br />

the issues that IoT physical analytics can address.<br />

In the IoT world observed physical systems reside in the<br />

analogous of a continuous experiment. As in traditional science<br />

experimenting manifests the dynamic process of knowledge<br />

advancement, similarly the co-existence of physics models<br />

and big data can create dynamic data dependent models with<br />

predictive power that can potentially provide better physical<br />

insights and advance knowledge. The mixing of physical<br />

models and experimental data is not novel. It has been used<br />

extensively in the semiconductor and other industries for the<br />

creation of compact physical models. In these models physics<br />

laws are combined with experimentally derived parameters and<br />

relationships in order to create small, fast and accurate model<br />

units that can scale well in very large simulations. In the big data<br />

IoT this combination of physics based models and experimental<br />

data can occur dynamically.<br />

Model<br />

Observables<br />

Parameter Estimation<br />

Optimizer/Solver<br />

State Estimation<br />

Y model<br />

Y model<br />

Compare<br />

predicted w.<br />

observed<br />

Hidden<br />

u c<br />

Plant<br />

Y plant<br />

Figure 3: Physical systems analysis relates the observables with the hidden<br />

states<br />

Figure 5: The model reads the inputs of the plant u c<br />

and predicts the output Y model<br />

.<br />

After comparison of its predictions with the actual outputs Y plant<br />

, it subsequently<br />

adjusts its parameters for a more accurate estimation of the current plant state<br />

174


Model complexity &<br />

processing power<br />


It occupies up to 300 KB of memory with all features enabled.<br />

Benchmark testing against automatically-generated C code<br />

from Matlab Embedded Coder exhibits an order of magnitude<br />

faster execution and similar memory usage when solving the<br />

Van der Pol equations on an ARM Cortex-M4 MCU.<br />

VI. Decentralized Scientific Processing<br />

An example of how this technology can be deployed in the<br />

IoT is the monitoring of a factory floor. In this scenario, multiple<br />

motors are operating generating a constant flow of voltage,<br />

current, rotor position, angular velocity and acceleration data.<br />

It is desirable to derive from these data through analytics the<br />

health condition of each motor in order to schedule maintenance<br />

and avoid unscheduled down time. Instead of transmitting all<br />

these data to a central processing location, a local MCU can<br />

utilize a model to associate the observed electrical and motion<br />

measurements with the device parameters of interest. Such a<br />

simple model is shown in Fig. 8 for a DC motor. Utilizing<br />

this model and measurement data the numerical solver and<br />

optimizer can fit the model parameters to the incoming data<br />

dynamically monitoring the health of the motor.<br />

A typical scenario in this example would be to detect<br />

changes in the parameters describing the motor specifications<br />

and therefore detect the onsets of faulty behaviors: for example,<br />

changes in parameters related to friction and inertia of the shaft<br />

could indicate wear in the gear box, and changes in the motor<br />

inductance could indicate that the windings are deteriorating or<br />

overheating.<br />

Moreover, within such a setup it would make sense to only<br />

transmit metrics related to the health condition of the system<br />

rather than the entire set of observed data. The rate at which the<br />

health condition metrics are transmitted can in turn be adapted<br />

to the health of the system itself: if the system is operating<br />

normally data can be transmitted at a slow rate, and if a faulty<br />

behavior is detected data can be transmitted at a higher rate<br />

so that the central processing location can accurately flag the<br />

system.<br />

There are numerous other applications and methods that can<br />

benefit from the ability of fast and accurate scientific computing<br />

in small MCUs at the edge.<br />

article/view/24/120>. Date accessed: 12 Jan. 2018. doi:http://dx.doi.<br />

org/10.14529/jsfi140206.<br />

[5] Bieswanger, A., Hamann, H.F. and Wehle, H.D., 2012. “Energy efficient<br />

data center”. IT-Information Technology Methoden und innovative Anwendungen<br />

der Informatik und Informationstechnik, 54(1), pp.17-23.<br />

[6] R. Melville, N. Clauvelin, and J. Milios, “A high-performance<br />

model solver for ‘in-the-loop’ battery simulations,” in American<br />

Control Conference (ACC), 2016, 2016, pp. 3119–3125.<br />

VII. Conlclusion<br />

Decentralized scientific processing in the IoT provides<br />

a method for reducing the flow of sensor data by processing<br />

them right at the source. This IoT platform requires compact<br />

models and compact numerical solvers that can operate within<br />

today’s MCU memory and speed constraints. Through physical<br />

analytics big IoT data can advance our understanding of the<br />

physical world and lead applications that turn big data into<br />

bigger returns.<br />

References<br />

[1] CISCO, “The Zettabyte Era: Trends and Analysis,” 2017. [Online]. Available:<br />

https://www.cisco.com/ [Accessed: 12-Jan-2018].<br />

[2] What-if, “Google’s Datacenters on Punch Cards,”.[Online]. Available:<br />

https://what-if.xkcd.com/63/ [Accessed: 12-Jan-2018].<br />

[3] Stephanie Pappas, “How Big Is the Internet, Really?’, Live Science,2016.<br />

[Online]. Available: https://www.livescience.com/54094-how-big-is-theinternet.html/<br />

[Accessed: 12-Jan-2018].<br />

[4] MATSUOKA, Satoshi et al. Extreme Big Data (EBD): Next Generation<br />

Big Data Infrastructure Technologies Towards Yottabyte/Year.Supercomputing<br />

Frontiers and Innovations, [S.l.], v. 1, n. 2, p. 89-107,<br />

sep. 2014. ISSN 2313-8734. Available at:


Build an Industrial-Strength<br />

Device-to-Cloud IoT Application in 30 Minutes –<br />

No Smoke and Mirrors involved<br />

Delivering IoT Solutions with a Hybrid IoT Platform Approach<br />

Terrence Barr<br />

Head of Solutions Engineering<br />

Electric Imp<br />

Los Altos, CA, USA<br />

Hugo Fiennes<br />

CEO and Co-Founder<br />

Electric Imp<br />

Los Altos, CA, USA<br />

Abstract— Internet of Things (IoT) platforms can provide<br />

critical assistance to companies looking to deliver connected<br />

products and services by reducing risk and time to market for<br />

their IoT solutions while providing the security and flexibility to<br />

adapt to evolving market demands. Outsourcing specialist areas<br />

such as security, operational management, and long-term<br />

maintenance allows companies to focus their resources on<br />

maximizing business value of the connected products and services<br />

themselves, rather than struggling with the underlying<br />

technology complexity.<br />

This paper discusses key requirements of end-to-end (deviceto-cloud)<br />

IoT solutions and the importance and critical benefits<br />

of choosing the right IoT architecture and platform. In the<br />

corresponding conference session the attendees will learn how to<br />

build a secure, scalable, and customizable device-to-cloud IoT<br />

application in 30 minutes by integrating a managed IoT device<br />

connectivity platform with a popular IoT application cloud<br />

service.<br />

Keywords—Internet of Things, IoT, Hybrid Platform, Security,<br />

Connectivity, Devices, Device Management, Edge Intelligence,<br />

Software Platform, Security Maintenance, UL 2900-2-2, Cloud<br />

Applications, Enterprise Integration, End-to-End, Time-to-Market,<br />

Risk Minimization<br />

I. THE CHALLENGE: IOT SOLUTION COMPLEXITY<br />

Many companies now understand the business benefits of<br />

connected products and services in Internet of Things (IoT).<br />

The value of connected products and services is driven by<br />

IoT business applications, which in turn depend on trustworthy,<br />

accurate, and reliable data from devices (products) in the field.<br />

Therefore, any IoT solution must be created and delivered endto-end,<br />

from the device to the business application, to generate<br />

the expected business benefits.<br />

However, implementing, deploying, and supporting end-toend<br />

IoT solutions can be complex and challenging – in<br />

particular meeting the increasing commercial or industrial<br />

requirements for security, reliability, scalability, and longevity.<br />

This complexity is a key factor that is slowing down or holding<br />

back many IoT deployments today. Among some of the<br />

technical challenges are:<br />

• Security (from device hardware through<br />

communications to cloud and management)<br />

• Hardware selection and product design<br />

• Software complexity (device and cloud)<br />

• Multitude of communication technologies<br />

• Device manufacturing and deployment at scale<br />

• Integration of legacy systems<br />

• Protocol conversion and data integration<br />

• Cloud infrastructure and scalability<br />

Many product and services companies do not possess the<br />

technical expertise, resources, or appetite for risk to build and<br />

deliver IoT solutions by themselves. For such companies, it<br />

makes more sense to focus on their core competencies and<br />

leverage IoT offerings from specialized vendors who abstract<br />

much of that complexity away from the customer.<br />

Unfortunately, choosing the right IoT offerings itself is a<br />

challenge. A dizzying number of IoT offerings exist in the<br />

market today, from low-level device hardware components on<br />

one end to powerful cloud-based IoT platforms on the other<br />

end, and any number of components, technologies, standards,<br />

protocols, tools, and services in between.<br />

When approaching the IoT solutions market, there are two<br />

extremes visible:<br />

1. Bespoke IoT solutions, which are customized<br />

from on a number of different technology and<br />

service components, and assembled, integrated,<br />

delivered, and supported by the vendor for a<br />

specific customer<br />

2. Off-the-shelf IoT Solutions, which are preintegrated<br />

technology and services packaged as<br />

© Electric Imp, 2018<br />

177<br />

www.embedded-world.eu


IoT solutions targeting specific markets and use<br />

cases, offered by the vendor with limited<br />

customization<br />

#1 (Bespoke) offers maximum flexibility for IoT solutions<br />

at the expense of complexity, cost, and time-to-market.<br />

Building bespoke IoT solutions from components requires<br />

substantial expertise, incurs technology and execution risk, and<br />

carries the burden of having to support the bespoke IoT<br />

solution over its entire lifetime. In our experience, this burden<br />

is almost always underestimated, especially with regards to<br />

security, resulting in projects going over time, and over budget.<br />

#2 (Off-The-Shelf) focuses on narrow functionality for<br />

specific applications and market segments, typically trading<br />

fast time-to-market and simplicity against flexibility.<br />

While off-the-shelf offerings may initially seem attractive,<br />

companies often find that these are not flexible enough to<br />

integrate well with existing products and business models, do<br />

not scale across product or business lines, and don’t evolve<br />

well with business needs. The difference between a Proof-of-<br />

Concept and a shipping IoT product is often about corner cases<br />

that occur in the real world, and a rigid IoT solution is often not<br />

capable of handling these corner cases efficiently.<br />

Both extremes are not viable for the majority of the<br />

companies looking for IoT solutions today. Many companies<br />

are looking for a straightforward but flexible approach to IoT –<br />

the key business requirements can be summarized as follows:<br />

• Fast time-to-market, low execution risk, and<br />

predictable, bounded expenditure<br />

• Well-designed, fully integrated security from<br />

device hardware to cloud, and maintained for the<br />

lifetime of the product<br />

• Flexibility to address unique and evolving<br />

technology and business needs<br />

• Easy integration with, and support for, existing<br />

and future products<br />

• Simplified procurement, integration, and delivery<br />

without sacrificing functionality or flexibility<br />

• Low upfront investment, and ability to<br />

incrementally invest as connected business scales<br />

• Cost effective, timely long-term support of<br />

solution<br />

II. THE SOLUTION: HYBRID IOT PLATFORM APPROACH<br />

In our experience, these requirements are best met with a<br />

Hybrid IoT Platform approach, which combines a<br />

comprehensive IoT device connectivity and management<br />

platform with the IoT application cloud platform that is best<br />

suited for the customer’s needs.<br />

1. IoT Device Connectivity and Management Platform<br />

The IoT Device Connectivity and Management Platform<br />

connects devices to the cloud securely, reliably, and at scale.<br />

Device connectivity and management is a highly specialized<br />

field which requires expertise in device hardware, security<br />

from device to cloud and all layers in-between, robust bidirectional<br />

connectivity (data and control), device<br />

management, software provisioning and OTA updates, protocol<br />

integration and data conversion, cloud integration, massive<br />

scalability, and more. The security, flexibility, and scale of the<br />

IoT Device Connectivity and Management Platform is a<br />

prerequisite to getting trustworthy IoT data into the application<br />

cloud. Without trusted device data there can be no IoT business<br />

value.<br />

2. IoT Application Cloud Platform<br />

The IoT Application Cloud Platform provides massivescale<br />

device data ingestion, processing and storage, business<br />

applications, and enterprise orchestration. IoT cloud platforms<br />

typically rely on external mechanisms to provide device<br />

security, connectivity, and management, and this is where the<br />

IoT Device Connectivity and Management Platform comes in.<br />

The ease and flexibility of integration between the two<br />

platforms is critically important for real-world IoT solutions as<br />

it enables the organization to optimize the complete solution to<br />

the customer’s requirements.<br />

A Hybrid IoT Platform approach provides important<br />

benefits to companies looking to build IoT solutions:<br />

• Strikes an optimal balance between flexibility and<br />

time-to-market for many companies and their IoT<br />

use cases<br />

• Leverages proven platform implementations for<br />

common IoT functionality while enabling<br />

customization of the IoT solution to meet the<br />

customer’s unique needs<br />

• Simplifies procurement and support of the IoT<br />

solution with only two key vendors and welldefined<br />

responsibilities and integration points<br />

• Provides flexibility to evolve with the customer’s<br />

business needs, both on the device and on the<br />

cloud side, including expanding the solution with<br />

additional connectivity options and a wider range<br />

of cloud services<br />

III. DEMONSTRATION<br />

To demonstrate the Hybrid IoT Platform approach, we<br />

combine the Electric Imp Device Connectivity and<br />

Management Platform with the Microsoft Azure Cloud<br />

Platform to rapidly create a secure, industrial-grade,<br />

customizable device-to-cloud IoT application.<br />

The Electric Imp platform provides fully integrated edge<br />

device-to-cloud security, bi-directional low-latency<br />

connectivity, end-to-end management, device and cloud<br />

application platforms, ongoing security maintenance, and<br />

ready-to-use enterprise integrations to a range of popular IoT<br />

cloud platforms such as Microsoft Azure, Amazon Web<br />

Services, Salesforce IoT, GE Predix, and many others.<br />

The integration between the Electric Imp platform and<br />

Microsoft Azure is accomplished via the unique and powerful<br />

Electric Imp cloud middleware container, while the device<br />

integration, edge processing and data model is implemented via<br />

© Electric Imp, 2018<br />

178


the Electric Imp managed device application container. This<br />

design makes processing the device data in the IoT business<br />

application straightforward because the data is trustworthy,<br />

appropriately processed, and correctly formatted when it<br />

reaches the IoT application cloud, avoiding impedance<br />

mismatches which are common in real-world IoT.<br />

demonstration is built on a single ‘shard’ of a scalable system –<br />

scaling is horizontal from this point on and the architecture of<br />

the Electric Imp connectivity platform allows scaling to many<br />

concurrent shards to support millions of devices, growing with<br />

the business needs of the company and minimizing engineering<br />

effort as the solution scales out.<br />

For the demonstration, these are the high-level steps to<br />

build the device-to-cloud IoT application (with approximate<br />

time in parenthesis):<br />

1. Securely connect and enroll edge device into the<br />

Electric Imp platform (2 min)<br />

2. Create IoT application template in Microsoft Azure<br />

IoT application builder (5 mins)<br />

3. Define device template, device properties,<br />

visualizations, and rules in Microsoft Azure IoT<br />

application builder (10 min)<br />

4. Authenticate device with Microsoft Azure IoT<br />

application via the Electric Imp platform (1 min)<br />

5. Deploy Microsoft Azure integration into Electric Imp<br />

cloud middleware container (2 min)<br />

6. Deploy edge application into Electric Imp device<br />

application container (2 min)<br />

7. Device and cloud applications now execute and device<br />

data is sent to the Azure IoT application, where it is<br />

stored, visualized, and rules are applied (1 min)<br />

Total time to create, deploy, and execute: Under 30<br />

minutes.<br />

The demonstration can be easily expanded further. It can<br />

include additional data models, bi-directional communication<br />

and control, and custom processing, filtering, and almost any<br />

other application functionality can be implemented by updating<br />

and simply re-deploying the business logic to the edge device<br />

and cloud (on the push of a button using the Electric Imp OTA<br />

software provisioning).<br />

It is noteworthy that the resulting IoT solution inherits the<br />

industrial-grade properties of the underlying platforms,<br />

including full security, scalability, and manageability. The<br />

IV. CONCLUSION<br />

Creating, deploying, and supporting device-to-cloud IoT<br />

business applications can be complex and challenging.<br />

Bespoke IoT solutions provide maximum flexibility but are<br />

costly and time-consuming and are not a viable option for<br />

many companies. Off-the-shelf IoT solutions offer simplicity<br />

and fast time-to-market, but often lack the necessary flexibility<br />

to adapt and grow with business needs.<br />

A hybrid IoT platform approach can deliver the best option<br />

for many companies by combining a comprehensive IoT<br />

device connectivity and management platform with the<br />

customer’s preferred IoT application cloud platform. This<br />

approach integrates two proven and ready-to-use platforms to<br />

deliver the necessary common IoT functionality while<br />

providing the flexibility to easily adapt and extend the IoT<br />

solution to the customer’s needs.<br />

The result is a customized device-to-cloud IoT solution that<br />

can be delivered to market quickly and with low risk and which<br />

can evolve over time as the needs of the customer grow. This is<br />

what most companies need to successfully extend their core<br />

product and service business into the connected world of the of<br />

Internet of Things.<br />

The Electric Imp IoT Edge-to-Enterprise Platform helps<br />

more than 100 customers around the world to build, ship, and<br />

manage their IoT solutions securely, effectively, and at<br />

massive scale, with more than 1 million devices, UL 2900-2-2<br />

Cybersecurity Certification, and pre-built integrations to<br />

leading cloud services like Microsoft Azure, Amazon Web<br />

Services, Salesforce IoT, GE Predix, and more.<br />

For more information, please see www.electricimp.com<br />

© Electric Imp, 2018<br />

179<br />

www.embedded-world.eu


Agile Development and ISO26262<br />

Irwin Fletcher<br />

Quality Management, OpenSynergy GmbH<br />

Berlin, Germany<br />

irwin.fletcher@opensynergy.com<br />

Abstract—The Harvard Business Review recently praised Agile<br />

methods saying they have greatly increased success rates in<br />

software development, improved quality and boosted productivity.<br />

Yet there is a persistent belief that Agile is unsuitable for<br />

automotive software development and used as an excuse for<br />

undisciplined behavior, when applied to developing software for<br />

embedded systems in regulated industries. This paper challenges<br />

this and shows how using Agile can be combined with conformance<br />

to standards such as ISO26262, to deliver real business benefits.<br />

However, balancing Agile practices with conformance expectations<br />

is not simple and effortless. It requires a deep understanding of the<br />

intent behind both approaches. This paper describes where Agile<br />

techniques can be applied most usefully to software development,<br />

and importantly, where they should not be used. The primary focus<br />

of this paper is directed towards the development of a Safety<br />

Element out of Context Software Component.<br />

Keywords—Scrum; Safety-critical; Agile; ISO26262;<br />

Embedded Software Development; SEooC; Traceability; User<br />

Stories; Requirements<br />

I. SOFTWARE IN THE DRIVING SEAT<br />

Software is a major component of today’s<br />

automobile. The features that it delivers are an ever<br />

increasing factor in purchasing decisions. The<br />

Boston Consulting Group state that “Consumers,<br />

want to purchase cars from companies that bring<br />

new technologies to market and do so quickly” [1].<br />

At the same time there is rising concern about<br />

relinquishing control to software. Many people do<br />

not fully trust the technology and express doubts<br />

about the safety of self-driving cars [2].<br />

The automotive industry is having to satisfy two<br />

differing demands. On the one hand to deliver<br />

innovative technology rapidly and, on the other, to<br />

ensure that software intensive systems are<br />

demonstrably safe and dependable.<br />

Software developers favor Agile methods. These<br />

should enable rapid product development thus<br />

satisfying the first demand. Whereas working to<br />

ISO26262 is recognized as guaranteeing functional<br />

safety which should satisfy the second [3,4]. Given<br />

that applying ISO26262 raises the cost of<br />

development by a factor of around three to five<br />

times, any way this can be reduced is welcome.<br />

Agile and ISO26262 both bring benefits but since<br />

they are based on different premises a considerable<br />

amount of thought and adaptation is required for<br />

them to work together efficiently. Agile can be<br />

thought of as ‘organic’, whereas ISO26262 is<br />

‘mechanical’.<br />

As an innovation, Agile has “greatly increased<br />

success rates in software development, improved<br />

quality and speed to market and boosted motivation<br />

and productivity in IT teams” [5]. For safety, the<br />

ISO26262 standard for Functional Safety in Road<br />

Vehicles is another advance, but in a different<br />

direction. While Agile methods and ISO26262 both<br />

aim to deliver high quality software that performs<br />

appropriately, the methodologies implied by the<br />

standard and Agile are not interchangeable [6]. This<br />

is highlighted by the fact that the most popular Agile<br />

method, Scrum, is defined in the 20-page Scrum<br />

Guide whereas the ISO26262 standard spans 450<br />

pages and details 600 practice requirements. It is<br />

clear that they approach software development quite<br />

differently [7,8].<br />

The question is, can Agile practices be used to<br />

speed software development, while at the same time<br />

complying with ISO26262 - and if so how?<br />

II.<br />

WHAT IS THE DIFFERENCE?<br />

The fundamental difference between ISO26262<br />

and Agile, it that ISO26262 can be considered to be<br />

www.embedded-world.eu<br />

180


ased on traditional mechanical engineering type<br />

practices applied to software development. In this<br />

case using techniques such as top down traceability<br />

provides ‘belt and braces’ safety. Agile, on the other<br />

hand, develops software in an emerging, organic<br />

fashion, which dispenses with some of the<br />

traditional practices that are considered<br />

cumbersome, wasteful and time consuming.<br />

The removal of the ‘braces’ from the ‘belt and<br />

braces’ approach is not a problem in, say, an<br />

infotainment system where failure would not be life<br />

threatening.<br />

When considering Agile in safety-critical projects<br />

there are some Agile techniques that will not<br />

compromise safety and can be used. These are<br />

described in the next section. However, some Agile<br />

techniques are not directly compatible with<br />

ISO26262 but once adapted can yield benefits.<br />

These are described in subsequent sections and<br />

further illustrated by two Use Case studies.<br />

III.<br />

WORKING SMART WITH AGILE<br />

Automotive organizations have been slow in<br />

exploiting Agile techniques when compared with<br />

others, such as medical device manufacturers or the<br />

US Department of Defense. The latter two have had<br />

regulator approved guidance on using Agile for<br />

some years [9,10]. These organizations recognize<br />

that Agile methods provide a number of proven<br />

techniques which speed development and improve<br />

quality. [11,12].<br />

There are some practices, such as pair<br />

programming, that simply offer an alternative way<br />

of working and so are not discussed here. The<br />

following, however, have been demonstrated to<br />

offer major advantages [13,14]. This has also be the<br />

experience of the author in diverse regulated<br />

industries.<br />

• Time Boxed Iterations: Short iterations where<br />

each produces a demonstrable update to the<br />

product. Usually 2-4 weeks in duration.<br />

Iterations begin with a planning session and<br />

finish with a demonstration to stakeholders of<br />

what has been completed. Stakeholder<br />

feedback then helps to optimize any future<br />

work.<br />

• No Change Rule: This is a mechanism to<br />

manage the persistent issue of ‘feature creep.’<br />

The rule states that during and increment no<br />

changes are allowed that would endanger the<br />

goal of the increment. Applying this rule<br />

results in more considered change demands,<br />

that are then dealt with during subsequent<br />

increment planning sessions.<br />

• The Product Owner Role: Sometimes called<br />

an Initiative Owner this role provides a single<br />

voice representing all of the project’s<br />

stakeholders. To work effectively the holder<br />

of the role needs to be given the authority to<br />

decide precisely the content of the final<br />

product. A single, clearly defined Product<br />

Owner with real authority buffers developers<br />

from the conflicting demands of multiple<br />

stakeholders, thus enabling the development<br />

team to focus on their job - developing code.<br />

• Daily Stand-ups: These are short, focused,<br />

regular, daily communications meetings which<br />

keep the development team informed of each<br />

member’s progress and issues. Note that<br />

discipline is required to avoid these meetings<br />

drifting into solution discussions.<br />

• Continuous Integration and Regression<br />

Testing: Here developers integrate code into a<br />

shared repository several times daily. Each<br />

check-in is then verified by an automated<br />

build, allowing teams to detect any problems<br />

as they arise.<br />

• Code Refactoring: To ensure safety, it is not<br />

sufficient to simply provide code that works,<br />

rather the objective is to deliver code that is<br />

dependable. It is therefore necessary to<br />

reconsider the specific implementation of code<br />

modules and their designs, as the project<br />

proceeds and understanding evolves. Code<br />

Refactoring, must be planned and controlled.<br />

It can then strengthen the robustness and<br />

safety of a system.<br />

• Test First Development: Here unit tests are<br />

developed alongside every code unit. This<br />

complements Continuous Integration and<br />

builds confidence in the state and stability of<br />

the overall codebase, as development<br />

proceeds. It does not, however, replace the<br />

need for Requirements Based System Testing<br />

to validation the Requirements.<br />

181


IV. AGILE VS ISO26262<br />

The previous section outlined several practices<br />

that can be helpful in improving the speed of<br />

development. Equally important is to appreciate the<br />

fundamental areas of difference between Agile<br />

thinking and that behind ISO26262. Acknowledging<br />

and understanding these differences is necessary<br />

before a blended solution can be considered.<br />

This section highlights five key areas where the<br />

Agile methods and ISO26262 could be considered in<br />

conflict and proposes potential adaptions. This<br />

discussion is particularly relevant to the<br />

development of a Software Component as a Safety<br />

Element out of Context (SEooC). In this case real<br />

life Requirements are not available and Safety and<br />

functional Requirements are assumed (invented) by<br />

the engineering team.<br />

A. Analysis vs Emergence<br />

In the Agile method decisions as to how much<br />

documentation a project requires are made by the<br />

team itself. This contrasts with ISO26262 which is<br />

prescribes the types and the content of the<br />

documents it requires. This difference need not of<br />

itself create a problem, as in theory, Agile would<br />

also allow a team to define and create an ISO26262<br />

compliant set of documents.<br />

The challenge, however, arises due to major<br />

differences in underlying philosophy. In Agile the<br />

thinking is that the needs (requirements) of a project<br />

cannot be fully captured at the start of a project,<br />

because at this stage people do not know exactly<br />

what they need. Thus, creating specifications and<br />

upfront documentation is a waste of time, since it is<br />

not known what will, and what will not, work<br />

[15,16]. The Agile way is to allow details to emerge<br />

during iterative development. Agile gurus dictate<br />

that “Scrum projects do not have an upfront analysis<br />

or design phase” [17].<br />

For a safety development, ISO26262 rightly<br />

presumes that a safety analysis and safety<br />

requirements are defined before further development<br />

occurs. Nevertheless, there is merit in the Agile<br />

concern that effort can be wasted defining details<br />

that will inevitably change. In this case, the spirit of<br />

the Agile thinking can be applied by creating the<br />

initial safety concepts iteratively. The key here is to<br />

cover the complete scope of the project in the<br />

documents but to limit the level of detail specified to<br />

help bring stakeholders to consensus as to what is to<br />

be delivered. Detail can be added to the areas as thy<br />

are selected for further elaboration and development.<br />

The use of facilitated workshops can be<br />

particularly useful here because the safety concepts<br />

and requirements are based on expert judgement<br />

which can be rapidly explored in workshop settings<br />

[18].<br />

B. Requirements vs User Stories<br />

In the authors experience, Agile can cease to be<br />

Agile when it is elevated to dogma. This can lead to<br />

good Agile practices in one context being<br />

inappropriately applied elsewhere. This is true of<br />

User Stories. As Per Lundholm, an Agile coach,<br />

correctly says “Elephants are not giraffes and User<br />

Stories are not Requirements” [19]. Yes, the animals<br />

in this case are both large four legged mammals that<br />

live in Africa, but they are not the same beast. In the<br />

same way User Stories are useful, but they are not<br />

Requirements.<br />

User Stories are particularly suitable when<br />

defining Human Computer Interactions (HCI) or for<br />

detailing how diagnostic tools should work for<br />

developers and integrators. They have clearly helped<br />

generate many appropriate and user-friendly<br />

applications.<br />

It is also important to acknowledge that User<br />

Stories are not the only form of Requirements<br />

definition necessary especially in the case of<br />

automotive software development. During<br />

architecture and design new requirements tend to<br />

emerge. These relate to how the software interacts<br />

with the other components and hardware and are<br />

‘Engineering Requirements’ that represent the shift<br />

from the problem in the outer world to inner world<br />

of the machine [20]. That is, Engineering<br />

Requirements typically define the behaviors of subsystems<br />

and components, such as actuator and<br />

sensor operations or timing constraints for operating<br />

systems.<br />

Recognizing the difference between these<br />

Engineering Requirements and User Stories is<br />

critical to avoid confusion. User Stories, as typically<br />

derived from Agile techniques such as User<br />

Journeys and Personas, are often stated in the form<br />

of “As a … I need…in order to…” This format does<br />

not work for Engineering Requirements and are best<br />

specified using the “The system shall…” format<br />

www.embedded-world.eu<br />

182


advocated by Requirements Engineering standards<br />

such as IEEE 29148.<br />

As an illustration, the author has encountered the<br />

following kind of ‘User Story’ “As an interrupt<br />

handler, I need to prioritize and pass on interrupts<br />

according to the timing criteria, so that the system<br />

can react to interrupts”. This misguided thinking is<br />

what happens if User Stories are applied in a<br />

doctrinaire fashion. Rather than gaining clarity the<br />

opposite is achieved. The ‘story’ has become ‘story<br />

telling’. In fact the appropriate technique is to use<br />

the Engineering Requirements format.<br />

A further problem experienced relates to Design.<br />

If Requirements define ‘what’ the system behavior<br />

should be, then Design defines ‘how’ the<br />

Requirement will be achieved. A valid User Story<br />

might say “As a car owner I want to securely park<br />

my car using my smartphone”. In the Agile method<br />

this User Story is then broken down into smaller<br />

stories, one of which might be “to start the parking<br />

app I need to enter two randomly selected digits of<br />

my password”. Here developers have added a ‘how’<br />

disguised as a ‘what’, and created a User Story, that<br />

is highly unlikely to have been requested by any<br />

user.<br />

This arises partially because, as mentioned<br />

before, Scrum does not have a design phase. Given<br />

that architecture and design are where safety<br />

countermeasures are defined, using this ‘pure’ Agile<br />

philosophy is clearly dangerous in a safety-critical<br />

development, and would be considered professional<br />

malpractice [13].<br />

With safety-critical projects, the overall Design<br />

does need to be documented early in the<br />

development. Where Agile thinking can be useful,<br />

just as with Safety Requirements, is in limiting<br />

detail during initial Design. By creating just enough<br />

of a Design Specification at the outset to span the<br />

full scope of the project, without covering the detail,<br />

it is possible to get agreement on the overall<br />

adequacy of the safety countermeasures. Details are<br />

then being worked on as the project evolves.<br />

C. Traceability vs Frequent Feedback<br />

What is traceability? Traceability is being able to<br />

demonstrate completeness and go back through the<br />

steps used to develop the solution and thereby<br />

manage changes successfully. Normally Agile<br />

developers do not concern themselves with<br />

traceability at all. They rely on the Stakeholder<br />

Feedback Sessions at the end of iterations to ensure<br />

an optimal solution (which will often differ from the<br />

original conception). When changes are required<br />

User Stories describing the change are created and<br />

implemented. For products such as apps for mobile<br />

phones this is perfectly adequate. However for<br />

ISO26262 traceability is mandatory.<br />

One way of understanding how Agile<br />

development works is to compare it to a fluid, which<br />

is allowed to flow. In order to apply traceability, the<br />

fluid must stabilize and change state from liquid to<br />

solid, from water to ice. Increments can build parts<br />

of the solution but only once a part has reached a<br />

stable state can traceability be applied. Note that not<br />

all parts of an artefact (such as a Design<br />

Specification) will stabilize at the same time. To<br />

apply this is in a blended solution requires clear<br />

criteria to be defined in order to know when<br />

stabilization has actually occurred for each type of<br />

artefact or part of an artifact, such as a design<br />

component [21].<br />

Attempting to apply traceability when<br />

development is still at the fluid stage, before<br />

stabilization has occurred, or retrospectively once<br />

the development is finished, expends time and effort<br />

without anything appreciative to safety.<br />

D. Agile Tools<br />

Because Agile has been so incredibly successful<br />

in certain applications it has created what might be<br />

termed ‘Agile Dazzle’. In the rush to adopt Agile<br />

organizations tend to buy in Agile toolsets without<br />

fulling analyzing the different requirements of the<br />

software in their own field.<br />

Understandably, Agile tools, such as the popular<br />

Atlassian JIRA, are designed to support an Agile<br />

way of working [22,23]. Generally, these tools<br />

provide for the management of User Stories (and<br />

Bugs) but have no separate mechanisms for Design.<br />

The author has experienced how complex this<br />

adapting such tools to regulated environments can<br />

be, often requiring additional commercial<br />

applications and plugins and subsequent writing of<br />

scripts to join it all together. In one organization<br />

modification to the JIRA product, itself, was needed<br />

to create the necessary management reports.<br />

183


Consequently, Agile toolsets will not work<br />

‘straight out of the box’ for software conforming to<br />

ISO26262. These additions also complicate the<br />

required Tool Confidence Analysis.<br />

Agile tools, used intelligently, do have a useful<br />

place in conformant developments. They are<br />

particularly suitable in managing the allocation of<br />

tasks in iterative development, as well as for defect<br />

resolution. Nevertheless, it is only once<br />

understanding has been acquired of the uses and<br />

limitations of Agile that these tools can be<br />

successfully integrated and applied. Experience<br />

shows that until this is achieved the potential exists<br />

of complicating rather than simplifying the<br />

developers task.<br />

E. Process vs People<br />

Automotive organizations have traditionally been<br />

hierarchal in structure, with accountability for<br />

success and failure resting solely on the shoulders of<br />

Management. In contrast, Agile methods promote<br />

flatter structures and push accountability down to<br />

development teams. In other words, Agile places its<br />

trust in people, whilst ISO26262 places it in<br />

processes. But processes don’t think. “Safety is<br />

demonstrated not by compliance with prescribed<br />

processes, but by addressing the hazards, mitigating<br />

those hazards and showing that the residual risk is<br />

acceptable” [25]. True safety arises out of acquired<br />

skills and experience of the Safety Engineers and<br />

Development Teams.<br />

Moving to an Agile way of working requires<br />

people to change how they work. This applies both<br />

to Management and to Development Teams.<br />

Managers need to relinquish some aspects of control<br />

and developers need to take on more responsibility.<br />

This is a significant shift in behavior and not<br />

surprisingly can give rise to resistance. In the<br />

author’s experience dealing with this factor cannot<br />

be overlooked when an automotive organization<br />

wishes to adopt Agile practices.<br />

V. APPROACHING A SOLUTION<br />

The following two examples are use cases<br />

demonstrating how a mindful blending of Agile<br />

methods and ISO26262 can be achieved that<br />

exploits the advantages of both. Firstly, an upgrade<br />

to an established product is examined and secondly a<br />

case of a new development within a known field of<br />

expertise.<br />

A. Upgrading an established product<br />

Here the situation considered is that of an already<br />

developed conformant product which is to be<br />

updated. In this case there is therefore ample<br />

knowledge of the original requirements and designs,<br />

traceability is in place, and the direction of the<br />

changes is already clear.<br />

The development could proceed without using<br />

Agile methods but an Agile approach provides<br />

appreciable gains. By running two cycles<br />

simultaneously the release date can be brought<br />

forward. Additionally, with careful planning,<br />

working interim releases can be made before the full<br />

change in implemented.<br />

In a two-cycle approach both cycles are managed<br />

using timed boxed iterations, stand-ups and show<br />

and tells. See Figure 1.<br />

• Change Management: Initial change planning<br />

begins before starting the update.<br />

• Safety Case: The Safety Case is required by<br />

ISO26262 and provides the evidence of<br />

confirmation to the standard. It also contains<br />

technical arguments explaining how the<br />

technical approach delivers safe operation.<br />

• Design Cycle: There can be several design<br />

cycles each providing a stable group of<br />

designs, traceable to requirements that can be<br />

passed to the implementation cycle. Planning<br />

is crucial to create a workflow where these<br />

design groups are independent<br />

• Implementation Cycle: This takes groups of<br />

finalized designs and constructs the product<br />

iteratively. This cycle includes updates to the<br />

Safety Case and eventually the final release.<br />

www.embedded-world.eu<br />

184


B. A new development<br />

When a development undertakes the creation of a<br />

new software product the direction of the final<br />

solution will be subject to uncertainties and stability<br />

will take time to materialize. This is then well suited<br />

to Agile methods that allow requirements, design<br />

and construction to influence each other. (Figure 2)<br />

By running Development Increments interspersed<br />

every so often with Stabilization Increments, a<br />

balance can be struck between Agility and<br />

Conformance. The key activities are:<br />

• Startup: A time-boxed stage usually 3-4 weeks<br />

in duration. Several facilitated workshops are<br />

held to scope and plan the work for the<br />

release. The initial topics such as failure<br />

analysis, requirements and architectures are<br />

created complete in scope but not in detail<br />

• Development Increments: Each increment is<br />

typically 2-3 weeks in duration. Increments<br />

start with a planning session setting the<br />

increment goal and selecting the work to be<br />

undertaken and end with a demonstration of<br />

the functions developed in a show and tell.<br />

Daily stand-ups are held to provide frequent<br />

team communication with a retrospective<br />

completing the activities before starting the<br />

next iteration.<br />

• Stabilization Iterations: When a sufficient<br />

amount of functionality is considered to be<br />

"done," a stabilization increment delivers the<br />

updated set of materials and assets for the<br />

work to date. These materials and assets are<br />

VI.<br />

then both detailed and consistent with the<br />

implementation of the system and provide any<br />

conformant deliverables in an efficient<br />

manner. A stabilization increment would be<br />

expected between the fourth and sixth<br />

development increments.<br />

• Consolation: This is where final testing is<br />

completed. Safety documents such as the<br />

safety manual, test completion report and<br />

safety case are finalized.<br />

• Release and Closure: This short step is where<br />

a decision to release or not is made. Following<br />

release the project materials and assets are<br />

baselined ready for further development and<br />

change management if required.<br />

CONCLUSION<br />

Deep Shift, the World Economic Forum report,<br />

states that “The seamless integration of the physical<br />

and digital worlds through networked sensors,<br />

actuators, embedded hardware and software will<br />

change industrial models. In short, the world is<br />

about to experience an exponential rate of change<br />

through the rise of software and services” [26].<br />

This shift brings both opportunities and dangers<br />

for us all. Those of us working on embedded<br />

automotive software systems are in the forefront of<br />

bringing about these changes that societies will<br />

adapt to over the coming decades.<br />

This paper is an initial response to the challenge<br />

of meeting consumer demands during a time of<br />

accelerating change, whilst maintaining<br />

185


safety. It proposes that this is possible when it is<br />

recognized that Agile methods are flexible as indeed<br />

is ISO26262. Just as and ice and water are one<br />

substance in two different states, in a similar way it<br />

is possible for the two approaches to complement<br />

each other.<br />

People also need to change patterns of behavior<br />

and thinking. Evangelical Agillists will have to let<br />

go of Agile as dogma, automotive engineers of long<br />

standing to recognize the advantages of Agile as<br />

well as the responsibility it entails, managers to<br />

replace reliance on process with trust in their<br />

developers.<br />

Furthermore, the question could usefully be<br />

asked as to whether ISO26262 itself needs to be<br />

more Agile. Meanwhile using the adaptions outlined<br />

here it is already possible to derive agile benefits<br />

while remaining true to ISO2626<br />

REFERENCES<br />

[1] Xavier Mosquet, Massimo Russo, Kim Wagner, Hadi Zablit,<br />

and Aakash Arora. Boston Consulting Group. Accelerating Innovation:<br />

New Challenges for Automakers, January 22, 2014.<br />

[2] Hillary Abraham, Bryan Reimer, Bobbie Seppelt, Craig Fitzgerald,<br />

Bruce Mehler & Joseph F. Coughlin, Massachusetts Institute of<br />

Technology, Consumer Interest in Automation: Preliminary<br />

Observations. Exploring a Year’s Change. 2017<br />

[3] C. Binder (Microsoft GmbH), T.Hemmer (conolement AG), S.Kukn<br />

(Porsche Consulting GmbH) and C.Mies (Electrobit Automotive<br />

GmbH), Microsoft White Paper. Adaptive Automotive Development.<br />

[4] Sergej Weber. Kugler Maag CIE. May 2015 Agile in Automotive – State<br />

of Practice 2015.<br />

[5] Darrell. K. Rigby, Jeff Sutherland, Hirotaka Takeuchi, Harvard Business<br />

Review. May 2016 Embracing Agile.<br />

[6] Steve Palmquist, Mary Ann Lapham, Suzanne Garcia-Miller, Timothy<br />

A. Chick, Ipek Ozkaya, Software Engineering Institute Parallel Worlds:<br />

Agile and Waterfall Differences and Similarities CMU/SEI-2013-TN-<br />

021<br />

[7] Ken Schwaber and Jeff Sutherland November 2017 The Scrum Guide.<br />

Scrum.org<br />

[8] ISO 26262:2011 Road vehicles -- Functional safety, parts 1-10,<br />

International Standards Organisation,<br />

[9] IR45:2012. Association for the Advancement of Medical<br />

Instrumentation. Guidance on the use of agile practices in the<br />

development of medical device software. ISBN 1570204454.<br />

[10] Kathleen Mayfield, Robert Benito, Michelle Casagni. 2010 Mitre<br />

Corporation, Handbook for Implementing Agile in Department of<br />

Defence, Information Technology Acquisition.<br />

[11] QSM. 2009 Beyond the hype: Measuring and Evaluating Agile<br />

Development, white paper.<br />

[12] Standish Group. 2015 Chaos report<br />

[13] Bertrand Meyer. Agile! The Good, the Hype and the Ugly Springer.<br />

ISBN 9783319051543<br />

[14] Scott. W. Ambler, Mark Lines. Disciplined Agile Delivery: A<br />

Practitioner's Guide to Agile Software Delivery in the Enterprise, 2012<br />

IBM Press, ISBN 9780132810135<br />

[15] Ken Schwaber; Jeff Sutherland . Software in 30 Days: How Agile<br />

Managers Beat the Odds, Delight Their Customers, And Leave<br />

Competitors In the Dust., John Wiley & Sons, 2012<br />

[16] D. Snowden, M.Boone. A leader’s framework for decision making.<br />

Harvard Business Review. November 2007<br />

[17] M. Cohn. User Stories Applied. Addison-Wesley Professional, 2004<br />

[18] Roman Pichler Agile Product Management with Scrum: Creating<br />

Products that Customers Love. Addison-Wesley Professional March 22,<br />

2010<br />

[19] Per Lundholm . User Stories are not requirements.. Blog poset 2016.<br />

Chrisp.se<br />

[20] M Jackson Problem Frames., Proceedings 14th Asia-Pacific Software<br />

Engineering Conference (APSEC 2007)<br />

[21] K. Collyer, J. Manzio Being agile while still being compliant, INCOSE<br />

2013. Annual Systems Engineering Conference 2013 (ASEC2013).<br />

[22] Scrum, documentation and the IEC 61508-3:2010 software standard<br />

[23] Thomas E. Murphy, Mike West, Keith James Mann . Magic Quadrant<br />

for Enterprise Agile Planning Tools. April 2017 Gartner Inc.<br />

[24] VersionOne. 11th Annual State of AgileTM Report. April 2017.<br />

[25] A. Ray Acceptable and residual risk. Quoted in C.Hobbs Embedded<br />

Software Development for Safety Critical systems, CRC Press 2016.<br />

[26] Deep Shift. Technology Tipping Points and Societal Impact. World<br />

Economic Forum. Report. September 201<br />

www.embedded-world.eu<br />

186


AUTOSAR –<br />

Development of a New C++ Standard<br />

Dr. Frank van den Beuken<br />

Senior Technical Consultant, Programming Research<br />

Ashley Park House, 42-50 Hersham Road, Walton on Thames<br />

Surrey, KT12 1RZ, United Kingdom<br />

Frank_van_den_Beuken@prqa.com<br />

Abstract—MISRA C++ [1], the most widely adopted C++<br />

coding standard in the automotive industry, has not been updated<br />

since its publication in 2008. It does not cover C++ language<br />

features introduced in the later ISO C++ standards published in<br />

2011 (C++11 [7]) and 2014 (C++14 [2]). In March 2017, the<br />

AUTOSAR (Automotive Open System Architecture) partnership<br />

released a new coding standard, “Guidelines for the use of the<br />

C++14 language in critical and safety-related systems”, which was<br />

updated in October 2017 [4].<br />

The AUTOSAR standard incorporates existing rules and<br />

guidelines from other standards such as MISRA C++ and High<br />

Integrity C++ [5]. These have been reviewed, modified and<br />

extended. The AUTOSAR standard provides a comprehensive set<br />

of guidelines for using the modern C++ language in safety-critical<br />

systems. We will compare the standard with other popular C++<br />

coding standards and explain its relationship with the automotive<br />

functional safety standard ISO 26262 [6].<br />

Keywords—Software engineering; AUTOSAR; MISRA; safety;<br />

coding standards; compliance<br />

I. INTRODUCTION<br />

Software development is increasingly important for<br />

automotive applications. Increasingly demanding safety,<br />

environmental, and convenience requirements have sharply<br />

increased the number of electronic systems found in vehicles.<br />

Ninety percent of all innovations are based on software-driven<br />

electronic components. These components account for up to<br />

forty percent of a vehicle’s development cost. The pace of<br />

development and the continual need to integrate more functions<br />

and control units, pose a significant challenge for vehicle<br />

manufacturers. This paper gives a brief overview of the new<br />

AUTOSAR Coding Guidelines and offers guidance on how to<br />

comply with them.<br />

II.<br />

WHAT IS AUTOSAR?<br />

AUTOSAR (AUTomotive Open System ARchitecture)<br />

aims to standardize and future-proof basic software elements,<br />

interfaces and bus systems, to help vehicle manufacturers<br />

manage growing system complexity while keeping costs down.<br />

It develops standardized open software architectures for<br />

automotive Electronic Control Units (ECUs).<br />

As a partnership of over 180 automotive manufacturers,<br />

automotive suppliers, tool vendors and semiconductor vendors,<br />

AUTOSAR’s core members include: BMW, Bosch,<br />

Continental, Daimler, Ford, GM, PSA, Toyota and Volkswagen.<br />

The first open architecture developed by AUTOSAR, the<br />

‘Classic Platform’, is intended for vehicle functions with strict<br />

real-time requirements and safety criticality, implemented on<br />

basic microcontrollers. Now, AUTOSAR has developed a new<br />

standard called the ‘Adaptive Platform’ for connected and<br />

autonomous vehicles. This is intended to meet the rapidly<br />

growing market needs for connected vehicle and highly<br />

autonomous driving technologies. Examples of technologies<br />

driving the adaptive platform standard include: high-powered<br />

32-/64-bit microprocessors with external memory, parallel<br />

processing and high bandwidth communications.<br />

www.embedded-world.eu<br />

187


Fig. 1 AUTOSAR platforms – classic and adaptive<br />

Software developed according to the Adaptive Platform<br />

standard can integrate with existing systems built according to<br />

the AUTOSAR ‘Classic Platform’ standard.<br />

The Classic Platform explicitly allowed for implementations<br />

in C, C++ and Java, but C was the dominant programming<br />

language used. Now, the APIs within the Adaptive Platform are<br />

defined in C++, suggesting that AUTOSAR views C++ as the<br />

language of choice for new Adaptive Platform components.<br />

C and C++ are the dominant programming languages used<br />

for automotive embedded systems. This is largely because they<br />

permit direct, deterministic control of hardware, and give<br />

flexibility to the developer. This also brings risk. It is possible to<br />

compile code that has undefined behavior, or code that is not<br />

guaranteed to behave the same way when compiled and run on<br />

different target hardware. Even the most experienced developer<br />

can introduce defects inadvertently.<br />

III.<br />

WHAT ARE THE AUTOSAR CODING GUIDELINES?<br />

In order to help ensure the safety and security of the code<br />

written by implementers of AUTOSAR software, AUTOSAR<br />

invited PRQA to become a development partner, and join the<br />

working group to develop the “Guidelines for the use of the<br />

C++14 language in critical and safety-related systems” (the<br />

‘Guidelines’)1. As the exclusive static analysis development<br />

partner in AUTOSAR we have contributed our expertise in the<br />

C++ programming language and best-practice software<br />

development gained over the last 30 years.<br />

The AUTOSAR Guidelines specify 342 coding rules. 154 of<br />

these are adopted directly from the widely adopted MISRA C++<br />

standard. 131 are based on rules defined in other well-known<br />

coding standards, such as PRQA’s High Integrity C++. 57 are<br />

based on research or other resources. The Guidelines permit<br />

some of the language features prohibited by some previous<br />

standards. Examples include dynamic memory management,<br />

exceptions, inheritance, templates and virtual functions. There<br />

are rules to ensure that these language features are used only in<br />

a safe manner.<br />

One of the principles of AUTOSAR development is to<br />

validate specifications in parallel with the standardization. The<br />

Adaptive Platform is validated through an AUTOSAR internal<br />

implementation, written in C++, known as the Adaptive<br />

Platform Demonstrator. AUTOSAR used the advanced<br />

QA·C++ analysis tool from PRQA, the exclusive static analysis<br />

development partner for AUTOSAR, to ensure the quality of the<br />

Demonstrator source code and verify compliance with the<br />

coding guidelines.<br />

IV.<br />

WHY ARE THE AUTOSAR CODING GUIDELINES<br />

NEEDED?<br />

Prior to the AUTOSAR Guidelines, there was no appropriate<br />

coding standard available for the use of modern C++ standards<br />

(C++11 and C++14) in safety-critical software. Available<br />

standards were either incomplete, written for legacy C++<br />

standards, or were not applicable for safety-critical applications.<br />

The most widespread C++ coding standard in the automotive<br />

188


industry, MISRA C++:2008 [1] was written for C++03 [7],<br />

which is over 14 years old.<br />

There have been a number of changes since the introduction<br />

of C++03 which has reduced the relevance of the MISRA<br />

standard for the AUTOSAR project:<br />

1. Evolution of C++<br />

2. Compiler improvements<br />

3. Improvements to testing, verification and analysis tools<br />

4. Creation of the ISO 26262 Vehicle Functional Safety<br />

Standard<br />

5. Assimilation of a broader base of safety and security<br />

expertise into additional standards such as:<br />

High Integrity C++ [5]<br />

Joint Strike Fighter Air Vehicle C++ [8]<br />

CERT C++ [9]<br />

C++ Core Guidelines [10][10]<br />

AUTOSAR designed the Guidelines to be used as an<br />

extension to the existing MISRA C++ standard. It specifies new<br />

rules and updates to MISRA rules as well as stating which<br />

MISRA rules are obsolete.<br />

V. WHO WILL USE THE AUTOSAR CODING GUIDELINES?<br />

The Objectives section of the Guidelines states: “The main<br />

application sector is automotive, but it can be used in other<br />

embedded application sectors. The AUTOSAR C++14 Coding<br />

Guidelines addresses high-end embedded microcontrollers that<br />

provide efficient and full C++14 language support, on 32- and<br />

64-bit microcontrollers, using POSIX or similar operating<br />

systems.”<br />

PRQA recommends, therefore, that any organization<br />

developing embedded software in C++14 should consider using<br />

these Guidelines.<br />

VI.<br />

HOW DO THE AUTOSAR CODING GUIDELINES<br />

COMPARE TO OTHER CODING STANDARDS?<br />

A. Traceability to existing standards<br />

Appendix A of the AUTOSAR Coding Guidelines document<br />

gives details about the traceability of the guidelines to five<br />

widely adopted C++ coding standards: MISRA C++, High<br />

Integrity C++ 4.0, JSF, SEI CERT C++ and the C++ Core<br />

Guidelines.<br />

For each rule of these standards it is established how it relates<br />

to the AUTOSAR Guidelines. A rule can be categorized as:<br />

1. Identical (only for MISRA C++): the rule text,<br />

rationale, exceptions, code example are identical. Only<br />

the rule classification can be different. There can be<br />

also an additional note with clarifications.<br />

2. Small differences: the content of the rule is included by<br />

AUTOSAR Guidelines rules with minor differences.<br />

3. Significant differences: the content of the rule is<br />

included by AUTOSAR Guidelines with significant<br />

differences.<br />

4. Rejected: the rule in the referred document is rejected<br />

by AUTOSAR Guidelines.<br />

5. Not yet analyzed: at the time of release of the<br />

Guidelines, the review of all standards was incomplete,<br />

so a number of rules is still to be analyzed.<br />

Below chart gives a summary of the comparison.<br />

TABLE 1 SUMMARY OF TRACEABILITY TO EXISTING STANDARDS<br />

C P P C G<br />

160<br />

44<br />

120<br />

81<br />

C E R T<br />

36<br />

24<br />

29<br />

62<br />

J S F<br />

105<br />

20<br />

47<br />

53<br />

H I C P P<br />

101<br />

20<br />

20<br />

11<br />

M C P P<br />

146 35<br />

26<br />

20<br />

1 - Identical 2 - Small differences: 3 - Significant differences 4 - Rejected 5 - Not yet analyzed<br />

189<br />

www.embedded-world.eu


Because the Guidelines are based on MISRA C++, it could<br />

be expected that this is where the largest overlap can be seen.<br />

The second largest overlap is with High Integrity C++ followed<br />

by JSF, C++ Core Guidelines and finally SEI CERT C++. It<br />

must be noted, however, that CERT C++ has the largest portion<br />

of rules that still need to be analyzed which may change its<br />

position relative to the other standards. In the following sections,<br />

we will discuss the comparison in more detail for each standard<br />

and also how the AUTOSAR Guidelines relate to ISO 26262.<br />

B. Traceability to MISRA C++<br />

The AUTOSAR Guidelines disagree with MISRA C++ on a<br />

number of topics. One significant topic is “single point of exit”.<br />

The Guidelines argue that this rule often leads to code that is<br />

harder to read, maintain or test. Furthermore, the C++ language<br />

provides exceptions, and throwing an exception that is not<br />

caught in the same function, also stops execution of that function<br />

which constitutes an exit point. MISRA C++ treats this as an<br />

exception to the rule. AUTOSAR Guidelines take the view that<br />

multiple exit points are acceptable, also because the introduction<br />

of extra variables for result values can cause dataflow anomalies.<br />

Another topic where the Guidelines deviate from MISRA<br />

C++ is dynamic memory management. This was forbidden by<br />

MISRA C++, but is allowed by the Guidelines. Instead it<br />

introduces new rules to prevent issues that may arise from using<br />

dynamic memory, such as memory leaks, memory<br />

fragmentation, invalid memory access, erroneous memory<br />

allocations and not deterministic execution time of memory<br />

allocation and deallocation.<br />

C. Traceability to High Integrity C++ 4.0<br />

HIC++ has a number of rules about code metrics and coding<br />

style, on which the Guidelines impose no limitation. In addition,<br />

the Guidelines allow two levels of pointer indirection where<br />

HIC++ only allows one. On the other hand, the Guidelines<br />

require use of noexcept, protected and =delete in more places.<br />

HIC++ heavily restricts use of the preprocessor; it may only<br />

be used for file inclusion and include guards. The Guidelines<br />

allow conditional file inclusion and use of path specifiers in<br />

include statements.<br />

D. Traceability to JSF<br />

JSF contains a number of rules about source code layout and<br />

naming conventions, which the Guidelines do not address. JSF<br />

also does not allow use of exceptions, for which the Guidelines<br />

provide rules. On the other hand, JSF does allow some form of<br />

multiple inheritance, where the Guidelines only allow it for<br />

implementing multiple interfaces. JSF does not allow type<br />

casting on pointer types whereas the Guidelines only don’t<br />

allow to cast pointer types to integral types. JSF allows two<br />

levels of pointer indirection where the Guidelines only allow<br />

one.<br />

E. Traceability to CERT C++<br />

CERT C++ provides rules for use of C memory allocation<br />

and IO functions, errno, variadic arguments, and<br />

functions that all are prohibited by the Guidelines.<br />

CERT C++ allows defining function macros as long as there are<br />

no side effects in the arguments, whereas the Guidelines forbid<br />

defining function macros.<br />

F. Traceability to C++ Core Guidelines<br />

The Core Guidelines allow the use of multiple inheritance to<br />

represent the union of implementation attributes, where the<br />

AUTOSAR Guidelines prohibit multiple inheritance. The<br />

AUTOSAR Guidelines are also stricter on the use of virtual<br />

inheritance, which it only allows for diamond hierarchies. The<br />

Core Guidelines include rules for the use of concepts, which is<br />

not part of the Guidelines because it’s not part of any ISO C++<br />

language standard.<br />

G. Relationship with ISO 26262<br />

ISO 26262 is a Functional Safety standard, entitled “Road<br />

vehicles – Functional Safety”. The standard is derived from the<br />

Functional Safety standard IEC 61508 titled “Functional safety<br />

of electrical/electronic/ programmable electronic safety-related<br />

systems”. As such, it covers all aspects of system development,<br />

and is not a coding standard. Part 6 [6] exclusively covers<br />

software, but it does not prescribe the use of any specific<br />

programming language, but merely specifies compliance tables<br />

with recommendations for the use of certain methods in<br />

software development for each automotive safety integrity level<br />

(ASIL). The ASIL can range from A to D where D has the<br />

strictest requirements (highest integrity level) and A has the<br />

least. The ASIL is determined by performing a risk analysis of<br />

a potential hazard by looking at the Severity, Exposure and<br />

Controllability of the vehicle operating scenario. For each ASIL<br />

the recommendation is one of<br />

“o”: there is no recommendation for use of the method<br />

“+”: the method is recommended<br />

“++”: the method is highly recommended.<br />

Each compliance table method is identified by a number and<br />

a letter. The meaning is that for each number, a suitable<br />

combination of methods with that number need to be<br />

implemented. So it is possible to be compliant while not (fully)<br />

implementing each listed method, but then a rationale shall be<br />

given that the selected combination of methods complies with<br />

the corresponding requirement.<br />

A number of the compliance table methods can be<br />

implemented by following coding standard rules, so enforcing<br />

a coding standard is an effective means in complying with the<br />

ISO 26262 safety standard. The AUTOSAR Guidelines cover<br />

four compliance tables:<br />

Table 1 — Topics to be covered by modelling and<br />

coding guidelines<br />

Table 3 — Principles for software architectural design<br />

Table 8 — Design principles for software unit design<br />

<br />

and implementation<br />

Table 9 — Methods for the verification of software<br />

unit design and implementation<br />

The coverage is most apparent for table 8 of which all<br />

methods correspond with one or more rules from the<br />

Guidelines:<br />

190


TABLE 2 ISO 26262-6 TABLE 8 — DESIGN PRINCIPLES FOR<br />

SOFTWARE UNIT DESIGN AND IMPLEMENTATION<br />

Methods<br />

1a. One entry and one exit<br />

point in subprograms<br />

and functions<br />

1b. No dynamic objects or<br />

variables, or else online<br />

test during their creation<br />

ASIL<br />

A B C D<br />

++ ++ ++ ++<br />

+ ++ ++ ++<br />

1c. Initialization of variables ++ ++ ++ ++<br />

1d. No multiple use of<br />

variable names<br />

1e. Avoid global variables or<br />

else justify their usage<br />

+ ++ ++ ++<br />

+ + ++ ++<br />

1f. Limited use of pointers o + + ++<br />

1g. No implicit type<br />

conversions<br />

1h. No hidden data flow or<br />

control flow<br />

+ ++ ++ ++<br />

+ ++ ++ ++<br />

1i. No unconditional jumps ++ ++ ++ ++<br />

1j. No recursions + + ++ ++<br />

Note that in the table above, for a number of methods there<br />

are some differences; method 1a requires one exit point, where<br />

the Guidelines allow more, but it does forbid use of setjump and<br />

longjmp to bypass the normal function call mechanism. Also,<br />

the Guidelines allow dynamic memory management under<br />

some conditions whereas method 1b forbids dynamic objects,<br />

but it can be argued that the rules provided implement “online<br />

test during creation”. In the other compliance tables there are<br />

also methods regarding limiting complexity and restricting<br />

hierarchy, size and dependencies, but the Guidelines imposes<br />

no limitations on code metrics. Similarly, there are methods<br />

recommending use of style guides and naming conventions,<br />

which is not required by the Guidelines.<br />

VII.<br />

HOW DO I ENSURE MY CODE COMPLIES WITH<br />

AUTOSAR GUIDELINES<br />

Traditionally, engineers conducted laborious manual code<br />

reviews to ensure code had been written according to their<br />

chosen standard. This process was error-prone and did not scale<br />

to handle today’s large, complex code bases. Fortunately, these<br />

checks can now be automated using tools. A ‘static analyzer’ is<br />

a tool designed for this purpose. A static analyzer not only<br />

reports violations of coding rules, but it performs a deep code<br />

inspection to highlight any undefined, unspecified, or compilerdependent<br />

behavior. It analyzes all the possible execution paths<br />

of the program to flag potential runtime issues. Often it can find<br />

issues that are not found by testing because it is rarely practical<br />

for tests to cover all possible execution paths. A static analyzer<br />

is an essential component of the toolset used for the development<br />

of safe, secure and reliable software.<br />

AUTOSAR’s use of PRQA’s static analysis tool, QA·C++,<br />

to ensure quality of its Demonstrator source code and its<br />

compliance to the coding guidelines has provided valuable<br />

insights. These insights, combined with PRQA’s contribution to<br />

the Guidelines, have enabled the development of the only static<br />

analysis solution that is optimized for AUTOSAR-compliant<br />

software development.<br />

PRQA’s AUTOSAR Compliance Module extends QA·C++<br />

for out-of-the-box compliance with the AUTOSAR Guidelines.<br />

For medium to large development teams the solution may be<br />

further enhanced with PRQA’s code quality management<br />

control center, QA·Verify. This guarantees that all team<br />

members consistently apply the coding guidelines in addition to<br />

tracking and reporting code quality for the duration of the<br />

project.<br />

VIII.<br />

SUMMARY<br />

The AUTOSAR standard will serve as a platform upon<br />

which future vehicle applications will be implemented by<br />

minimizing the current barriers between functional domains.<br />

The standard will achieve this by making it possible to map<br />

functions and functional networks to different control nodes in<br />

the system, almost independently from the associated hardware.<br />

Although developed for the automotive industry, these<br />

guidelines can also be used by any other organization or sector<br />

which uses C++14 to develop embedded software. In any<br />

application, the use of the PRQA static analysis tool, QA·C++,<br />

will ensure that the code is error free and that it complies with<br />

the coding guidelines.<br />

REFERENCES<br />

[1] MISRA C++:2008 Guidelines for the use of the C++ language in critical<br />

systems, The Motor Industry Software Reliability Association, June 2008<br />

[2] ISO/IEC 14882:2011, ISO International Standard ISO/IEC<br />

14882:2011(E) — Programming Language C++, International<br />

Organization of Standardization, September 2011<br />

[3] ISO/IEC 14882:2014, ISO International Standard ISO/IEC<br />

14882:2014(E) — Programming Language C++, International<br />

Organization for Standardization, December 2014<br />

[4] Guidelines for the use of the C++14 language in critical and safety-related<br />

systems, Automotive Open System Architecture, October 2017<br />

[5] High Integrity C++ Coding Standard Version 4.0,<br />

http://www.codingstandard.com, PRQA, October 2013.<br />

[6] ISO 26262-6, Road vehicles — Functional safety - Part 6: Product<br />

development at the software level, International Organization for<br />

Standardization, November 2011.<br />

[7] ISO/IEC 14882:2003, ISO International Standard ISO/IEC<br />

14882:2011(E) — Programming Language C++, International<br />

Organization of Standardization, October 2003<br />

[8] Joint Strike Fighter Air Vehicle C++ Coding Standards for the System<br />

Development and Demonstration Program, Document Number<br />

2RDU00001 Rev C, Lockheed Martin Corporation, December 2005.<br />

[9] SEI CERT C++ Coding Standard,<br />

https://www.securecoding.cert.org/confluence/pages/viewpage.action?pa<br />

geId=637, Software Engineering Institute Division at Carnegie Mellon<br />

University, 2017<br />

[10] Bjarne Stroustrup, Herb Sutter, C++ Core Guidelines,<br />

http://isocpp.github.io/CppCoreGuidelines/CppCoreGuidelines,<br />

December 2017<br />

www.embedded-world.eu<br />

191


Automotive Software Solutions for<br />

Complex Safety-Certified Designs of the Future<br />

Daniel Bernal<br />

Automotive Business Segment<br />

Arm, Inc.<br />

Chandler, AZ, U.S.A.<br />

Daniel.Bernal@arm.com<br />

Abstract — Automotive Original Equipment Manufacturers<br />

(OEMs) and Tier 1 suppliers have recognized that they are in the<br />

middle of a technology revolution. They spend over $100 billion<br />

in R&D including the training required for highly skilled<br />

software development resources. This paper describes the<br />

necessary elements of a mature software development and runtime<br />

software stack required to meet the strict demands of<br />

functional safety (ISO 26262) and also meet the standards for a<br />

common software infrastructure (AUTOSAR). The breadth of<br />

today’s standards-based, safety-certifiable solutions is well<br />

positioned to support even the most complex automotive<br />

Electronic Control Unit (ECU) use-cases, including the trend to<br />

support ECU consolidation (mixed-criticality).<br />

I. INTRODUCTION<br />

The modern automobile is transforming into a complex<br />

System of Systems (SoS) with many sensors, actuators, and<br />

intelligent compute platforms. Automotive system<br />

architectures are increasing in complexity as OEMs add<br />

Advanced Driver Assistance System (ADAS) features, which<br />

are now approaching autonomous drive functionality.<br />

International functional safety standards, such as ISO 26262,<br />

defines the key components of qualifying hardware and<br />

software in automotive equipment. These apply throughout the<br />

defined lifecycle of all the automotive electronics and<br />

electrical safety-related systems. Standards such as<br />

AUTOSAR provide for a common software infrastructure for<br />

automotive systems to achieve modularity, scalability,<br />

transferability, and reusability. This helps OEMs and Tier 1<br />

suppliers preserve their investment in software.<br />

I. COMPLEXITY ECU DESIGNS<br />

Vehicle designs are rapidly approaching 100+ ECUs.<br />

This presents a huge challenge in software development and<br />

systems integration. Similarly, over a decade ago the<br />

commercial avionics industry began to standardize the<br />

consolidation of applications of differing criticality levels.<br />

This is referred to as Integrated Modular Avionics (IMA).<br />

The automotive industry is moving in the same direction with<br />

mixed-criticality platform designs. Vehicle cockpit controller<br />

platform designs and autonomous drive compute platforms are<br />

taking advantage of the sophisticated features in hardware and<br />

software to support mixed-criticality computing. This trend of<br />

consolidation will inevitably continue as long as modern<br />

hardware features and software ecosystem supports the ability<br />

to easily consolidate multiple applications. This paper will<br />

explain how today's automotive software ecosystem solutions<br />

are well positioned to support the evolving requirements in the<br />

automotive industry. It will detail how the breadth and depth<br />

of the automotive software ecosystem including safetyseparation<br />

solutions, real-time operating systems (RTOS), and<br />

software tools can support traditional automotive ECU and<br />

consolidated ECU platform (mixed-criticality systems)<br />

designs. [1][2][3]<br />

II. FUNCTIONAL SAFETY AND SECURITY ENGINEERING<br />

ISO 26262 outlines a systems engineering approach for<br />

design of functionally safety electronic systems in road<br />

vehicles. OEMs and Tier 1 suppliers must account for this as<br />

systems are redesigned or new features are included in new<br />

models through the years. One simple example of this is the<br />

trend to replace side view mirrors with side view cameras and<br />

displays. This type of redesign makes for a cleaner exterior<br />

design but also comes at the expense of additional safety<br />

requirements on the vehicle electronics. These are the type of<br />

trade-offs that manufacturers must consider in new designs.<br />

ISO 26262 outlines a classification scheme for Automotive<br />

Safety Integrity Levels (ASIL). During the hazard analysis and<br />

risk assessment for a new ECU, an OEM will capture and<br />

classify safety goals. A safety goal is determined for every<br />

possible hazardous event. Hazardous events are classified<br />

with an ASIL.<br />

The ISO 26262 functional safety standard adopts a<br />

systems development lifecycle commonly referred to as the<br />

“V” model. Fig. 1 shows the “V” model with classes of tools<br />

and pre-qualified software elements that can be leveraged at<br />

different stages of a safety-qualified design.<br />

www.embedded-world.eu<br />

192


Fig. 1 “V”-Model Project Development Lifecycle<br />

The automotive industry, just like the industrial Internet<br />

of Things (IoT), is trending toward an engineering approach<br />

that integrates functional safety and security for product<br />

development throughout the lifecycle of a hardware and<br />

software integrated system. Functional safe system design and<br />

secure architecture system design are similar in process<br />

discipline. Both are based on a systems engineering approach<br />

with a configuration-managed set of evolving requirements,<br />

models, analysis, designs, and test/validation plans. These<br />

configuration-managed products that are produced as a result<br />

of the process are often referred to as “artifacts”. Both safety<br />

engineering and security engineering organizations follow a<br />

disciplined approach to identify, analyze, reuse, specify, verify<br />

and validate goals and requirements. This is referred to as<br />

requirements engineering. [4]<br />

Having tools to support the development of these artifacts<br />

is instrumental to the efficiency of a process driven<br />

organization. These tools are referred to as “requirements<br />

engineering” or “requirements management tools”. ISO<br />

26262 mandates traceability of testing back to requirements.<br />

The ability to automatically generate reports that show testing<br />

coverage back to requirements helps tremendously in process<br />

efficiency. Tooling can be used to improve efficiency not<br />

only for requirements engineering but also for applies to<br />

design, test and validation phases.<br />

III. SIMULATION FOR A “SHIFT-LEFT” STRATEGY<br />

OEMs are increasingly challenged by more complex<br />

electrical/electronic (E/E) vehicle designs and more feature rich<br />

ECUs. In addition, competitive pressure has forced OEMs to<br />

shorten development schedules. As a result, Tier 1 suppliers<br />

have started to rely more on a simulation strategy to reduce risk<br />

in software development. This allows for software<br />

development ahead of silicon and platform availability.<br />

Several levels of simulation models that support software<br />

development are listed below:<br />

• Instruction Set Simulator (ISS) models: ISS models are<br />

cycle accurate models that are compiled directly from<br />

RTL and retain complete functional and cycle accuracy.<br />

This enables users to confidently make architectural<br />

decisions, optimize performance or develop bare metal<br />

software.<br />

• SoC Models: SoC and subsystem models provide an<br />

accurate, flexible programmer's view models of an SoC’s<br />

design allowing software development such as drivers,<br />

firmware, OS and applications prior to silicon availability.<br />

These models allow full control over the simulation,<br />

including profiling, debug and trace. These models can<br />

typically be exported to allow integration into the wider<br />

SoC design process and Electronic Design Automation<br />

(EDA) tool platforms..<br />

• Virtual Platforms: Virtual platforms allow for software<br />

development without a hardware target. Although<br />

generally not cycle accurate, virtual platforms run at<br />

speeds comparable to the real hardware. Virtual platforms<br />

are complete simulations of physical paltform including<br />

processor, memory and peripherals. Virtual platforms are<br />

much more than just an instruction set simulator. A<br />

processor, memory and peripheral(s) model provides a<br />

good indication how software will execute on the physical<br />

device. In cases where a large team is working on a<br />

“generic” device support, virtual models remove the need<br />

for a large number of hardware targets. A virtual platform<br />

allows for OS bring-up, driver and firmware development<br />

significantly in advance of silicon.<br />

193


• Software in the Loop (SIL), also commonly referred to as<br />

SIL testing, is used to describe a test methodology where<br />

executable code is tested within a modelling environment<br />

that can help test software functionality. It is possible for a<br />

SIL simulation to run faster than real-time. This allows for<br />

comprehensive logic testing with faster than normal testing<br />

times.<br />

• Hardware in the Loop (HIL), also commonly referred to as<br />

HIL, is used in the development and test of complex realtime<br />

embedded systems. HIL simulation provides an<br />

effective platform by adding the complexity of the system<br />

under control (vehicle network) to the test platform.<br />

Virtual platforms solutions are available that support<br />

embedded real-time system simulation. Simulation technology<br />

is necessary but often not sufficient. SIL and HIL testing can<br />

complement the testing strategy for an ECU development.<br />

Avionics systems development in accordance to DO-<br />

178C, Software Considerations in Airborne Systems and<br />

Equipment Certification, has set precedent that testing on<br />

virtual platforms is viable for safety use-cases. In the cases<br />

where hardware/software integration testing have been run for<br />

formal credit on a virtual platform, the virtual platform has<br />

been qualified as a test tool. Since this type of tool could fail to<br />

detect an error while testing the hardware/software integration,<br />

DO-178C indicates a specific tool qualification process to be<br />

followed. It is common on a virtual platform that hardware<br />

emulation is not equivalent to the fidelity of the real hardware.<br />

In this case, only software requirements related to the fully<br />

emulated and qualified hardware features can be formally<br />

tested in this test environment for credit. It is likely that we will<br />

see OEMs and Tier 1 suppliers use this type of verification for<br />

credit using virtual platform environments. [5]<br />

Virtual platforms for verification and test have several<br />

advantages:<br />

• Software development in advance of silicon and<br />

platform availability.<br />

• Increased code coverage, functional coverage,<br />

assertion checking.<br />

• Stimulation of the design in different places and the<br />

exercise of software into paths and conditions that are<br />

difficult to replicate on real hardware platforms.<br />

• Fault injection at the hardware level. Easier to<br />

characterize how effective software is to handle<br />

special conditions, e.g. radiation caused bit-flips.<br />

Just like a compiler can be a qualified tool to generate<br />

executable code for a safety critical platform, a simulation<br />

model/environment can be qualified as a tool to support an<br />

ASIL development testing effort. [6][7]<br />

IV. SAFETY-QUALIFIED PLATFORM – THE BASIS FOR A<br />

SAFETY-QUALIFIED ECU<br />

A vehicle E/E architecture can be characterized as an SoS.<br />

A Commercial off the Shelf (COTS) ECU will have the<br />

requirement to be compliant with the ISO 26262 functional<br />

safety standard if its malfunctions are deemed to be safety<br />

relevant in a specific vehicle platform. This dictates that the<br />

hardware was developed as a Safety Element out of Context<br />

(SEooC) in accordance with the standard. This indicates that<br />

the hardware safety claims stand on their own merits. The<br />

platform will serve as the basis for the software stack which<br />

will also be required to meet the applicable safety objectives.<br />

The safety-qualified ECU hardware platform provider will<br />

provide a functional safety package with a safety manual that<br />

details the design and verification process, fault detection and<br />

control and assumptions of use. The platform fault detection<br />

and control may include pre-qualified software test libraries<br />

(STL). An STL is an executable program that is periodically<br />

executed by a processing unit to detect that the processing unit<br />

or another part of the integrated circuit is operating as<br />

designed. STLs can be used by safety engineers as a fault<br />

detection mechanism. STLs are typically run as built-in-selftest<br />

(BIST) at power on or runtime to detect failure modes that<br />

must be mitigated. STLs are part of the solution to detect a<br />

failure mode in hardware that may require mitigation based on<br />

the safety objectives of the ECU. [8][9]<br />

V. SOFTWARE STACK TO SUPPORT SAFETY-QUALIFIED<br />

ECU DESIGNS<br />

The ECU software stack must support the item safety<br />

case. This is a structured argument, supported with evidence<br />

(artifacts), which justifies why the system meets its safety<br />

goals in its specified operating environment. Safety cases may<br />

be represented as hierarchical. The software safety case<br />

composed by an ECU developer is composed of input from<br />

each software supplier that provides a COTS software element<br />

that is integrated into the ECU. Each of these software<br />

elements must meet the safety case as a Safety Element out of<br />

Context (SEooC). [10]<br />

A. Composing the Safety Case<br />

Assuming there is an ASIL ECU requirement, designers<br />

must make design decisions on how best to meet the safety<br />

requirements. Often, the best approach is to compose the<br />

ECU software stack from previously safety-qualified software<br />

elements. The elements in the runtime safety software stack<br />

must adhere to the process rigor mandated by ISO 26262. Fig.<br />

2 shows the typical software elements in a runtime software<br />

stack of a safety case.<br />

Automotive ECU engineers have many options in an<br />

ecosystem to help identify and mitigate the failure modes that<br />

must be mitigated to support the safety claims of an ECU. A<br />

comprehensive ecosystem of products (Silicon Device, Board<br />

Level, Software Stacks) and Services will make the job of an<br />

automotive safety engineer easier. Some of the safety related<br />

design patterns supported by the ecosystem of hardware and<br />

software suppliers includes:<br />

• Isolation for safety separation in software.<br />

• Redundancy in hardware and software<br />

• Pre-qualified software elements.<br />

• Boot and Run-Time Consistency Checks, i.e.,<br />

LBIST, BIST, etc.<br />

• Leverage solutions and tools from adjacent safety<br />

industries (i.e., DO-178B/C, IEC 61508, etc.<br />

www.embedded-world.eu<br />

194


Fig. 2 Software Elements of a Certified System<br />

B. AUTOSAR Compliant SW Elements<br />

ECUs vary in hardware resources and software<br />

architecture. Many OEMs have chosen to standardize their<br />

application interface by complying to the AUTOSAR Classic<br />

platform interface. This allows for greater application<br />

portability from one ECU platform to another. The<br />

AUTOSAR Classic platform for many years has been the<br />

application interface supporting the modular E/E vehicle<br />

architectures where every function has its own ECU. The<br />

evolving need for safety and security features is demanding<br />

that 8/16-bit ECU designs migrate to 32/64-bit. The upgrade<br />

in ECU platform hardware has allowed OEMs and Tier 1s to<br />

consider consolidating similar ECU applications onto one<br />

ECU. AUTOSAR Classic compliant runtimes have typically<br />

run on microcontroller unit (MCU) based ECUs. The ability<br />

to support consolidation of portable ECU functional blocks<br />

provides the greatest flexibility in an E/E vehicle architecture.<br />

Having a software stack that supports multiple AUTOSAR<br />

Classic runtimes on one MCU based SoC was previously<br />

possible. Until recently, this was supported with safety<br />

separation in software and no hardware support. This places a<br />

greater burden on the hypervisor safety separation layer.<br />

Features like a 2-level Memory Protection Unit (MPU) allow<br />

MCUs to support multiple guest OSes on the same platform.<br />

This allows vehicle designers flexibility to integrate classic<br />

ECU functional blocks onto one platform. This supports the<br />

ECU consolidation trend to reduce the cost of infrastructure<br />

including wiring and the number of domain gateways.[11]<br />

C. Complex ECU Architectures<br />

Modern vehicle designs with ADAS features require<br />

much more complex safety certifiable platforms. ADAS<br />

platforms have a much more demanding architecture and<br />

platform processing requirement. For this reason, the<br />

AUTOSAR consortium has decided to standardize the API<br />

and services to allow vehicle manufacturers the flexibility of<br />

using a service oriented architecture for their designs. The<br />

requirements that drive these more complex systems include<br />

mixed-criticality, real-time, security and safety. Cockpit<br />

Controller, ADAS and fully autonomous drive ECU platforms<br />

are all examples of this class of use-case. The logical choice is<br />

for OEMs and Tier 1 suppliers to leverage safety separation as<br />

part of their designs. Safety separation with a hypervisor<br />

layer allows a safety engineer to compose a mixed-criticality<br />

ASIL ECU. Fig. 3 is an example of a mixed-criticality<br />

autonomous drive platform.<br />

195


Fig. 3 Autonomous Drive Mixed-Criticality Platform<br />

The autonomous drive platform leverages the features<br />

provided by the AUTOSAR Adaptive software stack for<br />

sensor input and processing but also allows integration of a<br />

much more deterministic real-time application such as steering<br />

and accelerator control provided by an AUTOSAR Classic<br />

software stack. The safety separation in this use-case is<br />

provided by a pre-qualified ISO 26262 ASIL D hypervisor.<br />

An ASIL D safety embedded hypervisor is required to<br />

maintain the ASIL D safety claim of the steering and<br />

accelerator control guest operating systems. The ecosystem of<br />

automotive software solutions provides many options for<br />

qualified embedded hypervisor solutions that support designs<br />

up to ASIL D. Assuming that the integrator follows the<br />

assumptions of use of the qualified hypervisor, this will<br />

guarantee freedom from interference between time and space<br />

partitions. An additional benefit of using a modular<br />

architecture that is composed of mixed-criticality partitions is<br />

the ability for an OEM or Tier 1 to introduce unqualified<br />

software, such as Linux or other open source software in a<br />

partition that does not have a safety requirement. A mixed<br />

AUTOSAR Classic and Adaptive platform provides the<br />

foundation and flexibility to support different system<br />

architectures. This is required for a software defined vehicle<br />

where OEMs and Tier 1s will have the flexibility to move<br />

encapsulated ECU functionality from ECU to ECU.<br />

VI. ECOSYSTEM SUPPLIER SOLUTIONS<br />

There are many automotive suppliers that offer technologies<br />

which can help meet the safety engineering development and<br />

runtime compute requirements for a vehicle ECU. This<br />

includes complex ECU designs such as digital cockpit,<br />

advanced driver assistance and autonomous drive systems.<br />

These suppliers provide solutions for silicon, software and<br />

services facilitating efficient development of automotive<br />

solutions.<br />

• Silicon Devices and Board Products<br />

• Safety-Qualified Compilers and Tools<br />

• Embedded Virtualization (hypervisors)<br />

• Operating Systems, AUTOSAR Classic/Adaptive<br />

• Testing and Simulation Platforms<br />

• Collaborative Open Source Projects<br />

• Human Machine Interface (HMI) Tools<br />

• Security Software Frameworks<br />

• Middleware and Software Frameworks<br />

• Software Development Tools<br />

VII. CONCLUSION<br />

It is important that the ecosystem of suppliers that support<br />

automotive safety designs provides a breadth of products to<br />

support safety separation, security frameworks, and operating<br />

environments compliant with AUTOSAR Classic and<br />

AUTOSAR Adaptive. OEMs and Tier 1s will use these<br />

technologies to integrate real-time safety critical ECUs<br />

designs with newer more complex ECU designs such as<br />

ADAS and Autonomous drive ECUs. The pressure to shorten<br />

development times places a strong emphasis on starting<br />

software development early. Software tooling to help improve<br />

process efficiency in addition to high-fidelity simulation<br />

environments help mitigate risk. Lastly, having ecosystem<br />

solution options at all levels of the ECU system architecture<br />

including hardware and software stack helps reduce the risk in<br />

a new design.<br />

www.embedded-world.eu<br />

196


REFERENCES<br />

[1] http://ieeexplore.ieee.org/document/7994957/<br />

[2] https://en.wikipedia.org/wiki/Integrated_modular_avionics<br />

[3] https://en.wikipedia.org/wiki/Mixed_criticality<br />

[4] https://resources.sei.cmu.edu/asset_files/Presentation/2010_017_001_23<br />

266.pdf<br />

[5] RTCA/DO-178C Software Considerations in Airborne Systems and<br />

Equipment Certification. RTCA, Inc. 2011<br />

[6] https://developer.arm.com/products/system-design/fast-models<br />

[7] https://www.psware.com/using-virtual-environments-for-formalverification-credit/<br />

[8] http://www.armtechforum.com.tw/upload/2017/Hsinchu/B6_Functional<br />

_Safety_What_is_Arm_Doing_to_Support_this_Critical_Capability_HS<br />

U.pdf<br />

[9] http://yogitech.com/sites/default/files/documents/frstl_white_paper_rev1<br />

.2en.pdf<br />

[10] https://community.arm.com/processors/b/blog/posts/white-paper-thefunctional-safety-imperative-in-automotive-design<br />

[11] https://developer.arm.com/products/processors/cortex-r/cortex-r52<br />

197


Achieving ISO 26262 Compliance to Ensure Safety<br />

AND Security<br />

I. INTRODUCTION<br />

Mark. W. Richardson<br />

Lead Field Application Engineer<br />

LDRA<br />

Wirral, UK<br />

mark.richardson@ldra.com<br />

The bits and bolts of the automotive industry have kept pace<br />

with the latest advances in computing technology. A typical<br />

new car includes over one gigabyte of executable software,<br />

and automotive electronics accounts for about 25% of its<br />

capital cost.<br />

Adapting to such dramatic increase in software has spawned<br />

errors of process that have resulted in unfortunate loss of<br />

lives, countless recalls, and expensive litigation. Added to<br />

this, the advent of the connected car has served to<br />

dramatically increase the number of possible points of<br />

failure. To curb this level of risk and to enforce software<br />

quality, automotive OEMs increasingly demand ISO 26262 1<br />

compliance from their supply chain.<br />

This ISO 26262 Functional Safety Standard defines a product<br />

safety life cycle based on a risk-oriented approach,<br />

encapsulated by the assignment of an Automotive Safety<br />

Integrity Level (ASIL) to each system or subsystem under<br />

development. For every ASIL, the standard defines the<br />

processes related to requirement definition, implementation,<br />

verification, and validation, with traceability between each of<br />

these phases also being a key factor in the achievement of<br />

compliance.<br />

It is incumbent on suppliers seeking compliance to document<br />

how their mature and safety focused development<br />

environment is in accordance with the standard throughout<br />

this development lifecycle.<br />

This paper describes a tools-based methodology that can be<br />

used to cost-effectively manage an ISO 26262 compliant<br />

product development life cycle by providing an interactive<br />

compliance roadmap to help manage the software planning,<br />

development, verification, and regulatory activities of ISO<br />

26262 Part 6, Product Development: Software Level (ISO<br />

26262-6) 2 .<br />

The methodology guides development teams through the<br />

generation of fully compliant plans, document checklists,<br />

transition checklists, standards and other required lifecycle<br />

documents. When integrated with software verification tools,<br />

the compliance management system can also streamline<br />

verification processes, further reducing thousands of hours of<br />

documentation effort and achieving significant reductions in<br />

planning costs.<br />

II.<br />

ISO 26262 PROCESS OBJECTIVES<br />

A key element of ISO 26262-4:2011 3 is the practice of<br />

allocating technical safety requirements in the system design<br />

specification, and developing that design further to derive an<br />

item integration and testing plan. It applies to all aspects of<br />

the system including software, with the explicit subdivision<br />

of hardware and software development practices being dealt<br />

with further through the lifecycle.<br />

The relationship between the system-wide ISO 26262-4:2011<br />

and the software specific sub-phases found in ISO 26262-<br />

6:2011 can be represented in a V-model. Each of those steps<br />

is explained further in the following discussion (Figure 1).<br />

1<br />

International standard ISO 26262 Road vehicles —<br />

Functional safety<br />

2<br />

International standard ISO 26262 Road vehicles —<br />

Functional safety — Part 6: Product development at the<br />

software level<br />

3<br />

International standard ISO 26262 Road vehicles —<br />

Functional safety — Part 4: Product development at the<br />

system level<br />

198


Software architectural design (ISO 26262-6:2011 section<br />

7)<br />

There are many tools available for the generation of the<br />

software architectural design, with graphical representation<br />

of that design an increasingly popular approach. Appropriate<br />

tools are exemplified by MathWorks ® Simulink ®4 , IBM ®<br />

Rational ® Rhapsody ®5 , and ANSYS ® SCADE. 6<br />

Figure 1 - Software-development V-model with crossreferences<br />

to ISO 26262 and standard development tools<br />

System design (ISO 26262-4:2011 section 7)<br />

The products of this system-wide design phase potentially<br />

include CAD drawings, spreadsheets, textual documents and<br />

many other artefacts, and clearly a variety of tools can be<br />

involved in their production. This phase also sees the<br />

technical safety requirements refined and allocated to<br />

hardware and software. Maintaining traceability between<br />

these requirements and the products of subsequent phases<br />

generally causes a project management headache.<br />

The ideal tools for requirements management can range from<br />

a simple spreadsheet or Microsoft Word document to<br />

purpose-designed requirements management tool such as<br />

IBM Rational DOORS Next Generation or Siemens Polarion<br />

REQUIREMENTS . The selection of the appropriate tools<br />

will help in the maintenance of bi-directional traceability<br />

between phases of development, as discussed later.<br />

Specification of software safety requirements (ISO 26262-<br />

6:2011 section 6)<br />

This sub-phase focuses on the specification of software<br />

safety requirements to support the subsequent design phases,<br />

bearing in mind any constraints imposed by the hardware.<br />

It provides the interface between the product-wide system<br />

design of ISO 26262-4:2011 and the software specific ISO<br />

26262-6:2011, and details the process of evolution of lower<br />

level, software related requirements. It will most likely<br />

involve the continued leveraging of the requirements<br />

management tools discussed in relation to the System Design<br />

sub-phase.<br />

Figure 2 - Graphical representation of Control and Data<br />

Flow as depicted in the LDRA tool suite<br />

Static analysis tools contribute to the verification of the<br />

design by means of the control and data flow analysis of the<br />

code derived from it, providing graphical representations of<br />

the relationship between code components for comparison<br />

with the intended design (Figure 2).<br />

A similar approach can also be used to generate a graphical<br />

representation of legacy system code, providing a path for<br />

additions to it to be designed and proven in accordance with<br />

ISO 26262 principles.<br />

Software unit design and implementation (ISO 26262-<br />

6:2011 section 8)<br />

Coding rules: The illustration in Figure 3 is a typical example<br />

of a table from ISO 26262-6:2011. It shows the coding and<br />

modelling guidelines to be enforced during implementation,<br />

superimposed with an indication of where compliance can be<br />

confirmed using automated tools.<br />

4<br />

MathWorks ® Simulink<br />

https://uk.mathworks.com/products/simulink.html<br />

5<br />

IBM ® Rational ® Rhapsody ® family http://www-<br />

03.ibm.com/software/products/en/ratirhapfami<br />

6<br />

ANSYS ® SCADE Suite<br />

http://www.ansys.com/products/embedded-software/ansysscade-suite<br />

199


These guidelines combine to make the resulting code more<br />

reliable, less prone to error, easier to test, and/or easier to<br />

maintain. Peer reviews represent a traditional approach to<br />

enforcing adherence to such guidelines, and while they still<br />

have an important part to play, automating the more tedious<br />

checks using tools is far more efficient, less prone to error,<br />

repeatable, and demonstrable.<br />

least because large, rambling functions are difficult<br />

to read, maintain, and test – and hence more<br />

susceptible to error.<br />

• High cohesion within each software component.<br />

High cohesion results from the close linking<br />

between the modules of a software program, which<br />

in turn impacts on how rapidly it can perform the<br />

different tasks assigned to it.<br />

Figure 3 - Mapping the capabilities of the LDRA tool suite to<br />

“Table 6: Methods for the verification of the software<br />

architectural design” specified by ISO 26262-6:2011 7<br />

ISO 26262-6:2011 highlights the MISRA 8 coding guidelines<br />

language subsets as an example of what could be used. There<br />

are many different sets of coding guidelines available, but it<br />

is entirely permissible to use an in-house set or to<br />

manipulate, adjust and add one of the standard set to make it<br />

more appropriate for a particular application (Figure 4).<br />

Figure 4 - Highlighting violated coding guidelines in the<br />

LDRA tool suite<br />

Software architectural design and unit implementation:<br />

Establishing appropriate project guidelines for coding,<br />

architectural design and unit implementation are clearly three<br />

discrete tasks but software developers responsible for<br />

implementing the design need to be mindful of them all<br />

concurrently.<br />

These guidelines are also founded on the notion that they<br />

make the resulting code more reliable, less prone to error,<br />

easier to test and/or easier to maintain. For example,<br />

architectural guidelines include:<br />

• Restricted size of software components and<br />

Restricted size of interfaces are recommended not<br />

Figure 5 - Output from control and data coupling analysis as<br />

represented in the LDRA tool suite<br />

Static analysis tools can provide metrics to ensure compliance<br />

with the standard such as complexity metrics as a product of<br />

interface analysis, cohesion metrics evaluated through data<br />

object analysis, and coupling metrics via data and control<br />

coupling analysis (Figure 5).<br />

More generally, static analysis can help to ensure that the<br />

good practices required by ISO 26262:2011 are adhered to<br />

whether they are coding rules, design principles, or principles<br />

for software architectural design.<br />

In practice, for developers who are newcomers to ISO 26262,<br />

the role of such a tool often evolves from a mechanism for<br />

highlighting violations, to a means to confirm that there are<br />

none.<br />

Software unit testing (ISO 26262-6:2011 section 9) and<br />

Software integration and testing (ISO 26262-6:2011<br />

section 10)<br />

Just as static analysis techniques (an automated “inspection”<br />

of the source code) are applicable across the sub-phases of<br />

coding, architectural design and unit implementation, dynamic<br />

analysis techniques (involving the execution of some or all of<br />

the code) are applicable to unit, integration and system<br />

testing. Unit testing is designed to focus on particular software<br />

procedures or functions in isolation, whereas integration<br />

testing ensures that safety and functional requirements are met<br />

7 Based on table 6 from ISO 26262-6:2011, Copyright © 2015 IEC, Geneva, Switzerland. All<br />

rights acknowledged<br />

8<br />

MISRA – The Motor Industry Software Reliability Association<br />

https://www.misra.org.uk/<br />

200


when units are working together in accordance with the<br />

software architectural design.<br />

ISO 26262-6:2011 tables list techniques and metrics for<br />

performing unit and integration tests on target hardware to<br />

ensure that the safety and functional requirements are met and<br />

software interfaces are verified at the unit and integration<br />

levels. Fault injection and resource tests further prove<br />

robustness and resilience and, where applicable, back-to-back<br />

testing of model and code helps to prove the correct<br />

interpretation of the design. Artefacts associated with these<br />

techniques provide both reference for their management, and<br />

evidence of their completion. They include the software unit<br />

design specification, test procedures, verification plan and<br />

verification specification. On completing each test procedure,<br />

pass/fail results are reported and compliance with<br />

requirements verified appropriately.<br />

ISO 26262:2011 does not require that any of the tests it<br />

promotes deploy software test tools. However, just as for<br />

static analysis, dynamic analysis tools help to make the test<br />

process far more efficient, especially for substantial projects.<br />

Figure 7 - Examples of representations of structural<br />

coverage within the LDRA tool suite<br />

Structural coverage metrics: In addition to showing that the<br />

software functions correctly, dynamic analysis is used to<br />

generate structural coverage metrics. In conjunction with the<br />

coverage of requirements at the software unit level, these<br />

metrics provide the necessary data to evaluate the<br />

completeness of test cases and to demonstrate that there is no<br />

unintended functionality (Figure 7).<br />

Figure 6 - Performing requirement based unit-testing using<br />

the LDRA tool suite<br />

The example in Figure 6 shows how the software interface is<br />

exposed at the function scope allowing the user to enter inputs<br />

and expected outputs to form the basis of a test harness. The<br />

harness is then compiled and executed on the target hardware,<br />

and actual and expected outputs compared.<br />

Unit tests become integration tests as units are introduced as<br />

part of a call tree, rather than being “stubbed”. Exactly the<br />

same test data can be used to validate the code in both cases.<br />

Boundary values can be analysed by automatically generating<br />

a series of unit test cases, complete with associated input data.<br />

The same facility also provides a facility for the definition of<br />

equivalence boundary values such as minimum value, value<br />

below lower partition value, lower partition value, upper<br />

partition value and value above upper partition boundary.<br />

Should changes become necessary – perhaps as a result of a<br />

failed test, or in response to a requirement change from a<br />

customer - then all impacted unit and integration tests would<br />

need to be re-run (regression tested), automatically reapplying<br />

those tests through the tool to ensure that the<br />

changes do not compromise any established functionality.<br />

Metrics recommended by ISO 26262:2011 include<br />

functional, call, statement, branch and MC/DC coverage.<br />

Unit and system test facilities can operate in tandem, so that<br />

(for instance) coverage data can be generated for most of the<br />

source code through a dynamic system test, and then be<br />

complemented using unit tests to exercise such as defensive<br />

constructs which are inaccessible during normal system<br />

operation.<br />

Software test and model based development: There are<br />

several vendors of model based development tools, such as<br />

MathWorks Simulink, IBM Rational Rhapsody, and ANSYS<br />

SCADE, many of which are deservedly popular in the<br />

automotive industry. Their integration with test tools becomes<br />

pertinent once source code has been auto generated from the<br />

models they generate.<br />

Using the MathWorks product as an example, “Back-toback”<br />

testing is approached by first developing and verifying<br />

design models within Simulink. Code is then generated from<br />

Simulink, instrumented by the dynamic test tool, executed in<br />

either Software in the Loop (SIL or host) mode, or Processor<br />

In the Loop (PIL or target) mode. Structural coverage reports<br />

are presented at the source code level.<br />

In addition to “back-to-back” testing, such an integration<br />

provides facilities to ensure that generated source code<br />

complies with an appropriate coding standard, performs<br />

addition dynamic testing at the source code level, and<br />

201


complies with requirements. The same facilities can also be<br />

used to ensure that any hand-written additions to the auto<br />

generated code are adequately tested.<br />

Bi-directional traceability (ISO 26262-4:2011 and ISO<br />

26262-6:2011)<br />

Bi-directional traceability runs as a principle throughout<br />

ISO26262:2011, with each development phase required to<br />

accurately reflect the one before it. In theory, if the exact<br />

sequence of the V-model is adhered to, then the requirements<br />

will never change and tests will never throw up a problem.<br />

But life’s not like that.<br />

Consider, then, what happens if there is a code change in<br />

response to a failed integration test, perhaps because the<br />

requirements are inconsistent or there is a coding error. What<br />

other software units were dependent on the modified code?<br />

Such scenarios can quickly lead to situations where the<br />

traceability between the products of software development<br />

falls down. Once again, while it is possible to maintaining<br />

traceability manually, automation helps a great deal.<br />

Software unit design can take many forms – perhaps in the<br />

form of a natural language detailed design document, or<br />

perhaps model based. Either way, these design elements need<br />

to be bi-directionally traceable to both software safety<br />

requirements and the software architecture. The software<br />

units must then be implemented as specified and then be<br />

traceable to their design specification.<br />

Automated requirements traceability tools are used to<br />

establish between requirements and tests cases of different<br />

scopes, which allows test coverage to be assessed (Figure 8).<br />

The impact of failed test cases can be assessed and addressed,<br />

as can the impact in requirements changes and gaps in<br />

requirements coverage. And artefacts such as traceability<br />

matrices can be automatically generated to present evidence<br />

of compliance to ISO 26262:2011.<br />

Figure 8 - Performing requirement based testing. Test cases<br />

are linked to requirements and executed within the LDRA<br />

tool suite<br />

In practise, initial structural coverage is usually accrued as<br />

part of this holistic process from the execution of functional<br />

tests on instrumented code leaving unexecuted portions of<br />

code which require further analysis. That ultimately results in<br />

the addition or modification of test cases, changes to<br />

requirements, and/or the removal of dead code. Typically, an<br />

iterative sequence of review, correct and analyse ensures that<br />

design specifications are satisfied.<br />

During the development of a traditional, isolated system, that<br />

is clearly useful enough. But connectivity demands the ability<br />

to respond to vulnerabilities identified in the field. Each<br />

newly discovered vulnerability implies a changed or new<br />

requirement, and one to which an immediate response is<br />

needed – even though the system itself may not have been<br />

touched by development engineers for quite some time. In<br />

such circumstances, being able to isolate what is needed and<br />

automatically test only the impacted code becomes<br />

something much more significant.<br />

Connectivity changes the notion of the development process<br />

ending when a product is launched, or even when its<br />

production is ended. Whenever a new vulnerability is<br />

discovered in the field, there is a resulting change of<br />

requirement to cater for it, coupled with the additional<br />

pressure of knowing that in such circumstances, a speedy<br />

response to requirements change has the potential to both<br />

save lives and enhance reputations. Such an obligation shines<br />

a whole new light on automated requirements traceability.<br />

Confidence in the use of software tools (ISO 26262-8:2011<br />

section 11)<br />

This supporting process defines a mechanism to provide<br />

evidence that the software tool chain is competent for the job.<br />

The required level of confidence in a software tool depends<br />

upon the circumstances of its deployment, both in terms of<br />

the possibility that a malfunctioning software tool can<br />

introduce or fail to detect errors in a safety-related element<br />

being developed, and the likelihood that such errors can be<br />

prevented or detected.<br />

Tool qualification by a TÜV organization (“Technischer<br />

Überwachungsverein“, or “Technical Inspection<br />

Association”) for use in ISO 26262 compliant systems<br />

removes considerable user overhead in providing alternative<br />

evidence of that confidence.<br />

Depending on the user’s assessment of their application, test<br />

tools are generally assigned a “Tool Confidence Level“ of<br />

either TCL1 or TCL2. In all cases except where the tool suite<br />

is assigned TCL2 and the product is designated ASIL D, the<br />

existence of a TÜV certificate is sufficient to establish<br />

sufficient confidence in the tool. Otherwise, the tool is<br />

required to be subjected to a validation process, to show that<br />

the tool is capable of analysing sample software in the<br />

appropriate target environment.<br />

III.<br />

CONCLUSIONS<br />

There is an ever-widening range of automotive electrical<br />

and/or electronic (E/E/PE) systems such as adaptive driver<br />

assistance systems, anti-lock braking systems, steering and<br />

202


airbags. Their increasing levels of integration and connectivity<br />

provide almost as many challenges as their proliferation, with<br />

non-critical systems such as entertainment systems sharing the<br />

same communications infrastructure as steering, braking and<br />

control systems. The net result is a necessity for exacting<br />

functional safety development processes, from requirements<br />

specification, design, implementation, integration,<br />

verification, validation, and through to configuration.<br />

ISO 26262 “Road vehicles – Functional safety” was published<br />

in response to this explosion in automotive E/E/PE system<br />

complexity, and the associated risks to public safety 9 . Like the<br />

rail, medical device and process industries before it, the<br />

automotive sector based their functional standard on the<br />

industry agnostic functional safety standard IEC 61508 10 . The<br />

resulting ISO 26262 has become the dominant automotive<br />

functional safety standard, and its requirements and processes<br />

are becoming increasingly familiar throughout the industry.<br />

Although the standard has significant contribution to make to<br />

both safety and security, there is no doubt that it brings with it<br />

considerable overhead. The application of automated tools<br />

throughout the development lifecycle can help considerably to<br />

minimize that overhead, whilst removing much of the<br />

potential for human error from the process.<br />

Never has that been more significant than now. Connectivity<br />

changes the notion of the development process ending when<br />

a product is launched, and whenever a new vulnerability is<br />

discovered in the field, there is a resulting change of<br />

requirement to cater for it. Responding to those requirements<br />

places new emphasis on the need for an automated solution,<br />

both during the development lifecycle and beyond.<br />

COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral<br />

CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

CONTACT DETAILS<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Richardson<br />

Lead Field Applications Engineer<br />

E:mark.richardson@ldra.com<br />

9<br />

https://www.iso.org/news/2012/01/Ref1499.html<br />

10 IEC 61508:2010 Functional safety of electrical/electronic/programmable electronic safetyrelated<br />

systems<br />

203


A lean process for Ariane 6 Flight Software<br />

Development<br />

Philippe Gast<br />

Avionics Architecture & Flight Software<br />

ArianeGroup<br />

Les Mureaux, France<br />

philippe.gast@ariane.group<br />

Abstract—This paper addresses the method deployed in<br />

ArianeGroup to master the transition from System Functional<br />

engineering to Flight Software development.<br />

Keywords—Critical Software; realtime; Model Based System<br />

Engineering; SysML; Functional Design; Ada2012<br />

I. INTRODUCTION<br />

The current challenging context of Space Systems<br />

development (more competitors, less budget) requires to put in<br />

place innovative engineering processes and methods in order to<br />

be efficient all along the development, while keeping these<br />

systems at the necessary quality level. Efficiency means:<br />

To be able to develop right first time,<br />

To reduce the development duration and so the cost<br />

This paper shows how ArianeGroup has defined a Model<br />

based System Engineering (MBSE) method which permits to<br />

ease the transition from System to Software, to improve the<br />

consistency of the system definition and to early detect errors.<br />

The objective of this work is to generate, from a single model<br />

shared between the System and the Software teams, parts of<br />

documents (System Definition Files, Flight Software<br />

Specification) and a part of the Flight Software code.<br />

After this introduction, Section 2 will quickly present the<br />

Ariane 6 launcher. Section 3 will present the engineering<br />

method. Section 4 will describe the main Software Design<br />

choices consistent with the System definition concepts. Finally,<br />

Section 5 will show how the method has been implemented<br />

through usage of “off the Shelf” and “in house” tools. Section<br />

6, as a conclusion, will present the current feedback of using<br />

this approach.<br />

II.<br />

ARIANE 6 LAUNCHER PRESENTATION AND SPECIFICITIES<br />

Ariane 6 [1] is a 3-stage launcher, versatile (2 or 4 boosters)<br />

for various missions (mono/multi boost on different types of<br />

orbits). Ariane 6 is Fail Operational (FO).<br />

Fig. 1. Ariane 6 overview<br />

Main function, embedded by Flight Software are:<br />

Flight Control<br />

Propulsion (engine management, tank management)<br />

Mission management<br />

Services functions: pyrotechnic ignitor, measurement<br />

acquisition system, telemetry, valves control<br />

With following specificities<br />

Mix of cyclic treatment (as Flight Control command)<br />

and sequential treatment (engine ignition, stage<br />

separation,<br />

Highest control reactivity less than 10 milliseconds,<br />

Highest System Reconfiguration on error less than 20<br />

milliseconds,<br />

Level B Software<br />

www.embedded-world.eu<br />

204


III.<br />

FROM SYSTEM TO SOFTWARE<br />

The covered perimeter of this paper is the functional<br />

definition of the System, as shown on the figure below.<br />

Fig. 2. Perimeter of “System to Software” engineering approach<br />

The approach is based on our own experience<br />

(development of complex system such as Ariane 5, Automated<br />

Transfer Vehicle (ATV)) and results [2] of ESA analysis<br />

related to software crisis (beginning of 2000’s). The main<br />

objective we targeted when putting in place the” System to<br />

Software” engineering process has been the following: to<br />

improve the capture of system requirements allocated to the<br />

software. The process is based on the Functional Unit concept<br />

as a support to the System functional design activity.<br />

A. The Functional Unit approach<br />

The Functional Unit concept is to perform a breakdown of<br />

the functional architecture into a number of clearly defined<br />

parts, with high internal coherence and low coupling between<br />

them. Each part, designated as “Functional Unit”, represents a<br />

set of Hardware and Software products mutually coherent to<br />

provide dedicated functionalities and services. Functional Unit<br />

are managed by the Launcher Management Software function.<br />

Their control & command is based on “Finite State Machines”<br />

defining modes and configurations of each Functional Unit.<br />

<br />

<br />

<br />

<br />

<br />

<br />

To map functions on products in a coherent way,<br />

To trace software functional requirements directly<br />

from Functional Unit requirements (System need),<br />

To design the related software with a clear system<br />

design definition,<br />

To manage the development in a modular way,<br />

To build a verification approach based on Functional<br />

Unit requirements with a clear identification of test<br />

objectives,<br />

To facilitate the definition of the related integration<br />

tests and operations at system level based on<br />

Functional Unit design.<br />

B. Functional Design<br />

The functional architecture of the launcher is mapped on<br />

the Hardware and Software products (physical architecture) as<br />

a result of the Launcher system design activities. This mapping<br />

is organized in the following way:<br />

A set of Functional Units, each one gathers the<br />

equipment and software items which provides specific<br />

services and capabilities; with an objective of high<br />

consistency within a Functional Unit perimeter, and low<br />

coupling (interfaces) between Functional Units.<br />

In addition to the Functional Units, a Launcher<br />

Management function is in charge of the management<br />

of those Functional Units according to the on-going<br />

operation (on ground control, flight mission). Launcher<br />

Management is in charge of the on board contribution<br />

of the management of the Launcher Operational life, its<br />

hardware and software configuration, together with the<br />

management of the functional interfaces with the<br />

ground.<br />

The Launcher Management is as “a conductor” for the<br />

different Functional Units.<br />

Fig. 3. General Launcher functional architecture<br />

C. Launcher Management Details<br />

The Launcher Management is also one specific Software<br />

item and provides specific services and is the conductor of the<br />

Functional Unit. It is composed only of Software. The<br />

Launcher Management function is in charge of the on-board<br />

contribution to the management of the operations and of the<br />

configuration of the launcher. It is made of<br />

1) The Mission Management provides the on board<br />

capability of executing sequences of automatic operations as a<br />

set of pre-defined and scheduled commands stored in the<br />

Mission Plan, meaning:<br />

On board management of the Mission Plan according<br />

to ground commands (if any) for: enabling/disabling a<br />

plan.<br />

205


Detecting occurrence of the mission events, the<br />

occurrence of which enables the execution of the<br />

Mission Plan commands.<br />

Executing the current active plan (nominal,<br />

contingency), generating commands towards the<br />

Launcher Mode Management for:<br />

o<br />

Changes of launcher modes (Launcher<br />

commands), as needed by the execution of<br />

specific operations (e.g.: switch between<br />

Hardware capacities)<br />

o Functional Unit commands : to set<br />

Functional Unit modes and configuration<br />

according to the command<br />

o<br />

Jump to an alternative plan on request from<br />

Launcher Mode Management for Alarm<br />

recovery or from the TC path if any.<br />

2) The Launcher Mode Management provides the<br />

capability to:<br />

Set the launcher mode to the one required by the<br />

current operations according to the launcher commands<br />

from the Mission Management when in<br />

AUTONOMOUS mode, or from the ground when in<br />

MANUAL mode<br />

<br />

<br />

<br />

Execution of launcher commands: sequencing of<br />

different commands (Functional Unit command)<br />

defined in the launcher sequence related to the<br />

launcher mode transition. A launcher command<br />

execution cannot be interrupted by another command<br />

(e.g. failure recovery command)<br />

Execution of Functional Unit commands: sequencing<br />

of the different steps of the commanding sequence<br />

related to the Functional Unit mode transition. The<br />

execution of a Functional Unit command cannot be<br />

interrupted by another command (e.g. failure recovery<br />

command)<br />

Monitor the configuration of the launcher,<br />

The Launcher management and the Functional Unit<br />

services provide together all the services required to fulfil the<br />

mission in 2 different levels, as shown on the figure below.<br />

Fig. 4. Launcher Management<br />

D. Functional Unit details<br />

Functional Unit features consist in:<br />

<br />

<br />

A Hardware architecture, meaning equipment, part of<br />

equipment (sensors, avionics treatment, actuators),<br />

supporting the related Functional Unit services.<br />

A software part in charge of the Functional Unit<br />

management and Functional Unit algorithms in<br />

interface with the Hardware part (for commands and<br />

acquisitions). The Functional Unit provides also<br />

software services which are ran only when the<br />

Functional Unit is in a steady state mode; these are not<br />

part of the Launcher Management, but their scope is<br />

nevertheless recalled hereafter for the sake of<br />

completeness:<br />

Processing of measurements generated by the<br />

Functional Unit Hardware,<br />

<br />

<br />

<br />

Internal regulations loops for the Functional Unit,<br />

Processing of commands generated by the Functional<br />

Unit,<br />

Detection of mission events used by Mission<br />

Management in the execution of the Mission Plan.<br />

Each Functional Unit Software part provides the following<br />

functionalities:<br />

<br />

<br />

<br />

“Execute Functional Unit commands”, which gathers<br />

the acyclic processes of the Functional Unit,<br />

Settles the mode/configuration of the Functional Unit<br />

on receipt of commands from the Launcher Modes<br />

Management,<br />

“Execute Functional Unit processing”, which gathers<br />

all the Functional Unit cyclical functions like:<br />

o<br />

o<br />

o<br />

o<br />

o<br />

o<br />

o<br />

Process measurements generated by the<br />

Functional Unit Hardware,<br />

Internal regulations loops for the Functional<br />

Unit (e.g. pressurization regulation thrust<br />

positioning, etc.), the Mission Plans,<br />

Processes commands generated by the<br />

Functional Unit,<br />

Monitors the Functional Unit (e.g.: Hardware<br />

status, voltage, current, etc.); it generates<br />

alarms in case it detects a failure occurring to<br />

a piece of equipment or to a Software<br />

algorithm being part of the function it<br />

manages;<br />

generates Mission events to the MM<br />

Provides the electrical and functional status<br />

of equipment (if any) related to the<br />

Functional Unit.<br />

Provides telemetry data related to the<br />

Telemetry Functional Unit.<br />

www.embedded-world.eu<br />

206


Fig. 5. Functional unit<br />

IV.<br />

MAIN SOFTWARE DESIGN CHOICES RELATED TO SYSTEM<br />

DEFINITION APPROACH<br />

The Ariane 6 Flight Software is built with strong design<br />

principles (already applied on Automated Transfer Vehicle<br />

System) fully consistent with Functional Unit approach applied<br />

for system functional definition<br />

A. Synchronous design with Time triggered communication<br />

system<br />

The multi-tasking scheduling of the Ariane 6 Flight<br />

Software use Rate Monotonic Scheduling (RMS) policy; this<br />

permit to build synchronous software design. More precisely,<br />

Ariane 6 Flight Software is composed of a main task (the<br />

lowest priority task acting as the background task), one basic<br />

cyclic task (the lowest activation period / highest priority task<br />

whose cyclical activation is synchronized with Communication<br />

system) and a set of harmonic cyclic tasks (higher activation<br />

period / lower priority tasks whose cyclical activation is<br />

controlled by the basic cyclic task). Each task shall fulfil its<br />

deadline (treatment terminated before end of the period).<br />

Notice that all acyclic actions (i.e. launcher commands,<br />

Functional Unit commands, Failure recovery) are all executed<br />

in the basic cycling task in a discrete way (one step of the<br />

acyclic action is executed in one basic task cycle); with a<br />

unique specified maximum execution time for one step (this<br />

ease a lot the Central Processing Unit (CPU) budget<br />

mastering). Flight Software reactivity on asynchronous event is<br />

defined by specifying the max allowed number of steps related<br />

to an acyclic action.<br />

The communication system of Ariane 6 is based on Time<br />

Trigger Ethernet (TTE) technology. The figure below shows<br />

the principle of Ariane 6 communication frame building and<br />

synchronisation with the Flight Software to manage avionic<br />

input/output:<br />

<br />

TTE cluster cycle is designed in order to contain only<br />

one occurrence of each possible TT message. In other<br />

words, each TT message will be supposed to have the<br />

same period equal to the duration of the TTE cluster<br />

cycle, even if we may use oversampling (the period of<br />

<br />

<br />

some TT messages will be indeed a multiple of the<br />

TTE cluster cycle). This makes it possible to have a<br />

simpler TTE communication network configuration<br />

which is also independent from the definition of the<br />

bus frame. Thus, modifications in the bus frame<br />

definition will only impact the configuration of the<br />

Ariane 6 middleware and / or the configuration of the<br />

multi-tasks sequencer (harmonic cyclic tasks<br />

definition).<br />

A major frame made of a finite number of minor<br />

frames will be defined, each minor frame being made<br />

from a subset of all the possible TT messages,<br />

according to the required period and reactivity of the<br />

various functions. Moreover, the duration of a minor<br />

frame will be equal to the duration of the TTE cluster<br />

cycle.<br />

Thanks to the TTE start of cluster cycle event, the<br />

Ariane 6 Flight Software Basic Cyclic Task (BCT) will<br />

be activated once per minor frame (strong<br />

synchronization between Software and Hardware). The<br />

basic cyclic task will in turn activate a finite number of<br />

harmonic cyclic tasks whose activation period will be<br />

defined according to the bus frame definition.<br />

A cyclic task will receive / send TT messages according to<br />

the Last frame In / Next frame Out (LINO) principle. This<br />

principle can be summarized as:<br />

<br />

<br />

<br />

Every TT input message will be taken from the<br />

preceding minor cycle.<br />

Every TT output message will be addressed to the next<br />

minor cycle.<br />

No operation will be performed on the TT messages<br />

transmitted in the current minor cycle.<br />

Note that any cyclic task may access to these TT input /<br />

output messages using dedicated task by task data access<br />

mechanism.<br />

In the example below:<br />

<br />

<br />

<br />

<br />

One Minor frame=one TTE cluster cycle; one major<br />

frame=2 minor frames,<br />

The Basic Cyclic Task is activated once per minor<br />

frame and there is one Harmonic Cyclic Task (HCT)<br />

activated by the BCT once per major frame.<br />

Inside the BCT minor cycle n, the BCT will take the<br />

TT input message D transmitted in the BCT minor<br />

cycle n-1 and will address the TT output message A<br />

transmitted in the BCT minor cycle n+1.<br />

Inside the HCT minor cycle n, the HCT will take the<br />

TT input message B transmitted in the HCT minor<br />

cycle n-1 and will address the TT output message C<br />

transmitted in the HCT minor cycle n+1.<br />

207


Fig. 8. An example of SysML statechart<br />

Fig. 6. Software scheduling and synchronization with communication bus<br />

V. IMPLEMENTATION OF THE METHOD<br />

Several modelling languages and associated tools have<br />

been selected/developed to support the engineering process<br />

from functional definition to code.<br />

A. Functional definition<br />

The language which has been chosen for Functional Unit<br />

definition formalization is SysML [5] (using the Rhapsody<br />

tool; cf. [4]). The figure below shows the perimeter which is<br />

covered by SysML: the language is used to formalize the<br />

functional design related to the Software part of the Functional<br />

Unit.<br />

Functional Unit processing/monitoring (SysML<br />

blocks) and associated activation condition (Domain<br />

Specific Language in SysML model)<br />

<br />

Dataflows (SysML ports, Interface blocks + flow<br />

properties) between Software blocks<br />

Fig. 9. SysMLInternal Block Diagram<br />

Fig. 7. models perimeter<br />

More precisely, are formalized in SysML:<br />

<br />

<br />

The Functional Unit modes and configuration (SysML<br />

statechart),<br />

The Functional Unit commands and associated<br />

transition (SysML statechart)<br />

The language which is used to specify the mission is a<br />

Domain Specific Language. This textual language permits, at<br />

functional level, to specify mission plans and launcher<br />

sequences.<br />

A Mission Plan is built using instructions which permits to<br />

<br />

<br />

<br />

<br />

<br />

Execute commands<br />

Jump to another plan<br />

Monitoring of events<br />

Wait (duration, event raising, Boolean condition)<br />

Execute “If … then … else” statement<br />

www.embedded-world.eu<br />

208


Fig. 10. A mission plan<br />

A Launcher sequence (example provided below) is built<br />

using instructions which permit to<br />

<br />

<br />

Execute commands (in parallel or not)<br />

Wait (duration, end of command execution)<br />

<br />

<br />

<br />

Implementation of data flows using data buffers<br />

(taking into account multi-threading real time design of<br />

the Flight Software)<br />

Implementation of the different threads and associated<br />

sequencer of subprograms.<br />

Instantiation of in-house building blocks (which<br />

implement Functional Unit and Launcher Management<br />

generic mechanism) to implement<br />

o<br />

o<br />

Mission plans, launchers sequences,<br />

And for each Functional Unit : Finite State<br />

Machine (execution of commands and<br />

related transitions), processing/monitoring<br />

activation conditions, interfaces with<br />

Hardware, telemetry; a transition<br />

The figure below provides an overview of the automatic<br />

code generation chain<br />

Fig. 11. a launcher sequence<br />

Notice that a launcher sequence looks like a mission plan;<br />

the main differences are:<br />

<br />

<br />

in a mission plan, commands cannot be executed in<br />

parallel, while it is authorized in a launcher sequence,<br />

A launcher sequence is considered as atomic, it cannot<br />

be interrupted by another command request (e.g.<br />

failure recovery command)<br />

B. Design<br />

Real time design is also supported by the VASCO Domain<br />

Specific Language. At this level of engineering process<br />

VASCO is used to formalize the tasking of order of sub<br />

programs call as shown in the figure below.<br />

Fig. 13. Automatic Code Generation chain<br />

Finally, the code which shall be hand written is:<br />

<br />

<br />

Algorithms for each processing/monitoring,<br />

Algorithm for transition between modes/configuration<br />

(sequence of commands to configure Hardware, set of<br />

Software data).<br />

Fig. 12. Real time design model<br />

C. Coding<br />

An in-house suite tools take data from different functional<br />

and design models (SysML and DSL) and generate part of the<br />

Flight Software code; this covers<br />

VI.<br />

FIRST FEEDBACK<br />

The method presented in this paper has been applied on<br />

Ariane 6 project since more than one year. It has confirmed<br />

that:<br />

<br />

<br />

Co engineering between System and Software team is<br />

a success: the functional design is much mature and<br />

interfaces are well mastered when starting the products<br />

development. Several definition problems have been<br />

detected early.<br />

Modification in system definition can be quickly<br />

implemented thanks to modelling and automatic code<br />

generation.<br />

209


The lessons learned are the following ones<br />

<br />

<br />

<br />

Training of team on the method is essential and<br />

providing support all along the development also<br />

Modelling guidelines and rules shall be defined before<br />

starting the project and tools dedicated to rule checking<br />

shall be developed; it is very important to maintain, all<br />

along the development, the modelling rules and<br />

associated tools. Do not under estimate the work load<br />

of these activities.<br />

Considering the long life duration of the Flight<br />

Software (several decades), it is important to<br />

implement the method in a way to be independent, as<br />

much as possible, of commercial products (e.g.<br />

modelling tool)<br />

<br />

Models shall be managed in configuration<br />

To conclude, the method applied on Ariane 6 for Flight<br />

Software development, has shown its efficiency; and, as this<br />

method use enough general concepts for functional engineering<br />

formalization, it can be used for any type of Intensive Software<br />

System development.<br />

REFERENCES<br />

[1] Ariane 6 https://en.wikipedia.org/wiki/Ariane_6<br />

[2] Software crisis ESA Board for Software Standardisation and Control<br />

(BSSC), ESTEC, 10-11/02/2005<br />

[3] Ada 2012 http://www.ada2012.org/<br />

[4] Rhapsody http://www-<br />

03.ibm.com/software/products/en/ratirhapfami<br />

[5] SysML www.omgsysml.org<br />

www.embedded-world.eu<br />

210


The Infinite Software Development Lifecycle of<br />

Connected Systems<br />

Mark. W. Richardson<br />

Lead Field Application Engineer<br />

LDRA<br />

Wirral, UK<br />

mark.richardson@ldra.com<br />

I. INTRODUCTION<br />

Anyone familiar with functional safety standards such as<br />

DO-178C 1 , IEC 61508 2 , ISO 26262 3 , or IEC 62304 4 , will<br />

know all about the concept of bi-directional traceability of<br />

requirements and the need to ensure that the design reflects<br />

the requirements, that the software implementation reflects<br />

the design, and that the test processes confirm the correct<br />

implementation of that software.<br />

Anyone used to developing safety critical software<br />

applications will be also familiar with how painful a<br />

change of requirements can be because of the need to<br />

identify the code to be changed, and to then identify any<br />

testing to be repeated.<br />

Until now, that cycle has concluded with product release.<br />

Sure, there might be tweaks in response to field conditions<br />

but the business of development is then essentially over.<br />

Then came the Connected Car; the Industrial Internet of<br />

Things; Remote Monitoring of Medical Devices. For these<br />

or any other connected systems, requirements don’t just<br />

change in an orderly manner during development. They<br />

change without warning - whenever some smart Alec finds<br />

a new vulnerability, develops a new hack, or puts your car<br />

into a ditch. And they keep on changing not just<br />

throughout the lifetime of the product, but as for long as it<br />

is out in the field, changing the significance and emphasis<br />

of the product maintenance phase.<br />

code, static and dynamic analysis results, and unit- and<br />

system-level tests. It demonstrates how linking these<br />

elements enables the entire software development cycle to<br />

become traceable, making it easy for teams to identify<br />

problems and implement solutions faster and more cost<br />

effectively. And it highlights how such linked elements are<br />

even more important after product release, presenting a<br />

vital competitive advantage in dealing with the sinking<br />

feeling that starts with the message “We’ve been hacked”.<br />

II.<br />

PROCESS OBJECTIVES<br />

The ISO 26262 automotive functional safety standard<br />

serves as an example, but the principles discussed apply<br />

equally to the other safety critical industries and standards.<br />

Although terminology varies, a key element consistent to<br />

all such standards is the practice of allocating technical<br />

safety requirements in the system design specification, and<br />

developing that design further to derive an item integration<br />

and testing plan. It applies to all aspects of the system,<br />

with the explicit subdivision of hardware and software<br />

development practices being dealt with as the lifecycle<br />

progresses.<br />

The relationship between the system-wide ISO 26262-<br />

4:2011 and the software specific sub-phases found in ISO<br />

26262-6:2011 can be represented in a V-model. Each of<br />

those steps is explained further in the following discussion<br />

(Figure 1).<br />

This paper outlines how next-generation automated<br />

management and requirements traceability tools and<br />

techniques can create relationships between requirements,<br />

1<br />

RTCA DO-178C “Software Considerations in Airborne Systems and<br />

Equipment Certification” http://www.rtca.org<br />

2<br />

IEC 61508-1:2010 FUNCTIONAL SAFETY OF<br />

ELECTRICAL/ELECTRONIC/PROGRAMMABLE ELECTRONIC SAFETY-RELATED<br />

SYSTEMS<br />

3<br />

International standard ISO 26262:2011 Road vehicles — Functional safety<br />

4<br />

IEC 62304 International Standard Medical device software – Software life<br />

cycle processes Consolidated Version Edition 1.1 2015-06<br />

www.embedded-world.eu<br />

211


Software architectural design (ISO 26262-6:2011<br />

section 7)<br />

There are many tools available for the generation of the<br />

software architectural design, with graphical representation<br />

of that design an increasingly popular approach.<br />

Appropriate tools are exemplified by MathWorks®<br />

Simulink® , IBM® Rational® Rhapsody® , and ANSYS®<br />

SCADE<br />

Figure 1 - Software-development V-model with crossreferences<br />

to ISO 26262 and standard development tools<br />

System design (ISO 26262-4:2011 section 7)<br />

The products of this system-wide design phase potentially<br />

include CAD drawings, spreadsheets, textual documents<br />

and many other artefacts, and clearly a variety of tools can<br />

be involved in their production. This phase also sees the<br />

technical safety requirements refined and allocated to<br />

hardware and software. Maintaining traceability between<br />

these requirements and the products of subsequent phases<br />

generally causes a project management headache.<br />

The ideal tools for requirements management can range<br />

from a simple spreadsheet or Microsoft Word document to<br />

purpose-designed requirements management tool such as<br />

IBM Rational DOORS Next Generation 5 or Siemens<br />

Polarion REQUIREMENTS 6 . The selection of the<br />

appropriate tools will help in the maintenance of bidirectional<br />

traceability between phases of development, as<br />

discussed later.<br />

Specification of software safety requirements (ISO<br />

26262-6:2011 section 6)<br />

This sub-phase focuses on the specification of software<br />

safety requirements to support the subsequent design<br />

phases, bearing in mind any constraints imposed by the<br />

hardware.<br />

It provides the interface between the product-wide system<br />

design of ISO 26262-4:2011 and the software specific ISO<br />

26262-6:2011, and details the process of evolution of<br />

lower level, software related requirements. It will most<br />

likely involve the continued leveraging of the requirements<br />

management tools discussed in relation to the System<br />

Design sub-phase.<br />

Figure 2 - Graphical representation of Control and Data<br />

Flow as depicted in the LDRA tool suite<br />

Static analysis tools contribute to the verification of the<br />

design by means of the control and data flow analysis of<br />

the code derived from it, providing graphical<br />

representations of the relationship between code<br />

components for comparison with the intended design<br />

(Figure 2).<br />

A similar approach can also be used to generate a graphical<br />

representation of legacy system code, providing a path for<br />

additions to it to be designed and proven in accordance with<br />

ISO 26262 principles.<br />

Software unit design and implementation (ISO 26262-<br />

6:2011 section 8)<br />

Coding rules: The illustration in Figure 3 is a typical<br />

example of a table from ISO 26262-6:2011. It shows the<br />

coding and modelling guidelines to be enforced during<br />

implementation, superimposed with an indication of where<br />

compliance can be confirmed using automated tools.<br />

5<br />

IBM ® Rational ® DOORS ® http://www-<br />

03.ibm.com/software/products/en/ratidoor<br />

6<br />

SIEMENS Polarion ® REQUIREMENTS TM<br />

https://polarion.plm.automation.siemens.com/products/polarion<br />

-requirements<br />

212


These guidelines combine to make the resulting code more<br />

reliable, less prone to error, easier to test, and/or easier to<br />

maintain. Peer reviews represent a traditional approach to<br />

enforcing adherence to such guidelines, and although they<br />

still have an important part to play, automating the more<br />

tedious checks using tools is far more efficient, less prone to<br />

error, repeatable, and demonstrable.<br />

to test and/or easier to maintain. For example, architectural<br />

guidelines include:<br />

<br />

<br />

Restricted size of software components and<br />

Restricted size of interfaces are recommended<br />

not least because large, rambling functions are<br />

difficult to read, maintain, and test – and hence<br />

more susceptible to error.<br />

High cohesion within each software<br />

component. High cohesion results from the close<br />

linking between the modules of a software<br />

program, which in turn impacts on how rapidly it<br />

can perform the different tasks assigned to it.<br />

Figure 3 - Mapping the capabilities of the LDRA tool suite<br />

to “Table 6: Methods for the verification of the software<br />

architectural design” specified by ISO 26262-6:2011 7<br />

ISO 26262-6:2011 highlights the MISRA 8 coding<br />

guidelines language subsets as an example of what could<br />

be used. There are many different sets of coding guidelines<br />

available, but it is entirely permissible to use an in-house<br />

set or to manipulate, adjust and add to one of the standard<br />

sets to make it more appropriate for a particular application<br />

(Figure 4).<br />

A.<br />

Figure 4 - Highlighting violated coding guidelines in the<br />

LDRA tool suite<br />

Software architectural design and unit implementation:<br />

Establishing appropriate project guidelines for coding,<br />

architectural design and unit implementation are clearly<br />

three discrete tasks but software developers responsible for<br />

implementing the design need to be mindful of them all<br />

concurrently.<br />

As for the coding guidelines before them, the guidelines<br />

relating to software architectural design and unit<br />

implementation are founded on the notion that they make<br />

the resulting code more reliable, less prone to error, easier<br />

Figure 5 - Output from control and data coupling analysis<br />

as represented in the LDRA tool suite<br />

Static analysis tools can provide metrics to ensure<br />

compliance with the standard such as complexity metrics as<br />

a product of interface analysis, cohesion metrics evaluated<br />

through data object analysis, and coupling metrics via data<br />

and control coupling analysis (Figure 5).<br />

More generally, static analysis can help to ensure that the<br />

good practices required by ISO 26262:2011 are adhered to<br />

whether they are coding rules, design principles, or<br />

principles for software architectural design.<br />

In practice, for developers who are newcomers to ISO<br />

26262, the role of such a tool often evolves from a<br />

mechanism for highlighting violations, to a means to<br />

confirm that there are none.<br />

Software unit testing (ISO 26262-6:2011 section 9) and<br />

Software integration and testing (ISO 26262-6:2011<br />

section 10)<br />

Just as static analysis techniques (involving an automated<br />

“inspection” of the source code) are applicable across the<br />

sub-phases of coding, architectural design and unit<br />

implementation, dynamic analysis techniques (involving the<br />

execution of some or all of the code) are applicable to unit,<br />

7 Based on table 6 from ISO 26262-6:2011, Copyright © 2015 IEC, Geneva, Switzerland. All<br />

rights acknowledged<br />

8<br />

MISRA – The Motor Industry Software Reliability Association<br />

https://www.misra.org.uk/<br />

www.embedded-world.eu<br />

213


integration and system testing. Unit testing is designed to<br />

focus on particular software procedures or functions in<br />

isolation, whereas integration testing ensures that safety and<br />

functional requirements are met when units are working<br />

together in accordance with the software architectural<br />

design.<br />

ISO 26262-6:2011 tables list techniques and metrics for<br />

performing unit and integration tests on target hardware to<br />

ensure that the safety and functional requirements are met<br />

and software interfaces are verified at the unit and<br />

integration levels. Fault injection and resource tests further<br />

prove robustness and resilience and, where applicable,<br />

back-to-back testing of model and code helps to prove the<br />

correct interpretation of the design. Artefacts associated<br />

with these techniques provide both reference for their<br />

management, and evidence of their completion. They<br />

include the software unit design specification, test<br />

procedures, verification plan and verification specification.<br />

On completing each test procedure, pass/fail results are<br />

reported and compliance with requirements verified<br />

appropriately.<br />

Should changes become necessary - perhaps as a result of a<br />

failed test, or in response to a requirement change from a<br />

customer - then all impacted unit and integration tests<br />

would need to be re-run (regression tested), automatically<br />

re-applying those tests through the tool to ensure that the<br />

changes do not compromise any established functionality.<br />

ISO 26262:2011 does not require that any of the tests it<br />

promotes deploy software test tools. However, just as for<br />

static analysis, dynamic analysis tools help to make the test<br />

process far more efficient, especially for substantial<br />

projects.<br />

Figure 7 - Examples of representations of structural<br />

coverage within the LDRA tool suite<br />

Figure 6 - Performing requirement based unit-testing<br />

using the LDRA tool suite<br />

The example in Figure 6 shows how the software interface<br />

is exposed at the function scope allowing the user to enter<br />

inputs and expected outputs to form the basis of a test<br />

harness. The harness is then compiled and executed on the<br />

target hardware, and actual and expected outputs compared.<br />

Unit tests become integration tests as units are introduced as<br />

part of a call tree, rather than being “stubbed”. Exactly the<br />

same test data can be used to validate the code in both<br />

cases.<br />

Boundary values can be analysed by automatically<br />

generating a series of unit test cases, complete with<br />

associated input data. The same facility also provides a<br />

facility for the definition of equivalence boundary values<br />

such as the minimum value, value below lower partition<br />

value, lower partition value, upper partition value and value<br />

above upper partition boundary.<br />

Structural coverage metrics: In addition to showing that<br />

the software functions correctly, dynamic analysis is used to<br />

generate structural coverage metrics. In conjunction with the<br />

coverage of requirements at the software unit level, these<br />

metrics provide the necessary data to evaluate the<br />

completeness of test cases and to demonstrate that there is<br />

no unintended functionality (Figure 7).<br />

Metrics recommended by ISO 26262:2011 include<br />

functional, call, statement, branch and MC/DC coverage.<br />

Unit and system test facilities can operate in tandem, so<br />

that (for instance) coverage data can be generated for most<br />

of the source code through a dynamic system test, and then<br />

be complemented using unit tests to exercise, for example,<br />

any defensive constructs which are inaccessible during<br />

normal system operation.<br />

Bi-directional traceability (ISO 26262-4:2011 and ISO<br />

26262-6:2011)<br />

Bi-directional traceability runs as a principle throughout<br />

ISO 26262:2011, with each development phase required to<br />

accurately reflect the one before it. In theory, if the exact<br />

sequence of the V-model is adhered to, then the<br />

requirements will never change and tests will never throw<br />

up a problem. But life’s not like that.<br />

Consider, then, what happens if there is a code change in<br />

response to a failed integration test, perhaps because the<br />

214


equirements are inconsistent or there is a coding error.<br />

What other software units were dependent on the modified<br />

code?<br />

Such scenarios can quickly lead to situations where the<br />

traceability between the products of software development<br />

falls down. Once again, while it is possible to maintaining<br />

traceability manually, automation helps a great deal.<br />

Software unit design can take many forms – perhaps in the<br />

form of a natural language detailed design document, or<br />

perhaps model based. Either way, these design elements<br />

need to be bi-directionally traceable to both software safety<br />

requirements and the software architecture. The software<br />

units must then be implemented as specified and then be<br />

traceable to their design specification.<br />

Automated requirements traceability tools are used to<br />

establish traceability between requirements and tests cases<br />

of different scopes, which allows test coverage to be<br />

assessed (Figure 8). The impact of failed test cases can be<br />

assessed and addressed, as can the impact in requirements<br />

changes and gaps in requirements coverage. And artefacts<br />

such as traceability matrices can be automatically<br />

generated to present evidence of compliance to ISO<br />

26262:2011.<br />

Figure 8 - Performing requirement based testing. Test<br />

cases are linked to requirements and executed within the<br />

LDRA tool suite<br />

In practise, initial structural coverage is usually accrued as<br />

part of this holistic process from the execution of<br />

functional tests on instrumented code leaving unexecuted<br />

portions of code which require further analysis. That<br />

ultimately results in the addition or modification of test<br />

cases, changes to requirements, and/or the removal of dead<br />

code. Typically, an iterative sequence of review, correct<br />

and analyse ensures that design specifications are satisfied.<br />

III.<br />

THE INFINITE DEVELOPMENT LIFECYCLE<br />

When such changes become necessary, revised code needs<br />

to be reanalysed statically, and all impacted unit and<br />

integration tests need to be re-run (regression tested).<br />

Although that can result in a project management nightmare<br />

at the time, in an isolated application the need to support<br />

such occurrences lasts little longer than the time the product<br />

is under development.<br />

But connectivity demands the ability to respond to<br />

vulnerabilities identified in the field. Each newly discovered<br />

vulnerability implies a changed or new requirement, and<br />

one to which an immediate response is needed – even<br />

though the system itself may not have been touched by<br />

development engineers for quite some time. In such<br />

circumstances, being able to isolate what is needed and<br />

automatically test only the functions implemented becomes<br />

something much more significant.<br />

Whenever a new vulnerability is discovered, there is a<br />

resulting change of requirement to cater for it, coupled with<br />

the additional pressure of knowing that a speedy response<br />

could be critically important if products are not to be<br />

compromised in the field.<br />

Automated bi-directional traceability links requirements<br />

from a host of different sources through to design, code<br />

and test. The impact of any requirements changes – or,<br />

indeed, of failed test cases - can be assessed by means of<br />

impact analysis, and addressed accordingly. And artefacts<br />

can be automatically re-generated to present evidence of<br />

continued compliance to the functional safety standard of<br />

choice.<br />

IV.<br />

CONCLUSIONS<br />

Functional safety standards such as DO-178C, IEC 61508,<br />

ISO 26262, and IEC 62304, have made the concept of bidirectional<br />

traceability of requirements a familiar concept<br />

to anyone working in those industries, to ensure that the<br />

design reflects the requirements; that the software<br />

implementation reflects the design; and how the test<br />

processes confirm the correct implementation of that<br />

software.<br />

Anyone used to developing safety critical software<br />

applications will be also familiar with how painful a<br />

change of requirements can be, with its resulting change of<br />

design and code, consequential retesting.<br />

Although functional safety standards have significant<br />

contributions to make to both safety and security, there is no<br />

doubt that they bring considerable overhead with them. The<br />

application of automated tools throughout the development<br />

lifecycle can help considerably to minimize that overhead,<br />

whilst removing much of the potential for human error from<br />

the process.<br />

Never has that been more significant than now. Connectivity<br />

changes the notion of the development process ending when<br />

a product is launched, and whenever a new vulnerability is<br />

discovered in the field, there is a resulting change of<br />

requirement to cater for it. Responding to those requirements<br />

places new emphasis on the need for an automated solution,<br />

both during the development lifecycle and beyond.<br />

www.embedded-world.eu<br />

215


COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral<br />

CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

CONTACT DETAILS<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Richardson<br />

Lead Field Applications Engineer<br />

E:mark.richardson@ldra.com<br />

216


Automating the maintenance of bi-directional<br />

requirements traceability<br />

Mark A. Pitchford<br />

Technical Specialist<br />

LDRA<br />

Wirral, UK<br />

mark.pitchford@ldra.com<br />

I. INTRODUCTION<br />

Although the ever improving techniques in safety- and<br />

mission-critical software development and test are proven to<br />

yield significant improvements in software quality, they come<br />

to naught if the resulting application fails to perform as<br />

expected by the stakeholders – not just functionally but also<br />

with adequate regard for safety.<br />

Small wonder then that depending on the criticality of the<br />

application, requirements traceability is obligatory for<br />

certifiable, safety-critical applications to ensure that all<br />

requirements are implemented, and that all development<br />

artefacts can be traced back to one or more requirements.<br />

When requirements are managed well, traceability can be<br />

established between each development phase. For example,<br />

the resulting bi-directional traceability demonstrates that all<br />

system level requirements have been completely addressed by<br />

high level requirements, and that all high level requirements<br />

can be traced to a valid system level requirement.<br />

Requirements traceability also encompasses the relationships<br />

between entities such as intermediate and final work products,<br />

changes in design documentation, and test plans.<br />

The principle of bi-directional traceability has been<br />

established in the avionics community since no later than 1992<br />

when the DO-178B document 1 was introduced (since<br />

succeeded by DO-178C 2 ), and the introduction of other<br />

functional safety standards such as ISO 26262 3 in the<br />

automotive industry, IEC 62304 4 in the medical device sector,<br />

and the more generic IEC 61508 5 has seen that principle<br />

embraced more widely.<br />

Although it is both a logical and laudable principle, last<br />

minute changes of requirements or code made to correct<br />

problems identified during test put such ideals into disarray.<br />

Despite good intentions, many projects fall into a pattern of<br />

disjointed software development in which requirements,<br />

design, implementation, and testing artefacts are produced<br />

from isolated phases. Such isolation results in tenuous links<br />

between requirements, the development stages, and/or the<br />

development teams.<br />

The answer to this conundrum lies in the “trace data” between<br />

development processes which sits at the heart of any project.<br />

Whether or not the links are physically recorded and managed,<br />

they still exist. For example, a developer creates a link simply<br />

by reading a design specification and using that to drive the<br />

implementation. The collective relationships between these<br />

processes and their associated data artefacts can be viewed as<br />

a Requirements Traceability Matrix, or RTM. When the RTM<br />

becomes the centre of the development process, it impacts on<br />

all stages of safety critical application development from highlevel<br />

requirements through to target-based testing.<br />

Beyond that, the nature of connectivity calls into question<br />

when the development process itself comes to an end. The<br />

advent of such as the connected car, interactive medical device<br />

and Industrial IoT means that requirements can change at any<br />

time – not just during the traditional development lifecycle,<br />

but after it has been completed and even after a product’s<br />

production life is over. Any newly discovered vulnerability or<br />

actual compromise of a system implies an additional<br />

1<br />

RTCA DO-178B “Software Considerations in Airborne<br />

Systems and Equipment Certification” http://www.rtca.org<br />

2<br />

RTCA DO-178C “Software Considerations in Airborne<br />

Systems and Equipment Certification” http://www.rtca.org<br />

3<br />

International standard ISO 26262 Road vehicles —<br />

Functional safety<br />

4<br />

IEC 62304 International Standard Medical device software –<br />

Software life cycle processes Consolidated Version Edition 1.1<br />

2015-06<br />

5 IEC 61508-1:2010 FUNCTIONAL SAFETY OF<br />

ELECTRICAL/ELECTRONIC/PROGRAMMABLE ELECTRONIC SAFETY-<br />

RELATED SYSTEMS - PART 1: GENERAL REQUIREMENTS<br />

217


equirement to counter it, bringing with it a new emphasis on<br />

traceability even into the product maintenance phase.<br />

II.<br />

REQUIREMENTS ARE AN ONGOING COMMITMENT<br />

How often is the requirements specification baselined and then<br />

never referred to again? Without constant reference, how can a<br />

development team be sure that it is delivering a system which<br />

meets those requirements? By constructing trace links between<br />

requirements and development components from the very<br />

beginning of a project, problems such as missing or nonrequired<br />

functionality will be discovered earlier and, thus, will<br />

be easier and less costly to remedy. Unfortunately there are<br />

many factors which make it difficult or tiresome to maintain<br />

reference to a project’s set of requirements.<br />

At the start of a contract, the stakeholder sets out their vision<br />

for what they want from the delivered application. The project<br />

team then works to represent that vision as a set of<br />

requirements from which development can begin. The<br />

requirements should act as a blueprint for development.<br />

However, all too often, the team’s efforts diverge from this<br />

blueprint resulting in an application which does not align with<br />

the requirements. At best the stakeholder is disappointed. At<br />

worst, the company opens itself up to litigation and costly<br />

remedial work.<br />

The key to preventing the emergence of this “requirements<br />

gap” is to place the requirements at the forefront of<br />

development. To achieve this successfully, the process should<br />

not be too intrusive and the aim should be for it to help all<br />

participants equally, avoiding bias towards any particular<br />

disciplines or development phases.<br />

As a basis for all validation and verification tasks, all high<br />

quality software must start with a definition of requirements.<br />

Each high level software requirement must map to a lower<br />

level requirement, design and implementation. The objective<br />

is to ensure that the complete system has been implemented as<br />

defined - a fundamental element of sound software<br />

engineering practice.<br />

Simply ensuring that high level requirements map to<br />

something tangible in the requirements decomposition tree,<br />

design and implementation is not enough. The complete set of<br />

system requirements comes from multiple sources, including<br />

high level requirements, low level requirements and derived<br />

requirements. As illustrated in Figure 1 below, there is<br />

seldom a 1:1 mapping from high level requirements to source<br />

code, so a traceability mechanism is required to map and<br />

record the dependency relationships of requirements<br />

throughout the requirements decomposition tree.<br />

Figure 1 - Example of “1:Many” mapping from high level<br />

requirement through to requirements decomposition tree<br />

To complicate matters further, each level of requirements<br />

might be captured using a different mechanism. For instance,<br />

a formal requirements capture tool might be used for the high<br />

level requirements while the low level requirements are<br />

captured in PDF and the derived requirements captured in a<br />

spreadsheet.<br />

III.<br />

THE WATERFALL PROCESS AND OTHER STORIES<br />

With the initial requirements specified, development can<br />

proceed in accordance with the specified process for the<br />

project. It is useful to consider what impact the requirements<br />

have on the chosen process, and vice versa.<br />

Back in the 80s and 90s the “Waterfall” process dominated<br />

software development. Waterfall development processes are<br />

generally broken down into four, distinct phases, as illustrated<br />

in Figure 2ure 2.<br />

Figure 2 - The real- life implementation of a project is rarely<br />

as simple as the “Waterfall” process suggests<br />

Each phase is performed in isolation with the output of one<br />

being the input for the next. The final output, in theory<br />

anyway, is a working system which passes all tests.<br />

The purpose of the analysis phase is to refine the stakeholder’s<br />

vision of the system and produce a list of requirements, with<br />

those for software being itemised in the Software<br />

218


Requirements Specification (SRS). However much a project<br />

manager may wish for the SRS to be error-free, it rarely is and<br />

the change log begins increasing in size until a new version<br />

becomes inevitable.<br />

Contemporary software development processes and practices,<br />

such as the Iterative process, address many of the deficiencies<br />

found in the Waterfall process.<br />

IV.<br />

THE ART OF RQUIREMENTS<br />

Requirements need to be high quality. If they are too complex<br />

or cannot be easily understood, in later phases they will be<br />

difficult to follow and a lot of time will be wasted requesting<br />

modifications and refinements. Confidence in the<br />

requirements needs to be kept high otherwise the willingness<br />

to work with them will diminish, risking divergence from the<br />

stakeholder’s vision of the application. If the starting point of<br />

a project is of poor quality, then low-quality software is sure<br />

to follow.<br />

Given the overwhelming complexity of many projects, if all<br />

stakeholders are to share a common commitment to<br />

requirements then they must be understandable, unambiguous<br />

and precise. Such an environment will help alleviate scope<br />

creep & requirements churn, and will ensure that the delivered<br />

solution meets the stakeholder needs. It will also provide a<br />

mechanism to ensure adherence to any applicable software<br />

and industry standards.<br />

A. Textual Specifications<br />

Figure 3 - The “Iterative” process is just one example of how<br />

the “Waterfall” process has been refined to reflect the<br />

dynamic nature of projects and their requirements<br />

Requirements will change during the life of a project, whether<br />

due to stakeholders altering their vision or requested features<br />

proving to be unfeasible. Iterative processes (Figure 3)<br />

embrace this fact by splitting development into a number of<br />

phases (iterations) and considering only a subset of<br />

requirements during each iteration. Thus, the number of<br />

requirements subject to revision through feedback is<br />

significantly reduced; meanwhile development and refinement<br />

of those requirements not yet marked for implementation may<br />

proceed and benefit from any quality improvements applied to<br />

those requirements being implemented.<br />

Iterative processes retain the Waterfall phases, perhaps with<br />

altered names and additional disciplines added. With reference<br />

again to Figure 2 and Figure 3, the ‘Requirements’ discipline<br />

is analogous to the ‘Analysis’ phase. However, the key<br />

difference is that, although most effort is invested during early<br />

iterations as we would expect, effort continues over the life of<br />

the project (albeit at a gradually reducing level).<br />

Iterations can be thought of as mini Waterfall projects. A subset<br />

of the envisioned system is selected for implementation and<br />

then taken through the phases of analysis, design, construction<br />

and test. At the end of the iteration, the subset is expanded to<br />

include additional features and a new mini Waterfall begins.<br />

This process ensures that the requirements are continually being<br />

revisited and refined, keeping them in focus and using them to<br />

drive development.<br />

Textual specifications remain a popular way to capture<br />

requirements, and although they can be highly effective there<br />

are some disadvantages. For example, the stakeholder may<br />

prefer layman’s language whereas the contractor naturally<br />

leans towards technical jargon and their plans for the<br />

implementation. Furthermore, in conversational form, spoken<br />

language is inherently imprecise and prone to ambiguity.<br />

However, if a high degree of rigour is applied then such<br />

pitfalls can be overcome. One approach is to apply rules when<br />

writing requirements in much the same way as the MISRA<br />

standards are applied to C and C++ code; for example:<br />

• Use paragraph formatting to distinguish requirements<br />

from non-requirement text<br />

• List only one requirement per paragraph<br />

• Use the verb “shall”<br />

• Avoid “and” in a requirement<br />

o Consider refactoring as multiple requirements or<br />

specifying in more general terms<br />

• Avoid conditional language such as “unless” or “only if”<br />

o Such terms are likely to lead to ambiguous<br />

interpretation<br />

The use of such key words also helps if some members of the<br />

development team are less fluent in the chosen requirements<br />

language than others.<br />

219


B. Use Cases<br />

Use Cases 6 or User Stories offer another way to organise<br />

requirements, and to reduce ambiguous or imprecise<br />

specification. The example in figure 4 clearly shows what is<br />

expected to happen under a particular set of circumstances.<br />

The reduced dependence on natural language is particularly<br />

beneficial to international companies that do not share a<br />

common spoken language. Graphical representation of<br />

requirements switches the angle of analysis from a line-byline,<br />

itemised list of desired features (perhaps spreading over<br />

many pages) to a user-focused view of how the system will<br />

interact with external elements and what value it will deliver.<br />

On the other hand, a disadvantage to this approach is that<br />

precise written language is sure to be universally understood<br />

by anyone fluent in it. Not everyone involved in the project,<br />

particularly at its periphery, will have the inclination to learn<br />

the nuances of Use Case diagrams.<br />

Each Use Case or User Story comprises several scenarios.<br />

The first scenario as illustrated in Figure 4 is always the “basic<br />

path” or “sunny day scenario” in which the actor and system<br />

interact in a normal, error-free way.<br />

V. REQUIREMENTS MANAGEMENT AND<br />

TRACEABILITY<br />

However well the requirements are specified and their place in<br />

the development process established, a mechanism is required<br />

to ensure that they are reflected in the implementation of the<br />

project. Requirements traceability is widely accepted as a<br />

development best practice to ensure that all requirements are<br />

implemented and that all development artefacts can be traced<br />

back to one or more requirements. Like IEC 61508, ISO<br />

26262 and IEC 62304 amongst others, the DO-178C standard<br />

requires bi-directional traceability and has a constant emphasis<br />

on the need for the derivation of one development tier from<br />

the one above it. Paragraph 5.5 c typifies this when it states:<br />

“Trace Data, showing the bi-directional association between<br />

low-level requirements and Source Code, is developed. The<br />

purpose of this Trace Data is to:<br />

1. Enable verification that no Source Code implements<br />

an undocumented function.<br />

2. Enable verification of the complete implementation<br />

of the low-level requirements.”<br />

The level of traceability required by standards such as this<br />

vary, dependent on the criticality of the application. For<br />

example, less critical avionics applications designated DO-<br />

178C level (or “DAL”) D are known as “black box”, meaning<br />

that there is no focus on how the software has been developed.<br />

That means there is no need to have any traceability to the<br />

source code or software architecture. It is only required that<br />

the System Software requirements are traced to the High-<br />

Level Requirements and then to the test cases, test procedures<br />

and test results.<br />

For the more demanding DO-178C levels B and C, the source<br />

code has been development process is considered significant<br />

and so evidence of bi-directional traceability is required from<br />

the High Level requirements to the Low Level Requirements<br />

and then to the source code.<br />

Figure 4 - This example of a “Sunny day” scenario from an<br />

“Allow Authorised Access” Use Case shows how a system is<br />

expected to behave when a valid key card is swiped<br />

As the list of scenarios is established via this end-to-end<br />

analysis, the stakeholder’s vision is rigorously exercised<br />

allowing ambiguities and problems to be ironed out. Each<br />

scenario is assigned a priority enabling the complete set to be<br />

ranked, allowing the project team to plan each iteration and<br />

select which subset of the system will be implemented.<br />

Ultimately, for level A projects, there is a need to trace beyond<br />

the source code down to the executable object code.<br />

While bi-directional traceability is and always has been a<br />

laudable principle, last minute changes of requirements or<br />

code made to correct problems identified during test tend to<br />

put such ideals in disarray. Despite good intentions, many<br />

projects fall into a pattern of disjointed software development<br />

in which requirements, design, implementation, and testing<br />

artefacts are produced from isolated development phases.<br />

Such isolation results in tenuous links between the<br />

requirements stage and / or the development team.<br />

Processes like the waterfall and iterative examples show each<br />

phase flowing into the next, perhaps with feedback to earlier<br />

6<br />

TechTarget Definition: use case<br />

http://searchsoftwarequality.techtarget.com/definition/use-case<br />

220


phases. Traceability is assumed to be part of the relationships<br />

between phases; however, the mechanism by which trace links<br />

are recorded is seldom stated. The reality is that, while each<br />

individual phase may be conducted efficiently thanks to<br />

investment in up-to-date tool technology, these tools seldom<br />

contribute automatically to any traceability between the<br />

development tiers. As a result, the links between them become<br />

increasingly poorly maintained over the duration of projects.<br />

The Requirements Traceability Matrix (RTM) provides the<br />

solution to this problem and represents the logical extension of<br />

the required traceability between different phases. The links<br />

between phases can be ignored, or they can be acknowledged<br />

and properly managed. Either way, are critical.<br />

Figure 6 -The requirements traceability matrix (RTM) plays a<br />

central role in a development lifecycle model. Artefacts at all<br />

stages of development are linked directly to requirements<br />

matrix and changes within each phase automatically update<br />

the RTM.<br />

At the highest level, Requirements Management and<br />

Traceability tools can initially provide the ability to capture<br />

the requirements specified by standards such as the DO-178C<br />

standard. These requirements (or “objectives”) can then be<br />

traced to Tier 1 – the application specific software and system<br />

requirements.<br />

Figure 5 - RTM sits at the heart of the project defining and<br />

describing the interaction between the design, code, test and<br />

verification stages of development.<br />

Figure 5 illustrates this alternative view of the development<br />

landscape, reflecting the importance that should be attached to<br />

the RTM. Due to this fundamental centrality, it is vital that<br />

project managers place the same priority on RTM construction<br />

and maintenance as they do on requirements management,<br />

version control, change management, modelling and testing.<br />

The RTM must be represented explicitly in any lifecycle<br />

model to emphasise its importance as illustrated in Figure 6.<br />

With this elevated focus, it becomes the centre of the<br />

development process, impacting on all stages of design from<br />

high-level requirements through to target-based deployment.<br />

These Tier 1 high-level requirements might consist of a<br />

definitive statement of the system to be developed (perhaps a<br />

aircraft flap control module, for instance) and the functional<br />

criteria it must meet (e.g., extending the flap to raise the lift<br />

coefficient). This tier may be subdivided depending on the<br />

scale and complexity of the system.<br />

Tier 2 describes the design of the system level defined by Tier<br />

1. With our flap example, the low-level requirements might<br />

discuss how the flap extension is varied, building on the need<br />

to do so established in Tier 1.<br />

Tier 3’s implementation refers to the source/assembly code<br />

developed in accordance with Tier 2. In our example, it is<br />

clear that the management of the flap extension is likely to<br />

involve several functions. Traceability of those functions back<br />

to Tier 2 requirements includes many-to-few relationships. It<br />

is very easy to overlook one or more of these relationships in a<br />

manually managed matrix.<br />

In Tier 4 host-based verification, formal verification begins.<br />

Using a test strategy that may be top-down, bottom-up or a<br />

combination of both, software stimulation techniques help<br />

create an automated test harnesses and test case generators as<br />

necessary. Test cases should be repeatable at Tier 5 if<br />

required.<br />

At this stage, we confirm that the example software managing<br />

the flap position is functioning as intended within its<br />

development environment, even though there is no guarantee<br />

it will work when in its target environment. DO-178C<br />

221


acknowledges this and calls for the testing “to verify correct<br />

operation of the software in the target computer environment”.<br />

However, testing in the host environment first allows the<br />

target test (which is often more time consuming) to merely<br />

confirm that the tests remain sound in the target environment.<br />

In our example, we ensure in the host environment that<br />

function calls to the software associated with the flap control<br />

system return the values required of them in accordance with<br />

the requirements they are fulfilling. That information is then<br />

updated in the RTM.<br />

Our flap control system is now retested in the target<br />

environment, ensuring that the tests results are consistent with<br />

those performed on the host. A further RTM layer shows that<br />

the tests have been confirmed.<br />

VI.<br />

MAINTAINING THE REQUIREMENTS TRACEABILITY<br />

MATRIX<br />

A Requirements Traceability Matrix is a laudable aim<br />

irrespective of whether a standard insists on it. However,<br />

maintaining an RTM in a set of spreadsheets is a logistical<br />

nightmare, fraught with the risk of error and permanently<br />

lagging the actual project status.<br />

Constructing the RTM in a suitable tool not only maintains it<br />

automatically, but also opens up possibilities for filtering,<br />

quality checks, progress monitoring and metrics generation<br />

(Figure 7). The RTM is no longer a tedious, time-consuming<br />

task reluctantly carried out at the end of a project; instead it is<br />

a powerful utility which can contribute to its efficient running.<br />

The requirements becoming usable artefacts that are able to<br />

drive implementation and testing. Furthermore, many of the<br />

trace links may be captured simply by doing the day-to-day<br />

work of development, accelerating RTM construction and<br />

improving the quality of its contents.<br />

Modern requirements traceability solutions enable the<br />

extension of the requirements mapping down to the<br />

verification tasks associated with the source code. The<br />

screenshot below shows one such example of this. Using this<br />

type of requirements traceability tool, the 100% requirements<br />

coverage metric objective can be clearly measured, no matter<br />

how many layers of requirements, design and implementation<br />

decomposition are used. This makes monitoring system<br />

completion progress an extremely straightforward activity.<br />

Figure 7 - Traceability from high level requirements down to<br />

source code and verification tasks.<br />

VII.<br />

CONNECTIVITY AND THE INFINITE DEVELOPMENT<br />

LIFECYCLE<br />

During the development of a traditional, isolated system, that<br />

is clearly useful enough. But connectivity demands the ability<br />

to respond to vulnerabilities identified in the field. Each newly<br />

discovered vulnerability implies a changed or new<br />

requirement, and one to which an immediate response is<br />

needed – even though the system itself may not have been<br />

touched by development engineers for quite some time. In<br />

such circumstances, being able to isolate what is needed and<br />

automatically test only the functions implemented becomes<br />

something much more significant.<br />

Connectivity changes the notion of the development process<br />

ending when a product is launched, or even when its<br />

production is ended. Whenever a new vulnerability is<br />

discovered in the field, there is a resulting change of<br />

requirement to cater for it, coupled with the additional<br />

pressure of knowing that in such circumstances, a speedy<br />

response to requirements change has the potential to both save<br />

lives and enhance reputations. Such an obligation shines a<br />

whole new light on automated requirements traceability.<br />

VIII.<br />

CONCLUSION<br />

The delivery of a Requirements Traceability Matrix (RTM) is<br />

often contractually imposed on suppliers. Even when not<br />

required, many development teams recognise that a RTM is an<br />

important ‘best practice’ for successful projects. However the<br />

creation of a useful and error-free RTM can only happen when<br />

the requirements are of sufficient quality and the process is<br />

taken seriously. This paper has outlined several areas which<br />

have the capability to limit or undermine the RTM and has<br />

proposed a series of solutions:<br />

• Ensure that requirements embrace functional, safety and<br />

security related issues<br />

• Accept that requirements will change over the life of the<br />

project<br />

• Employ a development process which embraces and<br />

responds to change<br />

• Manage the quality of requirements<br />

• Let the requirements drive development<br />

• Build a RTM from the start of the project<br />

222


• Use the RTM to manage progress and improve project<br />

quality<br />

• Use the RTM to respond quickly and effectively to<br />

newly-discovered security vulnerabilities after product<br />

deployment<br />

Implementing these improvements undoubtedly takes effort<br />

but the end result will be a project that finishes on time, on<br />

budget, avoids any gap between the stakeholder’s vision of the<br />

application and what is ultimately delivered, and results in an<br />

effective support vehicle for deployed connected systems.<br />

COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Pitchford<br />

Technical Specialist<br />

E:mark.pitchford@ldra.com<br />

223


Change based Requirements Management<br />

Bernd Röser (Author)<br />

agosense GmbH<br />

Kornwestheim, Germany<br />

bernd.roeser@agosense.com<br />

Ralf Klimpke (Author)<br />

agosense GmbH<br />

Kornwestheim, Germany<br />

ralf.klimpke@agosense.com<br />

Today, requirements management only rarely begins<br />

„from scratch“. That means that existing product versions<br />

and variations are defined and developed at the same time.<br />

It is often at this point that the planned changes depart<br />

from the provisions contained in the requirements and<br />

specifications, making it difficult to ascertain which<br />

individual requirement resulted in which documented<br />

change. While it may be possible to show the differences<br />

between two milestones or baselines, identifying a direct<br />

correlation between specific provisions in a document and<br />

the actual changes implemented becomes very difficult, if<br />

not impossible. The article clarifies how tool-based<br />

methodology makes this connection, completely integrating<br />

request management into the modern development process.<br />

New ways to establish change-based requirements<br />

management with clearly distributed responsibilities to<br />

increase the level of traceability of development aim to<br />

improve communication and coordination between<br />

different responsibilities in requirements management and<br />

the development process over the long term.<br />

INTRODUCTION<br />

Software and software development is hardly imaginable<br />

today with a methodical approach - particularly given the<br />

importance of product security, quality and the need to predict<br />

activities and results. Application Lifecycle Management<br />

(ALM) tools and platforms help by supporting various<br />

practices and methods by visually representing the<br />

interdependencies between development artefacts and<br />

activities. But is that enough? Examining how development is<br />

organised more closely, it becomes clear that products today<br />

are rarely planned„from scratch“. In many cases, existing<br />

products are developed upon, improved or used as the model<br />

for further product variation. Looking at development in this<br />

way, it becomes clear that the major proportion of development<br />

activities can be characterised as changes to existing material.<br />

Changes, and change management do not begin with<br />

implementation or in the production stage, but rather much<br />

earlier in the product development process - for example,<br />

during requirements management.<br />

Managed effectively, this process can foster the re-use and<br />

alteration of existing artefacts or documents and provide<br />

optimal support across all the ALM tools used as part of the<br />

development process. This approach has long been known in<br />

the field of version management and source code<br />

administration using terminology like checkpoint, baselines,<br />

variants, versions, change lists etc. But why can we not also<br />

use these terms in the practice of requirements management?<br />

The challenges and the issues are almost identical. How can<br />

baselines or variants be used for planning requirements? How<br />

can intended changes to the specifications for a new product<br />

iteration be extracted and applied to an existing requirements<br />

document? How can it be shown, quickly and easily, which<br />

task and which change request is based on which specific<br />

change to a specification?<br />

Until now, establishing a controlled and most importantly<br />

comprehensive change process within requirements<br />

management required a considerable amount of manual effort.<br />

This article describes new opportunities to efficiently control<br />

development processes for „change based requirements<br />

management“ from the beginning using agosense.fidelia from<br />

agosense GmbH.<br />

agosense.fidelia is an independent web-based system for<br />

requirements management that unites popular RM functions<br />

with a specialised support for requirements management,<br />

integrating the development process.<br />

CHANGE TRACKING AND RELEASE<br />

Integration of requirements and change management is<br />

mostly limited to a simple linking of requirements and change<br />

requests. These links are usually manually created, and have no<br />

binding or lasting character. But how to demonstrate that<br />

requirements without a change request cannot be changed or<br />

whether the changes actually carried out match the intended<br />

changes?<br />

www.embedded-world.eu<br />

224


The uncertainty ends with agosense.requirements. The<br />

methodological approach of this tool allows every change to a<br />

document/ requirement to be listed and to ensure that a<br />

document can only be edited in connection with an allocated<br />

task or change request.<br />

This approach shall be examined in detail in the following<br />

sections, which will also provide information showing to make<br />

the most of this approach as part of planning in the<br />

development process.<br />

LISTING CHANGES<br />

As mentioned at the beginning, many regulated industries,<br />

every individual step in a development process much be<br />

recorded in order to exactly replicate the development process<br />

at a later time. To make this possible, developments and<br />

changes are described, planned and distributed to the person or<br />

group responsible. These individual steps are later tested and<br />

released as part of the ongoing process.<br />

To make this process as efficient as possible, and most<br />

importantly, to ensure that the changes carried out truly<br />

represent the intended outcome, all granular changes to the<br />

document (Sheet) are recorded in agosense.requirements. The<br />

system creates a „Change Set“ for every task and change which<br />

automatically records granular alterations to the document -<br />

similar to the approach used in software version administration<br />

tools (see. Fig 1).<br />

The change requests or tasks linked to the Change Sets<br />

usually originate in existing Change Management Systems<br />

(e.g. IBM RTC, Atlassian Jira etc) which can be directly<br />

integrated into agosense.requirements, thanks to agosense<br />

interface technology. This integration enables the allocated<br />

planning artefacts to be selected by the user in<br />

agosense.requirements. It goes without saying that information<br />

- including that regarding changes, change sets, release, etc -<br />

can be transferred back to the change management system.<br />

Change sets present an distinct list of changes to a<br />

document that can be allocated to a task or a change request.<br />

They can then flow into the document according to a specific<br />

order, for example using a regulated release process. This<br />

ensures that the document version is an accurate representation<br />

of the planning process, and the changes can be traced back to<br />

the appropriate layer at any time.<br />

current released version of a document is labelled the „base“<br />

version. If this document (or an older version of the document -<br />

then in a branch) is edited, the user simply selects the allocated<br />

task and begins working. This automatically results in the<br />

generation of a „Tentative“ version of the document, which<br />

serves as the working copy for processing.<br />

From this point, all the changes in the document version are<br />

automatically listed in the change set. After the work is<br />

completed, depending on the process implemented, the user<br />

can choose to either present the change sets for review, or enter<br />

them directly to the document. This operation results in the<br />

changes being officially carried over into the base version<br />

(„Apply“ see Fig. 2). The status of all dependent tasks and<br />

change requests in the linked system is automatically updated.<br />

Advantages:<br />

• Specifications and documents contain exact<br />

changes, which have been previously planned and<br />

approved<br />

• The user is guided through this process by<br />

agosense.fidelia without adding any extra effort<br />

• Changes are automatically documented in the<br />

background.<br />

Fig. 2. Concept Base/Tentative View and Change Sets<br />

Fig. 1: Diagram of Changes and Classification<br />

So how does this work in<br />

agosense.fidelia the<br />

detail? In<br />

225


Fig. 3 Change Sets and Reviews<br />

RELEASE PROCESS<br />

Depending on the level of maturity or formality of the<br />

defined operation processes, the specifications may be subject<br />

to a review process before being released for implementation.<br />

For our example, that means that every change request is<br />

reviewed according to the relevant change set.<br />

As shown in Fig. 3, the reviewer is shown detailed<br />

information regarding the specific change in an integrated diff<br />

view. Naturally, in this view comments can be created and<br />

corrections or follow-ups requested before the change set is<br />

released. After all the change sets for a specific release level<br />

have been accepted, a baseline for the document can be<br />

generated, representing the released version.<br />

This release process can also be controlled using the linked<br />

change management system. The reviewer can, for example, go<br />

straight to the change set view in agosense.fidelia by clicking<br />

on a hyperlink in the change request - depending on the user‘s<br />

operational preference.<br />

Advantages:<br />

• The user is guided through the whole process<br />

• Every step in the release process is documented<br />

and traceable.<br />

BRANCHES AND VARIATIONS<br />

As previously described, branches are central functions<br />

which allow product variations to be defined, or to generate<br />

deviations from existing product definitions. This creates a<br />

need for these functions to exist as an integral part of<br />

requirements management.<br />

There are only very few tools on the market that offer true<br />

branching, and those that do often do so using Add-ons, using<br />

inefficient copy mechanisms that unnecessarily inflate data<br />

storage needs and slow down the application over time.<br />

All the functions described here are also used within the<br />

branches, and every document version (whether baseline,<br />

variant, tentative or base) can of course be compared with<br />

another in any combination using the diff-view.<br />

Where special customer requests require more than<br />

straightforward branching, for example a description of<br />

function trees with corresponding dependencies that must be<br />

specifically managed in the creation of product variations,<br />

agosense.fidelia offers the ability to directly integrate variant<br />

management systems like „pure::variants“ from pure-systems<br />

GmbH.<br />

INTEGRATION<br />

As mentioned in the sections above, embedding<br />

requirements management into the whole development process<br />

is very important. The following section aims to show where<br />

226<br />

www.embedded-world.eu


the usual linkage points are, and how these can best be used to<br />

provide the traceability standards of today.<br />

To what extent is it necessary to expand the domain of<br />

requirements management to a planned change management?<br />

The following shall examine that question using a few<br />

examples.<br />

Assuming that a requirements document has been released<br />

and the product developed according to those<br />

specifications: what is now required to best<br />

incorporate subsequent changed requirements? The<br />

requirements cannot simply be changed, they must also be<br />

managed as part of a planned change process:<br />

• Changes must first be evaluated according to a<br />

range of criteria (e.g. costs, risks,...)<br />

• Dependencies must be checked: test cases, usecase<br />

models, ... may also need changing<br />

• The specifications itself must be altered and<br />

released<br />

• Changed requirements must be communicated and<br />

passed on for implementation.<br />

Efficient resolution of all these tasks cannot be achieved<br />

without an extensive integration of requirements management<br />

into all other adjacent domains, and global change management<br />

in particular.<br />

DEVELOPMENT PROCESS INTEGRATION<br />

Requirements management has thus become an integral<br />

part of the development process and can no longer be<br />

considered in isolation. It is therefore important for most<br />

companies that the RM tools being used provide sufficient<br />

relevant open interfaces and are able to be perfectly integrated<br />

into the broader tool environment (e.g. with test management,<br />

modelling, change management, etc). However, only a very<br />

small number of providers have an integration strategy that<br />

goes beyond the scope of their current product portfolio. This<br />

often leads to significant costs and frustration on the part of the<br />

client, particularly when trying to administer integration<br />

(usually specifically developed for the client) and provide users<br />

with the appropriate level of support. Additionally, companies<br />

often forget or underestimate the fact that tool integration<br />

requires profound knowledge and expertise of all tools to be<br />

integrated.<br />

Ideally, change management tool providers should at least<br />

offer open interfaces for all potential integration technologies,<br />

with the best case scenario also featuring a clear strategy and<br />

the relevant expertise in this area.<br />

agosense has a clear strategy, many years of experience in<br />

tool integration and offers an integration platform that links<br />

almost all popular tools in the ALM industry as an optimal<br />

addition. agosense.symphony is the central technology for all<br />

integration within agosense.requirements, and it can also be<br />

used as a stand-alone product to integrate a heterogenous tool<br />

chain in a way that is process oriented and specific to your<br />

needs.<br />

Fig. 4 Baselines and Variants in Document Selection Window<br />

CROSS DOMAIN TRACEABILITY<br />

What are the most important motivations for tool<br />

integration in general?<br />

• Presentation of information from other tools or<br />

databases<br />

• Enable users to work in their domain specific tools<br />

without constantly having to change between tools<br />

(due to compatibility, license or cost<br />

considerations)<br />

• Close the breaks in media and process that arise<br />

from the use of different tools for different tasks<br />

• Present dependent information<br />

• Use dependent information: see the effect of a<br />

planned change to a specification on other<br />

domains (e.g. tests, models, etc.).<br />

Fig. 5<br />

agosense.symphony as the central integration platform<br />

for all development domains and data transfer<br />

227


These individual motivations could be summarised under<br />

the term„traceability“ - the ability to locate data from a broad<br />

range of sources in relation to each other and to make this<br />

information network visible. In particular with regard to<br />

products where safety is key, for example in the automobile<br />

industry, there is a statutory requirement to ensure a certain<br />

level of maturity in the product creation process. This should<br />

result in an improved product quality, but also allow for an<br />

error to be reconstructed and followed back to its source, the<br />

product requirement.<br />

Ultimately, this is only possible through integration, as the<br />

majority of companies have a very heterogeneous tool<br />

environment and data storage.<br />

agosense.fidelia provides the optimum support (see Fig. 6<br />

for a presentation of the „Split Screen“ in conjunction with test<br />

data) in that trace information on the data from other tools can<br />

be presented, generated or processed directly in a view.<br />

Traceability is therefore possible across different tools, and<br />

able to be presented using the dashboards in agosense.fidelia<br />

for reports in real time.<br />

SUMMARY<br />

For the first time, organisations which are characterized by<br />

tight, strictly organised planning and development processes<br />

now have the opportunity to ensure continuous traceability for<br />

all activities all the way up to requirements management.<br />

Fig. 6 “Split Screen” in conjunction with test data<br />

www.embedded-world.eu<br />

228


Certification Testing Process with Full Traceability<br />

Michael Wittner<br />

Razorcat Development GmbH<br />

Berlin, Germany<br />

www.razorcat.com<br />

Abstract—The certification of safety critical systems<br />

especially in avionic systems requires extensive testing of the<br />

complete system functionality. Time and cost for the certification<br />

can be reduced significantly by a proven tool-supported testing<br />

process. The necessary documentation will then be generated<br />

automatically based on the testing results for each system<br />

requirement. The testing process from requirements analysis to<br />

evaluation of test results for each requirement will be<br />

demonstrated with a real-life project example.<br />

Within each cycle of the testing process, tests will be<br />

defined, linked to requirements, executed and evaluated. The<br />

testing process needs to keep track of the results of all testing<br />

activities in order to provide reporting and traceability between<br />

test results and requirements. With the integrated testing<br />

environment ITE, users can import and manage all<br />

requirements, select the appropriate test means and plan testing<br />

campaigns.<br />

Keywords—Testing, standards, requirement, certification,<br />

reporting<br />

I. INTRODUCTION<br />

This paper describes a well-structured testing process<br />

which is based on a dedicated test specification language<br />

(CCDL) and a test management tool (ITE) for requirements<br />

linking and traceability reporting. By means of a successfully<br />

completed test campaign for an avionics system component,<br />

the method, the step by step process and the available tool<br />

support will be demonstrated.<br />

II.<br />

THE TEST PROCESS AT A GLANCE<br />

For the validation and verification of a system it is<br />

necessary to provide evidence that the requirements are both<br />

correctly implemented as system functions and that the system<br />

functions do exactly what they are expected to do. For this<br />

purpose, a suite of tests needs to be created that will be<br />

executed against the system under test (SUT). Testing will take<br />

place at different levels down from unit testing up to<br />

integration and system testing. Other test means like reviews<br />

may also be used to validate non-technical requirements.<br />

Fig. 1 provides an overview about the testing process and<br />

its entities. The assignment of testing tools for each<br />

requirement will be done using the verification and validation<br />

(VxV) matrix. According to this planning activity, the progress<br />

in defining and executing tests linked to requirements can<br />

easily be measured throughout the whole testing process. At<br />

the end of each testing cycle the certification ready reports<br />

present the current testing status.<br />

Fig. 1. Overview about the testing process<br />

A. Preparation of requirements<br />

The base for all certification activities are well-defined<br />

requirements for the system under test. Requirements are<br />

usually structured within one or several documents and they<br />

can be further refined into sub requirements being linked to<br />

their main requirements. Requirements stemming from external<br />

tools need to be imported into ITE to be able to detect changes<br />

and handle linking to other test entities.<br />

B. Selecting the test means<br />

The next step after requirement definition is the selection of<br />

testing tools. These test means will be used for testing on<br />

different levels like unit, integration and system testing. There<br />

may be different tools that can be used for verification of a<br />

single requirement and it is the responsibility of the test<br />

229


engineer to select the most appropriate and effective one for<br />

each requirement.<br />

Fig. 2. Verification and validation matrix<br />

The assignment of test means to requirements takes place<br />

within the VxV matrix. For each requirement, there is one row<br />

containing the test mean assignments. This step of the process<br />

just decides about the testing tools or methods being applied to<br />

validate each requirement. Fig. 2 shows a VxV matrix for<br />

several requirements being tested with different test means.<br />

C. Test definitions and test procedures<br />

When testing a specific requirement, it is important to<br />

firstly reveal all relevant testing aspects to fully cover the<br />

functionality described within the requirement. A testing<br />

program for a complex system consists of a lot of test<br />

definitions that need to be linked to the requirements being<br />

tested.<br />

1) Defining the test specification<br />

Writing test definitions in purely textual form has the<br />

disadvantage that it gets harder to distinguish different tests for<br />

the same objective when already a couple of tests have been<br />

defined. Especially when maintaining and extending existing<br />

tests due to new requirements, systematic approaches to test<br />

specification such as the classification tree method are highly<br />

recommended. Fig. 3 shows tests defined using the<br />

classification tree method at a high abstraction level. These test<br />

definitions just outline the necessary setup and checks to be<br />

implemented on the respective test mean (i.e. a system testing<br />

tool in this case).<br />

1: Normal no load<br />

2: Normal extend<br />

3: Normal retract<br />

4: Degraded voltage<br />

5: Degraded hydraulic<br />

6: Degraded low temperature<br />

Both<br />

systems<br />

Redundancy<br />

System 1<br />

only<br />

System 2<br />

only<br />

Operating<br />

conditions<br />

normal<br />

Supply<br />

voltage<br />

reduced<br />

normal<br />

Hydraulic<br />

pressure<br />

reduced<br />

Actuator Test<br />

extend<br />

Moving<br />

direction<br />

retract<br />

none<br />

Environmental<br />

conditions<br />

Op load<br />

normal<br />

max<br />

Ambient<br />

temperature<br />

normal<br />

low<br />

-40 °C<br />

external auditors without further training. It is a test<br />

specification language consisting of only a few syntactical<br />

elements in a human readable format that can be learned easily.<br />

It provides means for the definition of complex chronological<br />

as well as event driven test procedures. Expected reactions of<br />

the system under test can be specified and evaluated in parallel<br />

to the test control flow. Requirements can be linked directly to<br />

test script commands using the syntax of the CCDL language.<br />

Fig. 4 shows an example test procedure for testing of a<br />

safety monitoring function of an aircraft actuator system. The<br />

example provides a short overview about the CCDL scripting<br />

possibilities: The SUT gets stimulated and the test control flow<br />

waits for the SUT to be running in normal operating mode.<br />

Now the test simulates a sensor fault and checks for the<br />

expected reaction of the SUT.<br />

Fig. 4. CCDL sample test procedure<br />

This simple example already shows the power of the CCDL<br />

language: The trigger condition defines the point in time where<br />

the SUT runs in normal operating mode. Based on this trigger<br />

the fault situation is applied (within the “when” script<br />

expression) and in parallel the expected reaction of the SUT<br />

will be checked (using the “within” script expression). The<br />

operator => checks if the given signal changes exactly once<br />

within the given interval to the value provided. Fig. 5 shows<br />

the graphical execution flow for a test run of the example test<br />

procedure. The red dot within the expected reaction check<br />

indicates that the failure warning signal did not change to its<br />

expected value of “1”.<br />

Fig. 3. Classification tree for the test of an aircraft actuator component<br />

The exact sequence of test steps is not defined within each<br />

test definition. It is up to the test engineer to write a useful test<br />

procedure that copes with the given testing challenge. It can be<br />

useful to have a set of initial conditions defined for all possible<br />

test setups of the system. These initial conditions can be reused<br />

for all applicable test definitions.<br />

2) Writing test scripts<br />

The CCDL language as an example for a system level<br />

testing language provides all required real-time testing<br />

functionality while remaining readable and understandable for<br />

Fig. 5. Execution flow with timing and test results<br />

The SUT will be stimulated and checked exactly as defined<br />

within the test in a precisely definable time frame. The usage of<br />

trigger conditions in conjunction with time offsets allows<br />

specifying precise time intervals for real time checking of SUT<br />

behavior. The use of CCDL within a certification test campaign<br />

230


esulted in a highly increased productivity of the test team<br />

compared with the former test scripting based on the<br />

programming language Python.<br />

D. Evaluation of test results<br />

The next step after completion of tests is the evaluation of<br />

the test results with focus on each requirement. The initial<br />

linking of test definitions to requirements is the basis for this<br />

evaluation, but it may also be necessary to link additional<br />

requirements and their respective evaluation results to executed<br />

tests. Since the CCDL provides linking of requirements to<br />

actual results being checked during test execution, the<br />

evaluation process can be highly automated. Manually<br />

overriding the automatic results must be possible as well to<br />

include human expertise in test result evaluation into the testing<br />

process.<br />

Each test result needs to be acknowledged as “closed” to be<br />

included into the current test statistics. This feature allows<br />

immediate reporting about the current testing status without<br />

taking into account work in progress results.<br />

E. Test reporting<br />

The final result of a certification process is the creation of<br />

reports about the achieved test results and the traceability<br />

between requirements and their respective test results. Because<br />

the overall testing process is usually an iterative process with<br />

several testing campaigns, it is essential that the reporting<br />

provides a quick overview as well as detailed insights into<br />

problems detected while testing. Fig. 6 shows an excerpt of an<br />

overview report. For each requirement, the planned test<br />

coverage defined within the VxV matrix together with the<br />

achieved execution results is shown. The number of tests are<br />

based on the number of test definitions and their respective<br />

number of executions (i.e. test runs). Filtering may be applied<br />

to show the results for specific test campaigns or part of the<br />

whole testing program.<br />

Fig. 6. Overview report for requirement based test results<br />

With the integrated testing environment ITE, users can<br />

generate customizable reports for traceability between<br />

requirements and executed tests which can directly be used for<br />

certifications and assessments.<br />

III.<br />

ANOTHER TESTING CYCLE<br />

Because testing of complex systems is normally not a oneshot<br />

operation it is necessary to support an effective handling<br />

of changes within the testing process. Changes to requirements<br />

may cause updates or extensions of tests, require additional<br />

tests or result in obsolete tests that can be removed. The<br />

existing relationships between all test process artefacts make it<br />

possible to highlight any suspicious elements (i.e. elements that<br />

may be affected by changes of their linked elements).<br />

Fig. 7. Relationships between test artefacts<br />

Fig. 7 shows the links between test artefacts that can be<br />

followed when exploring the impact of changes. Each<br />

dependent element will be marked as suspicious in order to be<br />

updated or acknowledged. Such a tagging of suspicious<br />

elements guides the test engineer throughout the necessary<br />

adaptions within a highly dynamic testing process.<br />

After resolving all suspicious artefacts, the test engineer can<br />

be sure to have taken into account all necessary changes and<br />

updates of tests that were to be done due to requirement<br />

changes.<br />

IV.<br />

USAGE IN LARGER TESTING PROJECTS<br />

This shortly outlined testing process has been used<br />

successfully in various avionics certification testing programs<br />

for safety critical aircraft components. One of these projects<br />

had the following key indicators:<br />

Time span over 3 years, up to 25 test engineers<br />

Number of requirements: 1270, additionally 3024<br />

derived sub requirements<br />

Number of test cases: 3014<br />

Number of test runs: 4118, thereof 1196 repeated<br />

execution cycles of test procedures<br />

The execution of test procedures could be conducted either<br />

on an expensive hardware test rig with all other relevant<br />

original components in place or on a simulator for the tested<br />

aircraft component only. Therefore it was important to select<br />

the most relevant tests and the best matching test tool within<br />

each test campaign.<br />

Especially the number of testing cycles and retests is<br />

interesting as it underlines the necessity for a powerful<br />

versioning and change management to be able to follow any<br />

suspicious paths of the test artefacts link chains.<br />

231


1000x in Three Years: How Embedded Vision is<br />

Transitioning from Exotic to Ubiquitous<br />

Jeff Bier<br />

Embedded Vision Alliance<br />

Walnut Creek, CA USA<br />

bier@embedded-vision.com<br />

Abstract—Just a few years ago, it was inconceivable that<br />

everyday devices would incorporate visual intelligence. Now it’s<br />

clear that visual intelligence will be ubiquitous soon. How soon?<br />

Faster than you might think, thanks to three key accelerating<br />

factors. In the next few years, we’ll see roughly a 10X<br />

improvement in cost-performance and energy efficiency at each<br />

of three layers: algorithms, software techniques, and processor<br />

architecture. Combined, this means that we can expect roughly a<br />

1000X improvement. So, tasks that today require hundreds of<br />

watts of power and hundreds of dollars’ worth of silicon will soon<br />

require less than a watt of power and less than a dollar’s worth of<br />

silicon. This will be world-changing, enabling even very costsensitive<br />

devices, like toys, to incorporate sophisticated visual<br />

perception. In this talk, I’ll explain how innovators across the<br />

industry are delivering this 1,000X improvement very rapidly.<br />

I’ll also highlight end-products that are showing us what’s<br />

possible in this new era, and important challenges that remain.<br />

Note: A paper is not being published for this presentation.<br />

The presentation slides are available upon request.<br />

232<br />

www.embedded-world.eu


Embedded Vision Solutions – State of the Art,<br />

Options and Applications<br />

Jan-Erik Schmitt<br />

Vision Components GmbH<br />

Ettlingen, Germany<br />

schmitt@vision-components.com<br />

Miriam Schreiber<br />

Vision Components GmbH<br />

Ettlingen, Germany<br />

miriam.schreiber@vision-components.com<br />

Abstract—This documents gives a brief overview over<br />

embedded vision solutions, their definitions, options and limits as<br />

well as typical applications.<br />

Keywords—embedded vision component; formatting; style;<br />

styling; insert (key words)<br />

I. INTRODUCTION<br />

With this paper we would like to contribute to an ongoing<br />

discussion regarding embedded components, and more specific<br />

Embedded Vision Solutions. To begin with, we find it<br />

necessary trying to define what Embedded Vision Solution is –<br />

especially since experience showed that certain ideas are<br />

connected with this term(s), but usually they are not very<br />

precise and we can not assume them as generally accepted. In a<br />

second step we will give a quick overview over the technology<br />

development until today and last but not least explain some<br />

typical application fields for embedded vision systems.<br />

II.<br />

DEFINITION: WHAT IS AN EMBEDDED VISION<br />

SOLUTION?<br />

Today, Embedded Vision and Embedded Vision Solutions<br />

are hot terms widely spread, but, unfortunately, there is no real<br />

definition, that claims to specify exactly what it means. Thus,<br />

any contribution to this topic, needs to begin with a short<br />

definition outlining the basics.<br />

Still, this paper only can give a brief survey over current<br />

use of certain terms. All of these terms additionally are subject<br />

to constant change by use of professionals as well as, due to<br />

increasing awareness, by nonstandard use in everyday<br />

language.<br />

The term Embedded Vision Solutions derives from three<br />

different terms: Embedded System, Machine Vision System, and<br />

Vision Solution.<br />

A. Embedded System<br />

What is an Embedded System?<br />

Even here, there are several terms in use. For better<br />

understanding, we considered the term embedded system as a<br />

short form for embedded computing system and thus<br />

synonymous.<br />

“Embedded computing systems (ECSs) are dedicated<br />

systems which have computer hardware with embedded<br />

software as one of their most important components. Hence,<br />

ECSs are dedicated computer-based systems for an application<br />

or a product, which explains how they are different from the<br />

more general systems […]. As implementation technology<br />

continues to improve, the design of ECS becomes more<br />

challenging due to increasing system complexity, as well as<br />

relentless time-to-market pressure. Moreover, ECSs may be<br />

independent, part of a larger system, or a part of a<br />

heterogeneous system. They perform dedicated functions in a<br />

huge variety of applications, although these are not usually<br />

visible to the user.“[1]<br />

In general, we can sum up, that embedded systems consist<br />

of two main components:<br />

· Hardware, which consists of the processor,<br />

program and data memory, interfaces, inputs,<br />

outputs, etc.<br />

· Firmware/Operating system<br />

B. Machine Vision System<br />

What is a Machine Vision System?<br />

The setup required to call a Machine Vision System is<br />

consisting of several components:<br />

· Lighting<br />

· Lens/Optics<br />

· Image sensor/Camera<br />

www.embedded-world.eu<br />

233


· Hardware/electronics (Processing Unit)<br />

· Software<br />

C. Vision Solution<br />

In the field of Machine Vision & Imaging, a Vision<br />

Solution is most often used as a system combining all<br />

necessary hardware and software components that are needed<br />

for the particular task. This can differ from application to<br />

application, since a wide range of components, both regarding<br />

hardware and software, can be used.<br />

Based on these clarifications, we can sum up that an<br />

Embedded Vision Solution is an embedded system disposing<br />

specific hardware and software components for a particular<br />

vision inspection task without using an external processing<br />

device but instead process all collected data onboard.<br />

III.<br />

EMBEDDED VISION TECHNOLOGY AND ITS<br />

DEVELOPMENT UNTIL TODAY<br />

The Apollo Guidance Computer, developed in the 1960s, is<br />

considered to be one of the first modern embedded systems [2].<br />

It had appr. 4KB RAM and could achieve 40,000 additions per<br />

second and a clock frequency about 100kHz.<br />

The first so-called Smart Camera, by then the most<br />

common term for an embedded vision system, for industrial<br />

use was brought to market in 1995. It was the VC11 from<br />

Vision Components GmbH, a DSP-based system with 32MHz<br />

clock frequency, 2MB DRAM and a 512KB Flash-EPROM.<br />

Today, this is considered as the beginning of embedded vision<br />

technology and soon other companies followed with similar<br />

products to join the quickly growing market. The first<br />

generations Smart Cameras were homogeneous systems based<br />

on DSP only like illustrated in fig. 1.<br />

A few years later, in 2000 the rollout of the first Vision<br />

Sensor (another term not exactly defined) followed. Also a<br />

homogeneous system based on DSP technology clocked at<br />

75MHz.<br />

Today, typical Embedded Vision Systems are based on<br />

heterogeneous systems like quad-core ARM with 1.2GHz<br />

combined with FPGA or GPU modules. A typical<br />

heterogeneous system is for example the Zynq SoC from<br />

Xilinx[3] as shown in fig. 2.<br />

In recent years, processor technologies became fast-paced<br />

due to related technologies in consumer and automotive<br />

markets: Cell phones, tablet computers, autonomously driving<br />

vehicles and many more.<br />

IV.<br />

CORE QUESTION: WHY AND WHEN ARE EMBEDDED<br />

VISION SYSTEMS USED?<br />

There are many reasons using embedded vision systems<br />

instead of conventional PC-based vision systems:<br />

· Like no other machine vision system, embedded<br />

vision systems are reduced to their basic<br />

components. Thus, the can easily be optimized for<br />

cost/performance ratio.<br />

· Embedded vision systems consist of less<br />

components than PC-based systems and are a lot<br />

smaller in general. Thus, they also consume less<br />

power than PC-based systems.<br />

· Embedded vision systems operate absolutely<br />

stand-alone.<br />

· Due to there minimal hardware design, they are<br />

extremely low-maintenance.<br />

Conclusion: Equal performance assumed, all arguments are<br />

pro using Embedded Vision Systems.<br />

Fig. 1. Block diagram DSP based camera. © Vision Components GmbH<br />

Fig. 2. Systemtypology of a Dual-Core ARM with combined FPGA. ©<br />

Vision Components GmbH<br />

234


V. APPLICATION AREAS OF EMBEDDED VISION SYSTEMS<br />

Thanks to technology achievements, there are hardly limits<br />

to applications that can be realized with Embedded Vision<br />

Systems.<br />

· General quality control like e.g. glass inspection or<br />

electronic parts inspection.<br />

· 1D and 2D code reading as well as Optical<br />

Character Recognition/OCR.<br />

· Pick & place applications as used for assembling<br />

robots.<br />

· In logistics, stereo vision for 3D applications and<br />

general ID reading tasks, parcel sorting, and<br />

general warehouse automation.<br />

· 3D laser triangulation, e.g for weld inspection.<br />

· Motion analysis in sports, for medical use or use<br />

in virtual realities.<br />

· 3D stereo vision cameras, e.g. used in sports or<br />

entertainment industries for ball tracking in Golf<br />

simulators, or for people counting tasks.<br />

· License plate reading/LPR for automated access<br />

systems.<br />

· Biometrics like fingerprint scanners.<br />

· Special tasks like surface inspection with standalone<br />

interferometer.<br />

· Robot guidance systems for autonomous assembly<br />

robots.<br />

and many more.<br />

[1] Springer International Publishing Switzerland 2016 D.P.F. Mo¨ller,<br />

Guide to Computing Fundamentals in Cyber-Physical Systems,<br />

Computer Communications and Networks, DOI 10.1007/978-3-319-<br />

25178-3_2 , p. 37f<br />

[2] According to Wikipedia:<br />

https://en.wikipedia.org/wiki/Apollo_Guidance_Computer, 19.01.2018<br />

[3] ZYNQ and XILINX are registered trademarks of Xilinx, Inc.<br />

www.embedded-world.eu<br />

235


Shifting Advanced Image Processing from<br />

Embedded Boards to Future Camera Modules<br />

A Paradigm-Change for Embedded Designers?<br />

Paul Maria Zalewski<br />

Product Management<br />

Allied Vision Technologies GmbH<br />

Stadtroda, Germany<br />

Abstract— Today’s embedded designers can choose from a<br />

broad multitude of possibilities when it comes to embedded vision.<br />

Most of them choose so called CMOS (Complementary Metal<br />

Oxide Semiconductor) camera modules, which are integrated in<br />

our smartphones, tablets and laptops. These modules are the<br />

preferred choice for designers since these modules deliver an<br />

acceptable image quality, are small, cheap and most importantly:<br />

Due to the standard interface MIPI CSI-2 (MIPI Camera Serial<br />

Interface 2) are easy to integrate within an embedded system.<br />

Besides these modules, other cameras are available with LVDS<br />

(Low Voltage Differential Signaling) and parallel interface, which<br />

make the integration more challenging to the designers.<br />

These camera options for Embedded Vision have something<br />

very important in common: A poor capability of image processing<br />

inside the camera. The Embedded community arranged<br />

themselves by overcoming these limitations with operating<br />

important image processing tasks on the CPU (Central Processing<br />

Unit) or dedicated ISPs (Image Sensing Processor) on the<br />

embedded board. Overall, cameras played a subordinate role in<br />

the context of the whole embedded system. Their major role was<br />

just to collect photons, convert them into electrons, create an<br />

acceptable image and transfer it to the host.<br />

In the past years, we have seen tremendous developments of<br />

embedded boards and their capability to operate complex tasks for<br />

the embedded world. This includes more powerful CPUs, GPUs or<br />

dedicated processing units e.g. own ISPs, which offer embedded<br />

designers more flexibility in designing their embedded systems.<br />

Simultaneously, embedded designers are confronted with a<br />

major question when it comes to embedded vision: Where can I<br />

perform image processing to get the best and most efficient results<br />

for my application?<br />

The paper gives the reader an idea of which kind of image<br />

processing exists and where the options are to perform and process<br />

algorithms in embedded systems for vision applications.<br />

Keywords— embedded, embedded vision, vision, camera, camera<br />

module, CMOS, Sony CMOS, ONSEMI CMOS, advanced image<br />

processing, MIPI CSI-2, Video 4 Linux 2, USB3<br />

I. IMAGE PROCESSING<br />

Image processing is a general term to explain a method to<br />

perform algorithms on an image. By performing algorithms on<br />

the image, embedded designers want to enhance the image itself,<br />

extract helpful information or just want to reduce the volume of<br />

the image data to enable a faster processing afterwards on the<br />

embedded board.<br />

The term image processing can be further divided in<br />

different sub categories.<br />

<br />

<br />

First on the list is pre-processing as the first step<br />

after an image was captured e.g. by a CMOS<br />

camera. Such an image is also called RAW image,<br />

since it is not touched by algorithms so far. This<br />

image may contain unwanted effects like defect<br />

pixels or an uncalibrated white balance. This is<br />

where Pre-processing algorithms help embedded<br />

designers to improve the image data by suppressing<br />

distortions or removing defects and abnormalities.<br />

Some typical examples are algorithms that perform<br />

a defect pixel correction or noise reduction.<br />

In this stage, the image is already characterized<br />

with a specific level of acceptable image quality.<br />

However, it can be further improved, and additional<br />

processing can be applied.<br />

The next block of processing is advanced image<br />

processing. Sometimes just called Image<br />

Processing. Within this step, higher level<br />

operations or image enhancements are performed to<br />

facilitate additional processing. A good example is<br />

www.embedded-world.eu<br />

236


sharpening. A 3x3 filter-matrix is applied on the<br />

image to increase the sharpness of the image.<br />

Another example is a Look-Up Table (LUT), which<br />

is an array that replaces runtime computation with<br />

a simpler array indexing operation. This can reduce<br />

the computing time significantly. Typically, LUTs<br />

are used to enhance contrast, brightness or color<br />

reproduction.<br />

The final step is the post-processing. It contains<br />

techniques to automate the identification of features<br />

in a scene to produce a decision. A common used<br />

algorithm in this step is for example face detection<br />

like implemented in smart phones. Embedded<br />

designers can choose out of a broad multitude of<br />

possibilities when it comes to embedded vision.<br />

Most of them choose so called CMOS camera<br />

modules, which are integrated in our smartphones,<br />

tablets and laptops. These modules are the preferred<br />

choice for designers since these modules deliver an<br />

acceptable image quality, are small, cheap and most<br />

importantly: Due to the standard interface MIPI<br />

CSI-2 are easy to integrate in the embedded system.<br />

Besides these modules, other cameras are available<br />

with LVDS and parallel interface, which make the<br />

integration more challenging to the designers.<br />

II. SYSTEM ARCHITECTURE FOR IMAGE PROCESSING<br />

There are four key questions to answer for engineers, when<br />

it comes to embedded vision applications.<br />

<br />

<br />

<br />

<br />

First, the selection of the necessary image<br />

processing tasks / algorithms for the specific<br />

application.<br />

Second, the selection of the right hardware platform<br />

including various options like main processor<br />

CPUs, or co-processors like GPUs, dedicated ISPs,<br />

video processors, DSPs (Digital Signal Processor),<br />

FPGAs (Field Programmable Gate Array) and so<br />

on.<br />

Thirdly, the selection of the software platform e.g.<br />

a lot of image processing can be performed on<br />

software with pros and cons.<br />

Fourth, the camera selection with preferred sensor<br />

resolution and especially image quality capabilities.<br />

Overall, a smart and careful consideration of all these four<br />

questions and their options is key. Especially in the embedded<br />

environment, where it is all about cost, power consumption and<br />

just simply to get the maximum performance out of the selected<br />

(cost sensitive) components.<br />

Todays used and preferred cameras for embedded vision are<br />

so called CMOS camera modules. Once introduced in mobile<br />

phones, they found their way in other applications besides<br />

mobile, because of their attractive price, small size and low<br />

power consumption. They are still mainly driven by the mobile<br />

industry in terms of new resolutions and functionalities.<br />

Regarding image processing, most of them powered by a basic<br />

set of algorithms for image processing. This contains for<br />

example a set of automatic image control functions like auto<br />

exposure, white balance or black level calibration. Most<br />

importantly, they make use of simple algorithms to enhance the<br />

image quality by applying for example sharpness, lens<br />

correction, defect pixel correction or noise canceling. These are<br />

standard functionalities in state of the art CMOS camera<br />

modules and designers configure them to get an acceptable<br />

image to the embedded board. The camera is connected via a<br />

MIPI CSI-2 connector and can be controlled by I2C to configure<br />

the registers of the camera.<br />

Before the camera is fully functional, a CSI-2 driver must be<br />

written or reused for a specific embedded board and Linux<br />

distribution. These drivers are typically provided by the vendor<br />

of the camera module or are written by embedded solution<br />

providers themselves. Each camera module type needs its own<br />

driver. Therefore, once a decision for a camera is made, it will<br />

stay in the system for a longer time to avoid adjusting the driver<br />

environment. Before it comes to any other image processing<br />

related question on the embedded board, designers need to make<br />

the decision if they use an off-shelf driver for a specific<br />

supported embedded board, or program a driver on their own to<br />

have the flexibility of choosing a preferred embedded board.<br />

If the set of algorithms for the image processing provided by<br />

the CMOS camera module are enough for the application,<br />

embedded boards do not need to provide additional image<br />

processing functionalities. This is not the case in most<br />

applications. Therefore, todays embedded boards are equipped<br />

with many co-processors besides the main CPU.<br />

As stated out earlier, co-processors like GPUs, dedicated<br />

ISPs, video processors, DSPs or FPGAs can be found on<br />

embedded boards. Each of them have advantages in specific<br />

areas of image processing. That means they can process specific<br />

algorithms very efficiently. Video processors for example have<br />

integrated hardware IP for image compression, which is the<br />

preferred choice over other co-processors. If the use case<br />

requires for example some 3D operations or geometric<br />

transformation, pure CPU based embedded boards would reach<br />

their limits very fast. In this case, a board with integrated GPU<br />

is recommended, because it can process these operations much<br />

faster due to the architecture. On the other hand, if it comes to<br />

pattern matching, the CPU is the preferred choice over the GPU<br />

since it can process these types of algorithms more efficient.<br />

When it comes to more advanced algorithms and image<br />

processing, like special filters, pixel and signal processing,<br />

CPUs as well as GPUs reach their limits at an early stage. This<br />

is where FPGAs and DSPs are getting interesting for the<br />

embedded designers. Besides the fact that they can perform such<br />

tasks more efficiently, they can be re-programmed, which offers<br />

the designers even more flexibility. Some drawbacks are that the<br />

overall cost of the embedded design and its complexity are<br />

increased.<br />

Another option for embedded designers are dedicated ISPs<br />

on embedded boards. They contain a specific set of image<br />

processing algorithms and enjoy considerable popularity among<br />

designers. In most of the cases, it is not necessary anymore to<br />

use all the image processing algorithms provided by the CMOS<br />

camera module anymore. Instead, the RAW image out of the<br />

sensor is directly transmitted to the ISP via MIPI CSI-2 and<br />

237


processed on the embedded board. From the hardware point of<br />

view, this option has major advantages for the embedded vision<br />

application. Nevertheless, the embedded designers are<br />

confronted with two disadvantages with this option. First, to get<br />

the full control of the ISP, the software environment must be<br />

designed around the vendor specific software and therefore,<br />

lacks flexibility. Second, such an option has its price. State of<br />

the art system on modules like the NVIDIA Tegra TX2 with<br />

dedicated ISP start with a list price of $499, with a carrier board<br />

not included.<br />

III. ALTERNATIVE OPTION FOR IMAGE PROCESSING<br />

In the previous chapter, different options were described to<br />

perform image processing algorithms from the camera itself to<br />

the embedded board with its CPU and optional co-processors.<br />

An alternative option is to shift some of the image processing<br />

tasks back to the camera. The corner stone of this alternative<br />

approach is a new kind of camera module, which is powered by<br />

an Application-Specific Integrated Circuit (ASIC) with an<br />

integrated ISP and image processing library. It performs a<br />

similar kind of image processing like state of the art CMOS<br />

camera modules, but extends it with processing usually<br />

performed in high-end embedded boards with dedicated ISPs or<br />

integrated FPGAs and DSPs. This includes Pre-processing, as<br />

well as advanced image processing functionality for example<br />

filters, pixel operations, signal processing or color space<br />

conversation.<br />

One major advantage of performing more image processing<br />

in the camera instead of on the embedded board, is the reduction<br />

of processor and co-processor load. This enables free resources<br />

for potentially other tasks. It also helps the embedded designer<br />

to accelerate the development phase and answers questions<br />

where to perform specific image processing tasks best. An<br />

FPGA or DSP may still be required on the embedded board for<br />

a specific application, but due to the shifting of some image<br />

processing tasks to the camera, less logic cells on the FPGA or<br />

DSP part are needed. This reduces the overall system costs of<br />

the design.<br />

Furthermore, ultra-low-cost embedded boards can be<br />

considered as potential alternatives to mainstream and high-end<br />

equipped boards. So far, designers rely on additional processing<br />

power on these higher cost embedded boards to perform their<br />

required image processing tasks. Since some of the processing<br />

tasks can be shifted and operated on the new camera module,<br />

costs can be reduced by selecting a less performant embedded<br />

board.<br />

Another challenge designers are confronted with, which is<br />

not directly related to image processing, but worth a mention is<br />

the need for a camera driver. Are four key questions to answer<br />

for engineers, when it comes to embedded vision applications.<br />

First, the selection of the necessary image processing tasks /<br />

algorithms for the specific application. Second, the selection of<br />

the right hardware platform including various options like main<br />

processor CPUs, or co-processors like GPUs, dedicated ISPs,<br />

video processors, DSPs, FPGAs and so on. Thirdly, the selection<br />

of the software platform e.g. a lot of image processing can be<br />

performed on software with pros and cons. Fourth, the camera<br />

selection with preferred sensor resolution and especially image<br />

quality capabilities.<br />

Each CMOS camera module needs its own camera driver.<br />

Every time an embedded designer wants to change the module<br />

for example to implement a higher resolution sensor, he or she<br />

needs to rewrite and configure the driver again. This is not the<br />

case with the new camera module. Once configured the driver<br />

for a specific SoC (System-on-Chip) and embedded board, the<br />

designers can easily replace the camera module with a higher<br />

sensor resolution without touching the whole architecture of the<br />

camera driver. The driver itself is provided by the camera vendor<br />

for selected SoCs and embedded boards. Alternatively, the<br />

camera driver is planned to be an open source to enable the<br />

highest flexibility possible for embedded designers should a<br />

specific SoC or embedded board be not supported by the camera<br />

provider.<br />

IV. CONCLUSION<br />

As of today, embedded system designers are facing design<br />

challenges when it comes to image processing and embedded<br />

vision. Different options are available on the market with<br />

individual advantages and disadvantages, which need careful<br />

consideration during the design phase in terms of performance<br />

and cost. Looking at the camera, designers commonly used<br />

CMOS camera modules to apply vision to their embedded<br />

design. When it comes to image processing, they have two basic<br />

options regarding the camera. First, using the image processing<br />

functionality implemented in the camera and accept the fact that<br />

additional Pre-processing and advanced image processing are<br />

performed on the embedded board if necessary. Second, skip<br />

most of the image processing functionality in the camera and<br />

instead make use of more sophisticated image processing<br />

algorithms on the embedded board. Both scenarios fulfil the<br />

requirements of the embedded vision designer in terms of image<br />

processing. What it does not satisfactory solve are budgetary<br />

terms of the overall system costs and the fact that each CMOS<br />

camera module needs its individual camera driver every time the<br />

embedded vision system is upgraded.<br />

As an alternative option, the camera vendor Allied Vision<br />

developed a new kind of camera module for embedded vision<br />

designers. The 1 Product Line with its embedded camera<br />

modules of the 130 C Family and 140 C Family which are<br />

designed to not only perform Pre-processing and advanced<br />

image processing, but also meet a specific price point starting<br />

with a list price of 99€ for a single camera module. Both families<br />

are equipped with the most common used interface in the<br />

embedded system environment: MIPI CSI-2. In the first phase,<br />

camera drivers will be available for embedded boards powered<br />

by the NXP i.MX6 and NVIDIA Tegra TX1 and TX2 SoC.<br />

More driver support is planned for boards with the upcoming<br />

i.MX8 SoC. Embedded designers can easily upgrade their<br />

camera module within the same SoC architecture, without<br />

reprogramming the whole CSI-2 camera driver like in the past.<br />

Furthermore, it will be an open source to give designers a<br />

maximum amount of flexibility.<br />

In summary, embedded designers will get an attractive<br />

camera module option for their next embedded vision system<br />

design as an alternative to the widely used CMOS camera<br />

mobiles driven by the mobile industry.<br />

www.embedded-world.eu<br />

238


Image Data Compression with a System-on-a-Chip<br />

Joerg Mohr (Author)<br />

Imaging Department<br />

Solectrix GmbH<br />

Nuremberg, Germany<br />

Abstract— The exponentially growing use of cameras and<br />

other image data acquisition devices requires efficient methods<br />

for image data compression. Not all image compression<br />

requirements can be fulfilled with common multimedia<br />

standards, some need special compression methods. If such a<br />

method shall be implemented in an embedded device, the<br />

realization in an FPGA (field-programmable gate array) or SoC<br />

(System-on-a-Chip) is a valid option. In this paper we briefly<br />

introduce reasons for and fundamentals of image compression<br />

methods. We present solutions for implementing compression<br />

techniques in an FPGA and give a brief overview of SoCs.<br />

Finally, we describe a specific realization of image data<br />

compression in an SoC.<br />

Keywords—image data compression; System-on-a-chip; FPGA<br />

I. INTRODUCTION<br />

A. Image Data Compression — Why?<br />

Semiconductor technology is getting ever more powerful<br />

and affordable. On the one hand, this helps lower costs for<br />

memory and data rates. For example, according to [1], costs for<br />

NAND flash memory are expected to decrease by a factor of<br />

10 within six years. So one could ask why image data<br />

compression methods are still needed in an environment with<br />

affordable data storage costs. On the other hand, this broad<br />

availability of powerful and affordable semiconductor<br />

technology allows for the creation of new and more products<br />

with high-resolution image sensors and powerful processors. A<br />

study expects video (i.e. image content) will even account up to<br />

75% of mobile data traffic [2]. In order to handle this<br />

increasing amount of data, advanced techniques to reduce the<br />

data rate but to keep image quality within expectations are<br />

crucial.<br />

B. Special Image Compression Requirements Demand<br />

Special Solutions<br />

Many image compression standards are targeted for<br />

common demands like still or video cameras, TV sets or<br />

mobile phones. These multimedia applications are widely used<br />

and — due to the large number of devices — they allow the<br />

creation of optimized software libraries for generic processors<br />

or the implementation of specific image data compression coprocessors<br />

in silicon. Even an extra-low-cost Internet of<br />

Things (IoT) system like the widely used Raspberry Pi is<br />

equipped with a complex H.264 video codec, enabling low<br />

delay video streaming [3].<br />

Although by now these codecs for multimedia applications<br />

are available for 4K resolutions at up to 60 frames per second<br />

(fps), they often lack some properties that exclude them from<br />

use in applications with special image compression<br />

requirements. Example applications that require properties<br />

beyond common multimedia codecs are high-end acquisition<br />

devices (professional cameras, film scanners), medical imaging<br />

systems (computed tomography scanners, sonographers), etc.<br />

For these special applications, there are software libraries<br />

for generic processors, but silicon co-processors are not<br />

available due to the low quantities of the applications. As a<br />

result, if these methods are needed in compact embedded<br />

systems, the implementation of image compression methods in<br />

an FPGA (field-programmable gate array) or in a System-on-a-<br />

Chip (SoC) becomes attractive.<br />

II.<br />

BASICS OF IMAGE DATA COMPRESSION<br />

A. Generic Types of Data Compression<br />

The goal of data compression is to reduce the amount of<br />

data. Basically, two types of compression can be distinguished:<br />

Mathematically lossless compression that encodes the<br />

data in a more efficient manner but is completely<br />

restorable. An often distracting feature of lossless data<br />

compression is that the data rate depends on the entropy<br />

of the input signal and cannot be efficiently controlled<br />

“Lossy” compression that reduces the amount of data<br />

by controlled loss of content that is considered to be<br />

negligible. The original cannot be completely restored<br />

from the compressed data.<br />

B. Methods of Image Data Compression<br />

As with lossless methods, only compression rates of<br />

approx. 2:1 can be expected on natural images, most image<br />

data compression methods combine both types. To be more<br />

specific, many image compression methods combine the<br />

www.embedded-world.eu<br />

239


following techniques: (1) Transformation, to reduce correlation<br />

of input data, (2) Quantization, to reduce information, and (3)<br />

Encoding, to reduce entropy.<br />

An example for an image data compression scheme that<br />

combines different techniques is JPEG (Joint Photographic<br />

Experts Group), named after the group that standardized the<br />

format [4]. It comprises:<br />

1) Color space transformation: To transform the primary<br />

RGB (Red/Green/Blue) color components to Luminance and<br />

Chrominance values.<br />

2) Subsampling of chrominance components: Because the<br />

human eye is more sensitive to luminance than to<br />

chrominance, the latter can be spatially subsampled without<br />

affecting the visual perception of the reconstructed image.<br />

This step is optional in the standard, but most commonly used<br />

as it improves the compression perfomance for most natural<br />

images.<br />

3) Digital Cosine Transformation (DCT): The colorconverted<br />

and optionally subsampled input image is divided<br />

into many 8x8 blocks. During encoding each block is<br />

transformed into another 8x8 block of DCT coefficients. The<br />

mathematical definition of Forward DCT is:<br />

(1)<br />

DCT has been chosen as it can concentrate the energy of<br />

the transformed signal in the low frequency range. Because the<br />

human eye is less sensitive to the high frequency content, this<br />

contribution can be reduced.<br />

4) Quantization: This reduction of the high frequency<br />

contribution can be achieved by quantization with a carefully<br />

selected quantization matrix: Each coefficient in the 8x8 DCT<br />

is divided by a corresponding quantization value and the result<br />

is rounded, therefore data is lost.<br />

(2)<br />

5) Entropy Encoding: After quantization, the first<br />

coefficient is treated different than the other 63 coefficients.<br />

The latter are often zero, and by applying a zigzag scan, an<br />

efficient zero-run-length coding can be applied. The results<br />

can be represented even more efficiently by coding them via a<br />

Huffman table. See [5] for further details.<br />

For correct interpretation of the compressed data, the<br />

results of all these steps are ordered into a file and supplied<br />

with descriptive headers.<br />

This diversification shows: Currently, there is no universal<br />

image compression method. Instead, it is rather an art to select<br />

the best-suited compression method for the expected content,<br />

expected use, tolerable artifacts and other side parameters.<br />

III.<br />

IMAGE DATA COMPRESSION IN THE FPGA<br />

As mentioned in previous chapters, requirements may<br />

forbid pure software or hardwired co-processor realizations of<br />

image data compression. Modern FPGAs are equipped with<br />

enough logic resources to allow complex calculations.<br />

Nevertheless, some challenges exist in FPGA realizations that<br />

have to be dealt with:<br />

Floating-point operations in an FPGA are inefficient.<br />

FPGA-internal memory is expensive and scarce.<br />

External memory needs extra logic for signal<br />

controlling and arbitration. Access to it is often a<br />

bottleneck.<br />

Complex decisions, e.g., nested if-else-decisions and<br />

loops, are difficult to implement.<br />

On the other hand, FPGAs offer advantages to the<br />

implementer that a generic processor cannot provide, like<br />

massive parallel processing and deep pipelining architectures.<br />

Additionally, modern FPGAs are equipped with lots of<br />

dedicated DSP (Digital Signal Processing) resources that allow<br />

over 5,000 GMAC/s of peak performance [6].<br />

Systems-on-a-Chip provide an additional advantage: As the<br />

FPGA and processor subsystems are connected via highperformance<br />

buses, they can be programmed to work on a<br />

closely connected basis. This allows for combining the<br />

advantages of both architectures.<br />

In this paper we present some example challenges<br />

implementing image compression methods in an FPGA or<br />

SoC and the solutions to overcome them:<br />

1) DCT Implementation: Equation (1) describes the<br />

mathematical definition, but obviously a direct<br />

implementation of cosine functions and loops would be<br />

inefficient in an FPGA. Other authors have already developed<br />

efficient DCT implementations. The Arai-Agui-Nakajima<br />

(AAN) algorithm [7] is among the fastest known DCTs. As a<br />

solution to the DCT implementation challenge, the algorithm<br />

was analyzed and optimized for parallel operations into an<br />

efficient realization. The following flow chart gives an<br />

overview of the internal pipelining structure:<br />

C. Other image compression methods<br />

Although JPEG is still widely used, other image<br />

compression methods have evolved. On the one hand, to<br />

provide better compression results, on the other hand, to better<br />

cope with specific input data types, or to provide different<br />

tolerable compression artifacts or side parameters that derive<br />

from the used application. Other image compression methods,<br />

to mention some are, e.g., GIF, PNG, JPEG-LS, JPEG 2000<br />

ProRes ® , DNxHD ® , AVC-Intra ® , HEIF etc.<br />

240


should be used for parallel pre-processing where possible,<br />

while complex decisions should be handled in the processor. In<br />

comparison to an RTL-described “soft core” within an FPGA<br />

implementation, a processor in an SoC offers much higher<br />

performance.<br />

Fig. 1. Flow chart of DCT implementation<br />

2) Quantization implementation: The definition of the<br />

quantization is mathematically simple, see Eq. (2), but<br />

difficult to implement in an FPGA. A division operation needs<br />

the combination of several logic and DSP operations, thus<br />

increasing the complexity and lowering the processing speed<br />

of the overall design. As a solution to the quantization<br />

implementation challenge, all possible divisor values were<br />

transferred to their reciprocals in 18-bit precision. The precalculated<br />

values are stored in internal memory. As the divisor<br />

values range from 1 to 256, the memory consumption for these<br />

reciprocal values is quite manageable. By simple<br />

multiplications following a predefined order this realization<br />

allows one quantization operation per clock cyle.<br />

3) Optimization strategies: While image compression<br />

standards mostly describe only the techniques, there is a wide<br />

area for implementers to optimize their parameters. For<br />

example, the selection of quantization matrices in JPEG to<br />

either achieve a constant quality or constant bit rate (CBR)<br />

over several images is in the responsibility of the programmer.<br />

Modern image compression methods offer a vast set of other<br />

options that should be carefully selected in order to achieve<br />

optimal compression performance. The following image<br />

shows an example for a complex mode decision algorithm in a<br />

video transcoding implementation:<br />

Fig. 2. Complex mode decision algorithm. From [8]<br />

Although complex decisions can be described in register<br />

transfer languages (RTL) like VHDL or Verilog, the effort for<br />

programming, maintaining and changing the code is far more<br />

inefficient than in programming languages for a processor. As<br />

a solution to overcome this challenge, we propose a closely<br />

coupled system of FPGA logic and processor. The FPGA logic<br />

IV.<br />

SYSTEMS-ON-A-CHIP<br />

A. Definition and Market Overview<br />

Generally, a System-on-a-Chip (SoC) is an integrated<br />

circuit that consists of several components like analog or digital<br />

interfaces, processing or logic functions. A typical application<br />

is in the field of embedded systems.<br />

Within this paper, the term SoC is used for silicon devices:<br />

<br />

<br />

<br />

<br />

with processor and FPGA subsystems on one die<br />

both connected via internal buses<br />

with a predefined set of processor interfaces like<br />

UART, SPI, memory interfaces, etc.<br />

Other functionality can be programmed in the<br />

FPGA subsystem.<br />

By now, all major FPGA manufacturers offer SoCs that<br />

combine their programmable FPGA fabric with a hard-coded<br />

processor in ARM architecture. The following table shows<br />

some of the SoCs on the market:<br />

Name<br />

CPU subsystem<br />

Core<br />

TABLE I.<br />

Microsemi ®<br />

SmartFusion ® ,<br />

SmartFusion ® 2<br />

1× ARM ®<br />

Cortex ® -M3<br />

MARKET OVERVIEW SOCS<br />

Manufacturer<br />

Intel® FPGA<br />

(Altera ® )<br />

Cyclone ® V<br />

SoC, Arria ® V<br />

SoC, Arria ® 10<br />

SoC, Stratix ®<br />

10 SoC<br />

1..4× ARM ®<br />

Cortex ® -A9,<br />

A53<br />

Xilinx ®<br />

Zynq ® -7000<br />

SoC, Zynq ®<br />

UltraScale +<br />

MPSoC<br />

1..4× ARM ®<br />

Cortex ® -A9,<br />

A53+2× R5<br />

Memory SRAM, Flash Cache, SRAM Cache, SRAM<br />

Interface<br />

FPGA subsystem<br />

Logic<br />

Serial, EMC,<br />

10/100Eth, ...<br />

700...150K<br />

logic elements<br />

Serial, EMAC,<br />

USB, ...<br />

25K...5500K<br />

logic elements<br />

Serial, EMS,<br />

GbE, USB,<br />

PCIe, SATA, ...<br />

23K...1143K<br />

Logic Cells<br />

Memory 36...4488 Kbits 1.4…229 Mbits 1.8...70 Mbits<br />

Interfaces<br />

ADCs, DACs,<br />

SerDes 3G,<br />

PCIe Gen2, ...<br />

SerDes 3...28G,<br />

PCIe Gen3, ...<br />

ADCs,<br />

SerDes3...32G,<br />

PCIe Gen4, ...<br />

Although FPGA manufacturers provide various tools for<br />

parallel FPGA and processor design, example designs, and<br />

other support, the development of embedded systems with<br />

SoCs is still a complex task. To overcome this challenge and to<br />

allow faster time to market, independent manufacturers have<br />

started the development of SoC modules and base boards.<br />

Examples are available under [9].<br />

www.embedded-world.eu<br />

241


B. The SXoM MS2-K7 module<br />

This specific image data compression implementation was<br />

planned for an SXoM MS2-K7 module by Solectrix GmbH,<br />

Germany. It is a module in SMARC ® format (Smart Mobility<br />

ARChitecture). The latter has been defined by the<br />

Standardization Group for Embedded Technologies e.V. [10].<br />

It is primarily designed for the development of extremely<br />

compact low-power systems. Its edge finger connector<br />

provides 281 signal lines, with a mixture of typical energysaving<br />

interfaces like SPI and I 2 C alongside classical computer<br />

interfaces such as USB, SATA and PCI Express.<br />

Generally, SMARC ® modules are based on ARM<br />

processors. They can, however, also be fitted with other SoC<br />

architectures. The SXoM MS2-K7 module by Solectrix GmbH<br />

is equipped with a Xilinx ® Zynq ® Z-7030/35/45 SoC.<br />

Fig. 3. Block diagram of a Xilinx ® Zynq ® Z-7000 series SoC. From [11]<br />

The SXoM MS2-K7 module provides two banks of DDR3<br />

memory, USB 2.0 host/client functionality, Ethernet PHYs and<br />

non-volatile memory as shown in the following diagram:<br />

Fig. 4. Block diagram of SXoM MS2-K7 module. From [12]<br />

Using the SMARC ® format offers the choice of various<br />

base boards, or starting development with an ARM processoronly<br />

module first or in parallel.<br />

V. APPLICATION EXAMPLE: IMAGE DATA COMPRESSION<br />

ON A SXOM MS2-K7 MODULE<br />

A. Requirements<br />

Within this paper we describe an example application of<br />

image data compression on an SoC. The exact image<br />

compression method cannot be disclosed but the requirements<br />

for an encoder-only implementation were:<br />

Image resolution: 4096 x 2160 pixels (4K),<br />

Data format: RGB with 10 bit per color component,<br />

Frame rate: Up to 60 fps,<br />

Target compression rate: 4:1 up to 20:1<br />

Quality: Similar to the given software reference<br />

Other compression properties:<br />

o<br />

o<br />

o<br />

o<br />

Optional chrominance subsampling<br />

Block-based DCT<br />

CBR with content-optimized quantization<br />

Optimized for parallel decoding on a processor,<br />

i.e. a header with information about compressed<br />

data organization is mandatory.<br />

Architecture: Xilinx ® Zynq ® -7000 SoC<br />

Realization: As an IP core with AXI-input/output<br />

(Advanced eXtensible Interface Bus)<br />

As the system will not only perform image compression but<br />

other compression tasks, e.g., networking, user interaction, etc.,<br />

an implementation method was preferred that allows parallel<br />

development and verification of these tasks.<br />

B. Chosen Course of Action<br />

Based on these requirements the following course of action<br />

was planned:<br />

1) Evaluation of the compression format in software: In<br />

order to evaluate the possible quality, a bit-exact variant of the<br />

reference software was created. This included exchanging all<br />

floating-point operations to appropriate fixed-point<br />

representation and also implementing the quantization as<br />

mentioned above. Only color conversion from RGB to<br />

luminance/chrominance, subsampling, and a framebuffer in<br />

external memory was implemented in FPGA. The software<br />

variant was compiled for the processor subsystem, with access<br />

to the framebuffered data. Albeit this implementation needs<br />

several seconds per image to compress and is far from the<br />

required frame rate, both AXI interfaces and the image quality<br />

could be verified on a system prototype. By using a SXoM<br />

with preconfigured memories for FPGA and processor<br />

subsystems, the implemention time could be shortened.<br />

242


2) Realization of a simplified compression scheme in<br />

FPGA: Within this step, the DCT, the quantizer and entropy<br />

encoder were implemented as a module in FPGA in order to<br />

verify the implementation speed. This module was called the<br />

“compressor”. The module was implemented in a parallel<br />

architecture, processing two pixels at the same time with<br />

280 MHz clock speed. No optimization strategies were<br />

included in this step, thus delivering compressed data without<br />

CBR properties and, due to the nature of the compression<br />

method, without header. Nevertheless, together with a<br />

modified reference software the reconstruction of the<br />

compressed image data was possible in order to compare the<br />

achieved quality.<br />

3) Final step: The optimization strategy in the software<br />

reference requires evaluation of all DCT blocks with different<br />

quantization parameters before chosing a final set of<br />

parameters. Within this step this requirement was solved by n-<br />

fold implementation of the “compressor” module without data<br />

output, but delivering a data base of compression data side<br />

information to an optimization algorithm. The latter is<br />

implemented on one of the two processor cores of the SXoM<br />

module. In order to achieve close coupling to the FPGA<br />

modules without software overhead, this core needs no<br />

operating system and runs as a so-called “bare-metal<br />

application”. The different treatment of processor cores is<br />

called asymmetric multiprocessing (AMP). After performing<br />

an optimization algorithm in order to achieve contentoptimized<br />

quantization and CBR, the application controls the<br />

final quantization process in a separate “compressor” module<br />

in the FPGA. Additionally, it creates the header with the<br />

information about the compressed data organization. The<br />

FPGA modules and the software were designed for close<br />

coupling without complex synchronizing overhead and they<br />

take advantage of the high-performance internal busses in the<br />

SoC.<br />

C. Results<br />

An image compression method that was originally targeted<br />

for generic processors was implemented on a Xilinx ® Zynq ® -<br />

SoC. A compression performance for 4K images was achieved<br />

up to 60 fps. All requirements regarding compression<br />

parameters, image quality, and implementation details were<br />

met. The consumed SoC resources for the encoder IP core are:<br />

Encoder IP<br />

core<br />

Absolute<br />

numbers<br />

Relative<br />

usage of a<br />

7035-device<br />

Slice<br />

LUTs<br />

TABLE II.<br />

Slice<br />

Registers<br />

SOC RESOURCES<br />

SoC resources<br />

Block<br />

RAM<br />

DSP<br />

slices<br />

processor<br />

cores<br />

68,128 71,574 177 98 1<br />

39.6 % 20.8 % 35.4 % 10.8 % 50 %<br />

In parallel to the implementation of the image data<br />

compression system, other embedded system tasks could be<br />

developed.<br />

The usage of a SXoM MS2-K7 module in standardized<br />

format and with predefined memory resources accelerates the<br />

development process. Stepwise integration of image<br />

compression components enables the verification on the target<br />

platform.<br />

REFERENCES<br />

[1] D. Floyer, D. Vellante, B. Latamore and R. Finos, “The Emergence of a<br />

New Architecture for Long-term Data Retention,” Wikibon.org, July<br />

2014.<br />

[2] R. Möller, P. Jonsson, S. Carson et al., “Ericsson Mobility Report,”<br />

Ericsson AB, November 2017.<br />

[3] U. Jennehag, S. Forsstrom and F.V. Fiordigigli, “Low Delay Video<br />

Streaming on the Internet of Things Using Raspberry Pi,” MDPI<br />

(Multidisciplinary Digital Publishing Institute), September 2016.<br />

[4] Joint Photographic Experts Group, “Information technology -- Digital<br />

compression and coding of continuous-tone still images: Requirements<br />

and guidelines” (ISO/IEC IS 10918-1 / ITU-T T.81), February 1994.<br />

[5] W.-Y. Wei, “An Introduction to Image Compression,” Graduate Institute<br />

of Communication Engineering, National Taiwan University, 2008.<br />

[6] T. Hill, “Accelerating Design Productivity with 7 Series FPGAs and<br />

DSP Platforms,” Xilinx Inc, February 2013.<br />

[7] Y. Arai, T. Agui, M. Nakajima, “A Fast DCT-SQ Scheme for Images,”<br />

Trans. IEICE, vol. E-71, no. 11, pp. 1095-1097, November 1988.<br />

[8] K. Lee, G. Jeon, J. Jeong, “Fast mode decision algorithm in MPEG-2 to<br />

H.264/AVC transcoding including group of picture structure<br />

conversion,” Opt. Eng. 48(5), 057003, doi:10.1117/1.3127198, May<br />

2009.<br />

[9] Solectrix GmbH, “SXoM Modules”,<br />

https://www.solectrix.de/en/sxom-modules<br />

[10] SGET Standardization Group for Embedded Technology e.V., “Smart<br />

Mobility ARChitecture,” Version 2.0, June 2016.<br />

[11] Xilinx Inc., “Zynq-7000 All Programmable SoC Product Advantages,”<br />

https://www.xilinx.com/products/silicon-devices/soc/zynq-7000.html<br />

[12] M. Schetter, “SXoM MS2-K7 System-on-Module compliant with<br />

SMARC 2.0 specification,” Solectrix GmbH, December 2017.<br />

www.embedded-world.eu<br />

243


CPU or FPGA for Image Processing: Choosing the<br />

Best Tool for the Job<br />

Kevin Kleine<br />

Vision Product Manager<br />

National Instruments<br />

Austin, Texas<br />

Abstract—This paper is a comparison of the considerations and<br />

advantages involved in image processing on CPUs and FPGAs. It<br />

explores common architectures and core software implementation<br />

constraints.<br />

Keywords—Image Processing; FPGA; CPU; Co-Processing;<br />

Preprocessing<br />

I. INTRODUCTION<br />

Machine vision has long been used in industrial automation<br />

systems to improve production quality and throughput by<br />

replacing manual inspection traditionally conducted by humans.<br />

We’ve all witnessed the mass proliferation of cameras in our<br />

daily lives in computers, mobile devices, and automobiles.<br />

However, the biggest advancement in machine vision has been<br />

processing power. With processor performance doubling every<br />

two years and a continued focus on parallel processing<br />

technologies like multicore CPUs, GPUs, and FPGAs, vision<br />

system designers can now apply highly sophisticated algorithms<br />

to visualize data and create more intelligent systems.<br />

This increase in performance means designers can achieve<br />

higher data throughput to conduct faster image acquisition, use<br />

higher resolution sensors, and take full advantage of some of the<br />

latest cameras on the market that offer the highest dynamic<br />

ranges. An increase in performance helps designers not only<br />

acquire images faster but also process them faster. Preprocessing<br />

algorithms such as thresholding and filtering or processing<br />

algorithms such as pattern matching can execute much more<br />

quickly. This ultimately gives designers the ability to make<br />

decisions based on visual data faster than ever.<br />

As more vision systems that include the latest generations of<br />

multicore CPUs and powerful FPGAs reach the market, vision<br />

system designers need to understand the benefits and trade-offs<br />

of using these processing elements. They need to know not only<br />

the right algorithms to use on the right target but also the best<br />

architectures to serve as the foundations of their designs.<br />

II.<br />

INLINE VS. CO-PROCESSING ARCHITECTURE<br />

A. Co-Processing<br />

Before investigating which types of algorithms are best<br />

suited for each processing element, you should understand<br />

which types of architectures are best suited for each application.<br />

When developing a vision system based on the heterogeneous<br />

architecture of a CPU and an FPGA, you need to consider two<br />

main use cases: inline and co-processing. With FPGA coprocessing,<br />

the FPGA and CPU work together to share the<br />

processing load. This architecture is most commonly used with<br />

GigE Vision and USB3 Vision cameras because their acquisition<br />

logic is best implemented using a CPU. The image is acquired<br />

on the CPU and sent to the FPGA via direct memory access<br />

(DMA). The FPGA can then perform operations such as filtering<br />

or morphology. The image can then be sent back to the CPU for<br />

more advanced operations such as optical character recognition<br />

(OCR) or pattern matching. In some cases, the entire algorithm<br />

can be implemented on the FPGA, sending only the results back<br />

to the CPU. This allows the CPU to devote more resources to<br />

other operations (e.g. motion control, network communication,<br />

image display) and can increase total system performance.<br />

Fig. 1. In FPGA co-processing, images are acquired using the CPU and then<br />

sent to the FPGA via DMA so the FPGA can perform operations.<br />

B. Inline Processing<br />

In an inline FPGA processing architecture, you connect the<br />

camera interface to the I/O pins of the FPGA enabling the pixels<br />

to stream directly to the FPGA from the camera. This<br />

architecture is commonly used with Camera Link and<br />

CoaXPress cameras because their acquisition logic is well suited<br />

for FPGA implementation. This architecture has two main<br />

benefits. First, just like with co-processing, you can use inline<br />

processing to move some of the work from the CPU to the FPGA<br />

by performing preprocessing functions on the FPGA. For<br />

example, you can use the FPGA for high-speed preprocessing<br />

functions such as filtering or thresholding before sending pixels<br />

to the CPU. This also reduces the amount of data that the CPU<br />

244


must process because it implements logic to only capture the<br />

pixels from regions of interest, which increases overall system<br />

throughput. Traditional Camera Link and CoaXPress frame<br />

grabbers use an FPGA to decode the pixel bus and transmit<br />

images to a CPU (typically via PCIe). Some of these frame<br />

grabbers enable fixed functionality such as onboard de-mosaic<br />

or convolution with set kernel size. A more advanced approach<br />

is to enable the FPGA onboard the frame grabber to implement<br />

custom user logic. In this architecture, systems can leverage user<br />

choice of prebuilt IP from specific software toolchains or enable<br />

users to design custom algorithms to run onboard the FPGA. The<br />

second benefit of this architecture is that it allows for high-speed<br />

control operations to occur directly within the FPGA without<br />

using the CPU. FPGAs are ideal for control applications because<br />

they can run extremely fast, highly deterministic loop rates. An<br />

example of this is high-speed sorting during which the FPGA<br />

sends pulses to an actuator that then ejects or sorts parts as they<br />

pass by.<br />

Fig. 2. In the inline FPGA processing architecture, the camera interface is<br />

connected directly to the pins of the FPGA passing the pixels directly from the<br />

camera.<br />

III.<br />

CPU VS. FPGA VISION ALGORTHIMS<br />

With a basic understanding of the different ways to architect<br />

heterogeneous vision systems, you can look at the best<br />

algorithms to run on the FPGA. First, you should understand<br />

how CPUs and FPGAs operate. To illustrate this concept,<br />

consider a theoretical algorithm that performs four different<br />

operations on an image and examine how each of these<br />

operations runs when implemented on a CPU and an FPGA.<br />

A. Hypothetical FPGA vs. CPU Processing Example<br />

Many image processing algorithms are inherently parallel<br />

and hence suitable for FPGA implementations. These<br />

algorithms which involve operations on pixels, lines, and<br />

regions of interest do not need high-level image information<br />

such as patterns. You can perform these functions on small<br />

regions of bits as well as on multiple regions of an image<br />

simultaneously.<br />

CPUs operate sequentially; the first operation must run on<br />

the entire image before the second one can start [for the sake of<br />

high-level discussion, this overlooks modern CPU optimization<br />

techniques such as pipelining and multithreading]. In this<br />

example, assume that each step in the algorithm takes 6 ms to<br />

run on the CPU. This results in a total processing time of 24 ms.<br />

Now consider the same algorithm running on the FPGA.<br />

Since FPGAs are massively parallel in nature, each of the four<br />

operations in this algorithm can operate on different pixels in the<br />

image at the same time. In this example, the latency for the first<br />

operation to finish processing the initial pixels is 2ms. From this<br />

point on, the operations can be running in parallel. This<br />

parallelism enables a significantly reduced processing time of<br />

6ms. This is substantially faster than the CPU implementation.<br />

Even in an FPGA co-processing architecture factoring the image<br />

transfer latency, the total processing time will be improved.<br />

Fig. 3. Since FPGAs are massively parallel in nature, they can offer significant<br />

performance improvements over CPUs.<br />

B. Practical FPGA vs. CPU Benchmarked Example<br />

Now consider a real-world example in which an image is<br />

preprocessed for particle counting. First, you apply a<br />

convolution filter to sharpen the image. Next, you run the image<br />

through a threshold to produce a binary image. This reduces the<br />

amount of data in the image by converting from 8-bit<br />

monochrome to binary representation enabling a more efficient<br />

morphology algorithm. The last step is to use morphology to<br />

apply the close function. This removes any holes in the binary<br />

particles. This algorithm executed on a CPU suffers the<br />

performance limitation discussed above. In practice, this takes<br />

166.7 ms when using the NI Vision Development Module for<br />

LabVIEW and the cRIO-9068 CompactRIO Controller based on<br />

a Xilinx Zynq-7020 All Programmable SoC. However, this<br />

same algorithm run on the FPGA executes every step in parallel<br />

as each pixel completes the previous step. This results in the<br />

FPGA taking 8 ms to complete the processing. This 8 ms<br />

benchmark includes the DMA transfer time to send the image<br />

from the CPU to the FPGA. In some applications, you may need<br />

to send the processed image back to the CPU for use in other<br />

parts of the application. Factoring in time for that, this entire<br />

process takes only 8.5 ms. In total, the FPGA can execute this<br />

algorithm nearly 20 times faster than the CPU.<br />

Fig. 4. Running this vision algorithm using an FPGA co-processing<br />

architecture yields 20 times more performance than a CPU-only<br />

implementation.<br />

C. FPGA Algorithm Characteristics<br />

So why not run every algorithm on the FPGA? Though the<br />

FPGA has benefits for vision processing over CPUs, those<br />

benefits come with trade-offs. For example, consider the raw<br />

clock rates of a CPU versus an FPGA. FPGA clock rates are<br />

typically on the order of 40 MHz to 200 MHz. These rates are<br />

significantly lower than those of a CPU, which can run over 3<br />

245


GHz. Image processing algorithms can include functions that<br />

rely on the entire output of a previous step to be valid before the<br />

next step can begin. These algorithms cannot leverage the<br />

parallelism of an FPGA and thus are more efficient on a CPU.<br />

Additionally, algorithms that require random access to pixels<br />

data throughout the image can be challenging to implement on<br />

FPGAs. While many modern FPGAs include integrated memory<br />

such as Dynamic RAM (DRAM), this is typically in lower<br />

quantities the RAM available to the CPU. Due to this memory<br />

limitation, it can also be difficult to run algorithms that require a<br />

template to be stored in accessible memory space (e.g. OCR or<br />

pattern matching). Since these algorithms typically cannot raster<br />

scan through an image, they suffer the difficulties in<br />

parallelization discussed above.<br />

A high-level rule of thumb is if an algorithm can operate by<br />

raster scanning an image, it is typically well suited for FPGA<br />

implementation. If it cannot, deeper consideration and<br />

potentially complex design is required.<br />

IV.<br />

OVERCOMING PROGRAMMING COMPLEXITY<br />

The advantages of an FPGA for image processing depend on<br />

each use case, including the specific algorithms applied, latency<br />

or jitter requirements, I/O synchronization, and power<br />

utilization. Often using an architecture featuring both an FPGA<br />

and a CPU presents the best of both worlds and provides a<br />

competitive advantage in terms of performance, cost, and<br />

reliability. Unfortunately, one of the biggest challenges to<br />

implementing an FPGA-based vision system is overcoming the<br />

programming complexity of FPGAs. Vision algorithm<br />

development is, by its very nature, an iterative process.<br />

Typically, several approaches need to be tried to determine<br />

which works best for a given application.<br />

To maximize productivity, you need to get immediate<br />

feedback and benchmarking information on your algorithms<br />

regardless of the processing platform you are using. Seeing<br />

algorithm results in real time is a huge time-saver when you are<br />

using an iterative, exploratory approach. What is the right<br />

threshold value? How big or small are the particles to reject with<br />

a binary morphology filter? Which image preprocessing<br />

algorithm and algorithm parameters can best clean up an image?<br />

These are all common questions when developing a vision<br />

algorithm, and having the ability to make changes and see the<br />

results quickly is key. However, the traditional approach to<br />

FPGA development can slow down innovation due to the<br />

compilation times required between each design change of the<br />

algorithm. One way to overcome this is to use an algorithm<br />

development tool that helps you develop for both CPUs and<br />

FPGAs from the same environment, while not getting bogged<br />

down by FPGA compilation times. The NI Vision Assistant is<br />

an algorithm engineering tool that simplifies vision system<br />

design by helping you develop algorithms for deployment on<br />

either the CPU or FPGA. You also can use the Vision Assistant<br />

to test the algorithm before compiling and running it on the<br />

target hardware, while easily accessing throughput and resource<br />

utilization information. Extensive development and testing<br />

efforts ensure that the results are identical between the CPU and<br />

FPGA executed algorithm.<br />

Fig. 5. Developing an algorithm in a configuration-based tool for FPGA targets<br />

with integrated benchmarking cuts down on the time spent waiting for code to<br />

compile and accelerates development.<br />

When considering whether a CPU or an FPGA is best for<br />

image processing, the answer is, “It depends.” You need to<br />

understand the goals of your application and use the processing<br />

element that is best suited to that design. However, regardless of<br />

your application, CPU- and FPGA-based architectures and their<br />

many inherent benefits are poised to take machine vision<br />

applications to the next level.<br />

246


Vision Applications Continuum from High-<br />

Performance and Desktop toward Embedded<br />

Computing made easy by an Efficient OpenCL TM<br />

Runtime Environment<br />

Bogdan Ditu, Ciprian Arbone, Fred Peterson<br />

Automotive Compiler Group<br />

NXP Semiconductors<br />

Abstract— As the embedded world seems the perfect place for<br />

exploring very specific hardware acceleration technologies, one of<br />

the highest challenges that comes along is the complexity of<br />

programming these architectures. Defining a programming<br />

model for each type of accelerator device is very resource<br />

consuming, considering all the time investments that are required<br />

in tools development, software porting and deployment, and not<br />

to mention specific optimizations for exploiting all the accelerator<br />

features.<br />

Even though some may argue that it is not very suited for<br />

embedded environments, OpenCL might be the perfect solution<br />

for providing a unified programming model for these<br />

acceleration technologies. By definition, OpenCL provides a<br />

standardized and portable approach for using any multi-core<br />

capabilities. The portability characteristic is the one that should<br />

allow algorithm development on high-level targets (Desktop or<br />

even HPC environments) followed by direct deployment on<br />

embedded systems.<br />

In order to achieve this level of portability, the embedded<br />

systems should be powered by an Efficient OpenCL Runtime<br />

Environment (i.e. OpenCL system implementation) which would<br />

support all the embedded targets (including acceleration devices)<br />

and would make the continuum between targets as seamless as<br />

possible.<br />

This paper is presenting how such an OpenCL Runtime<br />

Environment needs to be designed and implemented and how it<br />

would help in achieving the purpose of application portability<br />

(with main focus on vision applications) covering all of the<br />

computing architectures spectrum. Using OpenCL, the vision<br />

algorithm development should focus only on algorithm details<br />

and should not consider any device architecture characteristics<br />

for the algorithm functionality, while the performance should be<br />

at decent levels for the out-of-the-box / portable application. All<br />

the architecture details should be handled by the OpenCL<br />

runtime environment support for that target in the limits<br />

required by the standard, as efficiently as possible.<br />

Exploration of the very specific device capabilities would be<br />

possible by using custom extensions made available by the<br />

Runtime Environment (also detailed in the paper). The<br />

continuum and portability in both directions could also be<br />

maintained by the Runtime Environment which could assure that<br />

the custom extensions are available / emulated on all supported<br />

targets.<br />

The story of using this Efficient OpenCL Runtime<br />

Environment is also backed up by various experimentations with<br />

use cases of real life out-of-the-box OpenCL applications<br />

(developed using Desktop Environments) implementing Vision<br />

algorithms, which were easily deployed on different types of<br />

embedded multi-core systems (including the usage of some<br />

specific, custom extensions).<br />

Keywords—vision applications continuum; OpenCL<br />

I. INTRODUCTION<br />

Current and future embedded computing systems are<br />

becoming more and more complex and beside a main, general<br />

purpose computing unit, they would usually enclose one or<br />

more domain specific computing units. These domain specific<br />

computing units are used for offloading the main cores<br />

computations and even accelerating certain operations (e.g.<br />

graphic accelerators, compression accelerators, cryptographic<br />

accelerators, packet processing accelerators, and many others).<br />

As the embedded world seems the perfect place for<br />

exploring very specific hardware acceleration technologies,<br />

one of the highest challenges that comes along is the<br />

complexity of programming these architectures. Defining a<br />

programming model for each type of accelerator device is very<br />

resource consuming, considering all the time investments that<br />

are required in tools development, software porting and<br />

deployment, and not to mention specific optimizations for<br />

exploiting all the accelerator features.<br />

Beside programming each of the accelerating cores, another<br />

major challenge is that of coordinating, synchronizing and<br />

www.embedded-world.eu<br />

247


partitioning the workload each of the cores needs to do. Even<br />

though each of the accelerators might have its own efficient<br />

programming model (regardless its complexity and ease of<br />

enablement), it would cover at most the interaction between the<br />

main computing units and one specific accelerator (or class of<br />

accelerators). The specificity and efficiency of each particular<br />

programming model would probably make it unsuitable for the<br />

other cores than the ones it was intended for. This is why<br />

coordinating more accelerators would usually involve the<br />

interaction of more programming models, which were most<br />

likely not designed with inter-operability and collaboration as<br />

their strongest argument.<br />

OpenCL is a standard that enables a parallel programming<br />

paradigm that can support homogeneous as well as<br />

heterogeneous multi-core and many-core systems. Besides<br />

providing complex means for multi-core parallel programming<br />

(including collaboration, coordination, synchronization and<br />

workload partitioning for all the cores in the system, either<br />

homogeneous or heterogeneous), OpenCL also provides a<br />

unified programming model for all entities involved in the<br />

system, as well as a high-level of application portability. In<br />

other words, OpenCL enables ease-of-use for multi-core<br />

architectures programming (with significant impact especially<br />

around the heterogeneous area) as well as assuring a<br />

standardized and portable approach for using multi-core<br />

capabilities.<br />

Even though some may argue that it is not very suited for<br />

embedded environments, OpenCL might be the perfect solution<br />

for providing a unified programming model for these<br />

acceleration technologies. By definition, OpenCL provides a<br />

standardized and portable approach for using any multi-core<br />

capabilities. The portability characteristic is the one that should<br />

provide a great usability advantage and should allow algorithm<br />

development on high-level targets (Desktop or even HPC<br />

environments) followed by direct deployment on embedded<br />

systems.<br />

In order to achieve this level of portability, the embedded<br />

systems should be powered by an Efficient OpenCL Runtime<br />

Environment (i.e. OpenCL system implementation) which<br />

would support all the embedded targets (including acceleration<br />

devices) and would make the continuum between targets as<br />

seamless as possible.<br />

This paper is presenting how such an OpenCL Runtime<br />

Environment needs to be designed and implemented and how it<br />

would help in achieving the purpose of application portability<br />

(with main focus on vision applications) covering all of the<br />

computing architectures spectrum. Using OpenCL, the vision<br />

algorithm development should focus only on algorithm details<br />

and should not consider any device architecture characteristics<br />

for the algorithm functionality, while the performance should<br />

be at decent levels for the out-of-the-box / portable application.<br />

All the architecture details should be handled by the OpenCL<br />

runtime environment support for that target in the limits<br />

required by the standard, as efficiently as possible.<br />

Exploration of the very specific device capabilities would<br />

also be possible by using custom extensions made available by<br />

the Runtime Environment. These extensions are detailed in the<br />

paper, including examples and performance evaluation of using<br />

such extensions. In this context, the continuum and portability<br />

in both directions (from high-level computing toward<br />

embedded computing, and the other way around) could also be<br />

maintained by the Runtime Environment which could assure<br />

that the custom extensions are available / emulated on all<br />

supported targets.<br />

The story of using this Efficient OpenCL Runtime<br />

Environment as the enabler of the application continuum is<br />

also backed up by various experimentations with use cases of<br />

real life out-of-the-box OpenCL applications (developed using<br />

Desktop Environments) implementing Vision algorithms,<br />

which were easily deployed on different types of embedded<br />

multi-core systems (including the usage of some specific,<br />

custom extensions, as mentioned before).<br />

The rest of the paper is structured following the logical line<br />

presented in the introduction so far: the next section will make<br />

a brief overview of the OpenCL programming paradigm<br />

(section II). Once the reader gets an idea about how OpenCL<br />

can be used, we will detail the OpenCL usability story for<br />

embedded systems with main focus on the how OpenCL<br />

provides applications continuum from desktop development<br />

toward embedded computing systems (section III). Since one<br />

of the key elements in supporting the usability story is having<br />

an efficient OpenCL runtime environment implementation, the<br />

next section will present our proposition for a portable OpenCL<br />

system suitable for multi-core embedded systems (section IV).<br />

Next, we are also proposing custom extensions for exploring<br />

and exploiting very specific device capabilities (section V).<br />

The next two sections are presenting the embedded systems the<br />

proposed OpenCL implementation was targeted for together<br />

with the experimentation and performance evaluation of using<br />

this system efficiently for the usability scenario described so<br />

far (which involves the applications continuum from high-level<br />

development toward embedded systems deployment) (section<br />

VI and VII). In the end, some conclusion of the proposed<br />

solutions and usability scenarios as well as future work that can<br />

be developed from the presented ideas (section VIII).<br />

II.<br />

OPENCL OVERVIEW<br />

As mentioned before, OpenCL is a standard that enables a<br />

parallel programming paradigm that could be used for any type<br />

of multi-core or many-core system (containing either<br />

homogeneous or, most important in the presented context,<br />

heterogeneous cores).<br />

All the details about the OpenCL standard and paradigm<br />

can be checked in the OpenCL Specification (provided and<br />

maintained by the Khronos TM Group) [1]. In this overview, we<br />

are only trying to highlight the aspects of the standards that are<br />

of great interest in sustaining our usability scenario and can<br />

provide better programming model for the heterogeneous<br />

multi-core embedded systems described before. Our previous<br />

work ([4], [5]), as well as other overviews on OpenCL ([2],<br />

[3]) are helping us in this direction.<br />

The main aspect of OpenCL that sustains our usability<br />

scenario (and the application continuum from high-level targets<br />

toward embedded computing) is the portability of the<br />

application. The same application is guaranteed by the standard<br />

248


to be run on any system that supports the OpenCL paradigm.<br />

The portability is assured by developing the application for an<br />

abstract system, without any concern on the physical system<br />

that runs beneath. It is the OpenCL Runtime System (the<br />

OpenCL implementation) which would be responsible with the<br />

mapping of the abstract system on the target architecture.<br />

First of all, what are the means that the OpenCL<br />

programming paradigm is providing for handling a multi-core<br />

system? The OpenCL programming paradigm consists of a<br />

standardized API (which provides access to and allows<br />

controlling the OpenCL runtime environment) and a<br />

programming language derivation from C/C++ language<br />

(which allows programming the accelerating cores, by focusing<br />

on the solving mechanisms of the parallel problem).<br />

To understand how these means can be used for handling a<br />

multi-core system, we also need to clarify the logical entities<br />

that are involved in an OpenCL system. These entities are<br />

directly derived from the OpenCL abstract system platform<br />

definition: an OpenCL system consists of one host and one or<br />

more compute devices. To continue the hierarchy provided by<br />

the OpenCL abstract platform, any compute device can consist<br />

of more compute units, which at their own can support more<br />

processing elements.<br />

Once we become aware of the logical entities that are part<br />

of the OpenCL system, we will try to explain how the OpenCL<br />

paradigm (the means described above) can be applied for<br />

programming these entities and how they can work together as<br />

a whole for the complete programming model of the multi-core<br />

system. An OpenCL application will have to program both the<br />

host and the compute devices for controlling the complete<br />

system; from this perspective, the application will consist of:<br />

• one main application – running on the host of the<br />

OpenCL system – the host code can be programmed<br />

using C/C++ (but not limited to it) and it interacts with<br />

the OpenCL System through the mentioned<br />

standardized API. This application is providing the<br />

context and the control board for solving one or more<br />

parallel problems by defining the scenario on how the<br />

problems will be solved (which entities can collaborate,<br />

work in parallel, synchronize, or partition and balance<br />

their workloads to solve pieces of parallel problem, or<br />

multiple parallel problems).<br />

• one or more OpenCL kernels – running on compute<br />

devices as defined and controlled by the main<br />

application scenario. Each OpenCL kernel is used for<br />

solving a parallel problem by defining the base<br />

algorithm for it. This kernel will be executed in as many<br />

instances as required by the problem iteration space.<br />

Instances of the same kernel can be run on compute<br />

devices as instructed by the application scenario, while<br />

the work partitioning and workload balance can be<br />

smartly handled in an automatic and transparent manner<br />

by the OpenCL Runtime System implementation. The<br />

kernel itself defines only the core mechanism for<br />

solving the parallel problem (one instance), while the<br />

OpenCL Runtime Engine is being responsible with<br />

assuring that the kernel is executed as many times as the<br />

iteration space requires.<br />

Now that we provided some information about how<br />

OpenCL can be used as a programming paradigm for multicore<br />

systems, we can make a step further and see what types of<br />

parallel problems the OpenCL can be used to solve.<br />

Considering that an OpenCL application (powered by the<br />

OpenCL programming paradigm / model) can be used as<br />

control board for a multi-core system, there are many types of<br />

scenarios that can be imagined for defining such an application:<br />

• the most common type of parallel problem that can be<br />

solved is that of repeatedly applying the same<br />

mechanism to a large iteration space, using large<br />

amounts of data (similar with the single program –<br />

multiple data (SPMD) computing paradigm) (Fig. 1)<br />

o<br />

o<br />

o<br />

o<br />

in this context, the problem will be solved by<br />

compute devices or subsets of them<br />

as mentioned before, the solving mechanism<br />

is called kernel, one instance of it is called<br />

work-item and will be executed by the<br />

OpenCL abstract entity called processing<br />

element<br />

work-items are grouped together in workgroups<br />

and executed by compute units – this<br />

abstract level of work partitioning comes<br />

with programming mechanisms that allow<br />

some level of concurrency and<br />

synchronization<br />

the complete iteration space of the parallel<br />

problem should be covered by the execution<br />

of all work-groups<br />

• such parallel problems can be used as the basic building<br />

blocks of an OpenCL application, while the controlling<br />

application can resolve one or more such problems,<br />

using one or more compute devices, partitioning and<br />

balancing the workloads as needed for efficiency of the<br />

problem in a certain configuration of the targeted<br />

system<br />

• the OpenCL standard also provides the means for<br />

solving more problems in parallel, compute devices<br />

being used either in collaboration or concurrently to<br />

cover all intended needs – one can define different sets<br />

of problems or tasks from them, execute them in any<br />

order, define dependencies between them, synchronize<br />

them as needed<br />

Fig. 1. the most common type of parallel problem solved by OpenCL<br />

paradigm<br />

www.embedded-world.eu<br />

249


• there can be defined tasks or problems that are not even<br />

necessary parallel problems – they can be defined as<br />

sequential tasks based on the capabilities of the compute<br />

devices involved in the system<br />

• on the other end, the system can be configured for highdegree<br />

of parallelism (even higher than the OpenCL<br />

standard is defining), different compute units in the<br />

system being able to provide intrinsic parallel<br />

capabilities which can be smartly exploited by the<br />

OpenCL system, either automatically or with some<br />

custom extensions provided at the application level<br />

• also, the workload partitioning and load balancing can<br />

be smartly and dynamically handled by the OpenCL<br />

Runtime System, either implicitly or by exposing some<br />

custom control mechanisms to the main application<br />

As one can see, the OpenCL programming paradigm can be<br />

a very useful and meaningful way of programming a complex<br />

multi-core system involving more types of heterogeneous<br />

cores. Even though some may argue that it is not very suited<br />

for embedded environments, in our opinion OpenCL can be the<br />

perfect solution for providing a unified programming model for<br />

the described acceleration technologies and combining them in<br />

a single system. By definition, OpenCL provides a<br />

standardized and portable approach for using any multi-core<br />

capabilities. The portability characteristic is the one that should<br />

provide a great usability advantage and should allow algorithm<br />

development on high-level targets followed by direct<br />

deployment on embedded systems.<br />

Even if the usability scenario is the best-case scenario, in<br />

most of the cases it provides only the starting conditions for<br />

defining an efficient application which is properly using and<br />

exploiting all of the capabilities that are available in the<br />

targeted multi-core system. Many of the features described<br />

before are useful only in the context of specializing an OpenCL<br />

application for the targeted multi-core system. These features<br />

were presented only to allow the reader to make an informed<br />

idea on the powerful means that OpenCL can provide. One<br />

should be aware that any step further in the specialization<br />

direction gains efficiency, but loses from the portability (and<br />

application continuum) perspective.<br />

III. OPENCL USABILITY STORY FOR EMBEDDED SYSTEMS –<br />

APPLICATIONS CONTINUUM TOWARD EMBEDDED COMPUTING<br />

As described both in the Introduction (Section I) and the<br />

OpenCL Overview (Section II), we can agree that OpenCL can<br />

be a very good candidate for a unified programming model that<br />

can be used for complex multi-core embedded systems. The<br />

focus so far was on the multi-core programming capabilities<br />

the OpenCL paradigm can provide, what types of problems can<br />

it be used for and how suited it is for exploiting the<br />

performance provided by specialized acceleration technologies.<br />

As the purpose of this paper is that of defining a usability<br />

scenario that can allow the application continuum from highlevel<br />

targets toward embedded computing systems, this section<br />

will focus on this aspect.<br />

To better understand how this scenario will work, we can<br />

use Fig. 2 to explain which would be the most comfortable<br />

approach toward deploying of multi-core application on<br />

embedded systems, which are the advantages of using this<br />

approach and how the drawbacks of this scenario can be<br />

overcome.<br />

Fig. 2. OpenCL Usability Scenario for Embedded Systems – Applications Continuum toward Embedded Computing<br />

250


The OpenCL usability scenario for embedded systems, the<br />

one that can assure the applications continuum from high-level<br />

development toward embedded computing systems, should<br />

follow these steps:<br />

• the application developer should be mainly focused on<br />

developing domain specific algorithms (vision, linear<br />

algebra, scientific, medical, etc.) using general aspects<br />

of the parallel programming paradigm (defining<br />

problems that could be solved using SPMD principles,<br />

described before)<br />

• there is no doubt that OpenCL can be used as the<br />

programming model / paradigm for developing the<br />

application the domain specific algorithms will be<br />

integrated in<br />

• for convenience, the application would be developed in<br />

a high-level development environment, with main focus<br />

on the algorithm and application correctness, the<br />

environment being suited for running, debugging and<br />

preliminary evaluating performance of the application<br />

• since the application is developed using the OpenCL<br />

programming model, there will be direct deployment of<br />

the application toward embedded computing systems<br />

that are supported by an OpenCL Runtime Environment<br />

(and have an OpenCL system implementation targeted<br />

for those multi-core systems)<br />

• both the application development and the embedded<br />

system deployment would not require any knowledge of<br />

the targeted embedded system<br />

• in this phase the developer should be able to run the<br />

OpenCL application (and assess its correctness) as well<br />

as conducting an initial performance evaluation<br />

• by providing an Efficient OpenCL Runtime<br />

Environment implementation, at this point the out-ofthe-box<br />

performance should be at decent levels, with<br />

room of improvement<br />

• based on the performance requirements for the parallel<br />

application in the embedded context, some iterations of<br />

specialization of the OpenCL application can occur,<br />

mainly based on the specificities of the targeted multicore<br />

system that can be exploited:<br />

o<br />

o<br />

either by application specialization using<br />

OpenCL standard mechanisms, but taking<br />

into consideration the multi-core system<br />

resources (available compute units and their<br />

capabilities)<br />

or by specializing the application using<br />

custom extensions specific to certain<br />

accelerating target architectures<br />

• the specialization – re-deploying iteration could happen<br />

on embedded system level, including possible<br />

correctness re-assessment (and maybe embedded<br />

system debugging) OR could complete the continuum<br />

of the application by moving back to the high-level<br />

development. In this case, some extensions might be<br />

needed on the OpenCL system implementation for the<br />

high-level system used for application development<br />

Even though the scenario might look straight forward for<br />

some of the readers and maybe unrealistic for others, it has<br />

many advantages:<br />

• the focus of the algorithm developer on the algorithm<br />

functionality, not on the specific characteristics and<br />

capabilities of the targeted multi-core embedded system<br />

• better running and debugging conditions in the<br />

developing phase<br />

• using out-of-the-box OpenCL applications (or just<br />

algorithms) already available as a starting point for their<br />

embedded deployment and performance specialization<br />

(with minimum investment for the first phases of the<br />

embedded application)<br />

Especially for vision applications, this type of scenario is<br />

very suited since most of the vision algorithm developers come<br />

from the high-level development environments (including<br />

desktop, high-performance or graphical units development).<br />

Another advantage in the vision applications context is that of<br />

the increased vision algorithms code base, including<br />

implementations using OpenCL or similar parallel<br />

programming paradigms.<br />

In order to achieve this scenario, there are few aspects that<br />

one should keep in mind, aspects that would also be detailed in<br />

the following sections of the paper:<br />

• the need of an Efficient OpenCL Runtime Environment<br />

implementation available for the targeted embedded<br />

system (with main focus on its efficiency, easiness of<br />

porting toward new target systems, easiness of adding<br />

specific custom extension)<br />

• the first direct deployment should offer decent<br />

performance (due to the efficient OpenCL system<br />

implementation), but it would be only the starting point<br />

for the real performance the system can achieve<br />

• application specialization can be done either by using<br />

OpenCL standard mechanisms or by using custom<br />

extensions specific to targeted systems<br />

• more specialization an application gets, more it loses<br />

from its portability (and reduces the application<br />

continuum) toward other OpenCL systems<br />

IV. PROPOSED EFFICIENT OPENCL RUNTIME<br />

ENVIRONMENT FOR MULTI-CORE EMBEDDED SYSTEMS<br />

As mentioned several times before, one of the key elements<br />

in the usability scenario (application continuum) is the OpenCL<br />

Runtime Environment. The implementation of the OpenCL<br />

system influences the usability scenario in at least few aspects:<br />

• it makes possible to apply the scenario in the first phase<br />

(since it makes the OpenCL available for the targeted<br />

multi-core embedded system)<br />

www.embedded-world.eu<br />

251


• it contributes to the decent out-of-the-box performance<br />

of the application deployed on the targeted system<br />

• it allows application specialization by adding custom<br />

extensions for specific capabilities of the targeted<br />

system<br />

To meet these expectations, we used as central point for our<br />

usability scenario the OpenCL system implementation<br />

generated from the Efficient OpenCL Runtime Environment for<br />

Multi-Core Embedded Systems based on our previous work<br />

described in [4].<br />

We will not get in too many details, for any detailed aspects<br />

anyone can check the mentioned paper. We will just reiterate<br />

some aspects of the design (which are of interest to our story)<br />

and the challenges the design should meet.<br />

In the later sections, there will be some details related to the<br />

multi-core embedded systems this OpenCL Runtime<br />

Environment was ported to (section VI), as well as the<br />

experimentations we made for the vision applications<br />

continuum (section VII).<br />

Our previous work focused on the design of an efficient<br />

OpenCL Runtime Environment for Multi-Core Embedded<br />

Systems, keeping in mind the portability of the runtime system<br />

toward more types of hardware system configurations<br />

including different underlying operating systems and even<br />

bare-metal systems. As part of the design, the following<br />

considerations were addressed:<br />

• challenges of supporting OpenCL for Multi-Core<br />

Embedded Systems<br />

• OpenCL runtime system architecture decisions for easy<br />

portability of the OpenCL implementation toward new<br />

Embedded Systems<br />

• OpenCL runtime system configuration considerations<br />

for best matching of the OpenCL abstract system on the<br />

physical targeted system<br />

• efficiency of the workload partitioning and balancing<br />

strategies, including dynamic adjustments based on the<br />

runtime aspects of the overall system<br />

• easy adapting of custom extensions without influencing<br />

or losing from the portability aspect of the design and<br />

implementation<br />

V. PROPOSED OPENCL CUSTOM EXTENSIONS<br />

An important aspect in the OpenCL application<br />

specialization toward the targeted multi-core embedded system<br />

is that of being able to provide custom OpenCL extensions for<br />

the specific capabilities the targeted systems may have.<br />

In this section, we will detail some of the custom<br />

extensions that may be needed in different contexts. The list is<br />

not complete and mostly covers the types of extensions we<br />

came across in some of our targeted systems:<br />

• OpenCL C language extensions to allow access to the<br />

accelerator specific instructions (intrinsics). This type of<br />

extension would allow writing C-like code with the<br />

possibility of generating very specific instructions<br />

(similar with writing assembly code, but in the context<br />

of a high-level language (enhanced by OpenCL)). This<br />

is a very specific extension and would require reimplementation<br />

for each new target which might use<br />

this feature. Most of the changes are only at the level of<br />

OpenCL C compiler for the targeted system.<br />

• OpenCL extensions to allow calls to native accelerator<br />

functions. This type of extension will allow calling<br />

accelerator native code from the OpenCL host<br />

application. One should be aware that this type of native<br />

code does not meet the characteristics of the kernel code<br />

and thus cannot be called through standard OpenCL<br />

functionality. This type of extension is not specific to a<br />

particular targeted system. This extension is specific to<br />

the OpenCL system implementation and should be easy<br />

to re-use for any targeted system.<br />

• OpenCL C vector language extensions to enable usage<br />

of vector units within an OpenCL compute unit. This<br />

type of extension will allow exploring the vector<br />

capabilities of the compute unit, without using specific<br />

vector configuration aspects and without adapting the<br />

algorithm / kernel toward the usage of specific vector<br />

partitions. This extension should be straight forward to<br />

adapt from target to target, the only specific<br />

implementation details being in the target specific part<br />

of the OpenCL C compiler, around vector code<br />

generation.<br />

• OpenCL extensions for cascading kernels on device<br />

without transferring control back to the host. This type<br />

of extension can be very useful for keeping data locality<br />

and improve data transfers, including caching which<br />

would suffer if for consecutive computations using the<br />

same data set would require moving control back and<br />

forth from host to device for each of the kernels doing<br />

the computations. This extension is not specific to a<br />

particular targeted system, and may be very handy for<br />

those systems which have limited device memory and<br />

costly memory access to the OpenCL global memory.<br />

• OpenCL extensions to allow explicit partitioning of<br />

workloads between different devices as well as explicit<br />

dynamic workload balancing. These types of extensions<br />

would allow a better control of the OpenCL application<br />

toward the executed tasks and a better workload<br />

balancing between devices and compute units based on<br />

dynamic runtime conditions. Some of these aspects<br />

might be considered implicitly without breaking any of<br />

the OpenCL standard rules, but in certain situations<br />

explicit control is required.<br />

Most of the mentioned extensions are in various stages of<br />

implementation. The OpenCL C vector language extension in<br />

particular is one aspect of our work that was already explored,<br />

implemented and published ([5]). Some of its details will be<br />

mentioned below. Also, the specialization using this extension<br />

was experimented for one of our targeted system, in the context<br />

of out-of-the-box experience. The results of this<br />

252


experimentation are also available in [5], and will be briefly<br />

presented in a later section.<br />

The motivation behind the OpenCL C vector language<br />

extension is that of having target systems with vector<br />

capabilities within the device compute units. These compute<br />

units will be able to run a work-group in a faster manner, by<br />

vectorially execute more work-items in parallel. Even though<br />

the ideal situation would be for the vector capabilities to be<br />

automatically enhanced by the OpenCL C compiler (using<br />

auto-vectorization), there are situation where the application<br />

could benefit from being written in a vector manner (the actual<br />

vector extensions will be done in the OpenCL kernels). In<br />

order to provide this functionality, we defined a language<br />

extension (we called OpenCL Vector Language Extension)<br />

which would help the user designing and implementing the<br />

kernels in a vector manner, without concerning about data<br />

partitioning in certain vector sizes (as would the native<br />

OpenCL vector types provide support for). Using this<br />

extension is somehow equivalent with the user specifying to<br />

the compiler which are the pieces of code that would benefit<br />

from vector execution.<br />

The mechanism of extending the OpenCL language is not<br />

specific to a particular device target architecture and could be<br />

used in a generic fashion, regardless the vector architecture and<br />

vector specificities of the device itself. It would be the concern<br />

of the OpenCL system (including the OpenCL compiler) to<br />

handle the vector language extensions for the supported<br />

devices.<br />

This solution can be used for any device with vector<br />

execution, regardless the vector unit size, since it does not<br />

require specifying the vector size. The application and kernels<br />

using this solution will still be portable (and keep the<br />

continuum) as long as the OpenCL system supports this<br />

extension for the targeted device. The custom OpenCL Vector<br />

Language Extension consists of using generic vector types<br />

(something similar with the existing OpenCL vector types, but<br />

more generic, without a vector size). The OpenCL compiler<br />

detects the generic vector types and generates vector code for<br />

the device.<br />

There are few aspects that the application should be<br />

concerned of, all of which are detailed in the paper, including<br />

some example of transforming an existing application in a<br />

vector application using the language extension. We are just<br />

mentioning some of them so the reader will be aware of the<br />

impact this extension could have on the application itself and<br />

its execution in the OpenCL system:<br />

• changing the scalar types with generic vector types in<br />

the OpenCL kernels wherever vector execution could be<br />

enabled<br />

• the OpenCL system will handle (and automatically<br />

adjust) the traversal of the iteration space so that the<br />

vector execution will be considered<br />

• the vector memory accesses will need to be explicitly<br />

performed using library functions<br />

Regardless the extensions an application will use for being<br />

specialized, one should be aware that the portability of the<br />

application decreases with the amount of the particular aspects<br />

the application takes in consideration. Some of the extensions<br />

are generic and might not use too much from the specific<br />

aspects, others are conditioned by the same extension being<br />

supported by the OpenCL for the targets in the spectrum, while<br />

some of them are very specific and cannot be ported at all.<br />

VI. OPENCL SYSTEM IMPLEMENTATION AND THE<br />

TARGETED MULTI-CORE EMBEDDED SYSTEMS<br />

As the main enabler for the application continuum we are<br />

using an implementation of the Efficient OpenCL Runtime<br />

Environment for Multi-Core Embedded Systems we mentioned<br />

before, started from the design and implementation described<br />

in [4].<br />

The implementation of the design (as detailed in the paper)<br />

started as a proof of concept that the design itself is not just a<br />

theoretical approach, raising a lot of questions and challenges.<br />

It started with some multi-core systems that have large scale<br />

applicability in various areas, such as automotive,<br />

infotainment, industrial, medical and communication. Both the<br />

types of the systems and the types of the applications were well<br />

suited for the OpenCL programming paradigm.<br />

The initial implementations we had ([4]) started with multicore<br />

systems that were Linux enabled, mostly using the CPU-<br />

CPU OpenCL paradigm:<br />

• i.MX6Q family ([6]) (quad-core ARM® CPU) – used<br />

for automotive and infotainment<br />

• 32-core Power Architecture® CPU – used for<br />

communication applications<br />

• QorIQ® T4240 series ([7]) (24-virtual core Power<br />

Architecture® CPU enabled by several communication<br />

and signal processing accelerators) – used for<br />

communication applications<br />

Since then, the OpenCL system implementation passed the<br />

proof-of-concept phase and increased into maturity toward a<br />

product. The number of supported multi-core embedded<br />

systems also increased, as well as the diversity of the systems,<br />

approached applications and goals:<br />

• both homogeneous and heterogeneous multi-core and<br />

many-core systems<br />

• Linux-based, as well as bare-board<br />

• using both CPU-CPU and CPU-accelerator paradigms<br />

• multi-threading, multi-process as well as multiprocessor<br />

/ multi-OS inter-connectivity<br />

• the goal of using OpenCL was both as performance<br />

evaluation of the system scalability (including physical<br />

target, OpenCL system and application) AND as<br />

demonstrating ease-of-use and portability of the<br />

OpenCL application (the application continuum)<br />

www.embedded-world.eu<br />

253


(including real-life and real-time demonstrations of the<br />

system functionality)<br />

• the OpenCL focused applications were in the area of<br />

vision (image and video processing), networking and<br />

nevertheless artificial intelligence (deep learning /<br />

neural networks)<br />

A brief list of systems the OpenCL system was targeted for,<br />

mostly in embedded computing area, but not limited to it:<br />

• x86/64 – configurations up to 64 compute units (mostly<br />

used for prototyping and validation)<br />

• i.MX6Q family ([6]) – quad-core ARM® CPU<br />

• QorIQ® T4240 series ([7]) – 24 Power Architecture®<br />

virtual cores<br />

• Layerscape2 communication processor -<br />

LS2080A/LS2084A ([8]) – 8 ARM® Cortex® A57 /<br />

A72 cores<br />

• S32V234 automotive vision processor ([9]) – 4 ARM®<br />

Cortex® A53 cores<br />

• NXP BlueBox autonomous driving platform ([10]) –<br />

S32V234 + LS2084A (4 + 8 ARM® cores)<br />

• S32V234 automotive vision processor ([9]) using the<br />

APEX-2 vision accelerator<br />

We are mentioning all of these achievements, for the reader<br />

to make an impression about the maturity of the OpenCL<br />

Runtime Environment we are proposing, its portability<br />

characteristic (by design) as well as the support it provides for<br />

the application continuum between the high-level application<br />

development toward embedded computing systems.<br />

VII. EXPERIMENTATIONS AND PERFORMANCE EVALUATION<br />

OF VISION APPLICATIONS CONTINUUM EXPERIENCE<br />

As mentioned in the previous section, beside the increased<br />

number of multi-core embedded systems we targeted, the goal<br />

of this portfolio is toward experimentation and performance<br />

evaluation, more precisely:<br />

• performance evaluation of the system scalability –<br />

including hardware architecture, OpenCL system and<br />

application viewed as a whole<br />

• demonstration of ease-of-use and portability of the<br />

OpenCL application (the application continuum) –<br />

including real-life and real-time demonstrations of the<br />

system functionality<br />

• demonstration of how the proposed custom extensions<br />

can work in real-life and which might be the application<br />

specialization performance impact (with minimum<br />

effort)<br />

A. System Scalability<br />

Most of the mentioned targeted systems were used to<br />

evaluate the performance of the system scalability, especially<br />

those for embedded systems with increased number of cores<br />

(8+). The used applications covered all mentioned areas:<br />

• vision applications (including image and video<br />

processing)<br />

• networking applications<br />

• neural networks<br />

We will not get in too many details, as the purpose of this<br />

paper is a different one, but it worth mentioning that the overall<br />

system performance was properly scaling with the number of<br />

cores, until it reaches the scalability limit of the application.<br />

Any increase in performance over this limit will usually imply<br />

redesigning of the application.<br />

To contribute to the theme of this paper, it is important to<br />

mention that the used applications were out-of-the-box<br />

OpenCL applications, with special focus on the vision<br />

applications.<br />

For the sake of the argument, we are showing the<br />

performance evaluation of an out-of-the-box OpenCL vision<br />

application (a complex combination of image filters from [12])<br />

on a system that used up to 64-cores (Fig. 3). The OpenCL<br />

application performance was compared with the scalar<br />

implementation of the same application (no parallelism at all)<br />

and also using a specific programming model of that system.<br />

As one can see in the figure, the OpenCL out-of-the-box<br />

performance is consistently comparable with the specialized<br />

application implemented using the specific programming<br />

model, fact that proves the efficiency of the OpenCL system in<br />

the context of out-of-the-box, portable application.<br />

B. Usability Story – Vision Application Continuum<br />

The next step is demonstrating the usability story or as we<br />

called it the vision application continuum from high-level<br />

development environments (OpenCL application developed on<br />

desktop) toward embedded computing system.<br />

For the purpose of this demonstration we are detailing the<br />

actual story of the real-life, real-time demo application we<br />

performed for one of the targeted multi-core embedded system<br />

we mentioned above:<br />

• we started with porting the OpenCL Runtime<br />

Environment to a new multi-core embedded system<br />

which at that point prototyped the mixture of two<br />

existing systems (S32V234 automotive vision processor<br />

[9] and Layerscape2 communication processor [8]) for<br />

combining the automotive rigorousness of the S32V234<br />

multi-core with the computational power of the<br />

Layerscape2 multi-core system<br />

• this system became an important success in the<br />

automotive area and is now called BlueBox (NXP<br />

BlueBox autonomous driving platform [10])<br />

254


Fig. 3. Performance Evaluation of a Vision Application using up to 64 cores<br />

• we started the porting of the OpenCL system toward<br />

this system in the configuration of running a multioperating<br />

system (each multi-core system running its<br />

separate version of Linux), with the OpenCL host on<br />

one of the S32V234’s ARM® core and the 8<br />

Layerscape2’s ARM® cores as compute units<br />

• to demonstrate the performance and functionality of this<br />

system we needed a very resource consuming<br />

application, to be sure the computational power brought<br />

by the Layerscape2 is highly challenged<br />

• we found a very interesting application, an OpenCL<br />

implementation of the dense optical flow algorithm<br />

(based on the Lukas-Kanade method – [11], [12])<br />

• we took the out-of-the-box OpenCL application,<br />

adapted it to work for our input and graphical feed on a<br />

desktop environment, using a desktop dedicated<br />

OpenCL implementation<br />

• the next step was to move the application to run on our<br />

OpenCL system targeted for the desktop environment<br />

(x86/64 Linux) – the transition went smoothly<br />

• to prove the vision application continuum, we moved<br />

forward and directly deployed the application to the<br />

BlueBox multi-core embedded system – as before, the<br />

deployment happened seamlessly<br />

• we managed to get decent performance out of this port<br />

and we were satisfied with it – mostly because the used<br />

compute units are quite general computing units and not<br />

too much specialization could be applied<br />

• we also assembled a real-time demonstration using the<br />

optical flow application with a live-feed from a camera<br />

– the main applicability of the demo was for<br />

demonstrating the BlueBox capabilities toward<br />

Advanced Driver Assistance Systems usage of the<br />

vision application<br />

As a proof of concept, we were able to evaluate the<br />

performance scalability by using different amounts of<br />

Layerscape2 ARM® cores as OpenCL compute units, as<br />

shown in Fig. 4.<br />

A similar real story was around targeting the S32V234<br />

multi-core embedded system using the APEX-2 vision<br />

accelerator. The only main difference was the used algorithm.<br />

In this case an edge detection vision application was used, with<br />

the sole purpose of demonstrating the vision application<br />

continuum. We started the same, working on an out-of-the-box<br />

OpenCL application, with no specialization of the algorithm<br />

(running only on the scalar unit of the APEX-2 core).<br />

Fig. 4. OpenCL Optical Flow application scaling on different number of<br />

cores on BlueBox<br />

www.embedded-world.eu<br />

255


C. Vector Language Extension – Vision Application<br />

Specialization<br />

Even though it was already mentioned and for very detailed<br />

information the reader can check [5], for the sake of<br />

performance evaluation it worth mentioning in this section the<br />

experimentation we conducted with the specialization of the<br />

vision application using the custom OpenCL vector language<br />

extensions we proposed.<br />

The story around this experimentation (together with the<br />

motivation for defining and implementing such extension) goes<br />

like this:<br />

• we started with another out-of-the-box OpenCL vision<br />

application implementing a complex image filtering<br />

algorithm (started from [12]) and directly deployed it to<br />

the S32V234 ARM® + APEX-2 multi-core embedded<br />

system<br />

• since the out-of-the-box application was performing on<br />

APEX-2 device using only its scalar unit and not even<br />

considering the usage of its internal vector capabilities<br />

(one can check details on the APEX vector capabilities<br />

in [9]) we came up with the idea of vector language<br />

extension and to implement it in our OpenCL Runtime<br />

Environment<br />

• the next step was the specialization of the vision<br />

application toward vector execution of the kernel code<br />

The effect of the specialization can be clearly observed in<br />

Fig. 5 (extracted from our previous work [5]). The details<br />

related to the configurations the performance evaluation was<br />

conducted in, can also be checked in the original paper. The<br />

variation is around the number of vector units the APEX-2 core<br />

is configured to use, as well as in the memory areas that are<br />

considered for vector specialization.<br />

The experimentations we conducted and the evaluated<br />

performance in each of the cases are sustaining that the<br />

usability story (based on the application continuum) can be<br />

achieved with decent results on the out-of-the-box application,<br />

proving the system scalability, while the specialization of the<br />

application brings a real boost toward performance<br />

improvement.<br />

Fig. 5. Performance Evaluation of the vision application specialization using custom OpenCL vector language extensions<br />

256


VIII. CONCLUSION AND FUTURE WORK<br />

One of the goals of the paper was to provide a general<br />

solution, a viable alternative for the multi-core embedded<br />

computing systems programming models. Once this solution<br />

was identified as the OpenCL parallel programming paradigm,<br />

we had to come up with a usability story that would enable<br />

using the OpenCL system for embedded systems. The strongest<br />

argument of using the OpenCL programming paradigm,<br />

together with an efficient OpenCL Runtime Environment is<br />

that of providing a high degree of application portability. The<br />

OpenCL usability scenario sustains the application continuum<br />

from high-level / high-performance computing toward<br />

embedded computing systems, covering all the computing<br />

spectrum. The greatest advantage of using the OpenCL<br />

programming paradigm and OpenCL applications is that of<br />

developing algorithms in high-level development<br />

environments, with main focus on algorithm functionality, with<br />

no concern on targeted system, followed by direct deployment<br />

on embedded systems.<br />

We brought many strong arguments on why the OpenCL<br />

can be successfully used as a viable programming paradigm for<br />

multi-core embedded systems. We also defined a usability<br />

story that would ease-up deployment of application on<br />

embedded systems (with minimum knowledge of targeted<br />

architecture). The success of the usability story probably stays<br />

in the performance of the out-of-the-box application deployed<br />

in the embedded environment. The goal of our scenario is to<br />

have decent performance on the out-of-the-box experience,<br />

providing a good starting point of the application in the<br />

embedded environment followed by the specialization of the<br />

OpenCL application. The specialization can be explored in two<br />

main directions, either by using standard OpenCL and adapt<br />

the application toward better exploiting of the embedded<br />

system capabilities OR by using custom OpenCL extensions,<br />

very specific to the targeted system.<br />

The experimentation and performance evaluation we<br />

conducted are backing up the usability story of the application<br />

continuum using the Efficient OpenCL Runtime Environment<br />

implementation with decent out-of-the-box performance<br />

followed by specialization of the application toward targeted<br />

embedded system with an increased performance boost.<br />

As future work, we plan to make available more OpenCL<br />

custom extensions, especially those that are not target specific,<br />

to provide more specialization opportunities for the out-of-thebox<br />

OpenCL applications. One of the major focuses that can be<br />

experimented for the already targeted embedded systems is that<br />

of cascading kernels, as well as allowing access to the<br />

accelerator specific instructions. Another interesting<br />

specialization area would be around the explicit workload<br />

partitioning and balancing.<br />

Some other development direction would be that of<br />

exploring more vision applications with immediate impact in<br />

the ADAS area, and applying the complete usability scenario<br />

for more and more OpenCL applications, targeting more new<br />

multi-core embedded systems.<br />

REFERENCES<br />

[1] Khronos Group. The open standard for parallel programming of<br />

heterogeneous systems, http://www.khronos.org/opencl/<br />

[2] J. Tompson and K. Schlachter, “An Introduction to the OpenCL<br />

Programming Model”, Khronos Group, 2008<br />

[3] B. Dipert, “OpenCL Eases Development of Computer Vision Software<br />

for Heterogeneous Processors”, Embedded Vision Alliance, 2015<br />

[4] B. Ditu, I. Romaniuc, C. Arbone, M. Oprea, D. Vasile, “Design and<br />

Portability of an Efficient OpenCL Runtime Environment for Multi-<br />

Core Embedded Systems”, Embedded World Conference, 2015<br />

[5] B. Ditu, F. Peterson, C. Arbone, “Experimentation of Vision Algorithm<br />

Performance using Custom OpenCLTM Vector Language Extensions<br />

for a Graphical Accelerator with Vector Architecture”, 2017 IEEE 13th<br />

International Conference on Intelligent Computer Communication and<br />

Processing (978-1-5386-3368-7/17/$31.00 ©2017 IEEE)<br />

[6] NXP i.MX 6Quad Processors - Quad Core, High Performance,<br />

Advanced 3D Graphics, HD Video, Advanced Multimedia, ARM®<br />

Cortex®-A9 Core, https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/i.mx-applicationsprocessors/i.mx-6-processors/i.mx-6quad-processors-high-performance-<br />

3d-graphics-hd-video-arm-cortex-a9-core:i.MX6Q<br />

[7] NXP QorIQ® T4240 Multicore Communications Processors,<br />

https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/qoriq-platforms/t-series/qoriqt4240-t4160-t4080-multicore-communications-processors:T4240<br />

[8] NXP QorIQ® Layerscape 2084A and 2044A Multicore<br />

Communications Processors, https://www.nxp.com/products/processorsand-microcontrollers/arm-based-processors-and-mcus/qoriq-layerscapearm-processors/qoriq-layerscape-2084a-and-2044a-multicorecommunications-processors:LS2084A<br />

[9] NXP S32V234: Vision Processor for Front and Surround View Camera,<br />

Machine Learning and Sensor Fusion Applications,<br />

https://www.nxp.com/products/processors-and-microcontrollers/armbased-processors-and-mcus/s32-automotive-platform/vision-processorfor-front-and-surround-view-camera-machine-learning-and-sensorfusion-applications:S32V234<br />

[10] NXP BlueBox: Autonomous Driving Development Platform,<br />

https://www.nxp.com/products/processors-and-microcontrollers/armbased-processors-and-mcus/s32-automotive-platform/nxp-blueboxautonomous-driving-development-platform:BLBX<br />

[11] OpenCL Imaging on The GPU: Optical Flow,<br />

https://www.khronos.org/assets/uploads/developers/library/2011_GDC_<br />

OpenCL/NVIDIA-OpenCL-Optical-Flow_GDC-Mar11.pdf<br />

[12] Open Source Computer Vision, http://opencv.org/<br />

www.embedded-world.eu<br />

257


A modern approach to developing software in the<br />

growing embedded vision sector<br />

Christoph Wagner, Julian Beitzel<br />

Christoph Wagner<br />

Product Manager Embedded Vision<br />

MVTec Software GmbH<br />

Munich, Germany<br />

christoph.wagner@mvtec.com<br />

Julian Beitzel<br />

Application Engineer<br />

MVTec Software GmbH<br />

Munich, Germany<br />

julian.beitzel@mvtec.com<br />

I. INTRODUCTION<br />

The current trend of increasing performance in the<br />

embedded sector continues and enters areas that were<br />

unthinkable five years ago. One driving force behind this is<br />

the rapid development of the "mobile sector", the resulting<br />

widespread success of smartphones and the associated number<br />

of units. This area includes mainly mobile phones, but also<br />

smart watches and tablets. Meanwhile, even mobile computing<br />

sees its first releases of Arm-based laptop computers that are<br />

no longer based on the well-tried x86-Intel architecture.<br />

Arm processor architecture originated in 1983 from the<br />

British computer company Acorn and was first used in the<br />

predecessors of today's desktop PCs [1]. At the very least with<br />

the introduction of 64-bit architecture (ARMv8 series) in<br />

2013, and its use in iOS as well as Android devices, Arm has<br />

started to strengthen its position in the embedded market. One<br />

major reason of this increasing market share is that the<br />

processors fully exploit their advantages over x86 architecture<br />

in terms of power consumption and low power dissipation [2].<br />

Independently of the processor used, the computing power<br />

of modern mobile devices is up to sixty times higher than that<br />

of common desktop PCs of 2004. For an indication of how<br />

this power dynamic has shifted, see the comparison between<br />

the computing speed of the Pentium IV desktop processor and<br />

the current iPhone 8/X mobile phone in Figure 1.<br />

The fastest embedded platforms (e.g. iPhone 8 in Figure 1)<br />

have reached about a third of the computing power of a<br />

modern standard desktop CPU [3]. Especially in the last few<br />

years, the market has been experiencing a veritable boost in<br />

the integrated computing power of embedded devices.<br />

Figure 1: Computing performance for Single Precision Floating General<br />

Matrix Multiply (SGEMM) in GFLOPS. Image courtesy of MVTec Software<br />

GmbH, data by Geekbench.com [4]<br />

Due to increased performance and the associated<br />

application possibilities, embedded vision has become a<br />

guiding theme of Industry 4.0. The aim of this initiative is to<br />

network devices with each other, and thus enable the<br />

development of even more flexible production systems [5].<br />

In the consumer market, for example, the networking of a<br />

refrigerator with an online shopping account makes it easier to<br />

order food. A similar example from the industrial sector would<br />

be embedded vision systems communicating and exchanging<br />

results with each other easily and flexibly via e.g. OPC UA.<br />

This means that, especially with the help of embedded<br />

vision systems, modern production plants can be made even<br />

more flexible and even small batch sizes can be produced fully<br />

automated. Identification technologies such as code reading,<br />

matching, OCR, etc. play a decisive role in this respect. All<br />

this with a focus on being able to deliver as quickly as<br />

258


possible, without increasing inventories and tying up massive<br />

amounts of capital.<br />

Due to the many advantages of embedded vision over<br />

conventional PC-based machine vision systems such as<br />

simplification, miniaturization, and the associated savings<br />

potential, embedded vision is increasingly being used to teach<br />

production processes how to “see”, making it the “eye of<br />

production”.<br />

II. MARKET OVERVIEW<br />

The embedded market is a growth market that still has<br />

massive potential left to tap within the next few years, among<br />

other reasons due to technical innovations. This is especially<br />

true in the automotive segment with autonomous driving, in<br />

logistics with drones as a delivery service provider, or in the<br />

area of collaborative robotics where embedded technologies<br />

help to make cooperation between robots and humans possible<br />

and safe.<br />

Even today, there are already enough successful examples<br />

relying on embedded vision. Whether in the form of simple<br />

vision sensors, which are often equipped with "only" one<br />

function, but in return are highly optimized while keeping the<br />

smallest possible form factor, or in the form of a freely<br />

programmable smart camera that offers full flexibility, but is<br />

much more compact than a PC-based vision system. Another<br />

example are the powerful "single board computers" (SBC),<br />

which offer flexibility comparable to PC systems with cost<br />

reduction clearly in the foreground.<br />

They are already among us - almost every well-known<br />

sensor manufacturer has embedded systems in its portfolio.<br />

There are very good reasons for this, among the market<br />

forecasts, the undeniable advantages and the predicted market<br />

potential (See Figure 2).<br />

III. SIGNIFICANCE OF PROFESSIONAL VISION SOFTWARE<br />

Due to the ever-increasing proportion of embedded vision<br />

systems and the associated process relevance, the demands on<br />

the reliability of these systems are also increasing. As failures<br />

or false decisions lead to production stop, defective work, and<br />

subsequently substantial costs, inline vision-based control<br />

gains significance.<br />

In order to fulfil this need, the software must be very<br />

robust, i.e. it must be able to recognize the characteristics<br />

reliably, even under difficult conditions such as<br />

contamination, vibrations or interfering light and<br />

simultaneously manage the limited resources of embedded<br />

devices as sparingly as possible. At the same time, in order to<br />

avoid burdening cycle times unnecessarily, the maximum<br />

evaluation speed should always be the goal.<br />

This requires efficient programming and optimization of<br />

the relevant algorithms, which support hardware and methods<br />

like GPUs, instruction set extensions like Neon (Arm), or the<br />

automatic parallelization on CPU cores.<br />

IV. EFFICIENT DEVELOPMENT OF EMBEDDED VISION<br />

APPLICATIONS<br />

A. Software development IDEs vs. vision applications<br />

For the development of programs on embedded platforms,<br />

the use of a cross-compiler, as shown in Figure 3, is a<br />

common approach. This compensates for the usually limited<br />

resources compared to a standard PC platform.<br />

Figure 3: Classic application development for embedded platforms<br />

(target) with a development platform (host).<br />

Figure 2: European embedded system market size, by application, 2012-2023<br />

(USD Billion) [6]<br />

The PC is used to develop the application within a suitable<br />

integrated development environment (IDE). Using a<br />

configured toolchain, an executable for the embedded<br />

platform (target) can be built on the PC (host). This program<br />

can now be transferred to the target in order to run.<br />

Image processing functionality is usually provided in third<br />

party libraries for programming languages, like OpenCV for<br />

C++, C, Java, Python, or scikit-image for Python. The<br />

handling is no different from other libraries – they are<br />

included in the project and can be used in the source code. The<br />

advantage of this approach is that developers use a familiar<br />

workflow including the IDE and programming language for<br />

the creation of vision applications.<br />

259


Figure 4: Particle image in byte format and corresponding histogram<br />

Considering the kind of graphical data that should be<br />

processed in such an application, most IDEs are not well<br />

equipped for this purpose. Vision-based algorithms may get<br />

complex very fast and frequently require prototyping. Usually,<br />

there is an uncertainty about the setup, and the appearance of<br />

objects in the images affects the used approach.<br />

For example, the image shown Figure 4 demonstrates a<br />

simple vision task. The particle image should be segmented in<br />

fore- and background. Despite the simplicity of the<br />

assignment, many different approaches are possible. Some<br />

exemplary approaches:<br />

1. Fixed, manually set intensity threshold separating the<br />

fore- and background. Advantages: Very fast, easy.<br />

Disadvantages: Does not deal well with<br />

inhomogeneous background, unable to handle global<br />

exposure changes.<br />

2. Dynamic global threshold based on automatic<br />

histogram analysis, e.g. check for a global minimum<br />

in the histogram function. Advantages: Fast, handles<br />

global exposure changes. Disadvantages: Unable to<br />

handle inhomogeneous background.<br />

3. Variational threshold [7] using a sliding windowapproach<br />

to determine local intensity differences.<br />

Advantages: Deals with inhomogeneous background,<br />

handle global exposure changes. Disadvantages:<br />

Slow compared to the fixed thresholds.<br />

Furthermore, the requirements of the image processing<br />

task may shift; if new parts should be inspected, the lighting<br />

conditions or other factors of the setup could change.<br />

Therefore, the software should be designed in a way that the<br />

acquired images and results can be easily seen. Additionally, if<br />

the image processing part should be developed for multiple<br />

programming languages, a re-implementation is usually<br />

required.<br />

Lastly, debugging related to image processing on a<br />

platform for which the code is cross-compiled may be difficult<br />

using console-based debuggers like GDB.<br />

B. Speeding up prototyping with IDEs for image processing<br />

To get rid of long development and maintenance cycles,<br />

IDEs with support for image processing can be used.<br />

Following the suggestions of the previous part, the basic<br />

features should be supported:<br />

<br />

<br />

<br />

<br />

Immediate and easy-to-use display of images and<br />

variables containing graphical content (lines, regions,<br />

geometrical primitives).<br />

Interaction with the displayed content, such as<br />

zooming or moving.<br />

Support the display of numerical arrays as plots.<br />

Tools/assistants including typical tasks such as<br />

showing the histogram, performing interactive<br />

measurements in the image, display gray values at a<br />

certain spot would be desirable.<br />

An IDE containing these features is shown in Figure 5 as<br />

an example.<br />

In order to speed up the development even more, it would<br />

be desirable to use a simple, script-based language which<br />

allows the user to reduce the lines of code needed for<br />

performing simple actions. For this reason, Matlab and Python<br />

are pretty well-known in the scientific community to be<br />

suitable for fast evaluations [8] [9] [10] [11]. To get the best<br />

performance of the systems, compiled languages are preferred<br />

or even required to be executed on an embedded platform.<br />

IDEs like MVTec Software GmbH’s HDevelop, which can<br />

be seen in Figure 5, enable the export of the developed scriptcode<br />

to various programming languages. This enables a<br />

smooth transition from the prototyping stage to the productive<br />

environment.<br />

On the other hand, changes in the code have to be made on<br />

the host system, exported, included in the framework, crosscompiled<br />

and transferred again to the target platform in order<br />

to check the fix. This workflow leads to a longer development<br />

time. This issue is addressed in the following chapter.<br />

Figure 5: Exemplary vision IDE with the output image for debugging<br />

(left top), additional tools (center and right top), variable view (left<br />

bottom), programming window (right bottom). The screenshot shows<br />

HDevelop of MVTec Software GmbH.<br />

260


C. Easier development and maintenance with interpreted<br />

image processing functionality including remote<br />

debugging<br />

To address the issue of maintainability and debugging, a<br />

different approach is needed. Using classic approaches as<br />

described above, it is cumbersome to visualize graphic<br />

variables during runtime.<br />

As stated in the previous part, there are various<br />

requirements for the IDE in order to efficiently develop image<br />

processing algorithms. It would be ideal if this IDE could also<br />

be used for debugging on the platform.<br />

A concept for doing this is shown in Figure 6. The target<br />

platform starts the vision script using an interpreter within the<br />

application. In the code, a debugging server has to be<br />

configured and explicitly activated – it is worth mentioning<br />

that this is a potential security threat. A host may utilize the<br />

vision IDE to log on to the running application of the target<br />

platform, set breakpoints similar to a local application, and<br />

inspect intermediate results directly.<br />

This concept allows maintenance even for a remote target<br />

system which is available via Internet connection, e.g. at a<br />

customer site.<br />

Figure 6: Remote debugging using the vision IDE of the host to debug an<br />

interpreted vision script inside the application of the target platform.<br />

Since the vision code is interpreted, the update process is<br />

rather straightforward: The script can be fixed on the host<br />

platform before copying it back to the target platform. In this<br />

concept, there is no need for cross-compiling the complete<br />

application, a restart is sufficient as long as the signature of<br />

the vision functions does not change, see Figure 7.<br />

V. THE ECOSYSTEM OF MODERN EMBEDDED VISION<br />

SOFTWARE<br />

Due to the current development of the market, the<br />

requirements and expectations for professional embedded<br />

vision software are growing. Subsequently, competitive<br />

pressure in this segment is increasing. As a result, it is<br />

becoming increasingly important for hardware vendors to be<br />

able to implement the requirements faster and more<br />

efficiently.<br />

It is no longer sufficient to use a high-performance and<br />

robust software solution, because this is only one part of the<br />

development process. It is becoming increasingly important to<br />

be able to access an extensive development package that<br />

includes additional services around the software. Another<br />

important element is a convenient development environment<br />

especially developed for use in the vision sector. This can save<br />

a lot of time when creating the "Workbench" and therefore it<br />

is crucial for efficient and fast development of vision<br />

applications.<br />

Another essential aspect is the legal situation. When a<br />

(embedded) vision application is created and used on a<br />

commercial basis, it is essential for the creator to be protected<br />

legally. This means that it must be ensured that there is no<br />

patent infringement in the use of the vision algorithm, as this<br />

can lead to serious damage claims.<br />

These are typical areas that are difficult to cover with open<br />

source software and thus speak for the use of professional<br />

commercial image processing software.<br />

VI. FUTURE OUTLOOK<br />

Extrapolating the aforementioned developments, a shift<br />

between embedded- and PC-based vision becomes apparent<br />

quickly. Embedded vision is unlikely to replace the desktopbased<br />

applications, but it will continue to gain significance,<br />

market share, and usage in industrial scenarios. If the last<br />

decades’ developments in the consumer electronics market are<br />

any indication, the end of the line for the embedded computing<br />

market is yet far out of sight.<br />

VII. REFERENCES<br />

[1] S. B. Furber, "ARM system Architecture", Addison-Wesley, 1996, p. 36.<br />

[2] M. Hachman, "ARM Cores Climb Into 3G Territory," 14 10 2002.<br />

[Online]. Available: http://www.extremetech.com/extreme/52180-armcores-climb-into-3g-territory.<br />

[Accessed 18 01 2018].<br />

[3] Geekbench, "Geekbench 4.1.3 Tryout for Mac OS X x86 (64-bit),"<br />

Geekbench, 2018. [Online]. Available:<br />

https://browser.geekbench.com/v4/cpu/4687004. [Accessed 13 1 2018].<br />

[4] Geekbench, "Geekbench 4 CPU Search," Geekbench, 2018. [Online].<br />

Available: https://browser.geekbench.com/v4/cpu/search. [Accessed 18 1<br />

2018].<br />

[5] L. Goasduff, "What Is Industrie 4.0 and What Should CIOs Do About<br />

It?," Gartner, 18 5 2015. [Online]. Available:<br />

https://www.gartner.com/newsroom/id/3054921. [Accessed 12 1 2017].<br />

Figure 7: Fixing an erroneous script on the target platform without<br />

recompiling.<br />

261


[6] Global Market Insights, "Embedded System Market Size by Application<br />

[...]," Global Market Insights, 2016. [Online]. Available:<br />

https://www.gminsights.com/industry-analysis/embedded-systemmarket.<br />

[Accessed 09 1 2018].<br />

[7] W. Niblack, in ”An Introduction to Digital Image Processing”,<br />

Englewood Cliffs, N.J., Prentice Hall, 1986, pp. 115-116.<br />

[8] K. J. Millman and M. Aivazis, "Python for Scientists and Engineers,"<br />

Computing in Science & Engineering, vol. 13, no. 2, pp. 9-12, 2011.<br />

[9] Z. Boyan, "Application of MatLab in Science and Engineering<br />

Calculation," 01 01 2001. [Online]. Available:<br />

http://en.cnki.com.cn/Article_en/CJFDTOTAL-DLXZ200101019.htm.<br />

[Accessed 12 02 2018].<br />

[10] F. Perez, B. E. Granger und J. D. Hunter, „"Python: An Ecosystem for<br />

Scientific Computing",“ Computing in Science & Engineering, vol. 13,<br />

no. 2, pp. 13-21, 2011.<br />

[11] T. E. Oliphant, „"Python for Scientific Computing",“ Computing in<br />

Science & Engineering, pp. 10-20, 2007.<br />

262


Strategies for facilitating reuse of code in embedded<br />

vision applications<br />

Frank Karstens<br />

Marketing Module Business<br />

Basler AG<br />

Ahrensburg, Germany<br />

frank.karstens@baslerweb.com<br />

Abstract – More and more companies active in the field of<br />

computerized machine vision (which is still mostly based on a<br />

classic PC setup), are recognizing the benefits of an embedded<br />

approach (lower power consumption, less space and most<br />

notably--significant cost savings). However, transferring existing<br />

software code from a classic PC setup to an embedded target can<br />

pose a number of challenges: different operating systems (i.e.<br />

standard Windows vs. hardware-specific Linux), different<br />

processor architecture (i.e. x86 vs. ARM, MIPS, etc.), different<br />

camera interfaces (i.e. GigE vs. MIPI CSI-2) and so on.<br />

Well-defined standards, which are able to set frameworks for<br />

these differences, can help tremendously in the reuse of existing<br />

code. With GenICam, the industrial machine vision industry<br />

established such standards years ago, and it still maintains them<br />

to keep them up-to-date with new technologies. GenICam<br />

standardizes camera configuration and image data transmission<br />

and provides standardized APIs for software developers.<br />

GenICam reference implementations exist for various operating<br />

systems and processor architectures.<br />

In addition, there are camera-vendor-specific SDKs available<br />

which are based on GenICam technology and which add even<br />

more user convenience to the camera APIs. The broader the<br />

choice of the SDKs’ supported operating systems, processor<br />

architectures and camera interface technologies, the more<br />

flexibility is offered to the user to move from one technology to<br />

another and the easier it is to port existing code to the new target.<br />

The MIPI CSI-2 interface, which is of particular interest for<br />

embedded vision applications, has not been covered by the<br />

GenICam standard so far. Different camera vendors are already<br />

working on a GenICam-like abstraction layer on top of CSI-2,<br />

which in turn will make migration to this interface similar to<br />

other GenICam camera interfaces. It is likely that CSI-2 will play<br />

an important role in the world of GenICam in the near future<br />

too.<br />

I. INTRODUCTION<br />

According to the Aspencore Embedded Markets Study<br />

2017 [1], by far most software developers (81%) reuse code<br />

created in-house. This is not surprising, as this code represents<br />

the core intellectual property of an innovative enterprise.<br />

Reusing existing code that has been proven to perform the task<br />

it was written for and is already well tested reduces<br />

technological risks, speeds up development, improves software<br />

quality, and overall reduces development costs.<br />

Many well-established techniques are available to help<br />

reuse code in new applications. Providers of machine vision<br />

software solutions, however, have to deal with the fact that as<br />

years and technological progress advance, the in-house<br />

developed code is not only evolving and progressing itself, but<br />

also needs to be adapted to hardware changes on the camera<br />

side. Across the last 15 years, the machine vision industry has<br />

gone through two industry-changing trends:<br />

- The move from analog to digital, where analog frame<br />

grabbers were replaced by digital hardware, ranging from<br />

costly and specialized high-end (Camera Link, CoaxPress etc.)<br />

down to common consumer-level hardware such as Firewire,<br />

Gigabit Ethernet or USB2/USB3.<br />

- The trend from CCD to CMOS which offered higherresolution<br />

sensors with higher speed and better sensitivity<br />

These trends enabled better camera product offerings for a<br />

more affordable price, and this in turn forced machine-vision<br />

solution providers to integrate these new camera products into<br />

their setup to stay competitive. The adaptation of the cameraside<br />

hardware changes always requires changes in the software<br />

stack. New proprietary drivers with proprietary APIs need to be<br />

integrated; new camera features or properties need to be<br />

matched by code modifications and so on. The industry has<br />

found an answer to these challenges: the GenICam standard<br />

(see below).<br />

However, the most recent trend–which is about to<br />

fundamentally change the machine vision industry–is<br />

“embedded vision”. The technological progress on both the<br />

sensor and processing sides allows in many cases the design of<br />

computer vision setups (previously requiring a high-end<br />

camera, cable and PC) now with embedded technologies like<br />

camera or sensor modules, embedded processing boards etc.<br />

An embedded approach offers lower power and space<br />

requirements and most notably, significantly lower costs<br />

compared to a classic PC-based setup. An important driver of<br />

www.embedded-world.eu<br />

263


this trend has been the mobile industry, with consumer<br />

products requiring vision for smartphones, tablets and so on.<br />

But it is not only the “classic” machine vision industry<br />

adopting embedded technologies. .We also see vendors of<br />

typical mobile processors (like Qualcomm, Rockchip, Samsung<br />

etc.) discovering the industry as an interesting market segment.<br />

They have started to offer typical mobile processors (which are<br />

equipped with multiple CSI-2 interfaces) with long-term<br />

availability for industrial applications.<br />

II.<br />

THE GENICAM STANDARD<br />

To reduce at least partly the burden of rewriting code again<br />

and to speed up development cycles and control development<br />

costs, about 15 years ago, the machine vision industry<br />

(consisting of software, camera, cable and frame-grabber<br />

vendors) organized into the GenICam (Generic Interface for<br />

Cameras) standards group [2] and started to develop standards<br />

to offer unified basic functions for digital cameras.<br />

Camera configuration - this function supports a range of<br />

camera features such as frame size, acquisition speed,<br />

pixel format, gain, image offset, etc.<br />

Grabbing images - this function creates access channels<br />

between the camera and the user interface, and initiates<br />

receiving of images<br />

Transmitting meta-information - this function enables<br />

cameras to send extra data on top of the image data.<br />

Typical examples could be histogram information, time<br />

stamp, area of interest in the frame, etc.<br />

Delivering events - this function enables cameras to talk to<br />

the application through an event channel<br />

These functions rely on three different modules provided by<br />

GenICam:<br />

GenAPI [3] specifies a methodology for generating a<br />

standardized camera API. This is achieved by an XML file<br />

provided by the camera device. The XML file expresses all<br />

characteristics, properties and features of the camera device in<br />

a standardized way (e.g. if a device provides a feature like<br />

“Blacklevel” the XML describes this feature in all its properties<br />

(Name=Blacklevel, Type=IInteger, Accessmode=RW,<br />

Val=128, Min= 10, Max=255 and so on). The XML file can be<br />

used to generate a static API in various programming<br />

languages or it can even be used to dynamically generate a<br />

generic API at runtime. This approach guarantees that an API<br />

which was generated out of the camera’s XML reflects all the<br />

most recent camera features and properties.<br />

Standard Feature Naming Convention (SFNC) [4]<br />

specifies and standardizes the name and behavior of a given<br />

feature. If a camera vendor uses a standard feature name for a<br />

feature, then it must behave precisely as defined in the SFNC.<br />

GenTL [5] specifies a standardized transport layer<br />

interface for enumerating cameras, grabbing images from the<br />

camera, and moving them to the user application.<br />

Associated with GenICam are Interface standards such as<br />

GigE Vision [6] or USB3 Vision [7], which specify protocols<br />

for reliable data transfer on established physical layers (e.g.<br />

GigEVision on Gigabit Ethernet or USB3 Vision on USB).<br />

These standards are not actually part of GenICam; still, it is<br />

mandatory for such interface-standard-compliant devices to be<br />

GenICam-compliant too.<br />

III.<br />

SPECIFIC CHALLENGES FOR EMBEDDED VISION<br />

In a classic machine vision setup - consisting of camera,<br />

cable and (in most cases) a Windows PC, GenICam helped a<br />

lot to provide a stable interface which defines camera or<br />

interface specifics and provides plug-and-play functionality.<br />

Code, which was written for a specific GigE Vision camera<br />

supplied by vendor A, could be reused with only minor<br />

modifications for a USB3 Vison camera from vendor B.<br />

However, things are different in the world of embedded<br />

vision. The environment is filled with variables. The vision<br />

sensor is not necessarily a camera; it can be a camera module<br />

or even a naked CMOS sensor. In addition, the available<br />

processing platforms represent a wide variety: many different<br />

classical CPU architectures (x86, ARM, MIPS, PowerPC etc.)<br />

compete with FPGA, GPU or DSP-based approaches etc.<br />

Moving from one sensor or camera to another, or changing the<br />

processing system architecture quite likely requires the<br />

software developer to rewrite significant parts of the vision<br />

software.<br />

However, the grade of difficulties that may arise when<br />

migrating code from a non-embedded setup to an embedded<br />

one depends on the camera interface chosen.<br />

Embedded with GigE/USB is not actually different from a<br />

non-embedded approach. If the camera interfacing code for the<br />

non-embedded system was written for a GenICam-compliant<br />

API, and as long as the targeted processing platform provides<br />

these interfaces, the existing code can be reused without any<br />

modification when ported to an embedded target. Most camera<br />

vendors offer their GenICam-based camera API for Windows<br />

and Linux (or even macOS) operating systems for x86 or<br />

ARM-based processing architectures. Recompiling the code for<br />

a new target is usually all that is required. The plug-and-play<br />

behavior of USB3 Vision or GigE Vision works the same way<br />

as on a desktop PC.<br />

Embedded with proprietary camera interfaces is – as<br />

the term “proprietary” suggests – not standardized. There are<br />

plenty of proprietary embedded camera interfaces available,<br />

parallel or serial ones (e.g. LVDS). Quite often drivers for a<br />

specific embedded processing platform do not exist and need to<br />

be developed first. Some vendors of embedded processing<br />

systems (e.g. SOM vendors) offer camera modules as an<br />

accessory. In this case, they typically make sure that their BSP<br />

and SDK provides support for their camera modules too.<br />

Camera-interfacing code written for a non-embedded target<br />

must be rewritten extensively for such embedded devices. This<br />

can be an expensive task, which might involve creating kernel<br />

mode drivers, implementing memory management, task<br />

scheduling etc. If a new camera module needs to be designed<br />

in, it is likely that parts of this work need to be done again, in<br />

264


particular when the new camera module comes from a new<br />

vendor or uses another interface technology.<br />

Embedded with MIPI-CSI-2: In 2003, vendors of mobile<br />

devices or components formed the MIPI (Mobile Industry<br />

Processor Interface) Alliance [8] as it became more and more<br />

obvious that standards for connecting peripherals (e.g. all kind<br />

of sensors or displays) are required to speed up development<br />

and product release cycles. The CSI-2 specification [9]<br />

(Camera Serial Interface 2nd generation) is the today’s number<br />

one standard for connecting vision sensors or camera modules<br />

to mobile processors or SoCs respectively. This standard,<br />

however, can be regarded more as a hardware standard<br />

specification. The physical layer is described in the D-PHY or<br />

C-PHY specification; in addition, the CSI-2 standard specifies<br />

the packet-oriented protocol for transmitting image data “on<br />

the wire”, but it does not standardize driver architectures,<br />

software stacks or a camera API. In addition, camera<br />

configuration and camera features or feature register layouts<br />

are not standardized at all. This implies two important facts to<br />

be considered:<br />

Each individual mobile SoC comes with its proprietary<br />

camera framework (if there is any at all). This is<br />

particularly true for Linux (on Android we see camera<br />

APIs which provide an abstraction from lower software<br />

and hardware layers, like Google’s Camera API 3 etc.)<br />

Moving an embedded application from one mobile SoC to<br />

another typically requires rewriting the related software<br />

layers and might even require modifying the existing<br />

software architecture.<br />

With CSI-2, camera configuration is done based on the CCI<br />

(Camera Control Interface) specification which is a type of<br />

I²C subset. Each CSI-2-compliant sensor or camera<br />

module however has a different register layout and<br />

different sensor/camera features. Switching from one<br />

sensor/camera to another will require the software<br />

developer to adapt the code to the different sensor-specific<br />

features.<br />

For vendors of consumer-grade mobile applications, these<br />

standard restrictions do not actually pose a big problem. The<br />

development efforts needed for interfacing an individual sensor<br />

with an individual mobile SoC normally pay off as part of<br />

regular practice: the primary goal of a consumer product sold<br />

in very high numbers is to keep production costs under control,<br />

which often already includes an individual highly costoptimized<br />

full-custom software design.<br />

For providers of embedded vision applications, which<br />

create more specialized, industrial, solutions (which are<br />

typically not sold in such high numbers as smart phone apps),<br />

things are different: development costs must be kept under<br />

control and additional efforts for sensor or camera integration<br />

have a direct impact on the development budget and delay the<br />

time-to-market. Any approach that would offer any kind of<br />

generic API for any combination of CSI-2 camera and SoC<br />

would be highly welcome.<br />

This requirement is now also reflected by efforts of the<br />

MIPI alliance, which released in October 2017 the new MIPI<br />

CCS (Camera Control Set) specification [10]. The primary goal<br />

of CCS is to “Enable rapid integration of basic image sensor<br />

functionalities without device-specific drivers”. The CCS<br />

standard specifies a set of basic functions a CCS compliant<br />

device must provide (along with a related register map). In<br />

theory, the integration efforts for a new camera device should<br />

be reduced to a minimum at least as long as only basic camera<br />

features are needed (“…such as resolution, frame rate and<br />

exposure time, as well as advanced features such as phase<br />

detection auto focus (PDAF), single frame HDR, or fast<br />

bracketing…”) [10]. The future will show whether the CCS<br />

specification is able to penetrate the market. Until now, many<br />

CSI-2 compliant devices have not even complied with the CCI<br />

specification. In addition, it is questionable whether the basic<br />

functions are sufficient for typical machine vision tasks, which<br />

usually require differentiated and real time-like control on<br />

more complex features such as single frame capture, fast<br />

changing ROI, exotic pixel formats etc.<br />

IV.<br />

A GENICAM-LIKE INTERFACE ABSTRACTION FOR MIPI<br />

CSI-2?<br />

The key for reusing code is a stable API, which is able to<br />

hide the specifics of lower software and hardware layers. For<br />

industrial machine vision, GenICam already showed how such<br />

an approach can look. Given that GenICam is already the<br />

dominant camera-interfacing standard in the non-embedded<br />

world of computer vision, it is obvious that a GenICam-like<br />

interface for MIPI CSI-2 would be a perfect abstraction layer<br />

for migrating and reusing code that was already written against<br />

a GenICam-conforming API. Leading machine vision camera<br />

vendors like Allied Vision, Basler, FLIR Systems, and IDS<br />

joined the MIPI alliance; it can be expected that they (along<br />

with the GenICam organization) will start to work on a<br />

GenICam-like abstraction for MIPI CSI-2. It is, however,<br />

unpredictable when the first results will be available. Until<br />

then, there remain at least two more proprietary approaches<br />

which individual camera vendors may offer to ease the reusage<br />

of existing code:<br />

Putting CSI-2 under the hood of the existing camera<br />

SDK. Even though the GenICam organization specifies a<br />

reference implementation, almost all software developers do<br />

not use this reference implementation directly; rather, they use<br />

the proprietary camera SDK of an individual camera vendor.<br />

Such camera APIs (e.g. Basler’s pylon, Allied Vision’s vimba,<br />

or FLIR`s Spinnaker SDK) are essentially a vendor-specific<br />

implementation of GenICam. They typically add vendorspecific<br />

functions and a better level of convenience to<br />

GenICam, along with programming samples, documentation,<br />

drivers etc.; they build an easy-to-use complete SDK. For a<br />

CSI-2 target, a camera vendor would still have to provide an<br />

individual driver stack for a given camera/SoC combination.<br />

However he can offer the common vendor-specific camera API<br />

(e.g. the Basler pylon API) on top of the CSI-2 driver stack.<br />

From the developer’s perspective in this case the MIPI camera<br />

would behave like other cameras from the same vendor, thus<br />

existing camera-interfacing code could easily be reused. From<br />

the camera vendor’s perspective this means a huge<br />

development effort as the existing API must be adapted to each<br />

individual camera/SoC combination. Basler, for example, made<br />

www.embedded-world.eu<br />

265


a first step in this direction and now offers a MIPI camera<br />

module for Qualcomm Snapdragon processors by offering the<br />

pylon API.<br />

Putting CSI-2 under the hood of GenTL. A less<br />

proprietary approach would be to offer GenTL as a camera<br />

interface. GenTL (Generic Transport Layer) – a GenICam<br />

substandard – consists of two parts:<br />

The GenTL Producer, which is usually provided by the<br />

hardware vendor of a GenTL-compliant device and which<br />

exposes a standardized API for all required camera<br />

functions, including enumeration, configuration and image<br />

acquisition.<br />

The GenTL Consumer, this is the interface a software<br />

needs to implement in to interface with (utilize) any<br />

GenTL Producer.<br />

This means that any software providing a GenTL<br />

Consumer would be able to interface with any device that<br />

exposes a GenTL Producer, regardless of the actual hardware,<br />

interface technology, driver etc. Again, a camera vendor might<br />

offer a GenTL Producer on top of the CSI-2 driver stack. And<br />

this would be also a huge development effort as the GenTL<br />

Producer must also be adapted to each individual camera/SoC<br />

combination.<br />

GenTL offers a higher level of abstraction, as it does not<br />

bind the software developer to a vendor-specific SDK. The<br />

disadvantage of GenTL is its relatively high complexity, which<br />

comes with significant internal overhead, and so it might not be<br />

the best solution for an embedded platform with limited CPU<br />

power and memory.<br />

V. CONCLUSION<br />

MIPI CSI-2 will become the most important camera interface<br />

for embedded machine vision applications. The lack of a<br />

standardized API (like GenICam) makes it difficult to reuse<br />

code that was written for other GenICam-compliant camera<br />

hardware.<br />

Camera vendors are now starting to put MIPI CSI-2 driver and<br />

software stacks under the hood of their existing camera SDK,<br />

or are offering a GenTL Producer which abstracts CSI-2<br />

specifics. Until the MIPI alliance itself does not integrate<br />

GenICam into the CSI specification, the vendor-specific<br />

approaches will remain as proprietary solutions for specific<br />

Camera/SoC combinations.<br />

For the software developer this means: watch for those<br />

camera SDKs which offer the broadest support for both nonembedded<br />

and embedded processing platforms, operating<br />

systems and interface technologies, including MIPI CSI-2.<br />

Having one unified camera API leverages the possibility of reusing<br />

significant amounts of existing code, and offers more<br />

flexibility to the user to move from one technology to another<br />

and to port existing code to the new target.<br />

REFERENCES<br />

[1] https://m.eet.com/media/1246048/2017-embedded-market-study.pdf<br />

[2] http://www.emva.org/standards-technology/genicam/<br />

[3] http://www.emva.org/wpcontent/uploads/GenICam_Standard_v2_1_1.pdf<br />

[4] http://www.emva.org/wp-content/uploads/GenICam_SFNC_2_3.pdf<br />

[5] http://www.emva.org/wp-content/uploads/GenICam_GenTL_1_5.pdf<br />

[6] https://www.visiononline.org/vision-standards-details.cfm?type=5<br />

[7] https://www.visiononline.org/vision-standards-details.cfm?type=11<br />

[8] https://mipi.org/about-us<br />

[9] https://mipi.org/specifications/csi-2<br />

[10] https://mipi.org/specifications/camera-command-set<br />

266


Accelerating the Development of Intelligent, Vision-<br />

Enabled Devices at the Edge<br />

Rapid Prototyping with an Embedded Vision Development Kit<br />

Dirk Seidel<br />

Lattice Semiconductor: Senior Marketing Manager, Industrial<br />

San Jose, CA, U.S.<br />

dirk.seidel@latticesemi.com<br />

Abstract— The future looks promising for embedded vision<br />

systems. Exciting new applications are coming to market. One<br />

key to their success will be designers’ ability to continually<br />

improve performance and utility. Today, mobile platforms have<br />

expanded and gone beyond smartphones and tablets. Often they<br />

are used in industrial display systems for M2M applications and<br />

Industry 4.0 implementations, Advanced Driver Assistance<br />

Systems (ADAS) and Infotainment applications for automotive<br />

markets, DSLR cameras, drones, virtual reality (VR) systems<br />

and medical equipment. What today’s embedded vision system<br />

designers are looking for is flexible connectivity to address<br />

evolving interface requirements, energy-efficient image signal<br />

processing, and hardware acceleration. This paper will review the<br />

tools available to embedded vision designers for rapid<br />

prototyping and describe embedded vision technology and how<br />

it’s being used.<br />

Keywords— Artificial Intelligence, AI, machine learning,<br />

Intelligence at the Edge, Edge Intelligence, Embedded Vision Kit,<br />

Embedded Vision Technology, CrossLink, ECP5, NanoVesta,<br />

FPGA, ASSP, HDMI<br />

I. INTRODUCTION<br />

Ten years ago, embedded vision technology was primarily<br />

used in relatively obscure, highly specialized applications.<br />

Today, designers are finding exciting new uses cases for<br />

embedded vision applications in a growing array of industrial,<br />

automotive, and consumer applications. Specifically, the<br />

emergence of advanced robotics and machine learning, as well<br />

as the migration to the Industry 4.0 manufacturing model,<br />

promise to create new applications for embedded vision.<br />

Driven by the rapid rise of mobile influenced technologies,<br />

designers are faced with increasing their pace when designing<br />

new products such as machine vision, Advanced Driver<br />

Assistance Systems (ADAS), drones, gaming systems,<br />

surveillance and security systems, virtual reality (VR) systems,<br />

medical equipment and AI solutions. All these applications<br />

highly benefit from the easy access and simplicity of embedded<br />

vision technology.<br />

II.<br />

TECHNOLOGICAL CHANGE<br />

What has changed? First and foremost, many of the key<br />

components and tools crucial to the rapid deployment of low<br />

cost embedded vision solutions have finally emerged. Now<br />

designers can choose from a wide range of lower cost<br />

processors and programmable logic capable of delivering<br />

higher performance in a compact footprint, all while<br />

consuming minimal power. At the same time, thanks to the<br />

rapidly growing mobile market, designers are benefiting from<br />

the proliferation of cameras and sensors. In the meantime,<br />

improvements in software and hardware tools are helping to<br />

simplify development and shorten time to market.<br />

The rapid rise in the number of sensors that are being<br />

integrated into the current generation of embedded designs, as<br />

well as the integration of low cost cameras and displays, have<br />

opened doors to a wide range of exciting new intelligence and<br />

vision applications.<br />

At the same time, this embedded vision revolution has<br />

forced designers to carefully re-evaluate their processing needs.<br />

New, data-rich video applications are driving designers to<br />

reconsider their decision to use a particular Applications<br />

Processor (AP), ASIC or ASSP. In some cases, however, large<br />

software investments in existing APs, ASICs or ASSPs and the<br />

high startup costs for new devices prohibit replacement. In this<br />

situation, designers are looking for co-processing solutions that<br />

can provide the added horsepower required for these new, datarich<br />

applications without violating stringent system cost and<br />

power limits.<br />

III.<br />

MOBILE INFLUENCE<br />

While embedded vision solutions in one form or another<br />

have been around for many years, the growth rate of the<br />

technology has been limited by a number of factors. First and<br />

foremost, key elements of the technology have not been<br />

available at low cost. In particular, compute engines capable of<br />

processing HD digital video streams in real-time have not been<br />

widely available within the power and cost budget. Limitations<br />

in the high capacity solid state store capabilities and advanced<br />

analytic algorithms have also presented challenges.<br />

www.embedded-world.eu<br />

267


!<br />

Three recent developments promise to radically change<br />

market conditions for embedded vision systems. First, the rapid<br />

development of the mobile market has given embedded vision<br />

designers a wide selection of processors that deliver relatively<br />

high performance at low power. Second, the recent success of<br />

the Mobile Industry Processor Interface (MIPI) specified by the<br />

MIPI Alliance, offers designers effective alternatives, using<br />

compliant hardware and software components to build<br />

innovative and cost-effective embedded vision solutions.<br />

Finally, the proliferation of low-cost sensors and cameras for<br />

mobile applications have helped embedded vision system<br />

designers drive implementation up and cost down.<br />

IV. THE NEED FOR MORE PROCESSING POWER<br />

By definition, embedded vision systems include virtually<br />

any device or system that executes image signal processing<br />

algorithms or vision system control software. The key elements<br />

in an intelligent vision system typically include high<br />

performance compute engines capable of processing HD digital<br />

video streams in real-time, high capacity solid state storage,<br />

smart cameras or sensors, and advanced analytic algorithms.<br />

Processors in these systems can perform a wide range of<br />

functions from image acquisition, lens correction and image<br />

pre-processing, to segmentation, object analysis and AI.<br />

Designers of embedded vision systems employ a wide range of<br />

processor types including general purpose CPUs, Graphics<br />

Processing Units (GPUs), Digital Signal Processors (DSPs),<br />

Field Programmable Gate Arrays (FPGAs) and Application<br />

Specific Standard Products (ASSPs) designed specifically for<br />

vision applications. Each of these processor architectures offers<br />

distinct advantages and challenges. In many cases, designers<br />

combine multiple processor types in a heterogeneous<br />

computing environment. Other times, the processors are<br />

integrated into a single component. Moreover, some processors<br />

use dedicated hardware to maximize performance on vision<br />

algorithms. Programmable platforms such as FPGAs offer<br />

designers both a highly parallel architecture for computeintensive<br />

applications and the ability to serve other purposes<br />

such as expand I/O resources.<br />

V. IMAGE CAPTURE<br />

Designers of embedded vision systems can select from a<br />

wide variety of analog cameras and digital image sensors.<br />

Digital image sensors are usually CCD or CMOS sensor arrays<br />

that operate with visible light. Embedded vision systems can<br />

also be used to sense other types of energy such as infrared,<br />

ultra sound, Radar and LIDAR. Complex embedded vision<br />

systems require sensor fusion in which data from several<br />

different sensors are "fused" to compute something more than<br />

it could be determined by any one sensor alone.<br />

Designers are moving increasingly to “smart cameras” that<br />

use the camera or sensor housing to serve as a chassis for all<br />

edge electronics in the vision system. Other systems transmit<br />

the sensor data to the cloud to reduce processing overhead on<br />

the system processor and, in the process, minimize system<br />

power, footprint and cost. However, this approach faces issues<br />

when low latency and critical decision-making, based on the<br />

image sensor data is required.<br />

VI.<br />

THE DESIGNER’S CHALLENGE<br />

The widespread adoption of low cost, mobile-influenced<br />

MIPI peripherals has created new connectivity challenges.<br />

Designers want to take advantage of the economies of scale<br />

that the latest generation of MIPI cameras and displays offer.<br />

But they also want to preserve their existing investment in<br />

legacy devices. The main challenge the designers are faced<br />

with is creating customized prototypes quickly and costeffectively,<br />

while reusing their existing designs.<br />

What designers need is a highly flexible solution that offers<br />

the logic resources of a high performance, “best-in-class” coprocessor<br />

capable of highly-parallel computation that is<br />

required in vision and intelligence applications, while adding<br />

high levels of connectivity and support for a wide range of I/O<br />

standards and protocols. Moreover, this solution should offer a<br />

highly scalable architecture and support the use of mainstream,<br />

low-cost external DDR DRAM at high data rates. This device<br />

should be optimized for both low power and low-cost<br />

operation, and offer designers the opportunity to use industryleading,<br />

highly compact packages.<br />

VII.<br />

USE CASES<br />

The rapid evolution of embedded vision, caused by the<br />

availability of low cost mobile influenced image sensors and<br />

displays, has led to exiting new commercial embedded vision<br />

applications. In some cases, however, large software<br />

investments in existing APs, ASICs or ASSPs and the high<br />

startup costs for new devices prohibit replacement. In this<br />

situation, designers are looking for co-processing solutions that<br />

can provide the added horsepower required for these new, datarich<br />

applications without violating stringent system cost and<br />

power limits.<br />

A. Machine Vision<br />

One of the promising applications for embedded vision is in<br />

industrial arena for machine vision systems. Machine vision<br />

technology is one of the most mature and higher volume<br />

application for embedded vision. As a result, it is widely used<br />

in the manufacturing process and quality management<br />

applications. Typically, in these applications manufacturers use<br />

compact vision systems that combine one or more smart<br />

cameras with a processor module.<br />

Today, designers are finding a seemingly endless array of<br />

new applications for this technology. For example, a machine<br />

vision smart-camera in Figure 1: Machine Vision Smart<br />

Camera is ideally suited to monitor the production floor of a<br />

manufacturing facility. Designers can use an FPGA to serve as<br />

a sensor bridge, act as a complete camera Image Signal<br />

Processing Pipeline (ISP), and supply connectivity like GigE<br />

Vision or USB Vision.<br />

Figure 1: Machine Vision Smart Camera<br />

Another example, is an FPGA-based video grabber Figure<br />

2 which aggregates data from multiple cameras and performs<br />

image pre-processing prior to sending it over a PCIe interface<br />

to a host processor.<br />

268


!<br />

Figure 2: Video Grabber<br />

B. Automotive<br />

Given its rapid rise in the use of electronics, the automotive<br />

market offers a high growth potential for embedded vision<br />

applications. The introduction of Advanced Driver Assistance<br />

Systems and infotainment features are expected to drive<br />

growth quickly. The embedded vision product most commonly<br />

used in these applications is the camera module. Vendors either<br />

develop analytics and algorithm in house or embed third party<br />

IP from external developers. One emerging automotive<br />

application is a driver monitoring system which uses vision to<br />

track driver head and body movement to identify fatigue.<br />

Another one is a vision system that can monitor potential driver<br />

distractions, such as texting or eating, increasing vehicle<br />

operational safety.<br />

But vision systems in cars can do far more than monitor<br />

what happens inside the vehicle. Starting in 2018, regulations<br />

will require that new cars must feature back-up cameras to help<br />

drivers see behind the car. And new applications like lane<br />

departure warning systems combine video with lane detection<br />

algorithms to estimate the position of the car. In addition,<br />

demand is building for features that read warning signs,<br />

mitigate collisions, offer blind spot detection and automatically<br />

handle parking and park reverse assistance. All of these<br />

features promise to make driving safer and are required to<br />

make decisions right at the edge.<br />

Together, advances in vision and sensor systems for<br />

automobiles are laying the groundwork for the development of<br />

true autonomous driving capabilities. In 2018, for example,<br />

Cadillac will integrate a number of embedded vision<br />

subsystems into its CT6 sedan to deliver SuperCruise, one of<br />

the industry’s first hands-free driving technologies. This new<br />

technology will make driving safer by continuously analyzing<br />

both the driver and the road while a precision LIDAR database<br />

provides details of the road and advanced cameras, sensors and<br />

GPS react in real-time to dynamic roadway conditions.<br />

Overall, auto makers are already anticipating that ADAS for<br />

modern vehicles will require forward facing cameras for lane<br />

detect, pedestrian detect, traffic sign recognition and<br />

emergency braking. Side- and rear-facing cameras will be<br />

needed to support parking assistance, blind spot detection and<br />

cross traffic alerts functions.<br />

One challenge auto manufacturers face is limited I/Os in<br />

existing electronic devices. Typically, processors today feature<br />

two camera interfaces. Yet many ADAS systems require as<br />

many as eight cameras to meet image quality requirements.<br />

Designers need a solution that gives them the co-processing<br />

resources to stitch together multiple video streams from<br />

multiple cameras or perform image processing functions such<br />

as white balance, fish-eye correction and defogging, on the<br />

camera inputs and pass the data to the Application Processor<br />

(AP) in a single stream. For example, many auto manufacturers<br />

offer as part of their ADAS system a bird’s-eye view capability<br />

that gives the driver a live video view from 20 feet above the<br />

car looking down. The ADAS system accomplishes this by<br />

stitching together data from four or more cameras with a wide<br />

Field-of-View (FoV).<br />

Historically designers have used a single processor to drive<br />

each display. Instead, designers can now use a single FPGA to<br />

replace multiple processors, aggregate all the camera data,<br />

stitch the images together, perform pre- and post-processing<br />

and send the image to the system processor. Figure 3 shows the<br />

simplified architecture of a Bird’s-eye-view 360 Automotive<br />

Camera system, which collects data from four cameras located<br />

around the car (front, back, and side). A single FPGA is used<br />

for various pre- and post-processing functions and stitches<br />

together the video data to provide a 360-degree view of the<br />

vehicle surroundings. In this case, a single FPGA replaces<br />

numerous processors.<br />

Figure 3: Bird’s-eye-view 360 Automotive Camera<br />

System<br />

C. Consumer<br />

Drones, Augmented Reality/Virtual Reality (AR/VR) and<br />

other consumer applications offer tremendous opportunities for<br />

developers of embedded vision solutions. Today, drone<br />

designers are finding it cheaper to synchronize six or more<br />

cameras on a drone to create a panoramic view than build a<br />

mechanical solution that takes two cameras and moves them<br />

180 degrees. Similarly, AR/VR designers are rapidly<br />

converting video from a single video stream and splitting the<br />

content to a dual-display. They make use of low-cost mobile<br />

influenced technology which uses two MIPI DSI displays, one<br />

for each eye, providing low latency performance requiring only<br />

minimal power consumption, enhance depth perception and<br />

offer the user a more immersive experience.<br />

www.embedded-world.eu<br />

269


!<br />

Figure 5: Customizable FPGA-based Prototyping<br />

Platform<br />

!<br />

Figure 4: FPGA-based Virtual Reality System<br />

VIII.<br />

HOW TO MASTER THE CHALLENGE<br />

To overcome the designer’s challenge and striving for fast<br />

product development cycles for the lowest cost and lowest<br />

power products, which provide superior performance to the<br />

market it is highly recommended to take a modular approach.<br />

This method allows designers to customize their prototyping<br />

system based on existing field proven hardware and software<br />

and reuse existing elements in their design.<br />

Many hardware platforms, tailored to specific functions<br />

like sensor bridging, image processing or networking<br />

connectivity are available off-the-shelf from numerous<br />

vendors. Semiconductor manufacturers provide reference<br />

platforms or development boards incorporating their own<br />

products, while specialized design houses provide modular<br />

systems including semiconductor products from several<br />

vendors. Often these design houses also offer a commercial<br />

grade ISP, which simply can be used with their hardware and<br />

quickly included in a prototype. Furthermore, many boards and<br />

systems are available through electronic distributors or through<br />

online stores.<br />

One important factor when selecting the prototyping system<br />

of choice is the availability of the right connector and the<br />

capability to seamlessly connect multiple boards together.<br />

Numerous connectors exist as standard header pins, PMC<br />

Mezzanine, Milli-Grid Connector, or ERM5 Rugged High-<br />

Speed Headers connector. The amount of available connectors<br />

is endless.<br />

Header pins provide usually the most flexible option to<br />

wire up several boards. However, the drawback is that it often<br />

becomes extremely difficult to connect several high speed<br />

signals which require synchronization, like in video<br />

applications. Enormous amounts of time and money need to be<br />

spend to connect the development boards from different<br />

semiconductor vendors.<br />

A smart modular solution for an embedded vision<br />

prototyping solution is provided by Lattice Semiconductor and<br />

their Embedded Vision Development Kit. This kit is part of<br />

Lattice’s Video Interface Platform (VIP), which allows for an<br />

easy interchange of input and output interconnect boards<br />

through a simple snap on concept. The kit incorporates a<br />

nanoVesta connector, allowing easy connectivity of a variety of<br />

different image sensors, which are available from third party<br />

vendors like HelionVision.<br />

This modular 3-board set simplifies the implementation of<br />

highly flexible, cost-effective embedded vision solutions for<br />

mobile-influenced systems in industrial, automotive, and<br />

consumer markets. The development kit is built around a<br />

stackable three-board set that combines the CrossLink video<br />

bridge input board for sensor aggregation, an ECP5 FPGA<br />

processor board used for the ISP, and an HDMI output board.<br />

Allowing easy connectivity to a standard HDMI display. The<br />

kit is complemented by an evaluation version of Helion’s<br />

commercial grade IONOS ISP. This ISP is sensor independent<br />

and can easily be customized to any specific need. IONOS ISP<br />

provides HD-image signal processing for superior image<br />

quality and provides algorithm for pixel correction, white<br />

balance, debayer, color space conversion and gamma<br />

correction and more.<br />

To help embedded vision developers rapidly build<br />

prototypes for a growing array of applications and shorten<br />

time-to-market, a modular system approach is highly<br />

beneficial. Lattice recognizes these benefits and plans to offer a<br />

variety of additional input and output boards in the near future.<br />

A newly released HDMI input board is available now and can<br />

be used as an alternative to the Sensor Bridge for camera<br />

aggregation<br />

The Embedded Vision Development Kit allows developers<br />

to take advantage of existing hardware building blocks and<br />

easily customize their design functionality by easily mixingand-matching<br />

new boards to meet the needs of industrial,<br />

automotive and consumer vision applications.<br />

270


Selecting Cellular LPWAN Technology<br />

for the IoT<br />

Brent Nelson<br />

Sr. Product Manager – Long Range RF and Gateway Products<br />

Digi International<br />

Minnetonka, MN USA<br />

Abstract— For many Internet of Things (IoT) applications,<br />

high-throughput standards such as LTE-Advanced, with its<br />

throughput of 300Mbps, are overkill, since the amounts of<br />

data are relatively small. What’s more, devices and sensors<br />

are often deployed in far-flung, remote areas that often lack<br />

access to power, making a high-powered router unfeasible.<br />

To address this segment’s low-power, low-bandwidth<br />

requirements, the 3GPP, the cellular-standards body, is<br />

putting forth new “narrowband” standards. LTE Cat 1, LTE-<br />

M, and NB-IoT are designed to connect devices and sensors<br />

that dribble data and operate at very low power, allowing them<br />

to last multiple years on a battery.<br />

LTE Cat 1 networks are available in North America,<br />

Australia, and Japan, and are an excellent option for IoT<br />

devices that require cellular connectivity. With throughput<br />

speeds capped at 10 Mbps, this standard is significantly less<br />

complex and less power-hungry that Cat 3 or Cat 4<br />

technologies that support throughput of 100 and 150 Mbps.<br />

In the fall of 2017, carriers will activate their networks to<br />

support LTE-M in North America and NB-IoT in Europe.<br />

LTE-M has a maximum speed of about 1 Mbps and NB-IoT<br />

caps out at 144 Kbps, making them ideal for low-power, lowdata-rate<br />

applications. Keywords: Cat 1; LTE; NB-IoT; IoT;<br />

routers; 3G;<br />

I. INTRODUCTION<br />

Existing barriers to entry<br />

For several years, market research has suggested that the global<br />

IOT will drive to an exponential growth in number of connected<br />

devices with predications going as high as 50B new connected<br />

devices. While the number of connected devices has grown, we<br />

have yet to see the exponential growth that was widely<br />

predicted.<br />

There are many factors that have slowed the growth of the IOT.<br />

Perhaps the most challenging barrier is the cost and challenge<br />

of getting data from edge devices like sensors and machines.<br />

No matter how interesting or potentially revolution a<br />

technology new technology is; every IOT solution must pass a<br />

Return on Investment (ROI) analysis before that investment<br />

will be made.<br />

Historically there have been three typical barriers to connecting<br />

edge devices when deploying an IOT solution that requires<br />

mass deployment of remote monitoring…<br />

<br />

<br />

<br />

Device Cost<br />

Recurring Cost<br />

Battery Life<br />

New wireless WAN technologies like Low Power(LP) Cellular<br />

and Lora WAN were developed specifically to address these<br />

barriers and have the potential to drive the exponential growth<br />

the has been predicted.<br />

This paper will discuss those cellular technologies in depth.<br />

II. TECHNOLOGY OVERVIEW<br />

LTE‐CAT 1<br />

The first step on the roadmap of cellular for IOT devices was<br />

LTE‐CATEGORY (CAT) 1. LTE CAT 1 required a only a software<br />

update to existing LTE networks, so it could be deployed<br />

quickly and now has pervasive coverage in the US and Canada<br />

as well as across much of Europe.<br />

LTE CAT 1 was designed to operate at 3G speeds with a max<br />

downlink/uplink of 10Mbps/5Mbps. The main advantage of<br />

LTE CAT 1 was cost. LTE modules were priced at a premium vs<br />

3G/2G devices so IOT devices that used LTE were often price<br />

out of the market when they did not require the bandwidth of<br />

LTE this slowed the deployment of LTE even when the a<br />

shutdown of 3G and 2G networks was on the near term<br />

horizon.<br />

LTE CAT 1 modules were priced at similar levels to 3G/2G<br />

models. This meant there was no incentive to continue to<br />

deploy 3G/2G modules since those networks are nearing EOL<br />

www.embedded-world.eu<br />

271


and there was no longer a cost advantage. LTE CAT 1 however<br />

did not implement any power saving features vs LTE CAT3/4<br />

networks so battery life continued to be a challenge for<br />

remotely deployed assets.<br />

LTE Cat 1 networks are available in North America, Australia,<br />

and Japan.<br />

LTE‐M and NB‐IOT<br />

LTE‐M and NB‐IOT are the next step in the on the cellular IOT<br />

roadmap. Carriers have been deploying these networks<br />

throughout 2017 and that will accelerate in 2018.<br />

LTE‐M and NB‐IOT are market disruptive technologies based<br />

on their low cost, increase link budget and low power. These<br />

are the two technologies that will lead to the inflection point<br />

in the deployment of IOT.<br />

These technologies are often referred to as narrow band<br />

cellular. The typical 20 MHZ bandwidth of LTE is shrunk to<br />

1.4MHZ for LTE‐M and 200KHZ for NB‐IOT. While narrowing<br />

the band would be considered a bad thing for a high<br />

throughput application, for these technologies a narrower<br />

band leads to lower cost, better received sensitivity and lower<br />

power.<br />

A quick summary of the technical specifications of the different<br />

standards is shown below.<br />

Technology<br />

Downlink<br />

speed<br />

Uplink<br />

speed<br />

Number of<br />

antennas<br />

Duplex<br />

mode<br />

Receive<br />

bandwidth<br />

Transmit<br />

power<br />

Modem<br />

complexity<br />

Release<br />

8<br />

Release<br />

8<br />

Release<br />

13<br />

CAT4 CAT1 LTE-M<br />

150<br />

Mbps<br />

50<br />

Mbps<br />

10<br />

Mbps<br />

5 Mbps<br />

200<br />

kbps -<br />

1 Mbps<br />

200<br />

kbps -<br />

1 Mbps<br />

2 2 or 1 1 1<br />

Full<br />

duplex<br />

Full<br />

duplex<br />

20 MHz 20 MHz<br />

Half<br />

duplex<br />

1.4<br />

MHz<br />

Release<br />

13<br />

NB-<br />

IOT<br />

200<br />

kbps<br />

144<br />

kbps<br />

Half<br />

duplex<br />

200<br />

kHz<br />

23 dBm 23 dBm 20 dBm 23 dBm<br />

100% 80% 20%


IV.<br />

TRADEOFFS<br />

Since nothing I life is free, moving to either the NB-IOT or<br />

LTE-M standard has tradeoffs. Understanding these is critical<br />

to picking the right technology that will meet your application<br />

requirements.<br />

Mobile Initiated vs Mobile Terminated<br />

One of the first things engineers need to understand when<br />

looking at these LP-WAN technologies is they are designed for<br />

mobile initiated calls (meaning the end device initiates the data<br />

connection to the application). To support the ultra-low power<br />

modes, these modules to into a sleep state and cannot receive<br />

traffic from the network. The networks are not guaranteed to<br />

store network initiated messages that cannot be received (like<br />

they would with SMS) to data may be lost on network initiated<br />

calls. Note that keeping the device in a state where it can<br />

receive messages would defeat the purpose of the technology<br />

and make them only slightly better than standard LTE or 3G<br />

devices.<br />

LP‐ Cellular technologies like LTE‐M and NB‐IOT have the<br />

potential to grow the number of IOT devices deployed<br />

exponentially. They offer significant benefits in terms of<br />

device cost, recurring cost and battery life. While they are an<br />

ideal technology for many IOT devices there are tradeoffs to<br />

using these technologies and any engineer or company looking<br />

to deploy with these technologies should fully study the<br />

tradeoffs to ensure they meet the needs of their application.<br />

Latency<br />

The second major tradeoff is latency. When a LP-WAN device<br />

goes into low power mode the latency will increase<br />

significantly. In Discontinuous Reception Mode (DRX or<br />

eDRX) the device will only listen for incoming traffic every x<br />

seconds in DRX Mode and x seconds in eDRX mode.<br />

In Power Save Mode (PSM) mobile terminated connections are<br />

not possible so the latency would be define by how often the<br />

remote device wakes up and connects to the network but could<br />

easily be measure in hours or days.<br />

Throughput<br />

The second tradeoff is throughput. This one is obvious; these<br />

networks were not designed for high bandwidth applications<br />

like video. The typical throughput on an LTE-M network is<br />

300kbp while the throughput of a NB-IOT network is typically<br />


Sigfox – connecting the world with one LPWAN<br />

a global low power wide area UNB non-slotted Aloha transmission IoT approach<br />

Alexander Lehmann, M.Eng.<br />

Principal Engineer<br />

Sigfox Germany<br />

Munich, Germany<br />

Alexander.Lehmann@sigfox.com<br />

Abstract – Efficient Transmission of small data packets in a<br />

shared spectrum is challenging. This paper gives a brief<br />

summary of using Sub-GHz General Purpose Transmitters to<br />

transmit low power wide area UNB Frames without Signaling<br />

using Sigfox as an example.<br />

Keywords – Sigfox; UNB; IoT; Sub-Ghz; non-slotted; Aloha;<br />

spectrum frugality; low power; wide area<br />

keeping the spectrum utilized as minimal as possible and<br />

having the desired Quality of Service (QoS) as shown in III.<br />

III.<br />

QUALITY OF RECEPTION<br />

When messages arrive randomly, a simulation can indicate<br />

a first result for the performance of this concept – the desired<br />

quality is 99,99% of the received messages have to be decoded:<br />

I. INTRODUCTION<br />

Spectrum is a very rare resource – this makes it important<br />

to use it as efficiently as possible. Therefore, an UNB approach<br />

with a high spectral efficiency is essential – even more in a<br />

license free and shared band. Sigfox as a widely deployed<br />

network (35 countries and counting) shall be used as an<br />

example to discuss the current state of the art Ultra Narrow<br />

Band Network. An outlook will be given to possible future<br />

features, advanced processing methods and updates.<br />

II.<br />

RANDOM TIME AND FREQUENCY<br />

A. Design choices<br />

As a starting point an Aloha-approach network was chosen.<br />

Each device can emit randomly (in compliance with the duty<br />

cycle, maximal transmitted power and the frequency-band, e.g.<br />

in the ETSI Zone 1 ) to have no signaling for power saving and<br />

lifetime predictability reasons.<br />

B. Resulting Aftermath<br />

This also brings some consequences for reliability – to<br />

compensate for the random transmissions in time and<br />

frequency, retransmissions and a cooperative reception<br />

approach were chosen. Each message is repeated 2 times on<br />

different frequencies with < 50µs time in between – the<br />

complete message consists therefore of 3 frames. All base<br />

stations in range pick up the signal and forward it to the<br />

backend – the redundancy is enough and a tradeoff between<br />

Figure 1: Maximal BS Load for 99,99% QoS [Simulation by Sigfox]<br />

When the success rate of 99,99% is given, the maximal input<br />

load is at 14% of the channel capacity – a Sigfox message is<br />

100 Hz wide and the channel is 192 kHz.<br />

192 kHz / 100 Hz * 14% = 269 (1)<br />

269 concurrent messages / second should be decoded. Practical<br />

tests have shown, that this number keeps up. This results in a<br />

capacity of more than 10 Million frames per day. In case this<br />

should ever be not enough anymore, there is always the<br />

possibility to decrease the sensitivity of the base stations and<br />

then add more in the desired area. The cooperative approach –<br />

all base stations pick up all the signals they receive and relay<br />

them to Sigfox cloud – makes it easy to enhance the network in<br />

terms of capacity and coverage.<br />

1<br />

“ETSI EN 300 220”: Short Range Devices (SRD) operating<br />

in the frequency range 25 MHz to 1 000 MHz<br />

274


IV.<br />

MINIMISING THE MESSAGE OVERHEAD<br />

A. Uplink Frame chacteristics<br />

Besides the payload, a device has to have a unique ID, the<br />

message must be checked if it is received and decoded<br />

correctly, the base stations must be able to tune in on a<br />

message... Thus, a message consists primarily of:<br />

+––––––+––––––––––+––––––+–––––+––––+–––+<br />

| Preamble | Sync & Header | Device Id | Payload | MAC | CRC |<br />

+––––––+––––––––––+––––––+–––––+––––+–––+<br />

The downlink is always 600 bps in all Zones and occupies<br />

1600 Hz. As a modulation, Gaussian FSK was chosen. As the<br />

base stations are in listen before talk mode – opposed to the<br />

devices – they must obey a different duty cycle and are also not<br />

limited to the e.g. 14 dBm for devices in the ETSI Zone – here<br />

27 dBm can be used. For frequency stability, the drift of the<br />

downlink – base station sends, device receives – is lower and<br />

the center frequency is chosen in correspondence to the<br />

received frequency of the preceeding uplink message. The base<br />

station to send the downlink is the one with the best reception<br />

quality of the corresponding uplink.<br />

• Preamble: 19 bits, always 0b1010101010101010101<br />

• Sync & Header: 17 bits and a 12 bit counter<br />

• Device Id: 32 bits<br />

• Payload: 0 – 96 bits<br />

• MAC (Authentication): 16 – 40 bits<br />

• CRC: 16 bits<br />

B. Downlink Frame chacteristics<br />

For downlink messages, the format is similar, but additionally,<br />

a forward error correction was introduced. One of the reasons<br />

is that there is no repetition on the downlink. Below is the<br />

structure of a downlink message:<br />

+––––––+–––+–––+–––––+––––+–––+<br />

| Preamble | Sync | ECC | Payload | MAC | CRC |<br />

+––––––+–––+–––+–––––+––––+–––+<br />

• Preamble: 91 bits, always<br />

0b101010101010101010101010101010101010101010101010<br />

1010101010101010101010101010101010101010101<br />

• Sync: 13 bits, always 0b1001000100111<br />

• ECC (Forward Error Correction): 32 bits<br />

• Payload: 0 – 64 bits<br />

• MAC (Authentication): 16 bits<br />

• CRC: 8 bits<br />

V. RADIO CHARACTERISTICS<br />

The occupied bandwidth in the uplink – device transmits, base<br />

stations receive – was already given before and is in the ETSI-<br />

Zone (RC1) 100 Hz. In other Zones, it is 600 Hz (FCC, RC2).<br />

Corresponding to the bandwidth, the modulation is 100 bps<br />

while occupying 100 Hz and 600 bps using 600 Hz. The<br />

maximum allowed emission power is regulated by the<br />

corresponding Regulation Organizations. To be able to<br />

manufacture cheap devices, the frequency drift is also not too<br />

critical, if it is not too significant. A regular crystal can achieve<br />

this. For resilience reasons, a differential PSK is used in binary<br />

mode:<br />

Figure 3: 2GFSK-Modulation [Illustration by Sigfox]<br />

VI.<br />

POWER CONSIDERATIONS<br />

A. Initial Power planning<br />

Not only sending, but also receiving a message and the<br />

listening time before puts considerable strain on the battery.<br />

On average, sending can be considered 10.000 more power<br />

hungry than idle / deep sleep in the case of the used ICs.<br />

Listening and receiving uses around half the power compared<br />

to transmitting (but this depends heavily on the used<br />

transceiver!).<br />

B. Lack of Signaling<br />

As there is no signaling at all in the network and the radiated<br />

power is always the same, a very good lifetime power usage<br />

prediction can be created for different IoT customer scenarios.<br />

Depending on the transceiver, the implementation and the<br />

number of messages sent and received, a lifetime of over 5<br />

years with a standard AA-Battery was shown – with the<br />

battery aging being the biggest concern.<br />

VII.<br />

DEVICE INITIATED COMMUNICATIONS<br />

Figure 4 gives a qualitative overview of a complete downlink<br />

cycle. After the three uplink frames are sent in roughly 6<br />

seconds, depending on the payload, the device goes into sleep<br />

mode. 20 seconds after the end of the first frame, it wakes up<br />

and tunes in on the expected downlink frequency. When the<br />

download is after a maximum timeout of 25s received, the<br />

transceiver sends 1,4s after the end of the downlink frame a<br />

control frame as an acknowledgement containing the<br />

temperature, voltage (VDD idle & VDD tx) and the RSSI. The<br />

downlink messages must not be sent, but when the bit for a<br />

downlink is set, the device will listen 25s anyway. An Upload<br />

only cycle in comparision ends after the three first peaks.<br />

Figure 2: DBPSK-Modulation [Illustration by Sigfox]<br />

Figure 4: Power consumption over time for an DL cycle<br />

275


A. Downlink Cycle Considerations<br />

As the transceivers would considerably waste power in an<br />

always on listen-mode, the resulting battery life is not<br />

manageable from an economical point of view – that’s why<br />

bidirectional communication can only be established from the<br />

device.<br />

B. Battery Strain<br />

Over time, the battery will have problems supplying the peak<br />

power consumption in transmission or reception mode. The<br />

specified waiting time in between reduces the strain on the<br />

battery, as it can recover for a brief period of ~16 seconds.<br />

C. Security<br />

An additional advantage of the device initiated communication<br />

is the hardened security – without a previous uplink, the<br />

device cannot receive a downlink as it is simply not in<br />

listening mode. That hardens the device against malicious<br />

frames intended for misuse.<br />

VIII. OUTLOOK<br />

To further enhance the successful reception of messages, either<br />

brute force or insights, even together with the customers can be<br />

used. The points stated below should only give a hint on<br />

possible advanced recovering methods to come – they are<br />

neither complete nor fully described.<br />

ACKNOWLEDGMENT<br />

The author wants to thank his technical colleagues in<br />

France for deep and insightful discussions over the last 18<br />

month.<br />

REFERENCES<br />

[1] “ETSI EN 300 220”: Short Range Devices (SRD) operating<br />

in the frequency range 25 MHz to 1 000 MHz<br />

Alexander Lehmann received his B.Eng. in<br />

electrical engineering from the University of Applied<br />

Sciences in Munich and his M.Eng. from the<br />

Deggendorf Institut of Technology. Before joining<br />

Sigfox, he worked at Advanced Micro Devices. His<br />

main interests are in the fields of video compression<br />

and advanced communications and lecturing about<br />

the latter to customers.<br />

A. Forward ECC<br />

As already in the downlink specified, a forward error<br />

correction in the uplink can be considered.<br />

B. Combined Reception and Recombination<br />

When a decoded message at the base station has a CRC<br />

mismatch, it is nowadays discarded. If more than one frame<br />

was received or / and if received at more than one base station,<br />

there are multiple copies of the frames – these can be<br />

compared and recombined until the CRC matches.<br />

C. Expecting (parts of) the Message / Payload<br />

When working together with the customer, an insight can be<br />

gained on what the payload (or parts of it) should contain. The<br />

more payload bits are known, the fewer combinations must be<br />

tried to get a CRC match. Also, the counter is known at the<br />

Sigfox cloud. Static devices with transmission history are also<br />

an example, as then the device IDs are known at the usual<br />

receiving base stations. The header can in its current protocol<br />

implementation only have a significantly lower number of<br />

possibilities than what the 17 bit could represent. All these<br />

known or easy achieved educated guesses can significantly<br />

reduce the needed tries for a CRC success.<br />

D. Deep learning<br />

To automate this process further, patterns in the messages can<br />

be recognized and then over a significant amount of messages<br />

patterns then can predict how the coming messages should<br />

look like. If there are deviations and CRC fails, a leaner than<br />

brute force approach will be used.<br />

276


Why LTE Cat M1 and NB-IoT make perfect sense for<br />

digitization and smart predictive infrastructure<br />

Ludger Boeggering<br />

Market Development Manager<br />

u-blox AG<br />

Thalwil, Switzerland<br />

Ludger.Boeggering@u-blox.com<br />

Abstract— The newly implemented LPWA technologies<br />

Narrowband-IoT (NB-IoT or Cat NB1) and LTE Cat M1<br />

combine the advantage of using common available infrastructure<br />

with low cost, low power consumption, deep in-building<br />

penetration and high numbers of simultaneously operating units.<br />

NB-IoT provides its advantages in efficient battery operation<br />

and in-building penetration, opening the possibility for coverage<br />

underground and deep within factory buildings. By design, NB-<br />

IoT considers cost-optimized deployment essential due to the<br />

“clean slate” approach. NB-IoT does not need the intelligence to<br />

coexist with classic 4G traffic, hence there is a fair chance to<br />

reach customer cost expectations.<br />

Due to the fact that LTE Cat M1 technology is much closer to<br />

the standard LTE network, this technology allows applications<br />

which need more bandwidth and LTE-like latency. Nevertheless,<br />

LTE Cat M1 also provides extensive power saving modes and<br />

improved coverage.<br />

In combination with the security concept for the whole<br />

communication module, security by design provides upcoming<br />

security requirements for the industrial IoT.<br />

A further aspect of IoT is the remote management of a large<br />

and growing category of connected devices; those with limited<br />

bandwidth and those viable only at very low production costs.<br />

For these types of devices a standard has been introduced by<br />

(Open Mobile Alliance) OMA, called Lightweight M2M<br />

(LwM2M).<br />

Keywords—LTE; Cat M1; Cat NB1; 4G; 3G; 2G; low-power<br />

wide-area; LPWA; IoT; Internet of Things; NarrowbandIoT; NB-<br />

IoT; eMTC; 3GPP; licensed spectrum; smart; electricity; utility;<br />

gas; water; heat; power; city; building; environmental; agriculture;<br />

security; connected health<br />

I. INTRODUCTION AND TECHNOLOGY BRIEF<br />

Different wireless LPWA technologies have been invented<br />

in the past few years with the target to be specifically<br />

optimized for the Internet of Things (IoT). It is expected that<br />

the IoT will allow for numerous services in different<br />

application areas of industry and consumer. Examples of<br />

applications are VIP/pet/bike tracking, assisted living/medical,<br />

security systems and detectors, agriculture tracking,<br />

water/gas/heat/electricity metering, vending, fleet management,<br />

waste-bin/tank monitoring, lighting, parking and traffic<br />

management etc.<br />

In general, the invented technologies are either operating in<br />

a licensed or unlicensed frequency environment. There are<br />

current proprietary low-power wide-area (LPWA) technologies<br />

like SigFox, LORA or RPMA, but also the fast forthcoming,<br />

3GPP-standardized cellular IoT technologies LTE Cat NB1 (or<br />

narrowband-IoT, NB-IoT) and LTE Cat M1. According to<br />

Figure 1 each of the technologies have their specific features,<br />

hence technology selection should always be driven by the<br />

application requirements and by technology preference itself.<br />

u-blox has been involved in early stage trials and proof of<br />

concepts for NB-IoT, LwM2M and LTE Cat M1, covering a<br />

range of use cases.<br />

The presentation will give deep dive insights into these<br />

communication infrastructure technologies. Part of the<br />

presentation will cover each technology in detail by using<br />

dedicated use cases from the energy market and from predictive<br />

maintenance.<br />

Mobile network operator involvement will also be outlined<br />

with an outlook into the challenges of how to handle the future<br />

millions of devices on their networks. Finally this presentation<br />

will outline future 3GPP standards for the IoT.<br />

Fig. 1. Technology overview<br />

NB-IoT is a clean slate technology that can be implemented<br />

into radio and core networks of existing LTE or 2G cellular<br />

networks. The radio network supports work with simple, low<br />

cost devices. The transmission and higher layer protocols help<br />

devices consume less power with the aim of achieving a battery<br />

277


life of over ten years. Finally, extended coverage is provided<br />

for deep indoor penetration and rural areas.<br />

Fig. 2. NB-IoT overview<br />

LTE Cat M1 is a low‐power wide‐area (LPWA) air<br />

interface that lets you connect IoT and M2M devices with<br />

medium data rate requirements (375 kb/s upload and download<br />

speeds in half duplex mode). It enables longer battery<br />

lifecycles and greater in‐building range as compared to<br />

standard cellular technologies such as 2G, 3G or LTE Cat 1.<br />

Key features include:<br />

Support of voice functionality via VoLTE<br />

Full mobility and in‐vehicle hand‐over<br />

Low power consumption<br />

Extended in‐building range<br />

applications running on devices that may be deployed in the<br />

field for extended periods of time.<br />

Battery life of up to 10 years on a single charge in some use<br />

cases also contributes to lower maintenance costs for deployed<br />

devices, even in locations where end devices may not be<br />

connected directly to the power grid.<br />

As compared to NB‐IoT, LTE Cat M1 is ideal for mobile<br />

use cases, because it handles hand‐over between cell towers<br />

similarly to high speed LTE. For example, if a vehicle moved<br />

from point A to point B, crossing several different network<br />

cells, a Cat M1 device behaves like a cellular phone and never<br />

drops the connection. An NB‐IoT device needs to re‐establish<br />

a new connection after a new network cell is reached.<br />

Another benefit is the support of voice functionality via<br />

VoLTE (voice over LTE) for applications requiring a level of<br />

human interaction, such as for certain health and security<br />

applications (e.g. stay-in-place solutions and alarm panels).<br />

III.<br />

LANDSCAPE OF APPLICATIONS<br />

The IoT targets the connectivity of things to autonomously<br />

exchange status information and sensor data between each<br />

other and to cloud-based information platforms. Different<br />

technologies are currently in use or will be used in the future,<br />

including wired and short range wireless technologies.<br />

There are some applications, which are best operated by<br />

cellular IoT technologies, especially LPWA. Such applications<br />

include smart metering and battery powered sensors.<br />

Fig. 4. NB-IoT applications<br />

Out of these applications, a number of key technology<br />

requirements can be identified to support massive deployment:<br />

Lowest power consumption<br />

Fig. 3. LTE Cat M1 overview<br />

II.<br />

EXCEEDING EXPECTATIONS<br />

LTE Cat M1 is part of the same 3GPP Release 13 standard<br />

that also defined narrowband-IoT (NB‐IoT or LTE Cat NB1) -<br />

both are LPWA technologies in the licensed spectrum. With<br />

uplink and downlink speeds of 375 kb/s in half duplex mode,<br />

Cat M1 specifically supports IoT applications with low to<br />

medium data rate needs. At these speeds, LTE Cat M1 can<br />

deliver remote firmware updates over‐the‐air (FOTA) within<br />

reasonable timeframes, making it well‐suited for critical<br />

Low device cost<br />

Low deployment cost<br />

In-building penetration and extended coverage<br />

High scalability and massive number of devices<br />

278


NB‐IoT can help local government control street lighting,<br />

determine when waste bins need emptying, identify free<br />

parking spaces, monitor environmental conditions, and survey<br />

the condition of roads.<br />

Fig. 5. LTE Cat M1 applications<br />

A. Automotive & transportation<br />

LTE Cat M1 supports full hand‐over between network<br />

cells from a moving vehicle and is therefore well‐suited for<br />

mobile use cases with low to medium data rate needs, such as<br />

vehicle tracking, asset tracking, telematics, fleet management<br />

and usage‐based insurance.<br />

B. Smart metering<br />

Cat M1 is also ideal for monitoring metering and utility<br />

applications via regular and small data transmissions. Network<br />

coverage is a key issue in smart metering rollouts. Since meters<br />

are commonly located inside buildings or basements, Cat M1's<br />

extended range leads to better coverage in hard‐to‐reach areas.<br />

NB‐IoT is well suited for monitoring gas and water meters,<br />

via regular and small data transmissions. Network coverage is a<br />

key issue in smart metering rollouts. Meters have a very strong<br />

tendency to turn up in difficult locations, such as in cellars,<br />

deep underground or in remote rural areas. NB‐IoT has<br />

excellent coverage and penetration to address this issue.<br />

C. Smart buildings<br />

Cat M1 can easily provide basic building management<br />

functionality, such as HVAC, lighting and access control with<br />

its enhanced indoor range. Since it is also features voice<br />

functionality via VoLTE, it is also well‐suited for critical<br />

applications like security systems and alarm panels.<br />

NB‐IoT connected sensors can send alerts about building<br />

maintenance issues and perform automated tasks, such as light<br />

and heat control. NB‐IoT can also act as the backup for the<br />

building broadband connection. Some security solutions may<br />

even use LPWA networks to connect sensors directly to the<br />

monitoring system, as this configuration is more difficult for an<br />

intruder to disable as well as easier to install and maintain.<br />

D. Smart cities<br />

Within smart cities, Cat M1 can meet a variety of needs and<br />

effectively control street lighting, determine waste<br />

management pickup schedules, identify free parking spaces,<br />

monitor environmental conditions, and survey the condition of<br />

roads in a matter of milliseconds.<br />

E. Consumers<br />

NB‐IoT will provide wearable devices with their own<br />

long‐range connectivity, which is particularly beneficial for<br />

people and animal tracking. Similarly, NB‐IoT can also be<br />

used for health monitoring of those suffering from chronic or<br />

age‐related conditions.<br />

F. Agricultural and environmental<br />

NB‐IoT connectivity will offer farmers tracking<br />

possibilities, so that a sensor containing a u‐blox NB‐IoT<br />

module can send an alert if an animal’s movement is out of the<br />

ordinary. Such sensors could be used to monitor the<br />

temperature and humidity of soil, and in general to keep track<br />

of attributes of land, pollution, noise, rain, etc.<br />

G. Connected health<br />

Due to its extended in‐building range, voice support and<br />

mobility, Cat M1 is also a well‐matched air interface choice<br />

for connected health applications, such as outpatient<br />

monitoring and stay‐in‐place solutions.<br />

IV.<br />

TECHNOLOGY HIGHLIGHTS<br />

NB-IoT optimizes coverage, device battery life/power<br />

consumption and costs, as well as capacity for a massive<br />

number of connected devices and supports a scalable solution<br />

for low data rates. It can be deployed either in shared spectrum<br />

together with legacy LTE, or as stand-alone, i.e. a re-farmed<br />

2G carrier, with a narrow bandwidth of about 180 kHz. NB-IoT<br />

can coexist with LTE and, in standalone operation, also<br />

coexists with 2G, 3G and 4G. Due to changes in<br />

synchronization signal design and simplified physical layer<br />

procedures, the complexity of NB-IoT devices is even lower<br />

than that of 2G devices.<br />

One of the drivers for inventing this new technology is the<br />

requirement of extended coverage. Smart meters are one of the<br />

simple examples, because they are often installed in basements<br />

of buildings and surrounded by concrete. Methods to optimize<br />

indoor coverage use a combination of techniques, such as<br />

power boosting of data and reference signals, repetition, retransmission,<br />

tolerating modulation schemes and accepting<br />

lower signal strength levels.<br />

Finally, this methods provide an increased link budget of up<br />

to 23 dB when compared with 2G or 3G technology, at the<br />

trade-off of increased latency in extreme condition due to<br />

repeated transmissions.<br />

279


nomadic and globally usable products allow more flexibility<br />

for OEMs and customers, reducing cost. Especially for mobile<br />

IoT applications, such as wearables and tracking devices, a<br />

technology providing global roaming increases the usability.<br />

Due to 3GPP standardization, NB-IoT equipped products can<br />

be used everywhere a NB-IoT network is available.<br />

Fig. 6. Optimized coverage<br />

Compared with mesh radio solutions and other technologies<br />

in unlicensed spectrum, NB-IoT and LTE Cat M1 are operating<br />

in a controlled/licensed frequency spectrum (see Figure 7).<br />

Operating in a licensed environment allows the management of<br />

interference and offering of quality of service.<br />

V. ARCHITECTURE<br />

NB-IoT technology is designed such that it can be used in<br />

areas beyond the radio coverage of current cellular standards<br />

(GSM, UTMS or LTE) and in applications which typically<br />

require low power consumption, especially when run from<br />

battery power for many years. A corollary of this is that the<br />

devices will generally send small amounts of data infrequently;<br />

a typical usage scenario might be 100 bytes sent twice per day.<br />

Even higher data volume and number of transmissions are<br />

possible, but with an impact on power consumption.<br />

By design NB-IoT is dedicated to initiate the connection<br />

and data transfer. The system operation is analogous to SMS in<br />

that it is a datagram-oriented, stored-and-forward system,<br />

rather than a GPRS-like IP pipe. This is because NB-IoT<br />

devices can spend most of their time asleep, making possible to<br />

operate over long time from a battery.<br />

The NB-IoT standard specifies three different<br />

communication options: IP data, Non-IP data and SMS, as<br />

shown in the Figure 8 and Figure 9. Among these, IP data<br />

transmission using UDP/IP is currently the most used method,<br />

fully supported worldwide by network operators. IP data<br />

communication UDP/IP allows the use of the existing core<br />

network and supports efficient power consumption to allow<br />

long battery life time.<br />

Use of TCP/IP in a battery operated application is not the<br />

best possible way to reach longest battery life time due to<br />

higher power consumption for retransmissions and<br />

acknowledgement messages.<br />

Fig. 7. Technology Comparison<br />

Another key advantage of NB-IoT is the ability to support 2<br />

way communication. Depending on the selected power saving<br />

modes, the NB-IoT equipped device can communicate bidirectional<br />

with lower latency for control purposes and with<br />

higher latency only for data transfer. This bi-directional<br />

function can be used for any kind of interaction with the<br />

connected device, e.g. data collection, remote management and<br />

control, as well as firmware updates over the air (FOTA).<br />

FOTA is an extremely important functionality to support future<br />

tuning for changing security requirements.<br />

The cost of implementation (CAPEX) and operation<br />

(OPEX) of a network must also be kept in consideration. With<br />

NB-IoT in most cases a simple software upgrade allows the<br />

easy implementation and deployment of this technology in the<br />

field without the need of a complete new setup of a network<br />

including the search for the best position of the base stations.<br />

GSM was unique among the global standards for mobile<br />

communications, as it operates in a majority of the world. NB-<br />

IoT has the chance to be as pragmatic and globally usable as<br />

2G in the past. Even most of the usable applications are<br />

Fig. 8. NB-IoT communication methods<br />

Fig. 9. Protocol structure of a NB-IoT network<br />

280


Using IP data, an NB-IoT module is able to send raw data<br />

through UDP sockets to a destination IP address. The data sent<br />

over the socket AT commands is not wrapped in any other<br />

layer by the module, and the data provided is the data that is<br />

sent.<br />

In a typical NB-IoT application, as shown in Figure 10, at<br />

the far left there is the end device that contains a u-blox NB-<br />

IoT module. The module communicates over the radio network<br />

with a cell tower that supports NB-IoT. The cellular network<br />

links the cell tower with an IoT platform that stores uplink<br />

messages from the module. The server, on the right,<br />

communicates with the IoT platform to retrieve uplink<br />

messages and to send downlink messages to the module.<br />

Fig. 10. Typical NB-IoT application<br />

The end device, the IoT platform and the cloud server<br />

application can use additional protocols on top of UDP/IP, but<br />

these are transparent from the NB-IoT module point of view<br />

and have to be developed outside the module. A possible IoT<br />

scenario is the one shown in the Figure 11, where the CoAP<br />

protocol is developed on both the end-device (outside the<br />

module) and the IoT platform. In this example, the IoT cloud<br />

server uses the AMQP interface to send and receive data<br />

to/from the IoT platform.<br />

Fig. 11. End-to-end protocol architecture<br />

NB-IoT technology is not session oriented, depending onto<br />

the application UDP/IP is mostly preferred over TCP/IP, and<br />

the latencies between each packet in uplink and downlink may<br />

vary from milliseconds to minutes. Latency depends onto the<br />

coverage conditions. In poor coverage conditions the latency<br />

increases. The UDP sockets do not create connections to the<br />

servers, since it is a connection-less datagram protocol and the<br />

messages sent by the end device (by the NB-IoT module) may<br />

also not be received by the server. An NB-IoT application<br />

should take all these aspects into account. Message<br />

acknowledgement is required for a range of use cases, such as<br />

grid control and near real-time monitoring.<br />

An NB-IoT module is designed to enter a power save<br />

mode, called deep-sleep mode, whenever possible depending<br />

on the network activities and in order to limit the end device<br />

power consumption. A module in deep-sleep mode can reach<br />

an extremely low current consumption of about 3 µA (at 3.6<br />

V). After the module connects to a base station to send a<br />

message, the module will stay connected to the base station for<br />

a period of time after the last communication; this time is based<br />

on a network-defined timer called Radio Resource Control<br />

(RRC) release timer. The device will then go back in power<br />

saving mode as soon as possible (see the Figure 12 for more<br />

details).<br />

An NB-IoT network presents several timers (the most<br />

relevant ones are shown in the Figure 12) and all of these<br />

contribute to the overall energy balance of an IoT application:<br />

The RRC release timer defines the time a module<br />

remains connected to the network after a Tracking Are<br />

Update (TAU) or after an up-link message is sent<br />

The timer T3324: is the time used by the network to<br />

send paging events. The NB-IoT module enters into the<br />

power saving mode after T3324 has elapsed<br />

The timer T3412 define the time interval between two<br />

consecutive Traffic Area Updates<br />

Fig. 12. NB-IoT network activities and timers<br />

An NB-IoT module is able to send an up-link message<br />

whenever requested by the application. On the opposite,<br />

downlink messages (i.e. coming from the IoT platform) can be<br />

received only during the RRC release timer, i.e. after the<br />

transmission of an up-link message. To handle this process,<br />

when a message is sent from the end-device to the network, the<br />

cloud server knows the module is active and connected to the<br />

network and should send the down-link messages (if these are<br />

available).<br />

An application that needs to send lots of messages<br />

throughout the day, should keep the NB-IoT module powered<br />

on and set to idle mode but not use the PSM or eDRX. This<br />

means that when the module becomes active again there is no<br />

need to re-attach or re-establish PDN connections. The module<br />

is always connected to a base station. This approach would<br />

limit all the signalling operations required to (re)register the<br />

module with the network and in turn preserve the application<br />

power consumption.<br />

VI.<br />

CONCLUSION<br />

The newly implemented LPWA technologies Narrowband-<br />

IoT (NB-IoT or Cat NB1) and LTE Cat M1 combine the<br />

advantage of using common available infrastructure with low<br />

cost, low power consumption, deep in-building penetration and<br />

high numbers of simultaneously operating units.<br />

281


NB-IoT provides its advantages in efficient battery<br />

operation and in-building penetration, opening the possibility<br />

for coverage underground and deep within factory buildings.<br />

By design, NB-IoT considers cost-optimized deployment<br />

essential due to the “clean slate” approach. NB-IoT does not<br />

need the intelligence to coexist with classic 4G traffic, hence<br />

there is a fair chance to reach customer cost expectations.<br />

Due to the fact that LTE Cat M1 technology is much closer<br />

to the standard LTE network, this technology allows<br />

applications which need more bandwidth and LTE-like latency.<br />

Nevertheless, LTE Cat M1 also provides extensive power<br />

saving modes and improved coverage.<br />

In combination with the security concept for the whole<br />

communication module, security by design provides upcoming<br />

security requirements for the industrial IoT.<br />

A further aspect of IoT is the remote management of a large<br />

and growing category of connected devices; those with limited<br />

bandwidth and those viable only at very low production costs.<br />

For these types of devices a standard has been introduced by<br />

(Open Mobile Alliance) OMA, called Lightweight M2M<br />

(LwM2M).<br />

u-blox has been involved in early stage trials and proof of<br />

concepts for NB-IoT, LwM2M and LTE Cat M1, covering a<br />

range of use cases.<br />

282


Enabling firmware updates over LPWANs<br />

Jan Jongboom<br />

Internet Services Group, Arm<br />

Amsterdam, the Netherlands<br />

jan.jongboom@arm.com<br />

Johan Stokking<br />

The Things Industries<br />

Amsterdam, the Netherlands<br />

johan@thethingsindustries.com<br />

Abstract—Firmware updates are essential for large-scale<br />

deployment of connected devices. Security patches protect<br />

customer and business data, and new functionality, optimization<br />

and specialization extend the lifetime of devices. This paper<br />

discusses firmware updates over the most challenging type of<br />

networks: low-power and long-range networks.<br />

Keywords—LPWAN, LoRaWAN, Sigfox, NB-IoT, firmware<br />

updates<br />

I. INTRODUCTION<br />

Most Internet of Things (IoT) devices require both longrange<br />

and low-power consumption where a battery life can last<br />

years. Traditional wireless network technologies, such as<br />

cellular and Wi-Fi, cannot accommodate these needs. To<br />

facilitate the requirements of these devices, new network<br />

technologies called “Low-Power Wide Area Networks”<br />

(LPWANs) have emerged in the past few years. Networks such<br />

as LoRaWAN, Sigfox and NB-IoT are deployed using low-cost<br />

radio chips with kilometers of range and low battery<br />

consumption.<br />

A downside of these networks is that the data rates are<br />

much lower than those of traditional radio networks. Data rates<br />

in LPWANs are measured in bits per second, rather than<br />

megabytes per second. Additionally, many of these networks<br />

operate in the unlicensed spectrum (ISM band), which requires<br />

devices to adhere to duty cycle limitations, only allowing them<br />

to send a fraction of the time while suffering from interference.<br />

These characteristics make it difficult to support firmware<br />

updates over the air. This implies you cannot update the<br />

devices deployed in the field easily: especially devices that are<br />

deployed in places that are almost impossible to reach, or<br />

where the cost of sending a technician is too high with<br />

thousands of devices in a variety of places.<br />

Not being able to update the firmware on IoT devices easily<br />

is an extreme challenge when deploying at scale. First, you can<br />

never have 100% secure software; we have seen many<br />

examples of this in 2017. Second, these devices may have to<br />

last up to ten years, so keeping them up to date with the latest<br />

standards and protocols is important. Lastly, the ability to add<br />

functionality or specialize devices throughout the lifetime,<br />

from manufacturing and distribution to transfer of ownership or<br />

change of purpose, is critical in many business cases.<br />

The key requirements for firmware updates are the abilities<br />

to:<br />

1. Send data to multiple devices at the same time (so<br />

called multicast) in an efficient manner in terms of<br />

power consumption and channel utilization.<br />

2. Recover from lost packets.<br />

3. Verify the authenticity and integrity of the firmware<br />

while following standards end-to-end.<br />

This article discusses these challenges one by one and<br />

presents a solution.<br />

II. MULTICAST<br />

Unlike cellular or Wi-Fi, with which a device maintains a<br />

connection with the network at all times, most LPWANs are<br />

uplink oriented. Sending data (uplink) is more important than<br />

receiving data (downlink). It is only possible to send a<br />

downlink message at set times, which LPWAN refer to as RX<br />

windows. These RX windows only open shortly after a<br />

transmission, which is great for battery life because the device<br />

does not need to maintain the connection with the network and<br />

can go to sleep mode as much as possible. LoRaWAN Class-A,<br />

Sigfox and LTE-M in Power Save Mode (PSM) follow this<br />

model.<br />

However, for sending firmware images, this is terrible<br />

because you need downlink oriented transmission of many<br />

packets. With a payload size of 115 bytes taking 615 ms. air<br />

time (a typical transmission speed for LoRaWAN [1] ), you need<br />

to exchange 891 messages to send a 100 KB firmware image.<br />

Because of the 1% duty cycle requirement in many markets<br />

(including Europe), this requires over 9 hours to update a single<br />

device, assuming no packet loss. In addition, the gateways may<br />

cover hundreds or thousands of devices that are also subject to<br />

duty cycle limitations, which means it may take weeks to<br />

update a fleet of devices. Finally, for every received packet, the<br />

required transmission consumes lots of energy (typical<br />

LPWAN TX consumes ~50 mA, and RX ~9 mA current) and<br />

uses a lot of the available spectrum.<br />

To enable more efficient firmware update capabilities, you<br />

need to implement two features:<br />

www.embedded-world.eu<br />

283


1. A way to send the firmware image without the device<br />

requiring to transmit first, optimizing the device’s duty<br />

cycle and power consumption.<br />

2. Multicast support - for updating multiple devices at the<br />

same time, optimizing the gateway duty cycle.<br />

The first step is to get all devices that you need to update to<br />

listen at the same time, at the same frequency, data rate and<br />

security session. If you load the same keys to the devices, then<br />

all devices can both receive and decrypt the same packets as if<br />

they are one device. Once you are certain that the devices are<br />

listening, you can start broadcasting the firmware image<br />

without the need for devices to transmit first. This means that<br />

you need to schedule firmware updates, typically hours or days<br />

in advance, depending on the sleep behavior of the devices<br />

requiring the update.<br />

Because the network can continuously send messages, you<br />

can transmit the 891 packets (100 KByte) in under six minutes<br />

(at 600 ms. time on air per packet). Multicast firmware updates<br />

are only achievable for stationary devices because the gateways<br />

selected and the data rate need to be fixed when you create the<br />

multicast session, which can be days before the session starts.<br />

III. NETWORK RELIABILITY<br />

By doing this, there is an effect on the network reliability.<br />

The network still needs to adhere to the duty cycle limitation of<br />

the gateways, which send the packets. This means that a<br />

firmware update can render a gateway useless for relaying<br />

downlink messages for a while. A way of mitigating this is to<br />

have coverage by multiple gateways, and round-robin [2]<br />

between gateways during the update. Also, because most<br />

LPWAN gateways are only half-duplex, devices cannot use the<br />

frequency that the update uses. This is not such a problem on<br />

licensed spectrum (NB-IoT / LTE-M1), or in areas with wide<br />

unlicensed spectrum available (U.S.), but it is in Europe where<br />

LPWANs are deployed in limited spectrum (LoRaWAN only<br />

has eight channels in the EU). A way of mitigating this is to<br />

implement frequency hopping. (Weightless-N does this.)<br />

Another way is for a technician to drive to the deployment<br />

site with a separate gateway and use it for the update. Although<br />

this seems unnatural, it has some benefits when deploying on a<br />

constrained site. Because the reception is more predictable, you<br />

can use a higher data rate when sending the update, which<br />

causes less congestion on the network. Additionally, the update<br />

does not affect normal gateways. This could be an option for<br />

hard-to-reach sensors. This method is also an option for<br />

LPWANs that have a smaller link budget for downlink<br />

messages than for uplink messages, such as Sigfox.<br />

IV. SECURITY FOR FIRMWARE UPDATE OVER MULTICAST<br />

When instructing multiple devices to join a temporary<br />

multicast session in which all the devices share the same<br />

session keys, there is a potential security risk when one of the<br />

devices is compromised: packet injection. This is applicable<br />

because most LPWANs choose a symmetric authentication<br />

mechanism. Having the multicast session keys, the attacker can<br />

send packets as if they came from the server. Although this is a<br />

serious issue when using multicast without additional security<br />

measurements, such as controlling lights simultaneously, a<br />

firmware update mechanism requires three additional measures<br />

to secure the update process.<br />

First, once the device receives the file, it calculates the<br />

checksum of the data it received. The device sends this<br />

checksum to the network using its private secure session. The<br />

server compares this checksum with the checksum of the data<br />

that it sent. This check fails if the data has been tampered with.<br />

The server responds whether the checksum is correct to each<br />

device individually, on its private secure sessions.<br />

Second, as part of the server's response to indicate the<br />

correctness of the checksum, the server sends the message<br />

integrity code (MIC), which guarantees data integrity to the<br />

device. No one who does not know the device’s private secure<br />

session keys can forge this MIC: only the device and the server<br />

can calculate the same MIC. So the server checks the device's<br />

checksum, and the device checks the server's MIC,<br />

communicating on the device's private secure session.<br />

Third, when an attacker injects random packets, the device<br />

may not be able to reconstruct the original image. To avoid<br />

devices that run out of power because they keep listening for<br />

error correction packets, as presented in the next section, the<br />

multicast session should have a lifetime: a fixed limit on the<br />

number of messages. When reaching this limit, the device<br />

switches back to its private secure session and power efficient<br />

operating mode, and discards all data.<br />

V. SENDING LARGE BINARY PACKETS OVER A LOSSY<br />

NETWORK<br />

In the schema proposed above, there is no communication<br />

between the device and the network when the multicast<br />

transmission is in progress. Thus, it is not possible to determine<br />

which device received which fragments of firmware update.<br />

This is required to minimize spectrum usage, similar to how<br />

UDP works. On LPWAN networks, there is often no<br />

guaranteed quality of service, and packet loss can occur when<br />

the device is moving. To deal with the high packet loss, you<br />

should implement an error-correcting algorithm, which does<br />

not require communication from the device to the network.<br />

One such algorithm is Low-Density Parity Check Coding [3] .<br />

In the first step, the network sends the firmware as is,<br />

fragmented in packets. Next, the network starts sending error<br />

correction packets, which are XORed to what the device<br />

already received. Because the fragments have an increasing<br />

frame number, the device knows which fragments are missing<br />

and can use the correction packets to reconstruct the missed<br />

fragments. The network keeps sending correction packets until<br />

all devices confirm that they reconstructed all the fragments of<br />

the firmware update, or, in case of extreme packet loss, until<br />

the update server sent all correction packets. With a good error<br />

correction algorithm, you need up to five correction packets to<br />

correct for three missed fragments.<br />

After the device reconstructs the full firmware, the device<br />

switches back to its private secure session and operating mode.<br />

After successfully testing the device's checksum and the<br />

server's message integrity code as presented above, the device<br />

performs the firmware update.<br />

284


DELTA UPDATES<br />

To minimize the amount of data sent, it's advisable to<br />

implement delta updates, in which the network sends only the<br />

changed parts of the firmware image instead of the full image.<br />

This can reduce the data by 90%. For constrained devices, it's<br />

important to choose a linear patch format, which you can apply<br />

while using little memory. In addition, it's important to verify<br />

the authenticity of the firmware after patching to avoid bricking<br />

devices.<br />

VI. CRYPTOGRAPHIC VERIFICATION OF THE FIRMWARE<br />

The devised protocol only handles raw data integrity of the<br />

firmware. It involves timing and message level security and<br />

accounts for packet loss. However, a good firmware update<br />

process also requires additional security on top of the network<br />

layer because hijacking the firmware update algorithm is a big<br />

attack vector.<br />

To protect against these attacks, you need to program extra<br />

properties into the end device:<br />

• A public key of the owner who is authorized to update<br />

firmware on the device.<br />

• A manufacturer universally unique identifier (UUID).<br />

• A device type UUID.<br />

The firmware update should contain a manifest that consists<br />

of the cryptographic hash of the update, the manufacturer and<br />

the device type that the update applies to, all signed with the<br />

manufacturer's private key. Whenever the device receives the<br />

update, you can verify that a trusted authority signed it and<br />

whether it was meant for this device because the device<br />

contains the manufacturer's public key.<br />

For the cryptographic hash, we suggest you use at least<br />

single-curve ECDSA/SHA256 [4] , which you can efficiently<br />

implemented on constrained devices while still providing<br />

adequate security.<br />

VII. CONCLUSION<br />

Firmware updates are an essential requirement before<br />

devices that use LPWANs for connectivity hit the market in<br />

volume. When implementing the requirements in this paper,<br />

device manufacturers can ship products while assuring their<br />

customers of security updates, new functionality, optimizations<br />

and specialization throughout the device’s lifetime.<br />

To demonstrate that multicast firmware updates are<br />

possible, Arm and The Things Industries have developed a<br />

reference implementation on top of LoRaWAN that<br />

implements the suggestions from this paper. The result is a<br />

secure, fast and efficient method of updating constrained<br />

devices (under 32K RAM required) in the field. You can<br />

implement the reference implementation on other LPWANs,<br />

too, and it is licensed under the permissive Apache 2.0 (device<br />

firmware) and MIT (network server) license. You can find<br />

more information at https://mbed.com/fota-lora.<br />

REFERENCES<br />

[1] https://www.thethingsnetwork.org/forum/t/spreadsheet-for-lora-airtimecalculation/1190<br />

[2] https://en.wikipedia.org/wiki/Round-robin_scheduling<br />

[3] https://en.wikipedia.org/wiki/Low-density_parity-check_code<br />

[4] https://en.wikipedia.org/wiki/Elliptic_Curve_Digital_Signature_Algorith<br />

m<br />

www.embedded-world.eu<br />

285


Real-Time Position Tracking and Finish<br />

Detection with LoRa<br />

Juan-Mario Gruber<br />

Institute of Embedded Systems (InES)<br />

Zurich University of Applied Sciences (ZHAW)<br />

8401Winterthur, Switzerland<br />

gruj@zhaw.ch<br />

Benjamin Brossi<br />

Institute of Embedded Systems (InES)<br />

Zurich University of Applied Sciences (ZHAW)<br />

8401Winterthur, Switzerland<br />

broi@zhaw.ch<br />

Abstract—For outdoor sport events, it is often necessary to<br />

track the position of participants or equipment within a defined<br />

area in real time. Global navigation satellite systems (GNSS) can<br />

determine the position with great accuracy. In addition, using the<br />

LoRa radio technology, data can be transmitted over a distance<br />

up to 4 km under optimal conditions.<br />

Keywords—GPS; LoRa; Real-Time Position Tracking; Low<br />

Engergy; Energy Harvesting<br />

I. MOTIVATION<br />

Sports events like sailing regattas often require real-time<br />

position data of the participants. Since the participants are often<br />

spread over a wide area, the data must be able to be transmitted<br />

over a large distance.<br />

implemented for data transmission. In addition, the system is to<br />

be optimized for low energy consumption, so that it can be<br />

operated for several hours by rechargeable batteries or energy<br />

harvesting. In summary, the following objectives can be<br />

defined:<br />

• Real-time GNSS data<br />

• Data transfer via LoRa<br />

• Low energy<br />

• Lowest possible cost<br />

• Base station with real time visualization and finish<br />

detection<br />

II. CONCEPT<br />

The new position tracking system developed at the Institute<br />

of Embedded Systems at Zurich University of Applied Science<br />

measures the position of up to 255 independent objects in a<br />

large-scale area. That system consists of multiple position<br />

trackers, a central base station and a PC software (Position<br />

Tracker Manager).<br />

Fig. 1. Sailing regatta<br />

The Institute of Embedded Systems at Zurich University of<br />

Applied Science developed a system that transmits real-time<br />

position data from a global navigation satellite system (GNSS)<br />

live over a long distance to a base station. The base station<br />

displays the data on a map and can define a finish line. The aim<br />

is to detect the crossing of the finish line. The LoRa standard is<br />

Fig. 2. System block diagram [1]<br />

The system communicates bidirectionally. This makes it<br />

possible to configure and control the trackers from the base<br />

station. The system can automatically perform position-based<br />

evaluations such as crossing the finish line. The data package<br />

consists of object number, position data and time stamps. This<br />

286


has to be done because only a certain amount of data per hour<br />

can be sent through the LoRa band.<br />

The system implements a standard LoRa functionality and<br />

uses the 868 MHz band. The position trackers are built with<br />

latest very low power components and are optimized for low<br />

power consumption. This means that they are ready to be<br />

powered by energy harvesting.<br />

A. Tracker device<br />

The tracker devices determine time and position data from<br />

a GNSS and send them via LoRa. The GNSS module L86-M33<br />

from Quectel is used to determine the time and position. GPS,<br />

GLONASS and QZSS can be used with this module. The LoRa<br />

module iM880B-L from IMST is used for the data transfer.<br />

This is a certified module for wireless communication via the<br />

LoRa radio standard in the 868 MHz frequency band. The<br />

module features a STM32L151 microcontroller with an ARM<br />

Cortex M3 core and a SX1272 LoRa chip from Semtech.<br />

Fig. 3 shows the software state diagram for the tracking<br />

device. When switching on the track device, the required<br />

peripheral modules of the microcontroller, the LoRa radio and<br />

the GNSS module are initialized first. After the initialization, a<br />

valid time of the GNSS module is waited for and then the RTC<br />

is synchronized to this time. This may take a few minutes,<br />

depending on the signal quality of the satellite signals. When<br />

the RTC is successfully synchronized, the log and TDMA<br />

counters are started. The software changes to idle state.<br />

Fig. 3. Tracker device state diagram<br />

The software changes to the log state by an event, in which<br />

the UART is set to receive the data of the GNSS module. If the<br />

data has been successfully received, the data is buffered. The<br />

data in the buffer is sent to the base station via LoRa. When<br />

transmission is complete, the software switches to receive<br />

mode in which it is possible to receive configuration data from<br />

the Base Station. The reception time window is 30 milliseconds<br />

according to the TDMA protocol. If configuration data is<br />

received, the new log interval is set and the log and TDMA<br />

counters are restarted.<br />

If a low battery event occurs, the log and TDMA counters<br />

are deactivated. An attempt is made to determine a last position<br />

within 30 seconds. If successful, this position is transmitted in<br />

Low Power Mode. All components are then switched to the<br />

low power mode in order to consume as little power as<br />

possible. The Tracker Device must then be recharged and<br />

restarted with the switch.<br />

For the transmission, it had to be determined which is the<br />

smallest possible log interval and how often data packets are<br />

sent. In order to increase the reliability of data transmission, the<br />

current and previous position data are transmitted during each<br />

transmission. It turned out that the best results are achieved<br />

with 3 seconds log interval and 9 seconds transmission interval.<br />

Fig. 4. Tracker device<br />

B. Base station<br />

A STM32L152 Nucleo Development Kit from<br />

STMicroelectronics is used for the base station. It has an ARM<br />

Cortex M3, which is the same controller as on the LoRa<br />

module iM880B-L from IMST. The LoRa Shield<br />

XS1272MB2DAS module from Semtech is used for the radio<br />

connection. The module is compatible with the Development<br />

Kit and is connected via SPI.<br />

If the Base Station is switched on, it is in the idle state. If<br />

data are received via the UART, an interrupt is triggered. The<br />

system checks whether the data is valid. If this is the case, the<br />

new log interval is saved so that it can be configured via LoRa<br />

when receiving data from the corresponding tracker device.<br />

The system communicates bidirectionally, so if data has<br />

been received, the software switches to transmit mode in which<br />

the new log interval is sent to the tracker device, if a new log<br />

interval has to be configured. If there is no new log interval, the<br />

software switches to the data processing mode. The data<br />

received via LoRa are stored in this module. In this state the<br />

times of the positions are reconstructed, assigned to the<br />

corresponding tracker device ID and stored. The system then<br />

uses the time to check whether the position has already been<br />

received. If this is not the case, the position is sent via the<br />

UART interface. The software then changes back to the idle<br />

state.<br />

287


C. PC software (Position Tracker Manager)<br />

The Position Tracker Manager is connected to the base<br />

station via a virtual COM interface via USB. The data is<br />

transferred to the computer via UART. The Position Tracker<br />

Manager evaluates the data and displays the device trackers on<br />

an embedded map. In the software, two trackers can be defined<br />

corresponds to the maximum permitted values issued by the<br />

Federal Office of Communications (OFCOM). The time for<br />

the receiving state is shorter than that of the sending process<br />

and in this case is specified at 10 seconds per hour. The<br />

remaining time the module is in Low Power Mode RTC ON.<br />

These values result in an average current of 0.933 mA.<br />

To calculate the total power consumption of the tracker<br />

device, the current consumption of the individual components<br />

is added together. Based on the capacity of the battery used, a<br />

battery life of 16.6 h can be estimated [1].<br />

Current consumption tracker device<br />

State<br />

Current<br />

Quectel L86-M33 26 mA [2]<br />

iM880B-L<br />

LEDs<br />

0.933 mA<br />

2 mA<br />

LTC4080 1.9 mA [3]<br />

Total<br />

Estimated battery life<br />

30.833 mA<br />

500 mAh / 30.1 = 16.6 h<br />

Fig. 5. Position Tracker Manager<br />

as start and end points of the finish line. The software<br />

automatically calculates the distance to the finish line and<br />

detects when it is crossed. The transferred raw data and the<br />

ranking list can be exported from the software into a CSV file<br />

for further use.<br />

III. ENERGY CONSUPMPTION<br />

The power supply of the tracker device is ensured by a Li-<br />

Ion battery. Additionally, a charging circuit with integrated<br />

buck converter and 3.3 VDC output is used to charge this<br />

battery. The charging circuit is supplied by a 5 VDC plug<br />

from a power supply unit.<br />

To determine the current consumption of the LoRa module<br />

iM880B-L, the following values of the current consumption<br />

are taken from the data sheet [2]:<br />

Idle<br />

Transmit<br />

Receive<br />

Current consumption iM880B-L<br />

State<br />

Current<br />

5 mA<br />

Low Power Mode RTC ON<br />

Low Power Mode RTC OFF<br />

90 mA<br />

11.22 uA<br />

1.85 uA<br />

0.8 uA<br />

In the worst case, it is assumed that the module can<br />

transmit a maximum of 36 seconds per hour at 14 dBm. This<br />

IV. TEST RESULTS<br />

Field tests with sailing ships have shown that the device<br />

trackers work reliably. The positions are resolved with an<br />

accuracy of a few meters or less and the data transfer works in<br />

good conditions up to 3 km. The battery life of over 16 hours<br />

is sufficient for most sports and competitions. It should also be<br />

possible to operate the device trackers with an outdoor solar<br />

panel. The device trackers are ideal for use in open terrain. In<br />

built-up areas, both the accuracy of the position and the<br />

maximum transmission distance are reduced.<br />

V. CONCLUSION AND OUTLOOK<br />

It has been shown that it is possible to operate the device<br />

trackers with a very small energy budget. The accuracy of<br />

position and transmission distance is sufficient for most<br />

applications.<br />

In future, the base station will be replaced by a Raspberry<br />

Pi with a LoRa hat. The Raspberry Pi should be able to<br />

perform the data evaluation without the need of a PC. In<br />

addition, a web server and a JavaScript application should<br />

replace the position Tracker Manager.<br />

[1] T. Eigenmann, R. Gubler, Echtzeit-Positionstracker mit LoRa (Bachelor<br />

Thesis, advisor J. Gruber), Zurich University of Applied Science, 2017<br />

[2] iM880B Datasheet, v1.3, IMST GmbH Wireless Solutions, 2016<br />

[3] L86 Hardware Design, Rev. V1.0, Quectel Wireless Solutions, 2014<br />

[4] LTC4080 Datasheet, Rev C, Linear Technology, 2015<br />

288


Battery-Free Wireless Sensors<br />

Enabling the growth of the Internet of Things with a unique sensor architecture<br />

Greg Rice<br />

Technical Marketing and Applications Manager<br />

ON Semiconductor<br />

Protection and Signals Division<br />

Phoenix, Arizona, USA<br />

greg.rice@onsemi.com<br />

Abstract—The Internet of Things (IoT) is a phrase used to<br />

described a network of connected devices that send data back<br />

and forth. Sensors are used to create data that is used within the<br />

IoT, typically consisting of a sensing block, a power block, and a<br />

processing block. This paper presents a new sensor design that<br />

untethers the sensing block from the power and processing<br />

blocks, resulting in a battery-free, wireless sensor that<br />

complements traditional sensor designs.<br />

Keywords—IoT, Internet of Things, battery-free, wireless<br />

sensor, RFID, energy harvesting, temperature sensor, moisture<br />

sensor, passive sensor<br />

I. INTRODUCTION (Heading 1)<br />

The Internet of Things (IoT) is a phrase that is widely used<br />

when referring to emerging and growing technologies.<br />

Although the IoT is a common term, the definition of what the<br />

IoT means will change depending on how a particular person<br />

interacts with the IoT. For the purposes of this paper, the IoT<br />

is defined as a network of connected electronic devices, where<br />

data is transferred back and forth through a standard<br />

communications interface. The communication between the<br />

devices that comprise the IoT commonly takes place across a<br />

wireless interface, and will often incorporate a connection to<br />

the cloud for data storage and processing.<br />

For many people, their first experience to the IoT occurred<br />

when personal pagers were used to send basic messages across<br />

a wireless network. When smartphones were widely<br />

introduced approximately 10 years ago, the IoT began to<br />

evolve into something that almost every household was<br />

exposed to. Today, many homes include connected devices<br />

such as security cameras and smart thermostats that enable<br />

homeowners to monitor and control their home from<br />

thousands of miles away. Ultimately, data is being sent from<br />

one device to another across the IoT. The data being sent can<br />

include basic communication to send messages from person to<br />

person, and the data can also include information about the<br />

physical condition of something such as temperature or<br />

moisture. In many cases, some type of sensor is used to<br />

convert a physical parameter such as temperature or moisture<br />

into electronic data that can be sent across the IoT.<br />

II.<br />

SIZE AND GROWTH PREDICTIONS FOR THE IOT<br />

The estimates for the size and projected growth of the IoT will<br />

vary depending on who is providing the estimate, the rate of<br />

adoption for new IoT technologies, and how quickly the IoT<br />

ecosystem can be built and expanded. According to IHS,<br />

there are currently approximately 15 billion connected devices<br />

in the world, and the number of connected devices is expected<br />

to expand to 31 billion in 2020 and 75 billion in 2025. Other<br />

estimates project up to 200 billion connected devices in the<br />

next 10 years. Regardless of the final number of connected<br />

devices in the coming years, the project growth is expected to<br />

be exponential.<br />

What is expected to fuel the tremendous growth in the IoT?<br />

Historically, consumer technology such as smartphones,<br />

personal computers, and tablet computers have been the<br />

primary drivers in technology growth. With the IoT, growth is<br />

anticipated to come from non-traditional applications.<br />

Fig. 1. IoT Growth is expected to come from multiple industries<br />

www.embedded-world.eu<br />

289


Connected cars are expected to double as autonomous driving<br />

and other safety features are included in new vehicles. The<br />

number of connected device within a typical home is projected<br />

to grow from approximately 9 connected devices per home in<br />

2017, to approximately 500 devices per home in 2025.<br />

Additional applications such as healthcare, smart cities, and<br />

digital farming will also contribute to the tremendous growth<br />

of the IoT in the coming years, driving the global data traffic<br />

from approximately 2 Exabytes per day in 2017 to over 120<br />

Exabytes per day. To support the increased demand for data,<br />

new technologies need to be developed which may replace or<br />

complement existing technology, particularly when it comes<br />

to electronic sensors.<br />

III.<br />

SENSOR ARCHITECTURES<br />

Within an IoT network, there are multiple sensors used at the<br />

edge of the network to convert a physical parameter such as<br />

temperature or pressure into electronic data. Traditionally,<br />

electronic sensors are used to perform this function within the<br />

IoT. A traditional electronic sensor can be viewed as three<br />

primary technology blocks.<br />

A. Traditional Electronic Sensor Architecture<br />

At the core of the sensor design is the sensing element. The<br />

sensing element is something that reacts to the physical<br />

environment around the sensor. For a temperature sensor, the<br />

sensing element is something that has a predictable change in<br />

response to the temperature of the element. For a gas sensor,<br />

the sensing element will change in the presense of gas, etc.<br />

A power supply is needed for electronic sensors, and is<br />

typically designed into each sensor to provide stable voltage<br />

and current to power the circuits that comprise the sensor.<br />

Power can either be supplied through an AC power connection<br />

to wired power, or through DC power provided by batteries in<br />

some cases.<br />

In addition to power and sensing, a traditional electronic<br />

sensor also incorporates a block to perform data processing<br />

and connectivity. The data processing is used to control the<br />

power and sensing sections of the sensor, and also to provide<br />

communication of the electrical data that correlates to the<br />

physical parameter that is being monitored by the sensor.<br />

It is common for the sensing, power, and data processing<br />

elements to be included in every electronic sensor. This<br />

approach works well for many applications where sensors are<br />

used, but for certain applications the number of components<br />

required for each technology block results in limitation in the<br />

physical size of a sensor as well as the cost to scale when<br />

multiple sensor nodes are required for a system. In some<br />

applications, these aspects have limited the number of sensors<br />

that would be deployed if another sensor technology was<br />

available.<br />

Fig. 2. Block diagram of a traditional electronic sensor<br />

B. Battery-Free Wireless Sensor Architecture<br />

As a complement to traditional electronic sensors, a sensor<br />

architecture has been developed that separates the sensing<br />

element from the power and data processing blocks required<br />

for a complete sensor system. This approach results in an<br />

ecosystem that incorporates shared power and data processing<br />

in a single system, along with wireless, battery free sensors<br />

that are designed in small form factors, with a low cost to<br />

scale when multiple sensing nodes are needed. The Smart<br />

Passive Sensors TM (SPS) are able to sense parameters<br />

including temperature, moisture, pressure, and proximity with<br />

additional functionality in development.<br />

Fig. 3. Smart Passive Sensor block diagram<br />

Using smart passive sensors, each sense node is designed to<br />

operate in conjunction with a sensor hub, which performs a<br />

wireless interface to the sensor using the standard RAIN UHF<br />

RFID protocol and also incorporates connectivity to the IoT.<br />

With this approach, the sensors are designed to be powered by<br />

harvesting RF energy supplied by the sensor hub, and<br />

communicating sensor information to the hub wirelessly. One<br />

sensor hub can communicate with a large number of sensor<br />

nodes, provided that the sense nodes are within range of the<br />

RF antenna from the sensor hub. Read ranges of up to 10m<br />

have been achieved in ideal conditions, where typical<br />

applications support a range of 3-5m between the sensor and<br />

the RF antenna. Figure 4 shows the IoT architecture using<br />

SPS and a connected Sensor Hub<br />

290


B. Smart Healthcare<br />

As healthcare facilities become more connected, the need for<br />

advanced sensing capabilities grows. Smart passive sensors<br />

can be used to indicate if a patient is in a hospital bed or not,<br />

automatically notifying nursing staff if the patient is<br />

unexpectedly out of the bed which could suggest that the<br />

patent has fallen and requires immediate assistance.<br />

Fig. 4. IoT Architechture using SPS and Sensor Hub<br />

IV.<br />

PRACTICAL APPLICATIONS<br />

A. Industrial Predictive Maintenance<br />

Monitoring the condition of equipment within industrial<br />

factories is critical in order to manage equipment maintenance,<br />

reduce factory downtime, and avoid physical injuries to<br />

factory personnel. Busbars within power switchgear are used<br />

to transfer thousands of watts of power throughout a facility.<br />

The busbars are used to connect the inputs and outputs of high<br />

current circuit breakers that are capable of passing thousands<br />

of amperes of current through three phases. If a connection to<br />

a high current busbar becomes corroded or is loose, the<br />

increased resistance in the connection will result in a<br />

temperature rise on the busbar. If not corrected, a degraded<br />

high power busbar connection can result in catastrophic failure<br />

due to arcflash. An arcflash can cause significant damage to<br />

factory equipment and buildings, and physical injury or death<br />

to humans. Due to the high current within power switchgear,<br />

wired sensors cannot be used to monitor busbar temperature,<br />

and battery powered sensors are not desired due to the<br />

physical injury risk associated with replacing batteries within<br />

a sensor.<br />

Fig. 6. Smart Passive Sensors used for occupant detection in hospital bed<br />

Passive sensors can also be used to monitor fluid levels within<br />

a hospital room, such as the amount of fluid in an IV or<br />

catheter bag. Moisture sensing can also be integrated into bed<br />

linens and hospital garments to automatically and<br />

unobtrusively detect incontinence events without the need for<br />

manual monitoring. All of these functions are performed<br />

effectively and in a cost efficient manner, resulting in an<br />

improved experience overall for medical patients and staff.<br />

Fig. 5. Smart Passive Sensors installed in power switchgear<br />

For this application, using smart passive sensor technology to<br />

monitor individual busbar temperatures within power<br />

switchgear is a good fit. The sensor hub can be used to<br />

aggregate data from multiple temperature sensors, and send<br />

them data through a MODBUS interface for integration into<br />

standard facilities SCADA management software.<br />

C. Digital Farming<br />

Smart Passive Sensor technology can also be used in farming<br />

and agriculture applications. For livestock management,<br />

passive wireless temperature sensors can be used to monitor<br />

the health of cattle and pigs. As the livestock industry moves<br />

to reduce the amount of preventative medication that is<br />

administered to animals, it is expect that there will be an<br />

increase in illness for livestock. If certain illnesses are not<br />

detected early, the disease can spread to multiple animals and<br />

can result in increased cost to manage the illness and death in<br />

some cases. In addition, temperature sensing can be used to<br />

monitor the temperature of breeding animals to better predict<br />

female ovulation in pigs and cows and improve breeding<br />

efficiency.<br />

www.embedded-world.eu<br />

291


Finally, mobile sensor hubs can be integrated into drones and<br />

flown around crop fields to monitor the soil moisture<br />

throughout a farm and optimize watering for different sections<br />

within a farm.<br />

architecture has been developed that complements traditional<br />

electronic sensors. The new smart passive sensor technology<br />

untethers the sensing element from the power and<br />

communication blocks needed for traditional sensors, resulting<br />

in a shared power and communication function in a sensor hub<br />

and distributed wireless, passive sensing nodes. Practical<br />

applications for this technology include Industrial predictive<br />

maintenance, smart healthcare, and digital farming among<br />

others. Additional information regarding this sensor system<br />

and associated applications is available as needed.<br />

ACKNOWLEDGMENT<br />

Thanks to RF Micron for developing the smart passive sensor<br />

technology and for their continued collaboration. Magnus-S<br />

technology was invented by and is owned by RF Micron, Inc.<br />

Fig. 7. Smart Passive Sensor used to monitor livestock temperature<br />

V. CONCLUDING REMARKS<br />

As the internet of things continues to grow, the need for new<br />

sensor technologies must also grow to enable billions of new<br />

types of devices to connect to the IoT. A new sensor<br />

REFERENCES<br />

[1] Ali Abedi, “Battery Free Wireless Sensor Networks: Theory and<br />

Applications,” Proceedings of IEEE’s 2014 Int. Conf. on Computing,<br />

Networking and Communications<br />

[2] I. Zalbide, et al, “Battery-free wireless sensors for industrial<br />

appplications based on UHF RFID Technology,,” proceedings of the<br />

IEEE Sensors 2014 conference, Spain, 2014<br />

[3] AND9213: “Reading Battery Free Wireless Sensors,” application note,<br />

ON Semiconductor<br />

[4] AND9211: “Battery Free Wireless Sensor Measurements,” application<br />

note, ON Semiconductor<br />

[5] http://www.rfmicron.com/<br />

292


Understanding Advanced Bluetooth Angle Estimation<br />

Techniques for Real-Time Locationing<br />

Sauli Lehtimäki<br />

Silicon Labs<br />

Espoo, Finland<br />

Abstract — Bluetooth Angle of Arrival (AoA) and Angle of<br />

Departure (AoD) are techniques used for real-time locationing.<br />

These techniques are relatively new concepts in Bluetooth. The<br />

basic idea behind these techniques is to measure the phase<br />

differences between received radio frequency signals and<br />

numerically compute AoA or AoD based on these differences. By<br />

using the angle readings, it is possible to build systems that track<br />

people, mobile devices and other assets, usually in indoor<br />

environments. These new techniques can enhance the utility and<br />

functionality of Bluetooth beaconing applications. Antenna arrays<br />

and angle-of-arrival algorithms play a significant role in properly<br />

functioning Real-Time Locationing Systems (RTLS).<br />

Keywords— Angle estimation; Bluetooth; Direction Finding;<br />

Indoor locationing; Real-Time Locationing; RTLS<br />

I. INTRODUCTION<br />

Locationing technologies have many useful applications,<br />

one example being GPS, which is widely used all over the world.<br />

Unfortunately, GPS does not work very well indoors, so there is<br />

a real need for better indoor positioning technologies. Our goal<br />

is to track the locations (or angles) of individual objects with an<br />

external tracking system or for a device to track its own location<br />

in an indoor environment. This kind of locationing system can<br />

be used to track assets in a warehouse or people in a shopping<br />

mall, or people can use locationing for their own wayfinding.<br />

Bluetooth Angle of Arrival and Angle of Departure are new<br />

technologies that establish a standardized framework for indoor<br />

locationing. In these technologies, the fundamental problem of<br />

locationing comes down to solving for the arrival and departure<br />

angles of radio frequency signals. In this paper, we explain the<br />

basics of these technologies and give some theory for estimating<br />

direction of arrival. Currently the Bluetooth AoA/AoD<br />

specifications are in a mature state but not yet public. Because<br />

of this, this paper will only cover the general concepts without<br />

going into the details of the specification. In conclusion, we will<br />

briefly compare two other locationing technologies.<br />

II.<br />

BLUETOOTH AOA AND AOD<br />

A. AoA<br />

Let's consider a device with a multiple antenna linear array<br />

for a receiver and a device with one antenna for a transmitter.<br />

Also, assume that the radio wave travels as a planar wave front<br />

rather than spherically, which we can safely assume when<br />

looking from a distance. If the transmitter, which is sending a<br />

sine wave through the air, lies on the normal line perpendicular<br />

to the array line, every antenna (channel) of the array will see<br />

the incoming signal in the same phase. If the transmitter does not<br />

lie on the normal line, then the receiving antennas will see phase<br />

differences between the channels. This phase difference<br />

information can be used to calculate the angle of arrival. In<br />

practice, the receiver device will need to have multiple ADC<br />

channels or use an RF switch to be able to take samples from<br />

each individual channel. The samples are called “IQ-samples”<br />

since a sample pair of “In-phase” and “Quadrature-phase”<br />

readings are taken from the same input signal. These samples<br />

have a 90 degree phase difference in the sampling. When this<br />

pair is considered to be a complex value, each complex value<br />

contains both phase and amplitude information and can be an<br />

input for the arrival angle estimation algorithm.<br />

Radio waves travel at the speed of light, which is 300000<br />

km/s. When using frequencies around 2.4 GHz, the<br />

corresponding wavelengths are about 0.125 m. The maximum<br />

distance between two adjacent antennas for most estimation<br />

algorithms is a half wavelength. Many algorithms require this;<br />

otherwise, we get effects similar to aliasing. There is no<br />

theoretical minimum distance limitation, but, in practice, the<br />

minimum size is limited by the mechanical dimensions of the<br />

array plus, for example, mutual coupling between the antenna<br />

elements.<br />

B. AoD<br />

For Angle of Departure, the fundamental idea of measuring<br />

phase differences is the same, but device roles are swapped. In<br />

AoD, the device being tracked uses only one antenna, and the<br />

transmitter devices use multiple antennas. The transmitter<br />

device sequentially switches the transmitting antenna, and the<br />

receiving side knows the antenna array configuration and<br />

switching sequence.<br />

When considering this from an application point of view, we<br />

can see a clear difference between these two techniques. In AoD,<br />

the receiving device is able to calculate its own position in space<br />

using angles from multiple beacons and their positions (by<br />

triangulation). In AoA, the receiving device tracks arrival angles<br />

for individual objects. Still, it is good to note that different<br />

www.embedded-world.eu<br />

293


combinations of these can be performed; so, these techniques do<br />

not limit what can be done at the application level. Both in<br />

Bluetooth AoA and AoD, the AoA/AoD related control data is<br />

sent over a traditional data channel. Typically, these techniques<br />

can achieve a couple of degrees angular accuracy and around 0.5<br />

m locationing accuracy, but these figures are highly dependent<br />

on the implementation of the locationing system.<br />

III. CHALLENGES<br />

One of the biggest and perhaps most obvious challenges in<br />

this subject is answering the question: “How are angle estimates<br />

calculated based on the sample data?” It is not enough that we<br />

are able to calculate angle estimates in an ideal environment; we<br />

must also be able to calculate them in environments with very<br />

heavy multi-path in which signals are highly correlated or<br />

coherent. By coherent signal, we mean a signal that is delayed<br />

and scaled version of some other signal. This can be the case<br />

when radio waves are reflected from walls, for example.<br />

Other challenges include signal polarization. In most cases,<br />

we cannot control the polarization of the mobile device, so the<br />

system has to take this into account. Also signal noise, clock<br />

jitter and signal propagation delays add their own variables to<br />

the problem. Depending on the system scale, the RAM and<br />

especially CPU requirements can be demanding for an<br />

embedded system. Many of the well performing angle<br />

estimation algorithms require a significant amount of processing<br />

power from the CPU.<br />

In the next section, we will cover some theory on antenna<br />

arrays and angle of arrival estimations. Angle of departure can<br />

be derived from the angle of arrival theory.<br />

IV. ANGLE OF ARRIVAL THEORY<br />

Angle estimation methods and antenna arrays are essential<br />

for the locationing system to work properly. The history of<br />

direction finding theory goes back over 100 years when the first<br />

attempts to solve this problem were made using directional<br />

antennas and, obviously, purely analog systems. In the years<br />

following, test methods moved to the digital world, but the basic<br />

principles are still quite the same. These direction-finding<br />

methods are already used in many applications, such as medical<br />

equipment, security and military devices. In this section, we will<br />

discuss the basics of some typical antenna arrays and estimation<br />

algorithms. By direction finding, we refer to the general problem<br />

of estimating arrival and departure angles.<br />

A. Antenna Arrays<br />

Antenna arrays for direction finding are usually divided into<br />

categories. The most common ones discussed here are Uniform<br />

Linear Array (ULA), Uniform Rectangular Array (URA) and<br />

Uniform Circular Array (UCA). The linear array is a onedimensional<br />

array, meaning that all the antennas in the array lie<br />

on a single line, whereas the rectangular and circular arrays are<br />

two-dimensional arrays, meaning that the antennas are spread in<br />

two dimensions (on a plane). By using a one-dimensional<br />

antenna array, one can reliably measure only the azimuth angle,<br />

assuming the tracked device moves consistently on the same<br />

plane. Furthermore, with two-dimensional arrays, one can<br />

reliably measure both azimuth and elevation angles in the 3D<br />

half-space. If the array is extended to a full 3D array (antennas<br />

spread on all three Cartesian coordinates), then we will be able<br />

to measure the full 3D space.<br />

Designing an antenna array for direction finding is not a<br />

straightforward task. When antennas are placed in an array, they<br />

start affecting each other’s responses; this is called mutual<br />

coupling. We also have to keep in mind that, in most cases, we<br />

cannot control the polarization of the transmitting end. This<br />

creates an additional challenge for the designer. In IoT<br />

applications, the devices are often expected to be small and even<br />

work in very high frequency bands. Estimation algorithms often<br />

require some certain properties from the array. For example the<br />

estimation algorithm called ESPRIT works on the mathematical<br />

assumption that the array is divided into two identical subarrays<br />

([3]).<br />

B. Angle Estimation Algorithms<br />

Let's look at the mathematical/algorithmic problem of<br />

estimating the angle of arrival based on the input IQ-data. The<br />

problem definition itself is simple: “Estimate the arrival angle of<br />

an emitted (narrowband) signal arriving at the receiving array”.<br />

While the problem statement sounds very trivial, a robust<br />

solution (that works in real life) for this problem is not easy and<br />

can require much processing power from the hardware.<br />

Next, we will present two different approaches for solving<br />

this problem. The first one is a basic one and is called a classical<br />

beamformer. The second is a more advanced technique called<br />

Multiple Signal Classification (MUSIC). We will not go through<br />

proofs of any theorems or reasons why these methods work but<br />

rather give a high-level view of how the algorithms work.<br />

Deeper studies about these estimation algorithms can be found<br />

from [1] and [2].<br />

Classical Beamformer<br />

Let's begin with a mathematical model of a uniform linear<br />

array. We are given a data vector of IQ-samples for each<br />

antenna. Let this vector be called x. Now, there is a phase shift<br />

seen by each antenna (which can be 0) plus some noise, n, in the<br />

measurements, so x can be written as a function of time t:<br />

x(t) =a(✓)s(t)+n(t) , (1)<br />

where s is the signal sent over the air, and a is the steering<br />

vector of the antenna array:<br />

a(✓) =[1,e j2⇡dsin(✓)/ ,...,e j2⇡(m 1)dsin(✓)/ ], (2)<br />

where d is the distance between adjacent antennas; λ is the<br />

wavelength of the signal; m is the number of elements in the<br />

antenna array, and θ stands for the angle of arrival.<br />

Steering vector (2) describes how signals on each antenna<br />

are phase shifted because of the varying distances to the<br />

transmitter. By using (1), we can calculate an approximation of<br />

the so-called sample covariance matrix, R ++ , by calculating<br />

294


R xx ⇡ 1 N<br />

NX<br />

x(t)x H (t)<br />

, (3)<br />

t=1<br />

where H stands for the Hermitian transpose of a matrix.<br />

The sample covariance matrix (3) will be used as an input<br />

for the estimator algorithm as we will see.<br />

The idea of the classical beamformer is to maximize the<br />

output power as a function of the angle, similar to how a<br />

mechanical radar works. If we attempt to maximize the power,<br />

we end up with the next formula:<br />

P (✓) = aH (✓)R xx a(✓)<br />

a H (✓)a(✓)<br />

Now, to find the arrival angle, we need to loop through the<br />

arrival angle θ and find the peak maximum power, P. The angle,<br />

theta, that produces the maximum power corresponds to the<br />

angle of arrival. While this approach is quite simple, its accuracy<br />

is not generally very good. So, let's introduce another method,<br />

which is a bit better in terms of accuracy. See, for example, [4]<br />

for an algorithm accuracy comparison.<br />

MUSIC (Multiple Signal Classification)<br />

One type of estimation algorithm is the so-called subspace<br />

estimator, and one popular algorithm of that category is called<br />

MUSIC (Multiple Signal Classification). The idea of this<br />

algorithm is to perform eigendecomposition on the covariance<br />

matrix R ++ :<br />

(4)<br />

R xx = V AV 1<br />

, (5)<br />

where A is a diagonal matrix containing eigenvalues and V<br />

containing the corresponding eigenvectors of R ++ . Assume we<br />

are trying to estimate the angle of arrival for one transmitter with<br />

an n antenna linear array. It can be shown that the eigenvectors<br />

of R ++ either belong to so-called noise subspace or signal<br />

subspace. If the eigenvalues are sorted in ascending order, the<br />

corresponding n − 1 eigenvectors span the noise subspace,<br />

which is orthogonal to the signal subspace. Based on the<br />

orthogonality information, we can calculate the pseudo spectrum<br />

P:<br />

P (✓) =<br />

1<br />

a H (✓)VV H a(✓)<br />

As in a classical beamformer, we loop through the desired<br />

values of θ and find the maximum peak value of P, which<br />

corresponds the angle of arrival (argument θ) we wish to<br />

measure.<br />

In an ideal case, MUSIC has very good resolution in a good<br />

SNR environment and is very accurate. On the other hand, its<br />

performance is not very good when the input signals are highly<br />

correlated. This is the case especially in an indoor environment.<br />

Multipath effects distort the pseudo-spectrum causing it to have<br />

maximums at the wrong locations. More information about the<br />

conventional beamformer and MUSIC estimators can be found<br />

from [3].<br />

(6)<br />

Spatial Smoothing<br />

Spatial smoothing is a method for solving problems caused<br />

by multipathing (when coherent signals are present). It can be<br />

proven that the signal covariance matrix can be "decorrelated"<br />

by calculating an averaged covariance matrix using subarrays of<br />

the original covariance matrix. For a two-dimensional array, this<br />

can be written as the following<br />

, (7)<br />

where M 2 and N 2 are the number of subarrays in x- and y-<br />

directions respectively and R 45 stands for the (m, n):th<br />

subarray covariance matrix. An example proof of this formula<br />

and more information can be found from [2].<br />

The resulting covariance matrix can now be used as a<br />

"decorrelated" version of the covariance matrix and fed to the<br />

MUSIC algorithm to produce correct results. The downside of<br />

spatial smoothing is that it reduces the size of the covariance<br />

matrix, which further reduces the accuracy of the estimate.<br />

V. OTHER LOCATIONING TECHNOLOGIES<br />

In this section, we briefly present two other locationing<br />

technologies for comparison. These two methods use different<br />

kinds of algorithms/methods for locationing than those<br />

presented in this paper.<br />

RSSI<br />

With Received Signal Strength Indicator (RSSI), the basic<br />

idea is to measure the signal strength of the received signal to<br />

get a distance approximation between RX and TX. This<br />

information can be used to trilaterate the position of a receiver<br />

device based on multiple distance measurements from different<br />

transmitter points. This technology requires only one antenna<br />

per device but is not usually very accurate in an indoor<br />

environment.<br />

ToA / TDoA<br />

R = 1<br />

M s N s<br />

XM s<br />

m=1 n=1<br />

XN s<br />

R mn<br />

With Time of Arrival / Time of Flight (ToA/ToF), we<br />

measure the travel time of a signal between RX and TX and use<br />

that to calculate the distance between the ends. This distance is<br />

then used to trilaterate the position of the receiver. In ToA, all<br />

devices are time-synchronized. This technology also requires<br />

only one antenna per device, but, on the other hand, it requires<br />

very high clock accuracy to get reasonable positioning<br />

accuracies. There is also a variant of this technology called<br />

TDoA, where only the receiver devices need to be timesynchronized,<br />

and the estimation algorithms use the time<br />

difference for calculating position estimates.<br />

VI. SUMMARY<br />

Bluetooth Angle of Arrival and Angle of Departure are new<br />

emerging technologies that can be used to track assets as well as<br />

www.embedded-world.eu<br />

295


for indoor positioning and way-finding. These are phase-based<br />

direction finding systems that require an antenna array, RF<br />

switches (or a multi-channel ADC) and processing power to run<br />

the estimation algorithms. Designing a proper antenna array as<br />

well as an angle estimation algorithm are essential for a RTLS<br />

system. Good performing estimator algorithms are often not<br />

computationally cheap. Other positioning technologies include<br />

(but are not limited to) RSSI based methods and ToA based<br />

methods, but only phase-based AoA/AoD currently have a<br />

standardized framework in Bluetooth.<br />

REFERENCES<br />

[1] H. Krim, M. Viberg, “Two Decades of Array Signal Processing”, IEEE<br />

Signal Processing Magazine, July 1996, pp. 67-94<br />

[2] Y.-M. Chen, “On Spatial Smoothing for Two-Dimensional Direction-of-<br />

Arrival Estimation of Coherent Signals”, IEEE Transactions on Signal<br />

Processing, Vol. 45, No. 7, July 1997<br />

[3] Z. Chen, G. Gokeda, Y. Yu, “Introduction to Direction-of-Arrival<br />

Estimation”, Artech House, 2010<br />

[4] N. A. Baig, M. B. Malik, “Comparison of Direction of Arrival (DOA)<br />

Esimation Techniques for Closely Spaced Targets”, International Journal<br />

of Future Computer and Communication, Vol. 2, No. 6, December 2013<br />

296


Bluetooth Mesh Networking<br />

Martin Woolley<br />

Bluetooth SIG<br />

UK<br />

Twitter: @bluetooth_mdw<br />

Abstract— Mesh is a new network topology option available<br />

for Bluetooth Low Energy (LE) adopted in the summer of 2017.<br />

It represents a major advance which positions Bluetooth to be the<br />

dominant low power wireless communications technology in a<br />

wide variety of new sectors and use cases, including Smart<br />

Buildings and Industrial IoT.<br />

Keywords—Bluetooth,mesh,IoT,smart buildings<br />

I. INTRODUCTION<br />

Bluetooth has been actively developed since its initial<br />

release in 2000, when it was originally intended to act as a<br />

cable replacement technology. It soon came to dominate<br />

wireless audio products and computer peripherals, such as<br />

wireless mice and keyboards.<br />

In 2010, Bluetooth LE provided the next, major step<br />

forward. Its impact has been substantial and widely felt, most<br />

notably in smartphones and tablets, as well as in Health and<br />

Fitness, Smart Home and Wearables categories.<br />

Wireless communications systems based around mesh<br />

network topologies have proved themselves to offer an<br />

effective approach to providing coverage of large areas,<br />

extending range and providing resilience. However, until now<br />

they have been based upon niche technologies, incompatible<br />

with most computer, smartphone and accessory devices owned<br />

by consumers or used in the enterprise.<br />

120 Bluetooth SIG member companies participated in the<br />

work required to bring mesh networking support to Bluetooth.<br />

This is significantly more than is typically the case, and is<br />

representative of the demand for a global, industry standard for<br />

a Bluetooth mesh networking capability.<br />

The addition of mesh networking support represents a<br />

change of a type, and of such magnitude that it warrants being<br />

described as a paradigm shift for Bluetooth technology.<br />

parking space lights up so you can drive easily to it. The<br />

parking bay allocation system is updated to note that this space<br />

is now occupied.<br />

Entering the building, occupancy sensors note your arrival<br />

and identify you from the wearable technology about your<br />

person. You take the elevator to the 2 nd floor and exit. You’re<br />

the first to arrive, as usual. As the lift doors open, the lights<br />

from the elevator to your office and the kitchen come on.<br />

Coffee is deemed of strategic significance in your company!<br />

Other areas are left in darkness to save power.<br />

You walk to your office and enter, closing the door behind<br />

you. The LED downlights and your desk lamp are already on<br />

and at exactly the level you prefer. You notice the temperature<br />

is a little warmer than the main office space, reflecting your<br />

personal preference. Proximity with your computer<br />

automatically logs you in.<br />

Your day started well, with the building responding to your<br />

needs, taking into account your preferences. It’s clear that<br />

systems are being used efficiently. What made this possible?<br />

Your company installed a Bluetooth mesh network some<br />

months ago, starting with a mesh lighting system. Later the<br />

mesh was added to with occupancy sensors, environmental<br />

sensors, a wireless heating control system, and a mesh-based<br />

car park management system. The company is saving money<br />

on electricity and heating, and work environments have<br />

become personalized, boosting personal productivity.<br />

Maintenance costs are going down since adding items like<br />

additional light switches no longer requires expensive and<br />

disruptive physical wiring. Data is allowing the building<br />

management team to learn about the building, its services and<br />

how people act within it and are using this data to make<br />

optimizations.<br />

II.<br />

TAKING CONTROL<br />

A. Smart Buildings Get Truly Smart<br />

Imagine arriving at the office in your car, early one dark,<br />

winter morning. The security system lets you in and a parking<br />

bay is automatically allocated to you. The bay number over the<br />

297


achieved using messages, and devices are able to relay<br />

messages to other devices so that the end-to-end<br />

communication range is extended far beyond the radio range of<br />

each individual node.<br />

Figure 1 - A Bluetooth mesh network could span the office and car<br />

park<br />

The Bluetooth mesh network has made it easier and<br />

cheaper to be in control of building services, to wirelessly<br />

interact with them and to automate their behaviors. You<br />

wonder how you ever lived without such advanced building<br />

technology in the past!<br />

III.<br />

BLUETOOTH MESH - THE BASICS<br />

A. Concepts and Terminology<br />

Understanding Bluetooth mesh networking topology<br />

requires the reader to learn about a series of new technical<br />

terms and concepts, not found in the world of Bluetooth LE. In<br />

this section, we’ll explore the most fundamental of these terms<br />

and concepts.<br />

B. Mesh vs Point-to-Point<br />

Most Bluetooth LE devices communicate with each other<br />

using a simple point-to-point network topology enabling oneto-one<br />

device communications. In the Bluetooth core<br />

specification, this is called a ‘piconet.’<br />

Imagine a smartphone that has established a point-to-point<br />

connection to a heart rate monitor over which it can transfer<br />

data. One nice aspect of Bluetooth is that it enables devices to<br />

set up multiple connections. That same smartphone can also<br />

establish a point-to-point connection with an activity tracker.<br />

In this case, the smartphone can communicate directly with<br />

each of the other devices, but the other devices cannot<br />

communicate directly with each other. In contrast, a mesh<br />

network has a many-to-many topology, with each device able<br />

to communicate with every other device in the mesh (we’ll<br />

examine that statement more closely later on in the section<br />

entitled “Bluetooth mesh in action”). Communication is<br />

Figure 2 - A many to many topology with message relaying<br />

C. Devices and Nodes<br />

Devices which are part of a mesh network are called nodes<br />

and those which are not are called “unprovisioned devices”.<br />

The process which transforms an unprovisioned device into<br />

a node is called “provisioning”. Consider purchasing a new<br />

Bluetooth light with mesh support, bringing it home and setting<br />

it up. To make it part of your mesh network, so that it can be<br />

controlled by your existing Bluetooth light switches and<br />

dimmers, you would need to provision it.<br />

Provisioning is a secure procedure which results in an<br />

unprovisioned device possessing a series of encryption keys<br />

and being known to the Provisioner device, typically a tablet or<br />

smartphone. One of these keys is called the network key or<br />

NetKey for short. You can read more about mesh security in<br />

the Security section, below.<br />

All nodes in a mesh network possess at least one NetKey<br />

and it is possession of this key which makes a device a member<br />

of the corresponding network and as such, a node. There are<br />

other requirements that must be satisfied before a nnode can<br />

become useful, but securely acquiring a NetKey through the<br />

provisioning process is a fundamental first step. We’ll review<br />

the provisioning process in more detail in a later section of this<br />

paper.<br />

D. Elements<br />

Some nnodes have multiple constituent parts, each of which<br />

can be independently controlled. In Bluetooth mesh<br />

terminology, these parts are called elements. Figure 3 shows an<br />

298


LED lighting product which if added to a Bluetooth mesh<br />

network, would form a single node with three elements, one for<br />

each of the individual LED lights.<br />

Figure 3 - Lighting node consisting of three elements<br />

E. Messages<br />

When a node needs to query the status of other nodes or<br />

needs to control other nodes in some way, it sends a message<br />

of a suitable type. If a node needs to report its status to other<br />

nodes, it sends a message. All communication in the mesh<br />

network is message-oriented and many message types are<br />

defined, each with its own unique opcode.<br />

Messages fall within one of two broad categories;<br />

Acknowledged messages require a response from nodes<br />

that receive them. The response serves two purposes: it<br />

confirms that the message it relates to was received, and it<br />

returns data relating to the message recipient to the message<br />

sender.<br />

The sender of an acknowledged message may resend the<br />

message if it does not receive the expected response(s) and<br />

therefore, acknowledged messages must be idempotent. This<br />

means that the effect of a given acknowledged message,<br />

arriving at a node multiple times, will be the same as if it had<br />

only been received once.<br />

Unacknowledged messages do not require a response.<br />

F. Addresses<br />

Messages must be sent from and to an address. Bluetooth<br />

mesh defines three types of address.<br />

A unicast address uniquely identifies a single element.<br />

Unicast addresses are assigned to devices during the<br />

provisioning process.<br />

A group address is a multicast address which represents one<br />

or more elements. Group addresses are either defined by the<br />

Bluetooth SIG and are known as SIG Fixed Group Addresses<br />

or are assigned dynamically. 4 SIG Fixed Group Addresses<br />

have been defined. These are named All-proxies, All-friends,<br />

All-relays and All-nodes. The terms Proxy, Friend and Relay<br />

will be explained later in this paper.<br />

It is expected that dynamic group addresses will be<br />

established by the user via a configuration application and that<br />

they will reflect the physical configuration of a building, such<br />

as defining group addresses which correspond to each room in<br />

the building.<br />

A virtual address is an address which may be assigned to<br />

one or more elements, spanning one or more nodes. It takes the<br />

form of a 128-bit UUID value with which any element can be<br />

associated and is much like a label. Virtual addresses will<br />

likely be preconfigured at the point of manufacture and be used<br />

for scenarios such as allowing the easy addressing of all<br />

meeting room projectors made by this manufacturer.<br />

G. Publish / Subscribe<br />

The act of sending a message is known as publishing. nodes<br />

are configured to select messages sent to specific addresses for<br />

processing, and this is known as subscribing.<br />

Typically, messages are addressed to group or virtual<br />

addresses. Group and virtual address names will have readily<br />

understood meaning to the end user, making them easy and<br />

intuitive to use.<br />

In Figure 4 above, we can see that the node “Switch 1” is<br />

publishing to the group address Kitchen. Nodes Light 1, Light<br />

2 and Light 3 each subscribe to the Kitchen address and<br />

therefore receive and process messages published to this<br />

address. In other words, Light 1, Light 2 and Light 3 can be<br />

switched on or off using Switch 1.<br />

Switch 2 publishes to the group address Dining Room.<br />

Light 3 alone subscribed to this address and so is the only light<br />

controlled by Switch 2. Note that this example also illustrates<br />

the fact that nodes may subscribe to messages addressed to<br />

more than one distinct address. This is both powerful and<br />

flexible.<br />

Similarly, notice how both Switch 5 and Switch 6 publish<br />

to the same Garden address.<br />

The use of group and virtual addresses with the<br />

publish/subscribe communication model has an additional,<br />

substantial benefit in that removing, replacing or adding new<br />

nodes to the network does not require reconfiguration of other<br />

nodes. Consider what would be involved in installing an<br />

additional light in the dining room. The new device would be<br />

added to the network using the provisioning process and<br />

configured to subscribe to the Dining Room address. No other<br />

nodes would be affected by this change to the network. Switch<br />

2 would continue to publish messages to Dining Room as<br />

before but now both Light 3 and the new light would respond.<br />

Figure 4 - Publish / Subscribe<br />

299


H. States and Properties<br />

Elements can be in various conditions and this is<br />

represented in Bluetooth Mesh by the concept of state values.<br />

A state is a value of a certain type, contained within an<br />

element (within a server model - see below). As well as values,<br />

states also have associated behaviors and may not be reused in<br />

other contexts.<br />

As an example, consider a simple light which may either be<br />

on or off. Bluetooth Mesh defines a state called Generic OnOff.<br />

The light would possess this state item and a value of On<br />

would correspond to and cause the light to be illuminated,<br />

whereas a Generic OnOff state value of Off would reflect and<br />

cause the light to be switched off.<br />

The significance of the term Generic will be discussed later.<br />

Properties are similar to states in that they contain values<br />

relating to an element. But they are significantly different to<br />

states in other ways.<br />

Readers who are familiar with Bluetooth LE will be aware<br />

of characteristics and recall that they are data types with no<br />

defined behaviors associated with them, making them reusable<br />

across different contexts. A property provides the context for<br />

interpreting a characteristic.<br />

To appreciate the significance and use of contexts as they<br />

relate to properties, consider for example, the characteristic<br />

Temperature 8, an 8-bit temperature state type which has a<br />

number of associated properties, including Present Indoor<br />

Ambient Temperature and Present Outdoor Ambient<br />

Temperature. These two properties allow a sensor to publish<br />

sensor readings in a way that allows a receiving client to<br />

determine the context the temperature value has, making better<br />

sense of its true meaning.<br />

Properties are organized into two categories: Manufacturer,<br />

which is a read-only category and Admin, which allows readwrite<br />

access.<br />

I. Messages, States and Properties<br />

Messages are the mechanism by which operations on the<br />

mesh are invoked. Formally, a given message type represents<br />

an operation on a state or collection of multiple state values.<br />

All messages are of three broad types, reflecting the types of<br />

operation which Bluetooth Mesh supports. The shorthand for<br />

the three types is GET, SET and STATUS.<br />

GET messages request the value of a given state from one<br />

or more nodes. A STATUS message is sent in response to a<br />

GET and contains the relevant state value.<br />

SET messages change the value of a given state. An<br />

acknowledged SET message will result in a STATUS message<br />

being returned in response to the SET message whereas an<br />

unacknowledged SET message requires no response.<br />

STATUS messages are sent in response to GET messages,<br />

acknowledged SET messages or independently of other<br />

messages, perhaps driven by a timer running on the element<br />

sending the message, for example.<br />

Specific states referenced by messages are inferred from the<br />

message opcode. Properties on the other hand, are referenced<br />

explicitly in generic property related messages using a 16-bit<br />

property ID.<br />

J. State Transitions<br />

Changes from one state to another are called state<br />

transitions. Transitions may be instantaneous or execute over a<br />

period of time called the transition time. A state transition is<br />

likely to have an effect on the application layer behavior of a<br />

node.<br />

K. Bound States<br />

Relationships may exist between states whereby a change<br />

in one triggers a change in the other. Such a relationship is<br />

called a state binding. One state may be bound to multiple<br />

other states.<br />

For example, consider a light controlled by a dimmer<br />

switch. The light would possess the two states, Generic OnOff<br />

and Generic Level with each bound to the other. Reducing the<br />

brightness of the light until Generic Level has a value of zero<br />

(fully dimmed) results in Generic OnOff transitioning from On<br />

to Off.<br />

L. Models<br />

Models pull the preceding concepts together and define<br />

some or all of the functionality of an element as it relates to the<br />

mesh network. Three categories of model are recognized.<br />

A server model defines a collection of states, state<br />

transitions, state bindings and messages which the element<br />

containing the model may send or receive. It also defines<br />

behaviors relating to messages, states and state transitions.<br />

A client model does not define any states. Instead, it defines<br />

the messages which it may send or receive in order to GET,<br />

SET or acquire the STATUS of states defined in the<br />

corresponding server model.<br />

Control models contain both a server model, allowing<br />

communication with other client models and a client model<br />

which allows communication with server models.<br />

Models may be created by extending other models. A<br />

model which is not extended is called a root model.<br />

Models are immutable, meaning that they may not be<br />

changed by adding or removing behaviors. The correct and<br />

only permissible approach to implementing new model<br />

requirements is to extend the existing model.<br />

M. Generics<br />

It is recognized that many different types of device, often<br />

have semantically equivalent states, as exemplified by the<br />

simple idea of ON vs OFF. Consider lights, fans and power<br />

sockets, all of which can be switched on or turned off.<br />

Consequently, the Bluetooth Mesh Model specification<br />

defines a series of reusable, generic states such as, for example,<br />

Generic OnOff and Generic Level.<br />

Similarly, a series of generic messages that operate on the<br />

generic states are defined. Examples include Generic OnOff<br />

Get and Generic Level Set.<br />

300


Generic states and generic messages are used in generalized<br />

models, both generic server models such as the Generic OnOff<br />

Server and Generic Client Models such as the Generic Level<br />

Client.<br />

Generics allow a wide range of device type to support<br />

Bluetooth Mesh without the need to create new models.<br />

Remember that models may be created by extending other<br />

models too. As such, generic models may form the basis for<br />

quickly creating models for new types of devices.<br />

Figure 5 - Generic Models<br />

N. Scenes<br />

A scene is a stored collection of states which may be<br />

recalled and made current by the receipt of a special type of<br />

message or at a specified time. Scenes are identified by a 16-bit<br />

Scene Number, which is unique within the mesh network.<br />

Scenes allow a series of nodes to be set to a given set of<br />

previously stored, complimentary states in one coordinated<br />

action.<br />

Imagine that in the evening, you like the temperature in<br />

your main family room to be 20 degrees Celsius, the six LED<br />

downlights to be at a certain brightness level and the lamp in<br />

the corner of the room on the table set to a nice warm yellow<br />

hue. Having manually set the various nodes in this example<br />

scenario to these states, you can store them as a scene using a<br />

configuration application and recall the scene later on, either on<br />

demand by sending an appropriate, scene-related mesh<br />

message or automatically at a scheduled time.<br />

O. Provisioning<br />

Provisioning is the process by which a device joins the<br />

mesh network and becomes a node. It involves several stages,<br />

results in various security keys being generated and is itself a<br />

secure process.<br />

Provisioning is accomplished using an application on a<br />

device such as a tablet. In this capacity, the device used to<br />

drive the provisioning process is referred to as the Provisioner.<br />

The provisioning process progresses through five steps and<br />

these are described next.<br />

Step 1. Beaconing<br />

In support of various different Bluetooth mesh features,<br />

including but not limited to provisioning, new GAP AD types<br />

(ref: Bluetooth Core Specification Supplement) have been<br />

introduced, including the AD type.<br />

An unprovisioned device indicates its availability to be<br />

provisioned by using the AD type in<br />

advertising packets. The user might need to start a new device<br />

advertising in this way by, for example, pressing a combination<br />

of buttons or holding down a button for a certain length of<br />

time.<br />

Step 2. Invitation<br />

In this step, the Provisioner sends an invitation to the<br />

device to be provisioned, in the form of a Provisioning Invite<br />

PDU. The Beaconing device responds with information about<br />

itself in a Provisioning Capabilities PDU.<br />

Step 3. Exchanging Public Keys<br />

The Provisioner and the device to be provisioned, exchange<br />

their public keys, which may be static or ephemeral, either<br />

directly or using an out-of-band (OOB) method.<br />

Step 4. Authentication<br />

During the authentication step, the device to be provisioned<br />

outputs a random, single or multi-digit number to the user in<br />

some form, using an action appropriate to its capabilities. For<br />

example, it might flash an LED several times. The user enters<br />

the digit(s) output by the new device into the Provisioner and a<br />

cryptographic exchange takes place between the two devices,<br />

involving the random number, to complete the authentication<br />

of each of the two devices to the other.<br />

Step 5. Distribution of the Provisioning Data<br />

After authentication has successfully completed, a session<br />

key is derived by each of the two devices from their private<br />

keys and the exchanged, peer public keys. The session key is<br />

then used to secure the subsequent distribution of the data<br />

required to complete the provisioning process, including a<br />

security key known as the network key (NetKey).<br />

After provisioning has completed, the provisioned device<br />

possesses the network’s NetKey, a mesh security parameter<br />

known as the IV Index and a Unicast Address, allocated by the<br />

Provisioner. It is now known as a node.<br />

P. Features<br />

All nodes can transmit and receive mesh messages but there<br />

are a number of optional features which a node may possess,<br />

giving it additional, special capabilities. There are four such<br />

optional features: the Relay, Proxy, Friend and Low Power<br />

features. A node may support zero or more of these optional<br />

features and any supported feature may, at a point in time, be<br />

enabled or disabled.<br />

Q. Relay Nodes<br />

Nodes which support the Relay feature, known as Relay<br />

nodes, are able to retransmit received messages. Relaying is the<br />

mechanism by which a message can traverse the entire mesh<br />

network, making multiple “hops” between devices by being<br />

relayed.<br />

301


Mesh network PDUs include a field called TTL (Time To<br />

Live). It takes an integer value and is used to limit the number<br />

of hops a message will make across the network. Setting TTL<br />

to 3, for example, will result in the message being relayed, a<br />

maximum number of three hops away from the originating<br />

node. Setting it to 0 will result in it not being relayed at all and<br />

only travelling a single hop. Armed with some basic<br />

knowledge of the topology and membership of the mesh, nodes<br />

can use the TTL field to make more efficient use of the mesh<br />

network.<br />

R. Low Power Nodes and Friend Nodes<br />

Some types of node have a limited power source and need<br />

to conserve energy as much as possible. Furthermore, devices<br />

of this type may be predominantly concerned with sending<br />

messages but still have a need to occasionally receive<br />

messages.<br />

Consider a temperature sensor which is powered by a small<br />

coin cell battery. It sends a temperature reading once per<br />

minute whenever the temperate is above or below configured<br />

upper and lower thresholds. If the temperature stays within<br />

those thresholds it sends no messages. These behaviors are<br />

easily implemented with no particular issues relating to power<br />

consumption arising.<br />

However, the user is also able to send messages to the<br />

sensor, which change the temperature threshold state values.<br />

This is a relatively rare event, but the sensor must support it.<br />

The need to receive messages has implications for duty cycle<br />

and as such power consumption. A 100% duty cycle would<br />

ensure that the sensor did not miss any temperature threshold<br />

configuration messages but use a prohibitive amount of power.<br />

A low duty cycle would conserve energy, but risk the sensor<br />

missing configuration messages.<br />

The answer to this apparent conundrum is the Friend node<br />

and the concept of “friendship”.<br />

Nodes like the temperature sensor in the example may be<br />

designated Low Power nodes (LPNs) and a feature flag in the<br />

sensor’s configuration data will designate the node as such.<br />

LPNs work in tandem with another node, one which is not<br />

power-constrained (e.g. it has a permanent AC power source).<br />

This device is termed a Friend node. The Friend stores<br />

messages addressed to the LPN and delivers them to the LPN<br />

whenever the LPN polls the Friend node for “waiting<br />

messages”. The LPN may poll the Friend relatively<br />

infrequently so that it can balance its need to conserve power<br />

with the timeliness with which it needs to receive and process<br />

configuration messages. When it does poll, all messages stored<br />

by the Friend are forwarded to the LPN, one after another, with<br />

a flag known as MD (More Data) indicating to the LPN<br />

whether there are further messages to be sent from the Friend<br />

node.<br />

The relationship between the LPN and the Friend node is<br />

known as friendship. Friendship is key to allowing very power<br />

constrained nodes which need to receive messages, to function<br />

in a Bluetooth mesh network whilst continuing to operate in a<br />

power-efficient way.<br />

S. Proxy Nodes<br />

There are an enormous number of devices in the world that<br />

support Bluetooth LE, most smartphones and tablets being<br />

amongst them. In-market Bluetooth devices, at the time<br />

Bluetooth mesh was adopted, do not possess a Bluetooth mesh<br />

networking stack. They do possess a Bluetooth LE stack<br />

however and therefore have the ability to connect to other<br />

devices and interact with them using GATT, the Generic<br />

Attribute Profile.<br />

Proxy nodes expose a GATT interface, which Bluetooth LE<br />

devices may use to interact with a mesh network. A protocol<br />

called the Proxy Protocol, intended to be used with a<br />

connection-oriented bearer, such as GATT is defined. GATT<br />

devices read and write Proxy Protocol PDUs from within<br />

GATT characteristics implemented by the Proxy node. The<br />

Proxy node transforms these PDUs to / from mesh PDUs.<br />

In summary, Proxy nodes allow Bluetooth LE devices that<br />

do not possess a Bluetooth mesh stack to interact with nodes in<br />

a mesh network.<br />

Figure 6 - Smartphone communicating via a mesh proxy node<br />

T. Node Configuration<br />

Each node supports a standard set of configuration states<br />

which are implemented within the standard Configuration<br />

Server Model and accessed using the Configuration Client<br />

Model. Configuration state data is concerned with the node’s<br />

capabilities and behavior within the mesh, independently of<br />

any specific application or device type behaviors.<br />

For example, the features supported by a node, whether it is<br />

a Proxy node, a Relay node and so on, are indicated by<br />

Configuration Server states. The addresses to which a node has<br />

subscribed are stored in the Subscription List. The network and<br />

subnet keys indicating the networks the node is a member of<br />

are listed in the configuration block, as are the application keys<br />

held by the node.<br />

A series of configuration messages allow the Configuration<br />

Client Model and Configuration Server Model to support GET,<br />

302


SET and STATUS operations on the Configuration Server<br />

Model states.<br />

IV.<br />

THE MESH SYSTEM ARCHITECTURE<br />

A. Overview<br />

In this section we’ll take a closer look at the Bluetooth<br />

mesh architecture, its layers and their respective<br />

responsibilities. We’ll also position the mesh architecture<br />

relative to the Bluetooth LE core architecture.<br />

Figure 7 shows the mesh architecture.<br />

Figure 7 - The Bluetooth mesh architecture<br />

At the bottom of the mesh architecture stack, we have a<br />

layer entitled Bluetooth LE. In fact, this is more than just a<br />

single layer of the mesh architecture, it’s the full Bluetooth LE<br />

stack, which is required to provide fundamental wireless<br />

communications capabilities which are leveraged by the mesh<br />

architecture which sits on top of it. It should be clear that the<br />

mesh system is dependent upon the availability of a Bluetooth<br />

LE stack.<br />

We’ll now review each layer of the mesh architecture,<br />

working our way up from the bottom layer.<br />

B. Bearer Layer<br />

Mesh messages require an underlying communications<br />

system for their transmission and receipt. The bearer layer<br />

defines how mesh PDUs will be handled by a given<br />

communications system. At this time, two bearers are defined<br />

and these are called the Advertising Bearer and the GATT<br />

Bearer.<br />

The Advertising Bearer leverages Bluetooth LE’s GAP<br />

advertising and scanning features to convey and receive mesh<br />

PDUs.<br />

The GATT Bearer allows a device which does not support<br />

the Advertising Bearer to communicate indirectly with nodes<br />

of a mesh network which do, using a protocol known as the<br />

Proxy Protocol. The Proxy Protocol is encapsulated within<br />

GATT operations involving specially defined GATT<br />

characteristics. A mesh Proxy node implements these GATT<br />

characteristics and supports the GATT bearer as well as the<br />

Advertising Bearer so that it can convert and relay messages<br />

between the two types of bearer.<br />

C. Network Layer<br />

The network layer defines the various message address<br />

types and a network message format which allows transport<br />

layer PDUs to be transported by the bearer layer.<br />

It can support multiple bearers, each of which may have<br />

multiple network interfaces, including the local interface which<br />

is used for communication between elements that are part of<br />

the same node.<br />

The network layer determines which network interface(s) to<br />

output messages over. An input filter is applied to messages<br />

arriving from the bearer layer, to determine whether or not they<br />

should be delivered to the network layer for further processing.<br />

Output messages are subject to an output filter to control<br />

whether or not they are dropped or delivered to the bearer<br />

layer.<br />

The Relay and Proxy features may be implemented by the<br />

Network Layer.<br />

D. Lower Transport Layer<br />

The lower transport layer takes PDUs from the upper<br />

Transport Layer and sends them to the lower transport layer on<br />

a peer device. Where required, it performs segmentation and<br />

reassembly of PDUs. For longer packets, which will not fit into<br />

a single Transport PDU, the lower transport layer will perform<br />

segmentation, splitting the PDU into multiple Transport PDUs.<br />

The receiving lower transport layer on the other device, will<br />

reassemble the segments into a single upper transport layer<br />

PDU and pass this up the stack.<br />

E. Upper Transport Layer<br />

The upper transport layer is responsible for the encryption,<br />

decryption and authentication of application data passing to<br />

and from the access layer. It also has responsibility for<br />

transport control messages, which are internally generated and<br />

sent between the upper transport layers on different peer nodes.<br />

These include messages related to friendship and heartbeats.<br />

F. Access Layer<br />

The access layer is responsible for defining how<br />

applications can make use of the upper transport layer. This<br />

includes:<br />

- defining the format of application data.<br />

- defining and controlling the encryption and decryption<br />

process which is performed in the upper transport layer.<br />

- verifying that data received from the upper transport layer<br />

is for the right network and application, before forwarding the<br />

data up the stack.<br />

G. Foundation Model Layer<br />

The foundation model layer is responsible for the<br />

implementation of those models concerned with the<br />

configuration and management of a mesh network.<br />

303


H. Model Layer<br />

The Model Layer is concerned with the implementation of<br />

Models and as such, the implementation of behaviors,<br />

messages, states, state bindings and so on, as defined in one or<br />

more model specifications.<br />

V. SECURITY<br />

A. Mesh Security is Mandatory<br />

Bluetooth LE allows the profile designer to exploit a range<br />

of different security mechanisms, from the various approaches<br />

to pairing that are possible, to individual security requirements<br />

associated with individual characteristics. Security is in fact,<br />

totally optional, and its permissible to have a device which is<br />

completely open with no security protections or constraints in<br />

place. The device designer or manufacturer is responsible for<br />

analyzing threats and determining the security requirements<br />

and solutions for their product.<br />

In contrast, in Bluetooth Mesh, security is mandatory. The<br />

network, individual applications and devices are all secure and<br />

this cannot be switched off or reduced in any way.<br />

Figure 8 - security is central to Bluetooth mesh networking<br />

B. Mesh Security Fundamentals<br />

The following fundamental security statements apply to all<br />

Bluetooth mesh networks:<br />

1. All mesh messages are encrypted and authenticated.<br />

2. Network security, application security and device<br />

security are addressed independently. See “Separation<br />

of Concerns” below.<br />

3. Security keys can be changed during the life of the<br />

mesh network via a Key Refresh procedure.<br />

4. Message obfuscation makes it difficult to track<br />

messages sent within the network, providing a privacy<br />

mechanism that makes it difficult to track nodes.<br />

5. Mesh security protects the network against replay<br />

attacks.<br />

6. The process by which devices are added to the mesh<br />

network to become nodes is, itself, a secure process.<br />

7. Nodes can be removed from the network securely, in a<br />

way which prevents trashcan attacks.<br />

C. Separation of Concerns and Mesh Security Keys<br />

At the heart of Bluetooth Mesh security are three types of<br />

security key. Between them, these keys provide security to<br />

different aspects of the mesh and achieve a critical capability in<br />

mesh security, that of “separation of concerns”.<br />

To understand this and appreciate its significance, consider<br />

a mesh light which can act as a relay. In its capacity as a relay,<br />

it may find itself handling messages relating to the building’s<br />

Bluetooth mesh door and window security system. A light has<br />

no business being able to access and process the details of such<br />

messages, but does need to relay them to other nodes.<br />

To deal with this potential conflict of interest, the mesh<br />

uses different security keys for securing messages at the<br />

network layer from those used to secure data relating to<br />

specific applications such as lighting, physical security, heating<br />

and so on.<br />

All nodes in a mesh network possess a network key<br />

(NetKey). Indeed, it is possession of this shared key which<br />

makes a node a member of the network. A network encryption<br />

key and a privacy key are derived directly from the NetKey.<br />

Being in possession of the NetKey allows a node to decrypt<br />

and authenticate up to the Network Layer so that network<br />

functions such as relaying, can be carried out. It does not allow<br />

application data to be decrypted.<br />

The network may be subdivided into subnets and each<br />

subnet has its own NetKey, which is possessed only by those<br />

nodes which are members of that subnet. This might be used,<br />

for example, to isolate specific, physical areas, such as each<br />

room in a hotel.<br />

Application data for a specific application can only be<br />

decrypted by nodes which possess the right application key<br />

(AppKey). Across the nodes in a mesh network, there may be<br />

many distinct AppKeys, but typically, each AppKey will only<br />

be possessed by a small subset of the nodes, namely those of a<br />

type which can participate in a given application. For example,<br />

lights and light switches would possess the lighting<br />

application’s AppKey but not the AppKey for the heating<br />

system, which would only be possessed by thermostats, valves<br />

on radiators and so on.<br />

AppKeys are used by the upper transport layer to decrypt<br />

and authenticate messages before passing them up to the access<br />

layer.<br />

AppKeys are associated with only one NetKey. This<br />

association is termed “key binding” and means that specific<br />

applications, as defined by possession of a given AppKey, can<br />

only work on one specific network, whereas a network can host<br />

multiple, independently secure applications.<br />

The final key type is the device key (DevKey). This is a<br />

special type of application key. Each node has a unique<br />

DevKey known to the Provisioner device and no other. The<br />

304


DevKey is used in the provisioning process to secure<br />

communication between the Provisioner and the node.<br />

D. Node Removal, Key Refresh and Trashcan Attacks<br />

As described above, nodes contain various mesh security<br />

keys. Should a node become faulty and need to be disposed of,<br />

or if the owner decides to sell the node to another owner, it’s<br />

important that the device and the keys it contains cannot be<br />

used to mount an attack on the network the node was taken<br />

from.<br />

A procedure for removing a node from a network is<br />

defined. The Provisioner application is used to add the node to<br />

a black list and then a process called the Key Refresh<br />

Procedure is initiated.<br />

The Key Refresh Procedure results in all nodes in the<br />

network, except for those which are members of the black list<br />

from being issued with new network keys, application keys and<br />

all related, derived data. In other words, the entire set of<br />

security keys which form the basis for network and application<br />

security are replaced.<br />

As such, the node which was removed from the network,<br />

and which contains an old NetKey and an old set of AppKeys,<br />

is no longer a member of the network and poses no threat.<br />

E. Privacy<br />

A privacy key, derived from the NetKey is used to<br />

obfuscate network PDU header values, such as the source<br />

address. Obfuscation ensures that casual, passive<br />

eavesdropping cannot be used to track devices and the people<br />

that use them. It also makes attacks based upon traffic analysis<br />

difficult.<br />

The degree of security offered in this technique is fit for<br />

purpose.<br />

F. Replay Attacks<br />

In network security, a replay attack is a technique whereby<br />

an eavesdropper intercepts and captures one or more messages<br />

and simply retransmits them later, with the goal of tricking the<br />

recipient, into carrying out something which the attacking<br />

device is not authorized to do. An example, commonly cited, is<br />

that of a car’s keyless entry system being compromised by an<br />

attacker, intercepting the authentication sequence between the<br />

car’s owner and the car, and later replaying those messages to<br />

gain entry to the car and steal it.<br />

Bluetooth Mesh has protection against replay attacks. The<br />

basis for this protection is the use of two network PDU fields<br />

called the Sequence Number (SEQ) and IV Index, respectively.<br />

elements increment the SEQ value every time they publish a<br />

message. A node receiving a message from an element which<br />

contains a SEQ value less than or equal to that which was in<br />

the last valid message will discard it, since it is likely that it<br />

relates to a replay attack. IV Index is a separate field,<br />

considered alongside SEQ. IV Index values within messages<br />

from a given element must always be equal to or greater than<br />

the last valid message from that element.<br />

VI.<br />

BLUETOOTH MESH IN ACTION<br />

A. Message Publication and Delivery<br />

A network which uses Wi-Fi is based around a central<br />

network node called a router, and all network traffic passes<br />

through it. If the router is unavailable, the whole network<br />

becomes unavailable.<br />

In contrast, Bluetooth Mesh uses a technique known as<br />

managed flooding to deliver messages. Messages, when<br />

published by a node, are broadcast rather than being routed<br />

directly to one or more specific nodes. All nodes receive all<br />

messages from nodes that are in direct radio range and, if<br />

configured to do so, will then relay received messages.<br />

Relaying involves broadcasting the received message again, so<br />

that other nodes, more distant from the originating node, might<br />

receive the message broadcast.<br />

B. Multipath Delivery<br />

An important consequence of Bluetooth’s use of managed<br />

flooding is that messages arrive at their destination via multiple<br />

paths through the network. This makes for a highly reliable<br />

network and it is the primary reason for having opted to use a<br />

flooding approach rather than routing in the design of<br />

Bluetooth mesh networking.<br />

C. Managed Flooding<br />

Bluetooth mesh networking leverages the strengths of the<br />

flooding approach and optimises its operation such that it is<br />

both reliable and efficient. The measures which optimise the<br />

way flooding works in Bluetooth mesh networking are behind<br />

the use of the term “managed flooding”. Those measures are as<br />

follows:<br />

i) Heartbeats<br />

Heartbeat messages are transmitted by nodes periodically.<br />

A heartbeat message indicates to other nodes in the network<br />

that the node sending the heartbeat is still active. In addition,<br />

heartbeat messages contain data which allows receiving nodes<br />

to determine how far away the sender is, in terms of the<br />

number of hops required to reach it. This knowledge can be<br />

exploited with the TTL field.<br />

ii) TTL<br />

TTL (Time To Live) is a field which all Bluetooth mesh<br />

PDUs include. It controls the maximum number of hops, over<br />

which a message is relayed. Setting the TTL allows nodes to<br />

exercise control over relaying and conserve energy, by<br />

ensuring messages are not relayed further than is required.<br />

Heartbeat messages allow nodes to determine what the<br />

optimum TTL value should be for each message published.<br />

iii) Message Cache<br />

A network message cache must be implemented by all<br />

nodes. The cache contains all recently seen messages and if a<br />

message is found to be in the cache, indicating the node has<br />

seen and processed it before, it is immediately discarded.<br />

305


iv) Friendship<br />

Probably the most significant optimisation mechanism in a<br />

Bluetooth mesh network is provided by the combination of<br />

Friend nodes and Low Power nodes. As described, Friend<br />

nodes provide a message store and forward service to<br />

associated Low Power nodes. This allows Low Power nodes to<br />

operate in a highly energy-efficient manner.<br />

D. Traversing the Stack<br />

A node, receiving a message, passes it up the stack from the<br />

underlying Bluetooth LE stack via the bearer layer to the<br />

network layer.<br />

The network layer applies various checks to decide whether<br />

or not to pass the message higher up the stack or to discard it.<br />

In addition, PDUs have a Network ID field, which provides<br />

a fast way to determine which NetKey the message was<br />

encrypted with. If the NetKey is not recognized by the network<br />

layer on the receiving node, this indicates it does not possess<br />

the corresponding NetKey, is not a member of that subnet and<br />

so the PDU is discarded. There’s also a network message<br />

integrity check (MIC) field. If the MIC check fails, using the<br />

NetKey corresponding to the PDUs Network ID, then the<br />

message is discarded.<br />

Messages are received by all nodes in range of the node<br />

that sent the messages, but many will be quickly discarded<br />

when it becomes apparent they are not relevant to this node due<br />

to the network or subnet(s) it belongs to.<br />

The same principle is applied higher up the stack in the<br />

upper transport layer. Here though, the check is against the<br />

AppKey associated with the message, and identified by an<br />

application identifier (AID) field in the PDU. If the AID is<br />

unrecognized by this node, the PDU is discarded by the upper<br />

transport layer. If the transport message integrity check<br />

(TransMIC) fails, the message is discarded.<br />

VII. BLUETOOTH MESH - NEW FRONTIERS<br />

This paper should have provided the reader with an<br />

introduction to Bluetooth Mesh, its key capabilities, concepts<br />

and terminology. It’s Bluetooth but not as we know it. It’s a<br />

Bluetooth technology that supports a new way for devices to<br />

communicate using a new topology.<br />

Most of all, it’s Bluetooth that makes this most pervasive of<br />

low power wireless technologies, a perfect fit for a whole new<br />

collection of use cases and industry sectors.<br />

REFERENCES<br />

[1] Bluetooth SIG, Bluetooth Mesh Specification<br />

See https://www.bluetooth.com/specifications/adopted-specifications<br />

[2] Bluetooth SIG, Bluetooth Mesh Model Specification<br />

See https://www.bluetooth.com/specifications/adopted-specifications<br />

[3] Bluetooth SIG, Bluetooth 5 Core Specification<br />

See https://www.bluetooth.com/specifications/adopted-specifications<br />

[4] Bluetooth SIG, Bluetooth Core Specification Supplement<br />

See https://www.bluetooth.com/specifications/adopted-specifications<br />

306


Bluetooth Low Energy Solar Beacon as IoT Enabler<br />

Cecilia Höffler, Tobias Gemmeke<br />

Institute of Integrated Digital Systems and Circuit Design<br />

RWTH Aachen University<br />

Aachen, Germany<br />

hoeffler@ids.rwth-aachen.de<br />

This paper focuses on the accuracy issue of indoor navigation<br />

systems. Initially, key effects of RF signal transmission will be<br />

reviewed. On this basis, the received signal strength indicator<br />

(RSSI) based indoor navigation methods like triangulation and<br />

finger printing will be analyzed and their insufficiencies will be<br />

examined. This includes the RSSI deviation due to static and<br />

dynamic effects in the surroundings. A critical distance of RF<br />

sources for a high accuracy indoor positioning will be assessed.<br />

With this, the earlier mentioned inaccuracies can be<br />

significantly reduced. Finally, a hardware based blueprint will<br />

be provided, which enables the deployment of a significant<br />

amount of RF sources to realize the critical distance down to<br />

one RF source / m².<br />

Keywords— Bluetooth Smart; BLE; Bluetooth Low Energy;<br />

solar beacon sticker; Indoor Navigation; RSSI variances; heat map;<br />

triangulation<br />

I. INTRODUCTION<br />

Enabling the interaction between human beings and any kind of<br />

objects is a growing desire in modern days. Starting with the<br />

‘Internet of Things’ (IoT) today people even refer to the<br />

‘Internet-of-Everything’ (IoE), where any physical object is<br />

linked with the digital world. One key enabler is the link of<br />

linking a real-world scenario, the geometries of a room, with a<br />

digital navigation system.<br />

The focus of this paper will be on indoor navigation. The most<br />

common indoor navigation methods are based on triangulation<br />

or fingerprinting algorithms. Triangulation does not need a<br />

large database and is therefore very fast. But it suffers from a<br />

low accuracy compared to fingerprinting, due to the high<br />

dependences on the room geometries. To address the related<br />

short-coming there have been quite elaborate approaches for<br />

fingerprinting with a focus on static environments [1]. Since in<br />

a real world set up, dynamic effects (like the unpredictable<br />

movement of people) have to be considered, this kind of<br />

approach still lacks accuracy.<br />

Currently multiple RF technologies enable these indoor<br />

navigation solutions. The most widespread ones are based on<br />

Wi-Fi and BLE (Bluetooth Low Energy) [2]. These<br />

technologies need sensors and tags. Existing implementations<br />

of such active tags suffer from at least one of the following:<br />

form factor, range, cost of maintenance or cost of installation.<br />

These hurdles have prevented neither of the many available<br />

technologies like active RFiD, Bluetooth beacons nor Wi-Fi<br />

sensors from a massive breakthrough in the IoT.<br />

To overcome these barriers, an ideal tag would be unobtrusively<br />

while enhancing the original purpose of the enabled object.<br />

Such tag - being maintenance free - can for example reduce<br />

labor cost by issuing usage-based preventive check-ups. In<br />

addition to this, it enables efficiency increases by providing<br />

detailed insights into the value generating processes. Based on<br />

our assessment with a fully integrated System-on-Chip (SoC)<br />

stripped to the bare minimum of necessary functionality, two<br />

differentiators could be achieved at the same time: minimized<br />

the Bill-of-Materials (BoM) and a self-sustained wireless IoT<br />

system purely powered by energy harvesting.<br />

The core IP for this SoC is presented in [3]. It allows the<br />

manufacturing of an ultra-low power, secured ‘transmit-only<br />

radio tag’ that combines subthreshold operation in the digital<br />

baseband and control processor, and an ultra-low power radio<br />

front-end. This SoC is intended to be powered by a single<br />

organic photovoltaic cell allowing the implementation of a<br />

flexible, bendable, sticker sized active Bluetooth beacon. This<br />

actually means that we are capable of combining the advantages<br />

of passive tags (zero maintenance and a small form factor) with<br />

the advantages of active tags (extended range and significant<br />

transmit power). By adopting the BLE standard this technology<br />

can easily be exploited with standard technologies as integrated<br />

in common smartphones and mobile device to connect users<br />

with the capabilities of the digital world and the richness of their<br />

physical environment.<br />

II.<br />

INDOOR NAVIGATION METHODS<br />

Indoor navigation methods considered in this paper are<br />

triangulation and fingerprinting. Their theoretical approach will<br />

be briefly introduced. Furthermore, the limitations of these<br />

methods will be discussed based on simulation results and data<br />

taken from experimental measurements in real-life scenarios.<br />

The simulation is broken down in two steps. First an ideal RF<br />

source will be simulated in a rectangular space in Matlab.<br />

Second for a more realistic approach a simulation is run in<br />

WinProp, a Software Suite for wave propagation and radio<br />

network planning. Finally, the results will be validated in an<br />

experimental set up, which is identical to the scenario assessed<br />

in the simulation in WinProp.<br />

In our scenario, the doors consist of glass with a thickness of<br />

10cm and are 94 cm wide. The wall itself consists of brick and<br />

the ceiling is 2.45 m high. Two concrete columns are located in<br />

the middle of the room. The floorplan on the left of Figure 1<br />

shows the basic room without brick columns.<br />

307


The triangulation approach suffers various short-comings.<br />

Firstly, there is a high dynamic variance of the RSSI values of<br />

the individual RF sources as is highlighted in Figure 3 for<br />

different times and beacons of the identical type and transmit<br />

power setting.<br />

Figure 1 The lab set up (6 m wide and 9.5 m long) with brick walls and glass<br />

doors with a width of 0.94 m and wooden ceiling with a height of 2.45 m<br />

and 2 brick columns<br />

This simple scenario is used for the simulation in Matlab to<br />

show the reflection in an ideal indoor set up. The WinProp<br />

simulation and the experimental results are based on the<br />

described lab set up as shown the floorplan on the right of<br />

Figure 1. For the experimental validation five Minew i6 Sticker<br />

MiniBeacon were used as RF sources and placed at welldefined<br />

positions. The beacons advertised at 2.4 GHz with<br />

0dBm P t. As receiver we used a commercially available<br />

smartphone (Samsung A 5) to capture all effects of the whole<br />

transmission channel. The receiver was moved in discrete steps<br />

away from the transmitters. Thereby laying on a line of sight<br />

orthogonal to the wall.<br />

III.<br />

RSSI BASED LOCALIZATION.<br />

A. Triangulation<br />

In free space the power of the received signal P r at a certain<br />

distance d to the signal source can be calculated based on the<br />

Friis formula [4] described in (1) with the transmission power<br />

P t, the antenna directivities of the receiver and transmitter D r<br />

and D t and the wave length of the signal λ as follows<br />

P r = P t + D r + D t + 20log 10 (<br />

̅̅̅̅̅) λ<br />

4πd<br />

(1)<br />

Based on this equation the RSSI value can be used to calculate<br />

the exact distance to a certain RF source. The position of the<br />

receiver is determined as shown in Figure 2 with at least three<br />

received RF sources and their known transmit power level.<br />

Figure 3 RSSI value measurement of different beacons on the same position<br />

More specifically, the RSSI values are shown for a distance of<br />

1 m with a P t of each RF source being set to 0 dBm. Hence, the<br />

triangulation approach suffers from this hardware effects.<br />

Secondly, the results are based on the assumption of an ideal,<br />

i.e. free space, scenario. However, indoor environments are<br />

dominated by obstacles like walls or furniture. Such objects<br />

cause large scale effects like multi path propagation and<br />

shadowing [6]. These effects are visible in Figure 4, e.g., the<br />

shadowing can be seen in the green regions.<br />

Figure 2 Triangulation (with the yellow star as the object and the red blue<br />

dots as the RF sources with the black circles as their signal strength at a<br />

specific distance (Ref. [5]))<br />

Figure 4 Simulation in WinProp of 2.4 GHz RF source with ray tracing in<br />

line of sight<br />

308


Here, a so-called heat map is created with the finger printing<br />

algorithm. This heat map contains for each position in a given<br />

scenario the received signal strengths of the various transmitters<br />

(its fingerprint). Hence, the calculation of distance is no longer<br />

based on the Friis formula.<br />

B. Heat Map<br />

The test set up with its material parameters and geometries was<br />

evaluated in WinProp to simulate a heat map for the static case.<br />

This simulation takes large-scale effects into account. The multi<br />

path propagation can be seen in Figure 4. Here the ray tracing<br />

of a simulated RF source with 4 dBm P t and 2.4 GHz is<br />

displayed. The signal strength is indicated by color coding over<br />

the whole area.<br />

The shadowing effects in Figure 4 are caused by the brick<br />

columns. These effects have to be taken into account when<br />

determining the receiver position based on the received signal<br />

strength. The shadowing effects are also visible by comparing<br />

the Channel Impulse Response (CIR) of Figure 5 and Figure 6.<br />

specific attenuation [7][8], these signals can be seen with a<br />

lower power envelope on the CIR graphs. All this contributes<br />

to the measured powered at the position of the receiver. The<br />

received power in line of sight is significantly higher than the<br />

received power in the shadowed area. To neglect the shadowing<br />

effects the line of sight will be used as references in the later<br />

experimental evaluation in section IV.<br />

IV.<br />

EXPERIMENTAL MEASUREMENT<br />

So far, the focus lay on the large-scale effects like multi path<br />

propagation and shadowing. But for an accurate position<br />

estimation also the small-scale effects have to be taken into<br />

account. They happen over the carrier wavelength λ, which is<br />

12.5 cm for a frequency of 2.4 GHz. These effects are due to<br />

the constructive and destructive interference of signals, which<br />

are caused by the earlier mentioned multi path propagation and<br />

therefore dependent on the individual room geometries [6].<br />

The experimental test set up is placed in the already introduced<br />

room in Figure 1. The comparison of the free space simulation<br />

(blue curve), the simulation of small scale effects due to the<br />

room geometries (black curve) and the experimental results of<br />

BLE beacons (blue dots) and omnidirectional antennas (red<br />

stars) are shown in Figure 7.<br />

Figure 5 Channel Impulse Response in line of sight with a distance of 5.5 m<br />

of 2.4 GHz RF Source<br />

Figure 7 signal power propagation with constructive and destructive<br />

interferences in line of sight<br />

Figure 6 Channel Impulse Response in shadow of brick column with<br />

distance of 5.5m to 2.4 GHz RF Source<br />

The absolute distance of the receiver to the RF source is in both<br />

cases the same. By comparing the CIR in both figures, it can be<br />

observed, that the line of sight signal arriving at 18 ns delay<br />

time is further attenuated from -53 dBm to -75 dBm. Since the<br />

signal of the RF source is distributed in a radial way, there are<br />

reflected rays from the walls, which contribute to the multi path<br />

propagation. These signals have a different time of arrival, due<br />

to the longer distances to the receiver and therefore propagation<br />

time. Since the reflection on the walls contain a material<br />

Here the wall is simulated with a reflection coefficient of 0.35<br />

according to [8]. The free space simulation was calibrated with<br />

experimental results measured in a low reflection room of two<br />

omnidirectional antennas with a signal of 2.44 GHz and 0 dBm<br />

P t.<br />

The comparison of the free space simulation to the simulation<br />

in the lab set up confirm the simulation results. The measured<br />

RSSI values of the BLE beacons are significantly lower than<br />

the idealized case. This is attributed to the lower gain of the<br />

antenna in the beacons and the smart phone. However, exact<br />

antenna directivities D r and D t are unknown. But this offset can<br />

be resolved by calibration of the positioning application. These<br />

interferences lead to a non-monotonic decrease in RSSI values<br />

that is identical RSSI values are obtained at different distances.<br />

309


This lack of uniqueness in the RSSI value gets worse for higher<br />

distances from the RF source (starting at 1.5 m) resulting in<br />

reduced accuracy for indoor positioning. With this the<br />

correlation of the RSSI value and distance between receiver and<br />

transmitter is impossible, due to the ambiguity for RSSI values<br />

for distances higher than 1.25 m.<br />

For a qualitative analyses of the RSSI values of the BLE<br />

Beacons mentioned in Figure 7, they are plotted with their<br />

distribution in Figure 8 in dependence on the distance of the<br />

phone to the beacon. The box plot depicts the median with the<br />

red line and with the whisker the variability outside the upper<br />

and lower quartiles. Outliers are plotted as red stars.<br />

Figure 8 decrease of RSSI value of one beacon depending to the distance<br />

environment. An algorithmic approach alone cannot solve this<br />

issue, neither a heat map nor triangulation are a sufficient<br />

solution. But the earlier mentioned results in section II and IV<br />

indicate, that a high density of RF sources can enable a high<br />

accuracy algorithmic approach. The necessity for a high RF<br />

density is a low cost hardware solution. So, in the following<br />

currently available technologies will be compared.<br />

A. Requirements for RF Sources for Indoor Navigation<br />

application<br />

The accuracy of GPS for outdoor navigation of 3 m finds its<br />

shortcomings for indoor applications due to shielding off walls<br />

and ceiling. Therefore, another technical solution needs to be<br />

applied. As mentioned earlier the most common ones are based<br />

on either BLE or Wi-Fi as RF technology. The BLE beacon<br />

technology can appear in two flavors: battery powered or solar<br />

powered. For the Wi-Fi solution additional access points are<br />

needed for a full coverage to support the router. All RF sources<br />

will be called in the following beacon, their specification such<br />

as Wi-Fi or BLE will be added.<br />

The comparison of these technologies is taking the earlier<br />

proposition of one RF source / m² into account. The RF<br />

technologies will be compared in Figure 9 regarding their Total<br />

cost of ownership (TCO) and quality of service. The TCO<br />

implies the purchase cost, deployment cost and time,<br />

maintenance cost and time. The quality of service focuses on<br />

latency and energy consumption for the phone. Wi-Fi uses<br />

twice as much as energy of a smartphone than BLE[9], [10].<br />

This means that the BLE solution is preferable for the user than<br />

the Wi-Fi solution.<br />

The analysis in Figure 9 is based on the assumption of an<br />

average building of 500 m².<br />

The spread of the RSSI values at a given distance to the RF<br />

source is shown in Figure 8. It becomes clear that the spread<br />

increases with the distance. This is visualized by a tunnel,<br />

which widens dramatically at a distance of 1.25 m. The<br />

significance of a given RSSI values decreases with the distance<br />

between receiver and transmitter. This effect is striking at<br />

1.25m and extends for higher distances. Thence 1 m will be<br />

taken as a critical distance for reliable position determination.<br />

The mean RSSI values for distances under 1 m are in the<br />

expected range and lie on the fitted curve (as blue circles) in the<br />

earlier mentioned Figure 7.<br />

A distance of 1 m to a RF source leads to the need of one RF<br />

source / m². So the distance of the receiver never exceeds the<br />

critical distance to a RF source. This leads also to a high density<br />

of RF sources. To enable this high density, the requirements for<br />

RF sources will be examined in the next section.<br />

V. FEASIBILITY OF SUGGESTED SOLUTIONS<br />

Even though all measurements and simulations mentioned<br />

earlier were in a static environment (no humans or movable<br />

objects were present) the unreliability of the correlation of<br />

distance and signal strength is already prominent for a distance<br />

higher than 1 m. Hence, it will increase in a dynamic<br />

Figure 9 Comparison of Wi-Fi, battery powered and autarkic solar powered<br />

BLE beacon for indoor navigation<br />

310


The maintenance time for a beacon includes the time for finding<br />

missed beacons, which are not advertising anymore and the<br />

exchange of the battery. This is assumed to take 12 min in<br />

average. The average life time of a battery is expected to be 6<br />

years.<br />

With this comparison, the superiority of the autarkic BLE<br />

beacon to Wi-Fi and battery powered BLE beacon is distinct.<br />

The BLE beacon have a fifth of the power consumption of<br />

Wi-Fi beacons. The deployment cost of Wi-Fi is compared to<br />

the deployment cost for BLE beacon almost 10 times higher,<br />

due to their additional electrical cabling. The battery powered<br />

BLE beacons need a significant higher amount of maintenance<br />

time, approximately 133 h / year for battery replacements and<br />

therefore accumulate maintenance cost of over 2900 € / year.<br />

The operation costs for Wi-Fi (visualized as maintenance cost<br />

in Figure 9) consist only of the costs for the energy consumption<br />

for its access points.<br />

The total cost of ownership (TCO) consist of the deployment<br />

and maintenance cost. In these the material and labor cost<br />

according to German standards are included. So, the TCO for<br />

Wi-Fi is 27700 € and for battery powered beacons 5535 €. The<br />

TCO of autarkic solar powered beacon is significantly lower<br />

with 2100 €.<br />

This comparison shows the clear superiority of autarkic beacon.<br />

Which leads to the question, how the autarkic BLE beacons can<br />

be realized and which requirement have to be met to achieve<br />

this energy autonomy.<br />

B. Requirements for autarkic beacon<br />

The major cost factor in an autarkic BLE Beacon is its power<br />

source. Although there are multiple energy harvesting<br />

opportunities neither piezo nor temperature based energy<br />

harvesting is generating enough energy of 48 µW, which is<br />

needed for the advertisement of BLE packages in an average<br />

use case scenario. So the only valid source can be a solar cell<br />

[11]. The driving cost factor for the solar cell is the active area.<br />

This can only be reduced by increasing either the available light<br />

in the surrounding area or reducing the overall power<br />

consumption of the device.<br />

It can be assumed that the main current of a BLE beacon is<br />

consumed by the SoC [12]. Therefore the current consumption<br />

of a well performing state-of-the-art-beacon will be examined<br />

and taken as baseline for required energy budget [13].<br />

The usage of beacons for indoor navigation purposes is based<br />

on a minimum advertisement interval of 683ms for an average<br />

walking speed of 1.462 m / s [14] and a beacon density of 1<br />

beacon / m². With this, the following calculation is based on an<br />

advertisement interval of 683 ms and a duty cycle of 4.5 ms,<br />

Figure 10.<br />

The BLE beacon in [13] is working with 3.3 V supply voltage.<br />

Since in section IV P t is 0 dBm, the analyzed beacon in Figure<br />

10 will also advertise with 0 dBm. The question what influence<br />

the P t of a beacon has on the deviation of the RSSI values is not<br />

addressed in this paper. This can be analyzed in further<br />

experiments. It can be assumed, that with a significant lower P t<br />

further improvement in the power management can be reached.<br />

Figure 10 Oscilloscope measurement of BLE beacon while advertisement<br />

(with a voltage drop over 10 Ω load resistance)<br />

The current profile in Figure 10 consists of four different<br />

segments. The first segment is the charge up of the capacity. It<br />

is worth mentioning, that the size of the capacity provides a<br />

trade off between the supply noise for each advertisement and<br />

the time and power loss for the charge up of this capacity. Here<br />

a maximum current of 7 mA is be measured. The second part is<br />

the wake up of the transmitter part, with an average current of<br />

1.9 mA. The third shows the three advertisements on channels<br />

37, 38 and 39. Its average current is 3.6 mA.<br />

The fourth phase is representing the shutdown, with an average<br />

current of 1.9 mA. With a sleep current (Figure 11) of 1.28 µA<br />

this best-in-class commercial SoC requires a total energy of<br />

38.4 mJ per advertisement cycle of 687.5 ms.<br />

Figure 11 Oscilloscope based measurement of BLE beacon during sleep<br />

mode (with a voltage drop over 10 kΩ load resistance)<br />

311


The lighting conditions in a dim hallway will be taken as worstcase<br />

scenario with 200 lx. Since the light intensity is an external<br />

factor the only variable besides the active area of the PV cell is<br />

the power consumption of the BLE beacon. According to [2]<br />

most of the current consumption takes place while advertising.<br />

So, an approach for a lower power consumption could be to<br />

increase the advertisement interval. Since the sleep current<br />

cannot be neglected, its amount on the power consumption<br />

increases by extending the advertising interval. A possible<br />

tradeoff could be an advertisement interval of 1 s, with this the<br />

average power consumption can be reduced to 34µW. With the<br />

available power of various solar cells at this lighting conditions,<br />

the price for the solar cell per beacon is in the range of $ 1.76<br />

to $ 2.5.<br />

C. Proprietary Radio SoC<br />

The energy consumption can be decreased with a proprietary<br />

SoC in multiple ways. Firstly, an immediate shut down of the<br />

active part after advertising can be realized. This leads to a total<br />

energy reduction of 21.2 %. Secondly, the wake-up time can be<br />

reduced with faster locking of the clock oscillator down to 100<br />

µs. With this the total energy consumption is reduced to 24.6mJ,<br />

which is a reduction of 35.9%. Finally a significant decrease in<br />

the third part of the current profile in Figure 10 leads to an<br />

overall reduction of 51.5 %. This is based on an ultra-low power<br />

oscillator. This low current profile is based on the results of [3].<br />

Its current consumption of 65 nA for an accuracy of 420 ppm is<br />

outstanding, compared to the 120 nA for the oscillator of the<br />

best-in-class commercial SoC [12]. This can be reached thanks<br />

to a crystal free oscillator. In summary, the total energy<br />

consumption for one advertisement interval of 687.5 ms can be<br />

reduced to 18.6 mJ. In addition to this, the proprietary radio SoC<br />

scales the external components by three compared to the<br />

commercial SoC, since its oscillator does not need a dedicated<br />

RF crystal. This reduces the bill of materials significantly.<br />

VI.<br />

CONCLUSION<br />

This study details the reliability issues of state-of-the-art RSSI<br />

related indoor positioning methods like triangulation and heat<br />

finger printing. The main obstacle is the RSSI variations caused<br />

by large- and small-scale effects. The analyses of the<br />

experimental and simulated RSSI values suggested a maximal<br />

distance that received signal strength can still be used for<br />

localization. Based on the quantitative results, we find a critical<br />

distance for reliable indoor navigation with RF sources of 1 m.<br />

The implementation is finally discussed considering the total<br />

cost of ownership incl. the energy consumption of various RF<br />

technologies. In consequence, the best fitting beacon<br />

technology is based on the BLE standard using a dedicated SoC<br />

enabling autarkic PV powered operation even under dim light<br />

conditions.<br />

VII. ACKNOWLEDGMENT<br />

The authors would like to thank for the support of Prof. Dr.-Ing.<br />

Heberling and of Jörg Pamp (Institute High Frequency<br />

Technology, RWTH Aachen) as well as Prof. Dr. Heinen and<br />

Markus Scholl (Chair of Integrated Analog Circuits and RF<br />

Systems, RWTH Aachen).<br />

VIII. REFERENCES<br />

[1] R. Faragher and R. Harle, “An Analysis of the<br />

Accuracy of Bluetooth Low Energy for Indoor Positioning<br />

Applications,” Proc. 27th Int. Tech. Meet. Satell. Div. Inst.<br />

Navig. (ION GNSS+ 2014), pp. 201–210, 2014.<br />

[2] Bluetooth Special Interest Group, “Specification of the<br />

Bluetooth System Covered Core Package Version 4.2,”<br />

History, vol. 0, no. April, p. 2272, 2014.<br />

[3] M. Scholl, Y. Zhang, R. Wunderlich, and S. Heinen,<br />

“A 80 nW, 32 kHz charge-pump based ultra low power<br />

oscillator with temperature compensation,” Eur. Solid-State<br />

Circuits Conf., vol. 2016–Octob, pp. 343–346, 2016.<br />

[4] H. T. Friis, “A Note on a Simple Transmission<br />

Formula,” Proc. IRE, vol. 34, no. 5, pp. 254–256, 1946.<br />

[5] J. Hightower, G. Borriello, and R. Want, “SpotON: An<br />

indoor 3D location sensing technology based on RF signal<br />

strength,” Uw Cse, no. March 2000, p. 16, 2000.<br />

[6] K. N. P. A. Kushki, “Indoor Positioning with Wireless<br />

Local Area Networks (WLAN),” Encycl. GIS, pp. 469–469,<br />

2008.<br />

[7] P. Ali-rantala and M. Keskilammi, “Indoor<br />

Propagation of Bluetooth Waves, Effect of Distance on<br />

Bluetooth Data Transmission, and Simulation of Wave<br />

Propagation,” Citeseer, pp. 1–4.<br />

[8] T. Koppel, “Reflection and Transmission Properties of<br />

Common Construction Materials at 2.4 GHz Frequency,”<br />

Elsevier, vol. 113, pp. 158–165, 2017.<br />

[9] A. Lindemann, B. Schnor, J. Sohre, and P. Vogel,<br />

“Indoor positioning: A comparison of WiFi and Bluetooth Low<br />

Energy for region monitoring,” Heal. 2016 - 9th Int. Conf. Heal.<br />

Informatics, Proceedings; Part 9th Int. Jt. Conf. Biomed. Eng.<br />

Syst. Technol. BIOSTEC 2016, vol. 5, no. Biostec, pp. 314–<br />

321, 2016.<br />

[10] G. D. Putra, A. R. Pratama, A. Lazovik, and M. Aiello,<br />

“Comparison of energy consumption in Wi-Fi and bluetooth<br />

communication in a Smart Building,” 2017 IEEE 7th Annu.<br />

Comput. Commun. Work. Conf. CCWC 2017, no. 2014, 2017.<br />

[11] D. Niyato, E. Hossain, M. Rashid, and V. Bhargava,<br />

“Wireless sensor networks with energy harvesting<br />

technologies: a game-theoretic approach to optimal energy<br />

management,” IEEE Wirel. Commun., vol. 14, no. 4, pp. 90–<br />

96, 2007.<br />

[12] J. Bernegger and M. Meli, “Comparing the energy<br />

requirements of current Bluetooth Smart solutions,” no.<br />

February, pp. 1–23, 2014.<br />

[13] Atmel, ATBTLC1000 WLCSP SoC DATASHEET.<br />

2016, pp. 1–52.<br />

[14] R. W. Bohannon, “Comfortable and maximum<br />

walking speed of adults aged 20-79 years: Reference values and<br />

determinants,” Age Ageing, vol. 26, no. 1, pp. 15–19, 1997.<br />

312


Artificial Neural Networks Unleash New<br />

Possibilities for Edge Intelligence<br />

Hussein Osman<br />

Lattice Semiconductor: Product Marketing Manager<br />

San Jose, CA, U.S.<br />

hussein.osman@latticesemi.com<br />

Abstract— The rapidly growing area of artificial intelligence<br />

(AI), Neural Networks (NNs) and Machine Learning offers<br />

tremendous promise as developers attempt to bring higher levels<br />

of intelligence to their systems. Engineers have been using NNs<br />

as a paradigm to implement systems that can learn and infer<br />

based on learning. Computational requirements for such systems<br />

vary widely depending upon application. Traditionally designers<br />

using deep learning techniques and floating-point math in the<br />

data center have relied on high performance and power-hungry<br />

GPUs to meet demanding computational requirements. Designers<br />

extending AI to the edge don’t have the luxury of using powerhungry<br />

GPUs. Instead they must develop computationally<br />

efficient systems that not only meet accuracy targets, but also<br />

comply with the power, size and cost constraints of the consumer<br />

market.<br />

This paper will review on-device artificial intelligence which<br />

uses NN models to compare new incoming data and infer<br />

intelligence. On device AI dramatically improves user privacy by<br />

processing data locally rather than send it back to the cloud for<br />

processing. In addition, the paper will evaluate how technologies<br />

such as Field Programmable Gate Arrays (FPGAs) can make<br />

edge computing possible and how they can be used to optimize<br />

parallel computing. It will also explore the intelligence these low<br />

power technologies bring to battery-powered applications. Using<br />

a design example, the article will also examine how building AI<br />

into a FPGA running an open source RISC-V processor with<br />

accelerators can both dramatically reduce power consumption,<br />

while shortening response time and improving security.<br />

Keywords—Artificial Intelligence, AI, Artificial Neural<br />

Networks, ANN, neural networks, machine learning, Intelligence at<br />

the Edge, Edge Intelligence, RISC-V<br />

I. ON-DEVICE AI<br />

Over the last several years Neural Networks (NNs) have<br />

become an increasingly common paradigm for engineers<br />

utilizing machine learning (ML) techniques. In these<br />

applications engineers use NNs to implement systems that can<br />

continuously learn and infer, based on that learning.<br />

Traditionally these deep learning techniques are employed in<br />

the data center in systems built around large, high<br />

performance GPUs to meet highly demanding computational<br />

requirements. In these applications systems store data and run<br />

arithmetic functions in the cloud where the use of escalating<br />

levels of power is less of a design obstacle.<br />

Recently, however, demand has been building for new ways<br />

to extend these capabilities to edge applications. Ranging from<br />

smart TVs, and security systems to intelligent doorbells and<br />

self-driving vehicles, a rising number of new applications<br />

require a more immediate response than cloud-based systems<br />

can provide. On the edge the deep learning techniques that use<br />

floating-point math in the data center are impractical. Instead<br />

designers must develop more computationally-efficient<br />

solutions that not only meet stringent accuracy targets, but<br />

also comply with the power, size and cost constraints of the<br />

consumer market.<br />

While designers can use complex machine learning<br />

techniques during training in the data center, once a device<br />

moves to the edge, it must perform inferences using arithmetic<br />

that use as few bits as possible. Designers can simplify<br />

computation by switching from floating-point to fixed-point<br />

math or, ideally, basic integers. By altering training to<br />

compensate for the quantization of floating-point to fixedpoint<br />

integers, they can develop solutions that train faster with<br />

high accuracy and push the performance level of fixedpoint/low-precision-integer<br />

NNs close to those using floating<br />

point math. To build the simplest edge devices, however,<br />

training must produce NN models with 1-bit weights and<br />

activations. These models are called Binarized Neural<br />

Networks (BNNs).<br />

BNNs eliminate the use of multiplication and division by<br />

using 1-bit values instead of larger numbers during runtime.<br />

This allows the computation of convolutions using just<br />

addition. Since multipliers consume more space and power<br />

than other components in a digital system, replacing them with<br />

addition offers significant power and cost savings. But as<br />

demand builds for more intelligence on the edge, how must<br />

the use of BNNs change to meet these requirements?<br />

To address this need, designers require in-device AI<br />

solutions that still perform machine learning techniques in the<br />

cloud once a year. But to deliver this quick response they store<br />

a template locally and compare new data collected to the<br />

template to perform inferencing. These in-device AI solutions<br />

can not only reduce power, cost and product footprint, by<br />

eliminating the transfer of data back to the cloud they also<br />

offer improved security and reliability.<br />

How is the template created? Take, as an example, a TV<br />

designed to automatically turn off and save power when it<br />

does not detect a person in the room. As long as the TV<br />

www.embedded-world.eu<br />

313


detects a face in the room, it stays on. When it doesn’t, it<br />

powers off. In this case a template for a facial detection<br />

solution would be created in the cloud by comparing 100,000<br />

images of faces from around the world. This data is then sent<br />

from the cloud to the edge application and stored in a Field<br />

Programmable Gate Array (FPGA) or Microcontroller (MCU).<br />

The model or template is sent to the AI device in the form of<br />

weights and activations. Typically, it is stored in the internal<br />

memory of the device. A sensor directly connected to the AI<br />

device collects raw images constantly. These raw images are<br />

continually compared to the template in the AI device using<br />

the computational resources on the device to perform<br />

inferencing.<br />

Applications of this type do not have to perform the<br />

complex calculations associated with a facial recognition<br />

function. But by simply performing a facial detection function<br />

and turning off the TV when no one is present, designers can<br />

add significant new capabilities to dramatically reduce power<br />

consumption. Similar applications might include security<br />

systems that can detect whether the movement in a house was<br />

a person, a pet, or a shadow, or a doorbell that automatically<br />

rings when a person approaches the front door.<br />

II.<br />

BINARIZED NEURAL NETWORKS<br />

A recent collaboration between Lattice Semiconductor and<br />

VectorBlox Computing, a developer of high performance,<br />

soft-core processors for embedded applications, illustrates the<br />

advantages Binarized Neural Networks offer. Binarized<br />

Neural Networks reduce memory requirements by eliminating<br />

the use of multiplications and divisions and allowing the<br />

compilation of convolutions using just additions and<br />

subtractions.<br />

VectorBlox needed the hardware to run their machine<br />

learning algorithms to perform inferencing at the edge. But it<br />

also needed a solution that could deliver high performance at<br />

low power. To accomplish this task Lattice Semiconductor<br />

proposed the use of its iCE40 UltraPlus Field<br />

Programmable Gate Arrays (FPGAs). The iCE40 UltraPlus is<br />

a highly energy-efficient solution for repetitive number<br />

crunching. With 8 hardened DSP blocks, highly flexible I/Os<br />

and increased memory for buffering, it offers an attractive<br />

platform for building intelligent IoT edge products.<br />

In CNN-based machine learning the compute kernel is the<br />

convolution kernel where a 3x3 window of weights is<br />

multiplied with input data and then sum-reduced into a scalar<br />

result. Input values, weights and results typically use the<br />

floating-point system. Recent implementations like that<br />

described in “BinaryConnect: Training Deep Neural Networks<br />

with Binary Weights During Propagations” by M.<br />

Courbariaux and Y. Bengio and J.P. David eliminate<br />

multiplication by using binary weights to represent +1 or -1.<br />

To improve performance engineers at Lattice and<br />

VectorBlox made three enhancements to the BinaryConnect<br />

approach. First, they shrunk the network structure in half by<br />

moving from<br />

(2x128 C3) - MP2 – (2x256C3) – MP2 (2x512C3) – MP2 –<br />

(2x1024FC) – 10FSC<br />

to<br />

(2 x 64C3) – MP2 – (2 x 128C3) – MP2 - (2 x 256C3) – MP2 -<br />

– (2 x 256FC) – 10SFC<br />

where C3 is the 3x3 ReLU convolution layer, MP2 is a 2 x 2<br />

max-pooling layer and FC is a fully connected layer. At that<br />

point they optimized the network by using 8-bit signed fixedpoint<br />

values for all input data. Accumulation used 32-bit<br />

signed data to prevent overflow which was then saturated to 8-<br />

bit before the next layer.<br />

Fig. 1: The Binarized CNN structure presented a 10.8% error rate.<br />

Secondly, the designers implemented a hardware accelerator<br />

for the binarized neural network. Then they used the<br />

accelerator as an ALU in the ORCA soft RISC-V processor.<br />

They enhanced the ORCA processor with a custom set of<br />

lightweight vector extensions (LVE). By streaming the matrix<br />

data through the RISC-ALU, the LVE reduced or eliminated<br />

loop, memory access and address generation overhead and<br />

improved the efficiency of matrix operations. A CNN<br />

accelerator was added as a custom vector instruction (CVI)<br />

(see figure 2) to the LVE to further improve operation.<br />

Fig. 2: A Binarized custom vector instruction boosted performance.<br />

The third and final modification in the project was the<br />

addition of an augmented RISC-V processor in the iCE40<br />

UltraPlus FPGA. To perform inferencing at the edge designers<br />

needed a solution that offered a highly parallel architecture<br />

capable of performing a large number of similar arithmetic<br />

operations at low power. One of the reasons the team chose<br />

the iCE40 UltraPlus FPGAs was because they offer very<br />

flexible I/Os to connect to the image sensors and logic<br />

resources needed to down scale and manipulate the captured<br />

image data. The FPGAs also feature 8 hardened DSP blocks<br />

314


that the developers could dedicate to more complex algorithms<br />

as well as 1 Mbit of on-chip memory which could be used to<br />

buffer data longer in low power states. The LVE operates<br />

directly on 128 Kb of scratchpad RAM that has been triple<br />

overclocked to supply two reads and one write per CPU clock.<br />

Binary weights are stored in internal RAM, so the DMA<br />

engine can efficiently transfer those values into the scratchpad<br />

and steal cycles from the CPU if any LVE operations are<br />

running.<br />

The development team used Lattice’s iCE40 UltraPlus<br />

mobile development platform to prototype and test their<br />

design. Proof-of-concept demos helped engineers quickly<br />

develop drivers and interfaces. The platform featured a 1x<br />

MIPI DSI interface up to 108 Mbps, 4x microphone bridging<br />

and a variety of sensors. The FPGA can be programmed using<br />

an on-board SPI flash memory or the USB port.<br />

The team created a person detector by training a 10-category<br />

classifier with a modified CIFAR-10 dataset that replaced deer<br />

images with duplicated images from the people superclass in<br />

CIFAR-100. To maximize performance, the team reduced the<br />

network structure further and trained a new 1-category<br />

classifier using a proprietary database of 175K images<br />

including human facial images of various ages, ethnicities, as<br />

well as people wearing glasses and hats.<br />

III.<br />

COMPACT, LOW POWER SOLUTION<br />

Operating at 24 MHz this compact CPU was implemented<br />

in 4,895 of the iCE40 UltraPlus 5K’s 5,280 4-input LUTs. It<br />

also uses four of the FPGA’s eight 16x16 DSP blocks, 26 of<br />

30 4kb (0.5kB) BRAM, and all four 32 kB SPRAM. The<br />

proposed solution can support up to 8-layer deep NNs inside a<br />

single FPGA.<br />

The accelerator on the ORCA RISC-V improves runtime<br />

of convolution layers by 73X, while the LVE improves<br />

runtime of dense layers by 8X. Use of the iCE40 UltraPlus<br />

with the accelerators results in an overall increase in speed of<br />

approximately 71X.<br />

The 1-category classifier runs in 230 ms with 0.4% error<br />

and consumes 21.8 mW. A power-optimized version designed<br />

to run at one frame/second consumes just 4.4 mW. Error rates<br />

were attributed primarily to training not reduced precision. In<br />

effect, thanks to the impact of the accelerator implemented in<br />

the FPGA fabric, this FPGA-based solution offers the<br />

performance of a 1.7 GHz processor in the power envelope of<br />

a 24 MHz processor.<br />

IV.<br />

CONCLUSION<br />

With analysts at the Gartner Group predicting up to 80<br />

percent of all smartphones will feature on-device AI<br />

capabilities by 2022, demand is clearly building for more<br />

intelligence at the edge. The challenge for designers lies in<br />

finding the best technology to build highly resource-efficient<br />

solutions.<br />

REFERENCES<br />

[1] M. Courbariaux, Y. Bengio, and J.P. David,<br />

“BinaryConnect: Training Deep Neural Networks with Binary<br />

weights during Propagations,” in Advances in Neural<br />

Information Processing Systems 28 (NIPS 2015 (C. Cortes, N.<br />

D. Lawrence, D.D. Lee, M. Sugiyama and R. Garnett, eds), pp.<br />

3123-3131, Curran Associates, Inc., 2015.<br />

[2] G. Lemieux, J. Edwards, J. Vandergriendt, A.<br />

Severance, R. De Iaco, A. Raouf, H. Osman, T. Watzka, and<br />

S. Singh, TinBiNN: Tiny Binarized Neural Network<br />

Overlay in about 5,000 4-LUTs and 5mW, 3rd International<br />

Workshop on Overlay Architectures for FPGAs (OLAF), Feb<br />

2017.<br />

Fig. 3: The TiNBiNN solution was implemented in an iCE40 UltraPlus FPGA.<br />

www.embedded-world.eu<br />

315


olOne: Artificial Intelligence on Chip<br />

How Industry 4.0 would benefit from a new approach to AI miniaturization<br />

Marco Calabrese, Claudio Martines<br />

Holsys S.r.l.<br />

Taranto, Italy<br />

m.calabrese@holsys.com, c.martines@holsys.com<br />

Abstract—Embedded systems are being used in a wide span of<br />

contexts, from industrial processes to consumer appliances. With<br />

the prospective growth of Industry 4.0 that promises to enlarge the<br />

spectrum of embedded applications in disparate settings such as<br />

real-time anomaly detection, predictive maintenance, selfdiagnostics<br />

and so on, manufacturers will be enforced to inflate<br />

more intelligence into their products. Today, machine learning<br />

technologies often require heavy Cloud infrastructures, the<br />

availability of datasets for training and test, long time to market<br />

and high skills, all elements that, once combined, may obstacle<br />

investment decisions. As an answer to the growing need of readyto-deploy<br />

solutions, we present olOne, the first Artificial<br />

Intelligence development environment able to bring real-time<br />

sensor raw data interpretation onto commonly-used micro<br />

controllers in few steps and without any specific data scientist skill.<br />

Keywords—Industry 4.0; real-time sensor data processing;<br />

cyber-physical systems; holons; granular computing; computing<br />

with words;<br />

I. INTRODUCTION<br />

Industry 4.0 can be defined as the embedding of advanced<br />

cyber-physical systems (CPS) [1] into digital and physical<br />

processes [2].<br />

The rapid development of smart sensing technologies<br />

coupled with the increase of Internet of Things (IoT) connected<br />

devices let industrial machines and smart products generate a<br />

staggering amount of data on a daily basis. Even small amount<br />

of raw data produced at a constant rate by each single IoT device,<br />

when gathered over a large period of time and summed over the<br />

entire installed base of devices, become Big Data. As a result,<br />

CPS are required Big Data analytics and Machine Learning<br />

(ML) features to transform these data into meaningful<br />

information, thus enabling high-value services for the end<br />

customers [3].<br />

If raw data are analyzed via a centralized intelligence<br />

through Cloud services only, there are a number of potential<br />

points of failure to consider such as connectivity, bandwidth,<br />

latency, security, infrastructure costs, to cite a few. For such<br />

reasons, there is a hyping interest in combining Cloud services<br />

with Edge Computing solutions [4]. This trend is producing an<br />

interesting race towards AI miniaturization which is hindered<br />

however by technological barriers since most ML techniques<br />

were not engineered to fit the stringent memory and<br />

computational requirements of embedded computing.<br />

According to our view, Edge Computing is a necessary but<br />

insufficient paradigm shift for leveraging AI on every chip. Our<br />

claim is that also the traditional data-oriented machine learning<br />

approach, which is centred around data scientist activities, has<br />

to be rethought in favour of a more human-oriented approach<br />

which starts from (and incrementally enriches with) the knowhow<br />

of the process expert or the process manager.<br />

The rest of the paper is organized as it follows: Section II<br />

describes the current ML process and the hinders to their<br />

implementation in many real-world industrial contexts; Section<br />

III introduces our vision on a human-oriented ML approach;<br />

Section IV presents olOne, our Integrated Development<br />

Environment (IDE) to develop AI-based sensor data processing<br />

applications; Section V reports on a real-world experiment and<br />

results carried out on a commercial prototyping board; Section<br />

V concludes.<br />

II.<br />

EFFECTS OF DATA-CENTRIC MACHINE LEARNING<br />

Recent trends in ML are dominated by neural networks (NN)<br />

and deep-learning (DL) technologies. They can be defined as<br />

black-box multi-layer computational models that self-tune their<br />

internal weights to fit input training data.<br />

Undoubtedly, NN and DL have been proven effective in<br />

disparate application settings such as, to cite a few, medical<br />

image processing [5], automatic text generation [6] or<br />

handwriting recognition [7]. However, it is important to stress<br />

that, as a prerequisite to their implementation, NN and DL<br />

require the availability of data to run the training phase. We<br />

concentrate on this point in the next subsection.<br />

A. Barriers to effective implementation in real-world settings<br />

Traditional ML training cannot be done without data. This<br />

point is a crucial one in the Industry 4.0 scenario.<br />

While major technological companies are used to gathering<br />

and processing data since many years ago, the figure<br />

dramatically changes for most manufacturing firms who have<br />

focused until now on building their products rather than on the<br />

digital transformation and competencies required by the new<br />

vision.<br />

316


According to a recent study [8], almost one third of EU<br />

enterprises recruiting or trying to recruit ICT specialists reported<br />

having difficulty filling those vacancies, where more than one in<br />

two companies searching for an ICT specialist found a serious<br />

shortage of people with such skills.<br />

Simply put, several industries collect and process few data<br />

or no data at all, not having appropriate internal skills to do this;<br />

therefore, this impedes them from implementing appropriate<br />

maintenance or customer-oriented services.<br />

When data is available, it does not mean that it embraces all<br />

the relevant conditions one would like to keep an eye on.<br />

Although it may seem ironical, if one wanted to train a preexplosion<br />

pattern in a plant, the perfect way would be having the<br />

explosion happening, to collect sufficient data for analysis!<br />

The next barrier is posed by the lack of technical skills able<br />

to mine out useful insights from raw data, e.g. by appropriately<br />

tuning NN parameters. Generally, this task is performed by data<br />

scientists, a professional figure which is scarcely available on<br />

the labor market.<br />

Finally, much of the available know-how in terms of human<br />

experience is excluded by this process. NN and DL are trained<br />

as black-box models, e.g. to find anomalous patterns. Once that<br />

the pattern has been found, further work is required to<br />

understand why that pattern happened.<br />

B. Race towards AI miniaturization<br />

Today, especially in Internet of Things (IoT) applications, it<br />

is common to have NN and DL models trained on the Cloud,<br />

which are accessed as web services from the edge through REST<br />

API calls or similar [9]. This architecture works well in<br />

“relaxed” settings, where processing frequency is in the range of<br />

seconds, minutes or more. Because of network constraints, the<br />

centralized approach to ML computation does not fit real-world<br />

applications requiring prompt responses with millisecond<br />

latency.<br />

Indeed, computation performed directly at the edge/fog<br />

level, or even at the thing/endpoint level, has many additional<br />

benefits such as bandwidth reduction, privacy, security,<br />

resilience to single point failure, and so on. Unfortunately, NN<br />

and DL are engineered for solving computationally intensive<br />

tasks often outreaching those of micro-processors commonly<br />

used at the edge of the network. This is causing a race in chip<br />

manufacturer towards realizing AI acceleration platforms<br />

which, in general, are based on special purpose architectures<br />

such as GPU or FPGA [10].<br />

Nevertheless, manufacturing special-purpose chips requires<br />

significant time and investment. By contrast, since the installed<br />

base of microcontrollers is dominated by low-cost generalpurpose<br />

microprocessors, such as those of ARM ® Cortex ® -M<br />

family, we believe that there is a great potential to re-use existing<br />

solutions, provided that a certain paradigm shift must be<br />

undertaken in the overall ML process.<br />

III.<br />

CHOOSING A DIFFERENT PERSPECTIVE<br />

Several decades ago [11], Minsky, one of the forefathers of<br />

AI, conjectured that intelligence viewed as a complex process is<br />

the manifest macro appearance of a number of simpler micro<br />

phenomena taking place at a lower observation level. This<br />

collective intelligence grows towards increasing complexities as<br />

an emergent property [12].<br />

In the literature, these principles are subjected to<br />

investigations in studies on Granular Computing [13-15] and<br />

Zadeh’s approach to Computing with Words [16] (CWW). A<br />

computational model that sprouts from GrC and CWW is that of<br />

holons [17], which represents the theoretical background of our<br />

approach.<br />

A. Granular Computing, Computing with Words and Holons<br />

According to Pedrycz’s view [13], GrC, as opposed to<br />

numeric computing (which is data-oriented), is knowledgeoriented<br />

and accounts for a new way of dealing with information<br />

processing in a unified way. Since knowledge is basically made<br />

of information granules, information granulation operates on the<br />

granule scale thus defining a sort of pyramid of information<br />

processing where low levels account for ground data and higher<br />

level for symbolic abstraction.<br />

GrC provides a basic framework for CWW which consists<br />

in expressing knowledge of observed phenomena in terms of<br />

linguistic propositions rather than numerical equations. At the<br />

core of the CWW methodology lays in fact the concept of<br />

granule due to the inner fuzziness of linguistic expressions. In<br />

the Zadeh’s view, a word w is considered as a label of a granule<br />

[16]. Under this perspective, the use of words becomes de facto<br />

a form of granulation.<br />

Introduced by Koestler in late 60’s [18], holon was defined<br />

as an entity playing the role of an autonomous whole and a<br />

dependent part at the same time as it happens for biological cells<br />

which are autonomous wholes that contribute as parts to the<br />

benefit of the hosting organism. By analogy, the same scheme<br />

holds also for human words in a phrase. A word, taken alone,<br />

carries its own meaning, while in a phrase it contributes to the<br />

understanding of a different semantic picture. That’s why holons<br />

can be viewed as a computational model to CWW applications.<br />

Language inherent recursivity is a powerful mean for<br />

humans to embed and transfer pieces of knowledge at different<br />

granularity levels. For example, let’s consider an energy-fromwaste<br />

plant manager describing the good-working condition of<br />

the combustion chamber as when the temperature signals from<br />

thermocouples behave in the same way. This is a very simple<br />

and expressive statement that suffices to describe from the<br />

human standpoint what is “normal condition” and consequently,<br />

as negation to that, what should be considered as “anomalous”.<br />

It is machine that should work to close the gap between the<br />

semantics behind the human utterance and its algorithmic<br />

implementation.<br />

B. Holons in real-time sensor data interpretation<br />

Physical sensors can be viewed as the CPS perception<br />

system that senses the surrounding environment, generally<br />

through periodic sampling. Since measures are often noisy or<br />

ambiguous, real-time information extraction is demanding. If<br />

data interpretation is delayed for too long, many important<br />

phenomena can be lost, being this unacceptable in several<br />

mission-critical scenarios.<br />

317


Since the kind of AI we present here works on raw data<br />

streams coming from sensors, it is useful to restrict our domain<br />

only to real-time signal interpretation, rather than general<br />

purpose data processing, and in particular to the very step that<br />

transforms raw data streams into meaningful events for upper<br />

application layers.<br />

Early work on holonic approaches to sensor data<br />

interpretation were presented in [19-20]. This paper extends<br />

them in the light of a more general approach to CWW and<br />

human-machine interaction (HMI).<br />

C. Focusing on human know-how first<br />

As an alternative to the data-centric approach described in<br />

previous Section, an iterative deployment workflow is proposed<br />

that centers on the available human know-how about the target<br />

process to analyze and is incremental with respect to new<br />

insights that may come out after deployment.<br />

The main phases characterizing such workflow are<br />

somewhat inspired to the so-called Plan-Do-Check-Act cycle<br />

[21] and to software development [22]. They are the following:<br />

1. Target events definition: the process manager or the<br />

domain expert identify/update the relevant conditions in<br />

terms of events happening on both physical and virtual<br />

signals they want AI to find for them. This phase is the<br />

equivalent to the example on the energy-from-waste<br />

plant problem description provided before.<br />

2. Transcription: the sensor processing flow that fires the<br />

target events is transcribed in an intuitive way, e.g. a<br />

visual language. A real-time CWW interpretation<br />

engine can be then embedded as a widget to insert in the<br />

visual scheme to characterize signal dynamics on the<br />

fly, letting machine do the job of transforming a highlevel<br />

concept into executable code. The well-formed<br />

flow starts from the input signals, declares some<br />

processing, and ends with the target events. For<br />

example, if we want an accelerometer sensor level to be<br />

“steady” over a 1 second time span, we would draw a<br />

flow like the one in Figure 2 ahead in the text. In case<br />

the specific concept representing the intended<br />

description were missing from the CWW vocabulary,<br />

the language should be engineered to provide a<br />

vocabulary-enriching mechanism, e.g. by automatically<br />

abstracting input examples into computable and<br />

reusable models, in a way similar to what happens in the<br />

training phase of a NN (mechanisms used to perform<br />

this abstraction are outside the scope of the present<br />

article).<br />

3. AI automated code building: once that transcription has<br />

been completed, the processing flow is automatically<br />

built into an executable to run on to the target device,<br />

e.g. in the form of a compiled C library to insert in the<br />

main loop.<br />

4. Embedding and test: the executable is then embedded<br />

into the target device for deployment. AI in run may<br />

supply new insights to restart the whole process again<br />

until all the desired conditions have been declared and<br />

verified.<br />

IV.<br />

OLONE<br />

“olOne” (from the Italian word for holon) is an IDE<br />

engineered to quickly build and deploy real-time sensor data<br />

processing AI applications without any data scientist<br />

background.<br />

In compliance with the vision and workflow illustrated in the<br />

previous section, olOne enables user to dictate domainknowledge<br />

to the CWW engine that will be included as a specialpurpose<br />

AI library in the C/C++ embedded software.<br />

Once deployed onto the target device, AI sole objective will<br />

be to interpret at run time sensor raw data streams according to<br />

the event conditions defined in the design phase, checking if<br />

such conditions are being met or not. In this sense, AI acts as a<br />

“semantic transducer”, transforming numeric data streams into<br />

meaningful booleans either for higher application layers or for<br />

taking local actions.<br />

A. Visual design<br />

To ease program design, a visual language which is specific<br />

to signal processing is used. This choice comes from the<br />

observation that most of the sensor data stream analytics<br />

platforms available on the market are either SQL-oriented (such<br />

as Microsoft Azure Stream Analytics or SQL Stream) or<br />

performed through ad-hoc programming (like Oracle Stream<br />

Analytics). Both approaches use programming languages that<br />

were not conceived in origin for stream data processing. Instead,<br />

olOne provides a toolbox of draggable widgets that can be<br />

arranged in the editor panel to produce a sensor data processing<br />

flow. A well-formatted flow is mainly composed of the<br />

following nodes:<br />

1. Sensor: name of the data source.<br />

2. Time Window: time span of the raw-data buffer at<br />

which AI will observe phenomena, e.g. 1 second.<br />

3. Behavior: it is a selector over the type of dynamics AI<br />

will focus on; examples are level (representing a generic<br />

state change from one numerical plateau to another one)<br />

and trend (which corresponds to the classical notion of<br />

direction given an array of data points).<br />

4. Concept: it represents an abstract Yes/No condition AI<br />

will check for, e.g. “high level” or “rising trend”.<br />

Technology behind olOne is also capable of comparing<br />

behaviors from different sources, e.g. “humidity level<br />

higher than temperature level” and even abstract new<br />

concepts from raw data, e.g. “slight rise”, to enrich the<br />

base vocabulary.<br />

5. Boolean operator (optional): unary or n-ary logical<br />

connector that allows to combine or negate processing<br />

flows.<br />

6. Event: name of the target event.<br />

The above steps are repeated until all processing flows have<br />

been defined. A well-formatted design can be then automatically<br />

built into executable code through a build request performed by<br />

clicking on the build button.<br />

318


B. AI library integration in embedded programs<br />

Once the “olOne.h” library has been returned after the build<br />

request, AI can be inserted in the main loop of an embedded<br />

program as shown in Figure 1.<br />

The process is straightforward: at each iteration, data read<br />

from sensors are sent to the library as float values, then AI<br />

interpretation is called and Yes/No results are returned for taking<br />

local actions or provide condition monitoring insights to other<br />

application layers.<br />

/*initialize AI library memory before the main loop */<br />

olOneInit();<br />

while(1) {<br />

}<br />

/* Get instantaneous data from the expansion board using variable s */<br />

/* static X_CUBE_MEMS *s = X_CUBE_MEMS::Instance(); */<br />

s->hts221.GetTemperature((float *)&TEMP_Value);<br />

s->hts221.GetHumidity((float *)&HUM_Value);<br />

s->lsm6ds0.Acc_GetAxes((AxesRaw_TypeDef *)&ACC_Value);<br />

/* send raw data to AI library */<br />

addValue(0,ACC_Value.AXIS_X);<br />

addValue(1,ACC_Value.AXIS_Y);<br />

addValue(2,ACC_Value.AXIS_Z);<br />

addValue(3, TEMP_Value);<br />

addValue(4, HUM_Value);<br />

/* call AI library interpretation function */<br />

interpret();<br />

/* get Boolean results -> 0 False, 1 True */<br />

int* events= getEvents();<br />

// do something and then repeat the process, e.g. at 10Hz<br />

wait_ms(100);<br />

Fig. 1. Excerpt of sample C code for using the AI library produced by olOne.<br />

Reference hardware is the ST STM32F401 Nucleo-64 board with MEMS<br />

Inertial and Environmental expansion module.<br />

C. Additional features<br />

In addition to visual design and automatic code building,<br />

olOne brings a number of features engineered to quicken<br />

application development and test. Here it follows a list of the<br />

main ones:<br />

• Optimized code: the produced executable is optimized<br />

to stringently follow only the logics provided in the<br />

design phase, thus minimizing memory and<br />

computational footprint at runtime. In case time window<br />

buffers exceed the total available RAM, memory can be<br />

optimized through a granularization process that<br />

compresses many data points into one single variable.<br />

• Multiplatform: the same design can be built for different<br />

target environments, e.g. as a C library for embedded<br />

platforms or a completely stand-alone Java program for<br />

operating system-enabled devices.<br />

• Simulation: arrays of historical data can be analyzed offline<br />

with a given design. Results are written in the form<br />

of CSV file where raw-data input columns are paired<br />

with corresponding Yes/No output conditions.<br />

• Live dashboard: chart panels and LED-like widgets can<br />

be composed to display raw-data and event conditions<br />

read in real-time from the embedded device via serial<br />

port.<br />

• New concept creation: concepts can be added to the base<br />

vocabulary. User provides an array of data points, either<br />

taken from live data or defined by sketch based on a<br />

simulated example, asking AI to abstract the single<br />

example into a category of similar conditions.<br />

V. EXPERIMENTS USING OLONE<br />

To address the feasibility of the proposed approach, a couple<br />

of case studies have been implemented on a selected group of<br />

commercially-available embedded boards, all equipped with<br />

ARM ® Cortex ® -M family microprocessors, like the NXP<br />

MKL25Z128VLK4 Microcontroller Development Board<br />

(16KB, 48MHz, Cortex-M0+).<br />

We report here on the results obtained with the ST<br />

STM32F401 Nucleo-64 (96-KB, 84MHz, Cortex-M4), since the<br />

availability of pluggable sensor modules made the testing<br />

processing very quick.<br />

Case studies were chosen with the objective of detecting<br />

events involving different types of signal dynamics and AI tasks.<br />

In particular, we focused on:<br />

• level-change: through the analysis of accelerometer<br />

data. Since acceleration is the second order derivative of<br />

position, accelerometers are fast trackers of movement<br />

conditions.<br />

• trend analysis: through the comparison of temperature<br />

and humidity sensor data drifts. Since such physical<br />

quantities are correlated but of different nature,<br />

comparison between them requires AI to perform some<br />

kind of data fusion [23].<br />

The ST MEMS Inertial and Environmental Nucleo<br />

Expansion board was used to get raw data streams for all the<br />

above case studies.<br />

A. Level-change in accelerometer data<br />

The ability to detect changes in sensor data streams, to detect<br />

faults or time-variant environmental conditions, is a key<br />

functionality in self-adaptive CPS [24].<br />

In our experiments, we targeted level changes in<br />

acceleration, often called jerk in physics, to detect an<br />

orientation-independent [25] “standing still” condition (no jerk<br />

on the three axes) over a 1 second time span, with a sampling<br />

frequency of 10Hz.<br />

Although, at first glance, the classification task may appear<br />

trivial, it is complicated by the MEMS accelerometer intrinsic<br />

noise [26]. Furthermore, no a-priori information such as<br />

319


Fig. 2. Visual design for the use cases described in this paper. The processing flows proceed from the data sources (widgets representing sensors on the left) towards<br />

output boolean events (widgets on the right). Intermediate widgets are used to define the time window, the type of dynamics AI will look for and comparators or logic<br />

connectors to fuse different flows. In the trend analysis flow, the “slight rise” concept is inserted as a specification of the trend behavior.<br />

variance or noise power is given to the embedded AI apart from<br />

the description visually transcribed in olOne, as it appears in<br />

Figure 2.<br />

B. Temperature-Humidity trend comparison<br />

In this case study, the target was the comparison of slightly<br />

rising trends of temperature and humidity over a 5 second time<br />

window. This analysis, extended to wider time-span, can be<br />

useful for example to detect long-term drift, which is a wellknown<br />

problem related to sensor aging [27].<br />

Since the “slight rise” concept is not present by default in the<br />

base vocabulary, there was the need to add it as a new word. An<br />

array of temperature data points was used for this scope. The<br />

data collection experiment consisted in bringing a hand close to<br />

the board. The heat from the hand slightly increased the read<br />

temperature. Live data were then collected and stored to perform<br />

the data abstraction task by means of the olOne new concept<br />

creation feature described in the previous Section.<br />

It is noteworthy that the new concept could be used as a<br />

reference condition also for humidity, a different physical<br />

quantity from the one used in the learning phase.<br />

C. Preliminary tests and final considerations<br />

Our tests showed that both use cases could be implemented<br />

on the STM Nucleo board with success at 10Hz with a minimal<br />

memory footprint (approximately 250 byte) and with a CPU<br />

usage of less than 0.1%. We also tested the same design at<br />

100Hz, maintaining CPU usage to less than 1%. Some<br />

screenshots taken from the olOne live dashboard during these<br />

tests are reported in Figure 3.<br />

Fig. 3. Example screenshot taken from the live dashboard representing the<br />

“still” condition observed on the 3-axes acceleromter sensor and the “slight<br />

rise” condition on humidity and temperature signals.When AI finds that realtime<br />

data (displayed in chart) comply with the target event conditions, the boxes<br />

reporting the event labels get highlighted and display “FIRED”.<br />

320


As a consequence of our experiments, we support the thesis<br />

that CWW computational models like the one employed in<br />

olOne can be brought to commonly-used embedded<br />

architectures, even to ARM ® Cortex ® -M0 microprocessors,<br />

provided that a different ML approach centred around the human<br />

know-how about the process is followed.<br />

VI.<br />

CONCLUSIONS<br />

The interest around AI and its potential application to<br />

Industry 4.0 is hyping. However, several barriers to effective<br />

implementation in real-world settings can be found, such as the<br />

lack of adequate technical skills for data processing, the<br />

availability of datasets and the time itself spent for model<br />

training required by traditional ML approaches.<br />

In the race towards AI miniaturization, which is today<br />

mainly focused on realizing special-purpose architectures, our<br />

proposal should be considered as an attempt to provide a<br />

different perspective, that exploits existing general-purpose<br />

CPUs without requiring data scientist background for AI design<br />

and deployment.<br />

Rooted in the CWW paradigm, the approach employed in<br />

olOne represents, according to authors’ view, a viable<br />

alternative in the Industry 4.0 scenario to mainstream NN and<br />

DL implementations.<br />

ACKNOWLEDGMENT<br />

The authors would like to personally thank Pio Quarticelli<br />

and Danilo Pau from the ST Microelectronics Agrate Brianza<br />

site, for gently providing all the hardware and the technical<br />

support needed to setup the presented experiments.<br />

REFERENCES<br />

[1] E. A. Lee, “Cyber physical systems: design challenges”, 11 th IEEE<br />

International Symposium on Object and Component-Oriented Real-Time<br />

Distributed Computing (ISORC), Orlando, FL, 2008, pp. 363-369.<br />

[2] R. Schmidt, M. Möhring, RC. Härting, C. Reichstein, P. Neumaier, P.<br />

Jozinović, “Industry 4.0 - potentials for creating smart products: empirical<br />

research results”. In: Abramowicz W. (eds) Business Information<br />

Systems. BIS 2015. Lecture Notes in Business Information Processing,<br />

vol 208. Springer, Cham<br />

[3] J. Lee, H.A. Kao, S. Yang, “Service innovation and smart analytics for<br />

Industry 4.0 and big data environment”, 6 th CIRP Conference on Industrial<br />

Product-Service Systems, 2014, pp. 3 – 8. Vol. 16, 2014, Pages 3-8,<br />

Elsevier.<br />

[4] S. Teerapittayanon, B. McDanel and H. T. Kung, "Distributed deep neural<br />

networks over the cloud, the edge and end devices," 2017 IEEE 37th<br />

International Conference on Distributed Computing Systems (ICDCS),<br />

Atlanta, GA, 2017, pp. 328-339.<br />

[5] D. Shen, G. Wu, H.-I. Suk. “Deep learning in medical image analysis”.<br />

Annual review of biomedical engineering. 2017; 19:221-248.<br />

[6] I. Sutskever, J. Martens, G. Hinton, “Generating Text with Recurrent<br />

Neural Networks”, Proceedings of the 28 th International Conference on<br />

Machine Learning (ICML-11), ACM , pp. 1017-1024, June 2011.<br />

[7] M. Liwicki, A. Graves, and H. Bunke. “Neural Networks for Handwriting<br />

Recognition”. Book chapter, Computational Intelligence Paradigms in<br />

Advanced Pattern Classification, pp. 5-24, Springer, 2012.<br />

[8] S. Compagnucci, G. Berni, G. Massaro, M. Masulli, “Thinking the future<br />

of the european industry. Digitalization, Industry 4.0 and the role of EU<br />

and national policies”, EU study from I-com (Institute for<br />

competitiveness), Bruxelles, 6 September 2017.<br />

[9] C.-W. Tsai, C.-F. Lai, H.-C. Chao, A. V. Vasilakos, “Big data analytics:<br />

a survey”, Journal of Big Data, 2:21. Springer International Publishing.<br />

December 2015.<br />

[10] D. Vainbrand and R. Ginosar, "Network-on-Chip Architectures for<br />

Neural Networks," 2010 Fourth ACM/IEEE International Symposium on<br />

Networks-on-Chip, Grenoble, 2010, pp. 135-144.<br />

[11] .M. Minsky, The Society of Mind, Simon and Schuster, New York. 1986.<br />

[12] M. Ulieru, R. Doursat, “Emergent engineering: a radical paradigm shift”,<br />

Int. J. Autonomous and Adaptive Communications Systems, Vol. 4, No.<br />

1, 2011, pp. 39-60.<br />

[13] W. Pedrycz, “Granular computing: an introduction”, Proc. of the Joint 9th<br />

IFSA World Congress and 20th NAFIPS International Conference,<br />

Vancouver, BC, 2001, pp. 1349-1354 vol.3.<br />

[14] L. A. Zadeh. “Toward a theory of fuzzy information granulation and its<br />

centrality in human reasoning and fuzzy logic”. Fuzzy Sets Syst. 90, 2,<br />

pp. 111-127, September 1997.<br />

[15] L. A. Zadeh, “Some reflections on soft computing, granular computing<br />

and their roles in the conception, design and utilization of<br />

information/intelligent systems”, Springer-Verlag, Soft Computing (2):<br />

23—25. 1998.<br />

[16] L. A. Zadeh, "Fuzzy logic = computing with words," in IEEE<br />

Transactions on Fuzzy Systems, vol. 4, no. 2, pp. 103-111, May 1996.<br />

[17] M. Calabrese, Hierarchical-Granularity Holonic Modelling. Doctoral<br />

Thesis, 2011. University of Milan, Italy.<br />

[18] A. Koestler, “Some general properties of self-regulating open hierarchic<br />

order (SOHO)”, In Koestler and Smythies, 1969, 210-216.<br />

[19] V. Di Lecce, M. Calabrese, C. Martines, “From sensors to applications: a<br />

proposal to fill the gap”, Sensors & Transducers Journal, Vol. 18, Special<br />

Issue, pp. 5-13, January 2013.<br />

[20] V. Di Lecce and M. Calabrese. “Smart sensors: a holonic perspective”. In<br />

Proceedings of the 7 th international conference on Intelligent Computing:<br />

bio-inspired computing and applications (ICIC'11), De-Shuang Huang,<br />

Yong Gan, Prashan Premaratne, and Kyungsook Han (Eds.). Springer-<br />

Verlag, Berlin, Heidelberg, 290-298. 2011.<br />

[21] N. R. Tague, The Quality Toolbox, Second Edition, ASQ Quality Press,<br />

2004, pages 390-392.<br />

[22] G. Suryanarayana, T. Sharma and G. Samarthyam, "Software Process<br />

versus Design Quality: Tug of War?," in IEEE Software, vol. 32, no. 4,<br />

pp. 7-11, July-Aug. 2015.<br />

[23] B. Khaleghi, A. Khamis, F.O. Karray, “Multisensor data fusion: A review<br />

of the state-of-the-art”, Information Fusion, Vol. 14, Issue 1, January<br />

2013, Pages 28-44, Elsevier.<br />

[24] C. Alippi, V. D'Alto, M. Falchetto, D. Pau and M. Roveri, "Detecting<br />

changes at the sensor level in cyber-physical systems: Methodology and<br />

technological implementation," 2017 International Joint Conference on<br />

Neural Networks (IJCNN), Anchorage, AK, 2017, pp. 1780-1786.<br />

[25] W. Hamäläinen, M. Järvinen, P. Martiskainen and J. Mononen, "Jerkbased<br />

feature extraction for robust activity recognition from acceleration<br />

data," 2011 11th International Conference on Intelligent Systems Design<br />

and Applications, Cordoba, 2011, pp. 831-836.<br />

[26] A. Albarbar and S.H. Teay, “MEMS Accelerometers: Testing and<br />

Practical Approach for Smart Sensing and Machinery Diagnostics”, pp.<br />

19-40 in D. Zhang, B. Wei (eds.), Advanced Mechatronics and MEMS<br />

Devices II, Microsystems and Nanosystems, Springer International<br />

Publishing Switzerland 2017.<br />

[27] T. Islam, H. Saha, “Study of long-term drift of a porous silicon humidity<br />

sensor and its compensation using ANN technique”, In Sensors and<br />

Actuators A: Physical, Vol. 133, Issue 2, 2007, Pages 472-479.<br />

321


A new scalable architecture to accelerate<br />

Deep Convolutional Neural Networks for low<br />

power IoT applications<br />

Giuseppe Desoli, Thomas Boesch, Surinder Pal-Singh, Nitin Chawla<br />

ST Central Labs and Technology R&D<br />

STMicroelectronics<br />

Cornaredo (MI), Italy; Geneva, Switzerland; Noida India<br />

giuseppe.desoli@st.com, thomas.boesch@st.com, surinder-pal.singh@st.com, nitin.chawla@st.com<br />

Abstract— Deep Convolutional Neural Networks (DCNNs) or<br />

ConvNets allow achieving state of the art results in many<br />

applications involving recognition, identification and/or<br />

classification tasks; however, those come at a high cost in terms<br />

of processing power hindering their adoption in embedded and<br />

IoT domains, due to the scarce availability of low-cost and<br />

energy-efficient solutions. Recently a push towards an everincreasing<br />

deployment of DCNNs based inference tasks in<br />

embedded devices supporting the edge-computing paradigm has<br />

been observed, overcoming limitations of cloud-based computing<br />

for latency, bandwidth requirements, security, privacy,<br />

scalability, and availability. At the edge, severe performance<br />

requirements must coexist with tight constraints in terms of<br />

power and energy consumption. DCNNs algorithms necessitate<br />

billions of multiply-accumulate operations per second for realtime<br />

workloads, as well as local storage of millions of bytes of<br />

pre-trained weights. To cope with these constraints, low-power<br />

IoT end-nodes must resort to specialized hardware blocks for<br />

specific compute-intensive data processing, while retaining<br />

sufficient software programmability to cope with diverse<br />

computational needs. The Orlando architecture is a<br />

reconfigurable, scalable and design time parametric DCNN<br />

Processing Engine powered by an energy efficient set of HW<br />

convolutional accelerators supporting kernel compression and an<br />

on-chip reconfigurable data transfer fabric to improve data reuse<br />

and reduce on-chip and off-chip memory traffic. The Orlando<br />

SoC prototype integrates custom designed DSPs, along with an<br />

instance of the reconfigurable dataflow custom HW accelerator<br />

fabric designed in FD-SOI 28 technology with low power features<br />

and adaptive circuitry to support a wide voltage range from 1.1V<br />

to 0.575V. The chip adopts a GALS clocking architecture to<br />

reduce the clock network dynamic power and skew sensitivity<br />

due to on-chip variation at lower voltages. We achieved a power<br />

consumption of 41mW on a typical DCNN algorithm (AlexNet)<br />

with a peak layer efficiency of 2.9 TOPS/W.<br />

Keywords—Deep Learning; Neural Networks; FD-SOI; ultralow<br />

power SoC;<br />

I. INTRODUCTION<br />

DCNN based algorithms are now widely applied to a large<br />

number of hard to solve problems in classification, detection,<br />

recognition, analysis and, more recently, even synthetic signals<br />

generation in computer vision, signal processing, speech and<br />

audio applications, robotic motion, navigation, financial data<br />

analysis, medical diagnostics, and more. Since the seminal<br />

work of Y. LeCun et al [1] that lead to winning the 2012<br />

ImageNet Large Scale Visual Recognition Challenge with a<br />

CDNN significantly outperforming classical computer vision<br />

approaches for the first time called AlexNet [2]; many new<br />

kinds of neural networks topologies and operators have entered<br />

the state of the art, all requiring a baseline computational<br />

pattern consisting on some form of tensor convolution along<br />

with a more diverse set of additional operators deployed in a<br />

sequence of processing steps or layers.<br />

It is only in recent years that commodity computing<br />

hardware such as GPUs delivered the performance required to<br />

address DCNN training and inference based applications. At<br />

the same time, it is increasingly more difficult to improve over<br />

the state-of-art in hardware performance by way of generalpurpose<br />

designs leading to the emergence of hardware DCNN<br />

accelerators. A survey of the existing proposals in this domain<br />

is beyond the scope of this paper, some of the early works<br />

include the DianNao accelerator family [5] using a SISD<br />

architecture to process operations in parallel on a single chip<br />

[5], while few other examples can be found in [3,4 and 6].<br />

Hardware accelerators design efforts have proceeded in two<br />

directions: either toward more general-purpose accelerators to<br />

support training and inference with very high throughput and<br />

efficiency for example in servers [11] or toward specialized<br />

units addressing layers or classes of DNNs with the goal of<br />

reducing execution time and/or energy. In order to deploy these<br />

technologies making them pervasive in mobile, IoT and<br />

wearable devices; hardware acceleration provides the ability to<br />

work in real time with very limited power consumptions and<br />

limited amounts of embedded memory overcoming limitations<br />

of fully programmable solutions.<br />

www.embedded-world.eu<br />

322


C<br />

t<br />

r<br />

l<br />

We present a scalable modular architecture called Orlando<br />

providing state-of-the-art performance and energy efficiency to<br />

design HW accelerated Neural Processing Units (NPU) with<br />

the following features: (1) A flexible streaming HW<br />

convolutional accelerators supporting variable bit length kernel<br />

decompression, (2) a reconfigurable dataflow switching fabric<br />

improving data reuse and reducing the need for on-chip and<br />

off-chip memory traffic, (3) a power efficient array of DSPs to<br />

increase flexibility and support real-world applications. In<br />

addition, the SoC prototype designed to validate the<br />

architecture includes an ARM-based host subsystem with<br />

peripherals, a range of high-speed I/Os interfacing for imaging<br />

and other types of sensors and a chip-to-chip high-speed link to<br />

pair multiple devices together.<br />

Output<br />

Data<br />

Stream<br />

ThomaHawk ACCELERATOR<br />

Adder<br />

Tree<br />

Buf<br />

Display out<br />

(DVI)<br />

Interface<br />

M<br />

M<br />

…<br />

M<br />

MAC Units<br />

Image<br />

Sensor IF &<br />

ISP<br />

Feature Line Buffer<br />

Kernel<br />

Buf.<br />

Buf<br />

Buf<br />

Image<br />

Sensor IF &<br />

ISP<br />

Feature<br />

Data<br />

Stream<br />

Kernel<br />

Data<br />

Stream<br />

Batch<br />

Data<br />

Stream<br />

2xDSP<br />

2xDSP<br />

ORLANDO SoC<br />

8 x Dual DSP cluster<br />

2xDSP<br />

2xDSP<br />

XBAR<br />

2xDSP<br />

2xDSP<br />

BUS full xbar (64bits)<br />

2xDSP<br />

2xDSP<br />

Fetch<br />

PC<br />

BTB<br />

BTAC<br />

Xbar2(CMEM)<br />

6 6 6 6<br />

6 4 4 4 4<br />

4 Xbar(DMEM)<br />

I-Cache<br />

I<br />

D<br />

R<br />

F<br />

E<br />

E<br />

C<br />

G<br />

U<br />

F<br />

I<br />

L<br />

E<br />

Dual Core Cluster<br />

STRED5 PIPELINE<br />

STQ<br />

Color<br />

convert<br />

. . .<br />

H264<br />

MJPEG<br />

Ctrl Regs.<br />

DMA<br />

Eng 15<br />

Stream<br />

Switch<br />

. . . DMA<br />

Eng 0<br />

Bus Arbiter &<br />

System Bus Interface<br />

CA<br />

0<br />

. . .<br />

CA<br />

7<br />

CO-PROCESSOR SUBSYSTEM<br />

Image & DCNN<br />

co-processor<br />

subsystem<br />

CLK<br />

GEN<br />

RST<br />

CTRL<br />

PWR<br />

MNG<br />

2x128B<br />

2x128B<br />

2x128 KB<br />

2x128 KB<br />

Low<br />

Speed<br />

Periph<br />

2x128 KB<br />

4 x 1MB Global Ram<br />

(clk & pwr gating)<br />

AXI<br />

2x128 KB<br />

2x128 KB<br />

2x128 KB<br />

2x128 KB<br />

ARM<br />

CORTEX M4<br />

CPU<br />

2x128 KB<br />

2x128 KB<br />

2x128 KB<br />

MEM<br />

CTRL<br />

2x128 KB<br />

2x128 KB<br />

2x128 KB<br />

2x128 KB<br />

SoC<br />

I/D<br />

16K<br />

STRED5<br />

C-MEM<br />

32K<br />

LXBAR<br />

I/D<br />

16K<br />

I$<br />

16K<br />

CXBAR<br />

STRED5<br />

I/D<br />

16K<br />

LXBAR<br />

I/D<br />

16K<br />

I$<br />

16K<br />

2D<br />

DMA<br />

Fig. 1. [a] Orlando 1 FD-SOI 28nm SoC prototype high level system architecture, [b] DCNN HW accelerator subsystem shown on the left, [c] a single DSP<br />

cluster shown on the right<br />

II.<br />

SOC ARCHITECTURE<br />

A. High Level System<br />

The Orlando 1 test chip prototype [Fig. 1a] integrates an<br />

ARM Cortex-M4 microcontroller with 128KB of memory,<br />

assigned with control and sequencing tasks for I/O and HW<br />

configuration and synchronization. The chip supports a number<br />

of peripherals for external communication and interfacing and<br />

includes eight programmable clusters [Fig. 1c] each one<br />

composed of two ultra-low power proprietary DPSs along with<br />

interrupts controllers, timers, and dedicated tensor transfer<br />

DMA channels. A reconfigurable dataflow accelerator fabric<br />

[Fig. 1b] connects high-speed camera interfaces with image<br />

sensor processing (ISP) pipelines, croppers, color converters,<br />

feature detectors and descriptors (FAST, Census), video<br />

encoders (MJPEG, H.264), 8 channel digital microphone<br />

interface, streaming DMAs and 8 Convolutional Accelerators<br />

(CA). The chip includes 4 SRAM banks with dedicated 64bits<br />

bus ports each with 1Mbyte composed with 64KB memory<br />

cuts with individual sleep line control to activate it on demand<br />

and reduce total leakage when not needed. The system<br />

parameters are chosen to be capable to sustain the execution of<br />

all convolutional stages from internal on-chip memory for<br />

CDNN topologies of a complexity similar to an AlexNet<br />

without pruning or even larger ones if fewer bits are used for<br />

activations and/or weights to achieve high power efficiency. It<br />

is possible to connect multiple chips together via a chip-to-chip<br />

4 lanes high-speed serial links running up to 6Gbit/sec to<br />

support larger networks without sacrificing throughput and/or<br />

using the chip as a co-processor. CDNNs state-of-the-art<br />

topologies (e.g. VGGs, ResNet, inception-v4, etc.) require<br />

deeper topologies with many layers, millions of parameters,<br />

and varying kernel sizes, resulting in large bandwidth, power,<br />

and area costs often not compatible with constraints associated<br />

to embedded devices and applications. The cost in terms of<br />

energy per access varies almost of one order of magnitude from<br />

level to level, as well as large gaps exist for throughput and<br />

access latency at different levels of on-chip and external<br />

memory [Fig. 2],<br />

Energy/power x word access<br />

Local SRAM<br />

On-chip SRAM<br />

LPDDR<br />

Fig. 2. Relative cost of accessing different levels of memory going from<br />

local buffers attached to functional units to higher level of on-chip and<br />

external memory<br />

As a result, a common way to achieve efficiency is to<br />

define a hierarchical memory system and efficiently reuse local<br />

data in the deeper level of the hierarchy. Accelerating CDNN<br />

convolutional layers accounting for more than 90% of total<br />

operations calls for the efficient balancing of the computational<br />

vs memory resources for both bandwidth and area to achieve<br />

1x<br />

10x<br />

100x<br />

323


maximum throughput without hitting their associated ceilings<br />

due to architectural limitations.<br />

B. DSP Sub System<br />

Each 32bit DSP provides specific instructions (Min, Max, Sqrt,<br />

Mac, Butterfly, Average, 2-4 SIMD ALU) to accelerate typical<br />

CNN operations other than convolutions [2]. A dual load with<br />

16b saturated MAC, advanced memory buffer addressing<br />

modes and zero latency loop control execute in a single cycle<br />

while an independent 2D DMA channel allows the overlap of<br />

data transfers. The DSPs are tasked with max or average<br />

pooling, nonlinear activation, cross-channel response<br />

normalization and classification representing a small fraction of<br />

the total CDNN computation but more amenable to future<br />

algorithmic evolutions. They can operate in parallel with CAs<br />

and data transfers, synchronizing by way of interrupts and<br />

mailboxes for concurrent execution. DSPs are activated<br />

incrementally when the throughput targets require it, leaving<br />

ample margins to support additional tasks associated with<br />

complex applications, such as object localization and<br />

classification, multisensory (e.g. audio and video) CDNN<br />

based data-fusion and recognition, scene classification, etc.<br />

C. The Configurable Accelerator Framework (CAF)<br />

The Orlando Neural Processing Unit (NPU) engine<br />

includes a configurable accelerator framework (CAF) [Fig. 3]<br />

with a design-time selectable number of Functional Units (FU)<br />

such as DMAs, accelerators, or I/O interfaces to external<br />

devices. A centralized, fully connected, runtime configurable<br />

stream switch interconnects all FUs with unidirectional links<br />

transporting data streams to/from different kinds of data<br />

sources and sinks. A fully automated configuration process<br />

allows the designer to quickly generate synthesizable RTL<br />

code tailored to the actual system requirements. The<br />

configuration tool suit uses predesigned FU templates provided<br />

in a central library, takes care of any signal synchronization for<br />

FUs that run on different clock domains and configures the<br />

required stream links and bus interfaces to provide access to all<br />

configuration registers in the system.<br />

Image sensor 1<br />

Parallel YUV/RGB<br />

Serial (CSI2) RGB<br />

Image sensor 2<br />

Parallel YUV/RGB<br />

Serial (CSI2) RGB<br />

DVI RGB<br />

IF<br />

IF<br />

CONFIG.<br />

REGISTERS<br />

ISP<br />

ISP<br />

DISPLAY IF<br />

SYSTEM BUS INTERFACE<br />

2x64BIT<br />

DMA<br />

ENGINE 0<br />

…<br />

CONFIGURABLE<br />

STREAM SWITCH<br />

DMA<br />

ENGINE 15<br />

53 (INPUT STREAM LINKS)<br />

40 (OUTPUT STREAM<br />

LINKS)<br />

H264<br />

ENCODER<br />

…<br />

CENSUS TRANFORM 0<br />

co-processor<br />

subsystem<br />

Fig. 3. Orlando NPU Configurable Acceleration Framework (CAF)<br />

At runtime, an arbitrary number of concurrent, virtual<br />

processing chains limited by the available hardware resources<br />

can be defined to meet the specific characteristics of a task<br />

graph. These virtual processing chains can be configured and<br />

fired within a few system clock cycles and may process<br />

multiple tasks in parallel. An automatic backpressure<br />

mechanism handles the data flow control in each virtual<br />

processing chain preventing any data overflows. The<br />

CA 0<br />

…<br />

CA 7<br />

Unit<br />

DMA ENG<br />

SENSOR IF<br />

DISPLAY IF<br />

CA<br />

OTHER<br />

INFO<br />

16 UNITS, INPUT OR OUTPUT, DATA<br />

PACKING/UNPACKING, LINKED LIST<br />

CONTROL<br />

2 UNITS INCL. ISP BAYER => RGB/YUV<br />

DVI MONITOR INTERFACE<br />

8 UNITS FOR 2D CONVOLUTION<br />

ACCELERATION<br />

1 H264 ENC, 1 MJPEG ENC.,<br />

1 MJPEG DEC., 2 CENSUS TRANSF., 2<br />

IMAGE CROPPER, 1 FAST FEATURE<br />

DETECTOR, 4 GP COLOR CONV.<br />

interconnect supports stream multicasting to allow reuse of a<br />

data stream at multiple data sinks reducing the overall data<br />

bandwidth from/to the system bus [Fig. 4]. The unidirectional<br />

stream links are able to transport different data formats such as<br />

raster scan images, kernels coefficient, activation data and<br />

other kinds. A start and end tag along with other command and<br />

message packets are used for signaling and to trigger specific<br />

actions in all FUs participating in a virtual processing chain.<br />

Functional Units can have an arbitrary number of input and<br />

output stream links as well as a set of configuration registers<br />

used to enable, reset and configure their functionality. A<br />

centralized interrupt controller enables the routing of interrupt<br />

signals from any accelerator, interface or DMA engine to the<br />

DSP cores. A clock and reset management unit provides an<br />

individual clock and reset control for each FU in the system.<br />

Specialized DMA engines transform data structures<br />

accessible on the system bus into data streams injected into<br />

virtual processing chains whereas data streams received by the<br />

DMA engines are translated back to data structures to be<br />

written to any memory location on the system bus. Extensive<br />

data packing and unpacking features in the DMA engines allow<br />

the efficient use of variable data bit width and sophisticated<br />

control mechanisms using linked lists to support autonomous<br />

processing of tensors. Interrupt signals generated by the DMA<br />

engines signal the completion of a processing task to the DSP<br />

Cores and/or central control processor.<br />

The CAF subsystem instance in the Orlando 1 SoC<br />

prototype includes four camera interfaces (two serial and two<br />

parallel) with integrated ISPs, a display interface and various<br />

accelerators for standard image processing tasks such as color<br />

conversion, image cropping, image(MJPEG) and video (H.264)<br />

encoding. Additional accelerator blocks are available for<br />

feature point identification and tagging such as a FAST feature<br />

point detector and a two census transform blocks that allow for<br />

generating compact and illumination invariant feature<br />

descriptors.<br />

DMA E0 A 0 A 1 DMA E0<br />

IF 0 A 0 A 1 DMA E0<br />

IF 0 A 0 A 1 DMA E0<br />

IF 0<br />

DMA E0<br />

IF 0<br />

Sync.<br />

IF 1<br />

IF 0<br />

A 0<br />

A 1<br />

A 0<br />

A 1<br />

A 0<br />

A 2<br />

A2<br />

A2<br />

A2<br />

BUF<br />

DMA E1<br />

DMA E1<br />

DMA E0<br />

DMA E0<br />

Simple<br />

chains<br />

Chains<br />

with forks<br />

Joins<br />

single IF<br />

Joins with<br />

multiple IF<br />

Forks<br />

and hops<br />

Fig. 4. The Configurable Accelerator Framework allows different kinds of<br />

virtual link connections to be created between blocks, including sources and<br />

sinks of data.<br />

D. Chip Implementation<br />

The prototype chip is manufactured with<br />

STMicroelectronics 28nm FD-SOI technology; designed with<br />

mono-supply SRAM based single well low power 0.12µ 2<br />

single p-well bit cells with reduced variability, in-situ tracking<br />

www.embedded-world.eu<br />

324


of bitcell current and programmable read time for best speed<br />

and lowest dynamic power. Memories also have in-situ<br />

tracking of word line delay and slope for robust low voltage<br />

read/write across a wide voltage range from 1.1V to 0.575V<br />

[Fig. 6]. Globally asynchronous and locally synchronous<br />

clocking architecture reduces the clock network dynamic<br />

power and skew sensitivity due to on-chip variation at lower<br />

voltages and eases the use of dynamic frequency scaling. Finegrained<br />

power gating and multiple sleep modes for memories<br />

decrease the overall dynamic and leakage power consumption.<br />

Die size [Fig. 5 ]is 6.2x5.5mm2, each CA is 0.27mm2<br />

including memory and the chip reaches 1.175GHz at 1.1V with<br />

a theoretical peak CAs performance of 676 GOPS [Fig. 5]. The<br />

chip is capable of sustaining a wide range of operating points<br />

and can run at 200MHz with a 0.575V supply at 25C with an<br />

average power consumption of 41mW on AlexNet using eight<br />

pipelined CAs, achieving a peak efficiency of 2.9 TOPS/W.<br />

OTP<br />

High Speed<br />

Camera IF<br />

PLL<br />

Chip<br />

to<br />

Chip<br />

co-processors<br />

subsystem<br />

(DSP) Cores and local mems<br />

Global Memory<br />

Subsystem<br />

Technology<br />

Chip sizes<br />

Package<br />

Clock freq<br />

FD-SOI 28nm<br />

(X) 6239.2 um<br />

(Y) 5598.2 um<br />

FBGA 15x15x1.83<br />

200MHz – 1.175GHz<br />

Supply voltages<br />

0.575V – 1.1V digital<br />

1.8V I/O<br />

Power<br />

41 mW<br />

4x1 MB<br />

On-chip RAM 8x192 KB<br />

128 KB<br />

No of DPSs 16<br />

Peak DSP<br />

performance (*)<br />

No of CAs 8<br />

Peak CAs<br />

performance (*)<br />

75 GOPS<br />

(dual 16b MAC loop)<br />

676 GOPS<br />

(*) 1 MAC defined as 2 OPS (ADD + MUL)<br />

Sub tensors can be processed entirely with the local buffer<br />

resources available in each accelerator. The configurable batch<br />

size and a variable number of parallel kernels enable optimal<br />

trade-offs for the available input and output bandwidth sharing<br />

across different units and the available computing logic<br />

resources. Keeping the entire batch of feature and kernel data<br />

locally and as close as possible to the MAC units enables the<br />

optimal use of the available power budget. Feature and kernel<br />

data batches can be processed sequentially with multiple<br />

accelerators in a virtual processing chain or iteratively with<br />

intermediate results being stored in on-chip memory and<br />

fetched in the subsequent batch processing round.<br />

Various kernel sizes (up to 12x12), sub-tensor batch sizes<br />

(up to 16), and parallel kernels (up to 4) can be handled by a<br />

single CA instance but any size kernel can be accommodated<br />

with the accumulator input. The CA includes a line buffer to<br />

fetch up to 12 feature map data words in parallel with a single<br />

memory access. A register-based kernel buffer provides 36<br />

read ports, while 36 16-bit fixed point multiply-accumulate<br />

(MAC) units perform up to 36 MAC operations per clock<br />

cycle. The kernel buffer implements pre-buffering of kernel<br />

data that are required in a subsequent processing step.<br />

An adder tree accumulates MAC results for each kernel<br />

column [Fig. 7]. The overlapping, column-wise calculation of<br />

the MAC operations allows an optimal reuse of the feature<br />

maps data for multiple MACs thus reducing the power<br />

consumption associated with redundant memory accesses.<br />

A different CAs optimal configuration per each CDNN<br />

layer is defined manually, while we are working on a tool to<br />

automatically generate it off-line starting from a CDNN<br />

description format such as Caffe and TensorFlow [10].<br />

Fig. 5. Orlando 1 prototype SoC build in FD-SOI28 nm technology<br />

Wide DVFS Range<br />

2930 2691<br />

950<br />

1977<br />

650<br />

200 266 450<br />

1423<br />

1175<br />

969 801<br />

0.575 0.6 0.7 0.825 1 1.1<br />

Frequency<br />

GOPS/W<br />

Fig. 6. The Orlando 1 SoC prototype supports a wide range of operating<br />

conditions from ultra-low Vdd for highest efficiency to high-performance<br />

FEATURE<br />

CA<br />

DATA<br />

STR.<br />

BUF FEATURE LINE<br />

STREAM<br />

IF<br />

BUFFER<br />

(UP 12 LINES<br />

WITH UP TO 512<br />

KERNEL<br />

PIXELS<br />

FEATURE<br />

36 x<br />

WIDTH<br />

OR 3 LINES WITH<br />

16x16BIT MACs<br />

BATCH<br />

UP TO 2048<br />

SIZE<br />

13 INPUT<br />

PIXELS)<br />

ADDER TREE<br />

KERNEL<br />

36 READ<br />

KERNEL BUFFER<br />

STR.<br />

PORTS<br />

CONVOLUTION<br />

PIXEL N<br />

DATA<br />

BUF (UP 484 KERNEL<br />

STREAM IF<br />

RESULTS<br />

VALUES)<br />

KERNEL BUFFER<br />

CTRL<br />

OUPUT<br />

DATA<br />

OUTPUT<br />

STREAM<br />

INTERM.<br />

CTRL<br />

DATA<br />

STR.<br />

STR.<br />

BUF INTERMEDIATE ACCUMULATION<br />

BUF<br />

STREAM<br />

IF<br />

IF<br />

DATA<br />

MAC<br />

Unit<br />

INFO<br />

UNITS<br />

KERNEL SIZE<br />

1x1 to 12x12<br />

ACCU.<br />

KERNEL 0<br />

BATCH SIZE Up to 16<br />

PARALLEL KERNELS Up to 4<br />

ACCU.<br />

KERNEL 1<br />

FEATURE SIZE Up to 512 for kernels > 6x6<br />

Up to 1024 for kernel > 3x3<br />

ACCU.<br />

Up to 2048 for others<br />

KERNEL 2<br />

VARIOUS<br />

KERNEL DECOMPRESSION 8bit => 16 bit,<br />

EXTENSIONS KERNEL PREBUFFERING, OUTPUT STREAM<br />

ACCU.<br />

MERGING, DATA SHIFTING AND ROUNDING<br />

KERNEL 3<br />

ACCU.<br />

PIXEL N-1<br />

Fig. 7. Orlando NPU Convolutional Accelerator<br />

FEATURE WIDTH<br />

FEATURE<br />

DEPTH<br />

ADDER<br />

TREE<br />

CONV.<br />

OUT<br />

ACCU. ACCU.<br />

PIXEL N PIXEL N+1<br />

III.<br />

CNN HW ACCELERATION<br />

A. Convolutional Accelerators (CA)<br />

Convolutional accelerators can be grouped or chained<br />

together to handle varying sizes of feature maps and multiple<br />

kernels in parallel using the interconnection capabilities<br />

provided by the programmable stream switch adapting to<br />

different neural network topologies as well as feature and<br />

kernel tensors geometries.<br />

B. Hyperparameters compression<br />

A large number of schemes have been proposed in the<br />

literature to compress CNN's hyper-parameters with fewer bits<br />

including uniform and trained quantization, pruning, weight<br />

sharing and even Huffman encoding and others techniques [8].<br />

It is generally accepted that in many cases the precision<br />

required for marginal decreases in the output accuracy can be<br />

lower than 16 bits and as low as eight or fewer bits [9]. In order<br />

to keep the hardware complexity limited, we’ve selected a<br />

325


elatively simple scheme calling for a non-linear quantization<br />

scheme for which the quantized steps are defined offline with a<br />

k-means approach applied to all of the weights per each layer.<br />

This scheme is flexible enough to accommodate also linear<br />

quantization models with a min, max boundary representation<br />

such as the one adopted in TensorFlow. The Orlando<br />

Convolutional Accelerators can decompress at run-time the<br />

compressed weights before storing them into the local kernel<br />

buffers providing significant benefits in terms of total memory<br />

bandwidth requirements reduction, while the nonlinear<br />

quantization scheme helps to minimize the impact on the<br />

reduction of accuracy. Fig. 8 shows the quantizer functions for<br />

two different layers of an AlexNet compressed to eight bits per<br />

coefficient starting from their FP32 representation produced<br />

during training; as it can be seen, the statistics vary<br />

significantly across layers, showing the benefits of allocating<br />

quantization steps non-uniformly and asymmetrically with<br />

respect to the center offset. The CA supports on-the-fly kernel<br />

decompression and rounding; the functionality implemented<br />

with a lookup table populated upfront before the processing of<br />

a tensor or sub-tensor is started.<br />

12000<br />

8000<br />

4000<br />

0<br />

-4000<br />

0 127 254<br />

-8000<br />

-12000<br />

Layer 1<br />

2000<br />

1500<br />

1000<br />

0<br />

-500<br />

0 127 254<br />

-1000<br />

Fig. 8. Kernel weights can be quantized non linearly with 8 or fewer bits<br />

(e.g. with KNN), convolution Accelerator supports decompression in HW,<br />

AlexNet top-1 classification error rate increase of 0.3%<br />

On many CNN topologies, kernel weights can be quantized<br />

with an ensemble of vector codebooks for increased network<br />

compression and lower memory bandwidth without significant<br />

loss in performance. We have developed a scheme that can be<br />

applied to kernel tensors to take advantage of this; TABLE I.<br />

shows how many bits per coefficient per layers are achieved<br />

when using an ensemble of 1 to 64 vector codebooks per layer<br />

with vectors lengths of 3 and 5 coefficients. The codebooks are<br />

learned with a modified version of k-means adapted to low<br />

values of k; vectors can be chosen either slicing horizontally,<br />

vertically or depth-wise the kernel tensors, grouping them in<br />

subsets each assigned to a different codebook. The position of<br />

each vector in the original kernel determines which codebook<br />

is used to encode it in order to avoid transmitting additional<br />

bits for encoding its label. We’ve not observed great variations<br />

depending from which direction the vectors are selected from<br />

(x,y or z); however, a great variability in terms of the overall<br />

VQ ensemble optimal parameters is observed from a network<br />

topology to another and, including for the same network<br />

trained to a different set of classes.<br />

500<br />

Layer 3<br />

TABLE I.<br />

IV.<br />

ACCURACY VS PARAMETER COMPRESSION WITH VQ FOR<br />

TINY YOLO [13]<br />

Layer<br />

No.<br />

codebooks<br />

Vector Quantized Tiny Yolo<br />

Codebook No. of<br />

geometry parameters<br />

Bits per<br />

coefficient<br />

C1 4 3x16 432 4.89<br />

C2 1 3x256 4608 4.00<br />

C3 4 3x256 18432 4.00<br />

C4 16 3x256 73728 4.00<br />

C5 32 3x256 294912 3.33<br />

C6 64 3x256 1179648 3.00<br />

C7 64 3x256 4718592 2.75<br />

C8 64 3x256 9437184 2.71<br />

C9 64 5x256 128000 6.72<br />

IOU%<br />

Recall%<br />

FP32 65.17 81.53<br />

32<br />

VQ<br />

15855536<br />

63.28 79.76 2.86<br />

Quantized<br />

CNN LOGICAL TO PHYSICAL MAPPING<br />

Efficient mapping of a CNN task graph to the underlying<br />

architectural computing and memory resources requires that<br />

the execution of convolutional layers is partitioned by way of<br />

slicing both kernel and input activation tensors. Each subtensor<br />

is assigned to a different convolutional accelerator and<br />

the partial results can either be sent to memory or directly<br />

streamed into another accelerator’s input processing a different<br />

slice of the same kernel sub-tensor for direct accumulation<br />

[Fig. 9].<br />

The Orlando streaming architecture allows the creation of<br />

virtual channels between CAs dynamically to chain them<br />

together in a coprocessor pipeline or to run them independently<br />

while broadcasting sub-tensors input data without the need to<br />

perform separate memory accesses [Fig. 10]. The shape of the<br />

sub-tensors is constrained by a relatively large number of<br />

parameters such as the available local storage for each CA (line<br />

buffers and kernel buffers), the total on-chip memory storage<br />

that can be used for input and output activation maps, and the<br />

size of the kernels for a given layer compared to the maximum<br />

kernel size supported by the accelerators.<br />

Multiple accumulation rounds would be required if the iteration<br />

space exceeds any of those constraints. In addition to finding a<br />

legal schedule for sub-tensors that tessellates the whole global<br />

tensor iteration space; a mapping strategy should keep into<br />

account a multi-objective cost function that includes not only<br />

performance (e.g. frames per second), but possibly also energy<br />

efficiency (frames/sec/W) while keeping external memory size<br />

and requirements into considerations. While this is a relatively<br />

difficult scheduling problem to solve, there are effective<br />

approaches that automatically derive a nearly optimal solution,<br />

for example, based on polyhedral models in the simplified<br />

iteration space of the convolutional tensor processing [7].<br />

www.embedded-world.eu<br />

326


FEATURE<br />

DEPTH<br />

KER<br />

0<br />

…<br />

… KER<br />

Q<br />

Fig. 9. Feature and kernel tensors are sliced into batches of variable depth<br />

processed iteratively and results are accumulate<br />

FMAP<br />

KER. 0/1<br />

BATCH (N-1) K0<br />

KERNEL<br />

WIDTH<br />

BATCH 0<br />

IN FEATURE<br />

MAP<br />

Parallel Batch Execution<br />

DMA<br />

BATCH (N-1) K1 DMA<br />

CA 0<br />

…<br />

CA N<br />

DMA<br />

OUT K0<br />

DMA<br />

OUT K1<br />

Chained Batch Execution<br />

FMAP<br />

KER. 0/1<br />

CA 0<br />

DMA<br />

BATCH (N-1) K0<br />

FMAP next<br />

batch DMA<br />

…<br />

CA N<br />

OUT<br />

DMA<br />

FEATURE WIDTH<br />

BATCH<br />

N<br />

BATCH<br />

SIZE<br />

FEATURE<br />

HEIGHT<br />

Fig. 10. Chained and parallel sub-tensor execution on multiple CAs reduces<br />

bandwidth, power, and # of DMA channels<br />

V. EXPERIMENTAL RESULTS<br />

KER 0<br />

In the following section we provide some experimental<br />

results for executing few CDNN workloads; first a typical<br />

AlexNet benchmark is described for which maximum power<br />

efficiency is the target on the actual Orlando SoC prototype;<br />

then a VGG-16 workload is described (this one based on a<br />

simulated model) comparing different choices of design time<br />

parameters for the Orlando NPU to illustrate performance for<br />

different possible configurations.<br />

A. Alexnet on the Orlando 1 SoC prototype<br />

When a compressed format with eight bits is adopted, as<br />

described in section III.B, AlexNet fits entirely within the<br />

internal on-chip memory with the exception of the Fully<br />

connected (FC) layers for the final classifier stages. The total<br />

amount of internal storage required is 2318 KB for parameters<br />

stored with 8 bits each, 1436 KB for feature maps with a<br />

precision of 16 bits and a total of ~10 MB of external RAM for<br />

FC layers compressed with a VQ scheme [Fig. 11b]. All of the<br />

five convolutional layers are directly mapped onto the Orlando<br />

NPU via a dynamic configuration of the configurable<br />

accelerator framework and associated CAs, while the rest of<br />

the layers is directly managed by optimized code running on<br />

the eight DSP clusters [Fig. 11a].<br />

Performance is reported in TABLE II. Per each layer in<br />

terms of processing latency, percentage utilization of<br />

computing resources for CAs and GOPs/Sec/W for both eight<br />

and 16 bits precision for coefficients, with accumulators results<br />

always scaled back to 16 bits when storing the final<br />

convolution result. The chip operates at 200MHz with a Vdd of<br />

Σ<br />

Σ<br />

KER Q<br />

OUT FEATURE<br />

MAP<br />

Parallel/Chained Batch Execution<br />

FMAP<br />

KER. 0-3<br />

BATCH (N-1) K0<br />

DMA<br />

DMA<br />

FMAP next<br />

batch<br />

BATCH (N-1) K1 DMA<br />

CA 0.0<br />

…<br />

CA<br />

0.M<br />

CA<br />

N.0<br />

…<br />

CA<br />

N.M<br />

DMA<br />

OUT K0<br />

DMA<br />

OUT K1<br />

0.575V at 25 degrees Celsius, and each convolutional layer is<br />

processed with four independent chains of two cascaded CAs.<br />

105M 223M 149M 224M 74M 37M 16M 4M<br />

CONV 11x11<br />

CONV 11x11<br />

RELU, NORM<br />

POOL<br />

RELU, NORM<br />

POOL<br />

CONV 5x5<br />

RELU, POOL<br />

CONV 3x3<br />

Fig. 11. [a] Top, AlexNet HW/SW partitioning, [b] bottom, memory footprint<br />

The input is a batch of a single image of sizes 227x227<br />

pixels. The maximum efficiency reached is for layers 3, 4 and<br />

5 and is equal to 2930 GOPS/sec/W with a total average for the<br />

whole network of 2473 GOPS/sec/W for 8 bits coefficient and<br />

2009 GOPS/sec/W for 16 bits. The average power<br />

consumption is this configuration is 41 mW and includes the<br />

power for all of the accelerators subsystem and on-chip<br />

memories.<br />

Layer MOPS Latency Utilization<br />

[ms]<br />

Tot. Operations: 832 M<br />

RELU<br />

CONV 3x3<br />

TABLE II.<br />

GOPs/W<br />

RELU<br />

85-90% of total operations<br />

CONV 5x5<br />

RELU, POOL<br />

CONV 3x3<br />

RELU<br />

CONV 3x3<br />

35K 307K 884K 649K 442K 37M 16M 4M<br />

RELU<br />

Power<br />

[mW]<br />

GOPs/W<br />

Power<br />

[mW]<br />

GOPs/W<br />

Power<br />

[mW]<br />

16(F)x16(W)->16 16(F)x8(W)->16 8(F)x8(W)->16<br />

max avg max avg max avg<br />

1 210.8 2.5 80% 1228 988 86 1471 1183 72 1810 1456 58<br />

2 447.8 6.5 86% 1475 1262 54 1767 1512 45 2175 1861 37<br />

3 299 3.6 73% 1987 1445 58 2380 1731 48 2930 2131 39<br />

4 224.2 2.7 73% 1987 1445 58 2380 1731 48 2930 2130 39<br />

5 149.6 1.8 72% 1987 1434 58 2380 1717 48 2930 2114 39<br />

Total 1331.6 17.1 77% 1677 1287 61 2009 1542 51 2473 1898 41<br />

B. VGG-16 estimates on different configurations<br />

In order to evaluate the flexibility and efficiency of the<br />

Orlando NPU template, we have estimated performance and<br />

power consumption of VGG-16 [14] on a number of different<br />

configurations with a varying number of accelerators, assuming<br />

the availability of a high-speed external memory LPDDR3/4<br />

interface. We assumed a total on-chip memory allocated to the<br />

DCNN workload of 512KB (a reasonable assumption for a<br />

SoC in current silicon process technologies) and configured the<br />

CAs to support both 144 and 36 MACs with 16 bits and 8 bits<br />

precision respectively (8 bits MACs have four times the<br />

throughput by way of a SIMD implementation). Results are<br />

shown in Fig. 12 in terms of frames per second throughput vs.<br />

power consumption for a range of voltage/frequency operating<br />

points for configurations of 1, 2, 4, 8 and 16 CAs respectively.<br />

CONV 3x3<br />

CONV 3x3<br />

Tot. Parameters: ~ 60M<br />

RELU, POOL<br />

RELU, POOL<br />

FC<br />

FC<br />

FC<br />

FC<br />

FC<br />

FC<br />

DSP<br />

HW<br />

Acc<br />

EXT<br />

MEM<br />

SRAM<br />

327


FPS<br />

160<br />

140<br />

120<br />

100<br />

80<br />

60<br />

40<br />

20<br />

0<br />

1 CA 2 CAs 4 CAs 8 Cas 16 CAs<br />

140<br />

115<br />

80<br />

73<br />

56<br />

59<br />

40<br />

33<br />

37<br />

28<br />

30<br />

25<br />

20<br />

14<br />

14.8 18.3<br />

12<br />

17<br />

2 2 4 5 7 9<br />

3.1 6 4.1<br />

8<br />

7.0 10.1<br />

0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175<br />

Vdd/Freq range<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

Power [mW]<br />

Fig. 12. VGG16 performance vs power scaling for different Vdd ranges and CAs with 8bpp MACs 16 kernels in parallel<br />

Efficiency F/sec/W<br />

500<br />

450<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

0<br />

1 CAs 2 CAs 4 CAs 8 CAs 16 CAs<br />

0.575/200 0.6/266 0.7/450 0.825/650 1/950 1.1/1175<br />

Vdd/Freq range<br />

14000<br />

12000<br />

10000<br />

8000<br />

6000<br />

4000<br />

2000<br />

0<br />

LPDDR MB/sec<br />

Fig. 13. VGG16 Efficiency vs external LPDDR Bandwidth for different Vdd ranges and CAs with 8bpp MACs 16 kernels in parallel<br />

Fig. 13 shows the efficiency in terms of frames per second<br />

per Watt vs. the associated bandwidth requirements for the<br />

external LPDDR interface. A configuration with a single CA<br />

provides 1.55 FPS at 200 MHz and 0.575V with an efficiency<br />

of 130 FPS/W equivalent to 1.95 TOPS/W or 12 mW while, at<br />

the same frequency and power supply, a 16 CAs configuration<br />

performs at 25 FPS with an efficiency of 440 FPS/W or 56<br />

mW. The two configurations require a throughput of 215<br />

MB/sec and 2015 MB/sec respectively. In the high-end corners<br />

for frequency and supply (1.1GHz, 1.1V), the peak<br />

performance for 16 CAs configuration reaches 140 FPS with a<br />

lower efficiency of 145 FPS/W and associated external<br />

memory bandwidth required of ~11GB/sec still compatible<br />

with an LPDDR4 32 bits interface.<br />

VI.<br />

CONCLUSIONS AND FUTURE WORK<br />

We have described a flexible and scalable HW architecture<br />

to accelerate DCNN workloads for the design of scalable NPUs<br />

and its silicon validation, demonstrating its use in accelerating<br />

deep convolutional neural network operations, with a focus on<br />

convolutions that are the key compute-intensive task therein.<br />

We have also addressed the problem of the large parameter<br />

space associated with these networks by incorporating a<br />

quantization scheme simple to implement in HW yet effective<br />

enough to compress the parameter space of embeddable<br />

networks like tiny-yolo, which although targeting resourceconstrained<br />

devices, would still need off-chip external memory<br />

to implement otherwise. In terms of future work, we are<br />

evolving the Orlando architecture to include HW acceleration<br />

of other non-convolutional operators like pooling, activation<br />

function ranging from sigmoid, tanh, ReLU variants as well as<br />

custom defined activations covering recent work like<br />

Kafnets[12], batch normalization and other miscellaneous<br />

operators. These new accelerators will leverage the streaming<br />

dataflow model of the Orlando to stitch together computation<br />

pipelines at runtime allowing data to flow from one block to<br />

the other reducing the need to access memory by subsequent<br />

www.embedded-world.eu<br />

328


compute units thus providing an energy efficient realization of<br />

the execution graph<br />

VII.<br />

REFERENCES<br />

[1] Y. LeCun et al., “Gradient-based learning applied to document<br />

recognition,” Proceedings of the IEEE, vol. 86, no. 11, pp. 2278-2324,<br />

1998<br />

[2] Krizhevsky, A., Sutskever, I., Hinton, G.: Imagenet classification with<br />

deep convolutional neural networks. In: NIPS. pp. 1-9. Lake Tahoe, NV<br />

(2012)<br />

[3] J. Sim, et al., “A 1.42TOPS/W Deep Convolutional Neural Network<br />

Recognition Processor for Intelligent IoE Systems,” ISSCC Dig. Tech.<br />

Papers, pp. 264-266, 2016.<br />

[4] T. Chen et al., “A High-Throughput Neural Network Accelerator,” IEEE<br />

MICRO 2015, vol. 35, no. 3, pp. 24-32<br />

[5] Y. Chen, et al., “DaDianNao: A machine-learning supercomputer”<br />

IEEE/ACM Int. Symp. on Microarchitecture, pp. 609-622, 2014.<br />

[6] YChen et al., “Eyeriss: An Energy-Efficient Reconfigurable Accelerator<br />

for Deep Convolutional Neural Networks” ISSCC Dig. Tech. Papers, pp.<br />

262-264, 2016<br />

[7] B Pradelle, Benoit Meister, Muthu Baskaran, Jonathan Springer, Richard<br />

Lethin, “Polyhedral Optimization of TensorFlow Computation Graphs”<br />

6th Workshop on Extreme-scale Programming Tools (ESPT-2017)<br />

[8] Song Han, Huizi Mao, and William J. Dally, “Deep Compression:<br />

Compressing Deep Neural Network with Pruning, Trained Quantization<br />

and Huffman Coding”, http://arxiv.org/abs/1510.00149, 07 Jun 2017<br />

[9] Alberto Delmas Lascorz, Sayeh Sharify, Patrick Judd & Andreas<br />

Moshovos, “Tartan: Accelerating Fully-Connected and Convolutional<br />

Layers in Deep Learning Networks by Exploiting Numerical Precision<br />

Variability”, https://arxiv.org/abs/1707.09068, 27 Jul 2017<br />

[10] “How to Quantize Neural Networks with TensorFlow”,<br />

https://www.tensorflow.org/performance/quantization<br />

[11] Norman P. Jouppi et al., “In-Datacenter Performance Analysis of a<br />

Tensor Processing Unit”, e 44th International Symposium on Computer<br />

Architecture (ISCA), Toronto, Canada, June 26, 2017<br />

[12] S. Scardapane et al., “Kafnets: kernel-based non-parametric activation<br />

functions for neural networks”, https://arxiv.org/pdf/1707.04035.pdf, 23<br />

Nov 2017<br />

[13] J. Redmon et al, “YOLO: Real-Time Object Detection”,<br />

https://pjreddie.com/darknet/yolo, accessed 10 Jan 2018<br />

[14] Karen Simonyan, Andrew Zisserman, “Very Deep Convolutional<br />

Networks for Large-Scale Image Recognition”,<br />

https://arxiv.org/abs/1409.1556, 10 Apr 2015<br />

329


Certification Aspects of a Connected Vehicle<br />

Ritu Sethi<br />

Intel Technology India Pvt. Ltd.<br />

Bangalore, India<br />

ritu.sethi@intel.com<br />

Abstract—Vehicular communications is an evolving area of<br />

networking between vehicles and everything else (V2X) –<br />

vehicles-to-vehicles (V2V), vehicles-to-pedestrians (V2P),<br />

vehicles-to-infrastructure (V2I) or vehicles-to-network (V2N).<br />

IEEE backed DSRC (Dedicated short-range communications<br />

based on 802.11p Wi-Fi-based protocol) and 3GPP proposed<br />

LTE advancements in Cellular-V2X are two of such emerging<br />

wireless technologies that will enable communication between<br />

talking vehicles of tomorrow. While the LTE enhancements are<br />

still being standardized, DSRC enabled products are already<br />

pitching in the new cars to provide 360-degree situational<br />

awareness to enhance vehicle safety [7].<br />

Vehicular communication puts forth a unique wireless<br />

communication scenario with stringent requirements of fast<br />

network acquisition, ultra-low latency, very high reliability,<br />

priority for safety critical messages, interoperability between<br />

technologies and still conforming to security and privacy<br />

constraints. Even though wireless communication technologies<br />

individually provide specifications and conformance to them<br />

would certify that particular component, there is still a need to<br />

certify a complete system holistically. This paper brings out<br />

these certification challenges.<br />

Keywords—V2X; DSRC; Cellular; Safety ; Certification<br />

I. INTRODUCTION<br />

Nearly two decades ago, the United States Department of<br />

Transportation’s (USDOT) National Highway Traffic Safety<br />

Administration (NHTSA) analyzed accidental driving related<br />

mishaps and took up an initiative to address them. It was<br />

concluded that 82% of all car crashes involved impaired drivers<br />

and up to 90% of car-accident related deaths and 40% of crashes<br />

at the intersections could be eliminated if vehicle to vehicle<br />

(V2V) communication could be enabled.<br />

In 1999, FCC allotted 75 MHz of spectrum in the 5.9 GHz<br />

band to be used by intelligent transportation systems (ITS) in<br />

the US. In 2008, ETSI allocated 30 MHz to be utilized for V2X<br />

based studies in the European Union. IEEE working group thus<br />

led the initiative of proposing an amendment to 802.11a Wi-Fi<br />

protocol especially related to automotive cases to address their<br />

requirements.<br />

Thus a new protocol 802.11p termed as DSRC (Dedicated<br />

short-range communications) evolved [1] [3] specifically<br />

designed to permit very high data transmission that is critical in<br />

communications-based active safety applications in the<br />

automotive sector. Not left far behind, the LTE standards were<br />

evolved to drive-in the requirements for Cellular-V2X<br />

bringing-in many more commercial use cases. Depending on<br />

the use case, the requirements from the protocol can be easily<br />

translated to Key Performance Indicators (KPI). For example,<br />

collision avoidance safety critical use case needs an end-to-end<br />

latency of a few milliseconds giving sufficient reaction time to<br />

the driver; reliability of the order of 10 -x ; supporting node<br />

mobility of the order of hundred km/hr; providing positional<br />

accuracy of few cm; and communication range of the order of<br />

few 100 meters.<br />

With the integration of V2X to the autonomous and assisting<br />

driving use cases, there is an interplay of all other sensor data<br />

assimilation as well. As a standalone system, there is<br />

compelling need for the connected vehicle to be fully<br />

independent and capable of providing a functionally reliable<br />

and safe environment. As a player in the entire ecosystem, it has<br />

to contribute towards providing of low interference and nonvulnerable<br />

operating condition.<br />

This paper is further structured as the following: Section II<br />

provides a brief on the wireless technologies to achieve V2X,<br />

including the technical and infrastructural challenges, Section<br />

III presents certification challenges followed by a brief<br />

summary for closing remarks.<br />

II.<br />

WIRELESS TECHNOLOGIES ENABLING V2X<br />

Historically, an important wireless technology for V2X is<br />

DSRC - based on IEEE 802.11p protocol. It defines the physical<br />

(PHY) and medium access control (MAC) layers. It has evolved<br />

from Wi-Fi___33 standard IEEE 802.11a and maintains the<br />

same frame structure and modulation as in 802.11a. The<br />

software stack is standardized under the IEEE 1609 working<br />

group for WAVE (Wireless Access in Vehicular<br />

Environments). Many different layers have been developed for<br />

various networking and management functions for multichannel<br />

operations, resource management and security<br />

specifically for vehicular use cases.<br />

www.embedded-world.eu<br />

330


On a high level, DSRC is based on the principle of each<br />

vehicle broadcasting its core state information in a Basic Safety<br />

Message (BSM) nominally 10 times/sec. The BSM contains<br />

vehicle state information (location, speed, acceleration, and<br />

heading) and is sent out in all directions. The receivers, on the<br />

other hand, build model of each neighbor’s trajectory, assess<br />

threat to host vehicle and warn driver to take control if threat<br />

becomes acute.<br />

Not far left behind at utilizing LTE cellular networks, 3GPP<br />

has enhanced the Release 14 specifications (and beyond) to<br />

standardize cellular access for vehicular communication. Since<br />

cellular network provides higher capacity than the local Wi-Fi<br />

networks and enjoys a wide world deployment, the existing<br />

broadcast mechanisms like MBMS and eMBMS(enhanced<br />

Multimedia and Broadcast Services)/ SC-PTM (Single Cell<br />

Point to Multipoint) /Sidelink (PC5) for Device2Device [2][5]<br />

can be utilized to achieve the particularly demanding needs of<br />

V2X. With the help of core network, prioritization of safety Vs<br />

non-safety messages could be easily achieved. The controlling<br />

nodes, could vary transmission rate and range based on service<br />

conditions like vehicle speed and density.<br />

The proposed enhancements in LTE enable the<br />

communication range sufficient to give the driver(s) ample<br />

response time (e.g. 4 seconds). The maximum supported<br />

relative velocity of the vehicles is 500 km/h, absolute velocity<br />

is 250 km/h for vehicle2vehicle or vehicle2pedestrian use case.<br />

It also ensures that the subscriber identity cannot be tracked or<br />

identified by any other vehicle subscriber/single party/operator<br />

beyond a certain short time-period required by the V2X<br />

application.<br />

A. Technical Challenges<br />

Since the underlying assumption of the 802.11 protocol is<br />

based on Carrier Sense Multiple Access with Collision<br />

Avoidance (CSMA-CA), which relies heavily on carrier<br />

sensing with back off timers upon sensing of collision, the<br />

intensity of channel contention among vehicles in dense urban<br />

setting increases due to high transmission collision rate leading<br />

to large channel delay. Some optimizations proposed in [4]<br />

references 802.11e to improve Quality of Service (QoS)<br />

enhancements to provide priority access to certain traffic.<br />

Optionally, high priority traffic may use a shorter back off<br />

time before trying to sense the channel again for activity. This<br />

improves latency for safety critical traffic but does not address<br />

the contention as there is no guaranteed or reserved resources<br />

in 802.11p. Another possible way to address it is the dynamic<br />

formation of Vehicle Ad-hoc NETwork (VANET) giving<br />

improved chance of channel access to the newly formed<br />

network of vehicles in close proximity. But then such VANETs<br />

come with high cost of maintainability due to high mobility and<br />

dynamic nature of movement of individual nodes.<br />

The competing technology of LTE geared towards V2X is<br />

capable of addressing the limitations of the 802.11p PHY<br />

challenges to some extent, it comes with its own set of<br />

constraints. The cellular network is designed around centralized<br />

control and is not suitable for safety applications having strict<br />

delay requirements as it comes with signaling overhead<br />

implying high latency. Each vehicle is required to have valid<br />

subscription and authorization irrespective of served by E-<br />

UTRAN or not. Even though this has a huge push from 5G<br />

Automobile Association (5GAA), it is still slowly finding its<br />

way since the specifications were standardized in January 2017<br />

and are still lagging to be widely adopted by the industry.<br />

B. Infrastructural Challenges<br />

V2X is an incumbent technology and before it can be<br />

deployed, it needs huge investment in setup of some or all of<br />

the below infrastructural nodes:<br />

<br />

<br />

<br />

<br />

<br />

<br />

Equipment at the roadside, enclosures, mountings,<br />

and network backhaul;<br />

Provision of controllers and systems at traffic lights<br />

and intersections to provide the signal phase and<br />

timing accurately.<br />

Systems and services to provide for detailed maps and<br />

geometries including and road signage<br />

Positioning services for resolving vehicle locations to<br />

high accuracy and precision.<br />

Centralized data collection servers and analysis of the<br />

data provided by the vehicles.<br />

Security credential management and processes for a<br />

trusted network.<br />

III.<br />

CERTIFICATION CHALLENGES<br />

Recently a car manufacturer has approached for design<br />

approval with a car interior having no steering wheel or brake<br />

pedals citing no need for manual controls in case of an<br />

autonomous driving vehicle. With this utmost level of<br />

automation and no manual intervention, there is increased focus<br />

of validation and verification of not just basic functionality but<br />

also providing for the utmost secure and safe operation for both<br />

- the passengers within the vehicle and others nearby. The<br />

connected vehicle, thus is not only responsible for the safety<br />

related use cases of its own but also contributes additionally<br />

towards the other vehicles in its neighborhood, it becomes<br />

critical for it to abide by norms that can lead to a reliable,<br />

functionally dependent, robust, secure and safe operating<br />

condition.<br />

While the above mentioned wireless technologies provide a<br />

backbone to enable connectivity between vehicles and prove<br />

technical viability, V2X devices do not work self-sufficiently,<br />

they are dependent on the infeed sensor data from the vehicle<br />

which is usually controlled by the vehicle OEM (Original<br />

Equipment Manufacturer). In such a case, the certification of<br />

each modules can be achieved with ease but the end platform<br />

level certification is an enormous task especially because of the<br />

tight coupling to achieve the interworking.<br />

With the given constraints, certification becomes a challenge<br />

especially with multiple completing standards and profusion in<br />

331


the number of car manufactures and vendors offering their own<br />

solutions. Below are the few challenges that need to be<br />

addressed:<br />

A. Inter-operational<br />

Interoperability is a critical piece of the puzzle. To achieve<br />

the full benefit of the system, every vehicle on the road should<br />

be able to communicate with every other vehicle and that is<br />

possible only when every vehicle follows a common protocol.<br />

Today, the industry is divided between DSRC and Cellular-<br />

V2X, with several consortiums to back their respective choices.<br />

Few of the requirements are as follows:<br />

Seamless user experience across 3GPP and non-3GPP<br />

interworking – enable seamless handovers across<br />

technologies.<br />

Ensure interoperability across multi-modal<br />

<br />

communication systems (DSRC/Cellular).<br />

Ensure consistency on the messages being delivered<br />

from Cellular and DSRC systems.<br />

Ensure interoperability and consistency across<br />

Vendors, Deployments, OEM and wide array of device<br />

manufacturers.<br />

<br />

Agreed upon and a standardized adoption of<br />

intersection and traffic control management by the<br />

Department of Transportation.<br />

B. Functionally Reliable<br />

The system should consistently behave as expected to<br />

deliver the required functionality in any reasonable scenario.<br />

The adopted technologies should be able to provide service in<br />

all scenarios in-out of coverage including non-subscribed users<br />

and network gaps. Additionally, it should also adhere to the<br />

below:<br />

Reliable and consistent message<br />

prioritization/transmission across cities, states and<br />

countries.<br />

Consistency in deploying the DSRC and Multi-modal<br />

communication systems at all levels.<br />

System redundancy planning to handle natural<br />

disasters.<br />

C. Fault Management<br />

The ability to manage faults and raise them appropriately is<br />

an essential safety requirement. The following could provide<br />

the high level guidelines:<br />

<br />

<br />

<br />

System should be able to detect network outages/gaps<br />

and raise and broadcast alarms over multimodal<br />

available paths.<br />

Re-prioritization/fallback options of the decision<br />

matrix should be provided for.<br />

Ensure service continuity possibly through P2P (Peer<br />

to peer)/D2D (device to device) with mesh.<br />

The below table [8] further specifies difference failure modes<br />

in the message communication context:<br />

D. Regulatory<br />

The regulatory conformance is usually driven by the local<br />

bodies and have regional flavors. Primarily, the following need<br />

to be considered:<br />

Regulations for connected cars need to be driven at all<br />

levels from Local governments to Department of<br />

Transportation.<br />

Uniformity in deploying and building<br />

<br />

Smart/Connected cities through standards.<br />

Certification of Connected Cars should be beyond the<br />

Radio certification, where it is expected to cover the<br />

real-world scenarios of traffic scenarios, and<br />

multimodal<br />

communications.<br />

E. Performanace Gurantee<br />

Performance is the critical part of the connected vehicles.<br />

Since many of the safety related use cases depend on<br />

deterministic latencies, the following aspects are important<br />

during conformance validation:<br />

<br />

<br />

<br />

Reliability and latency aspects should conform to the<br />

expectations. The connected vehicle should comply<br />

with the Quality of Service guarantees.<br />

Protocols should be resilient to handle the worst case<br />

scenarios of network dynamic congestion and<br />

interference.<br />

Performance guarantee is essential even during natural<br />

disasters in a truly autonomous world.<br />

www.embedded-world.eu<br />

332


F. Safety and Security<br />

Security plays another important part in the safe deployment<br />

of such applications. Starting at component level, security<br />

needs to be built as an integral part of the system and also<br />

ensuring that at no times privacy and security get compromised.<br />

<br />

<br />

<br />

Autonomous Vehicles are guided by signals from<br />

RSU, Cellular Networks and also from other Vehicles.<br />

All systems in the communication pipeline should<br />

secure from intrusion or other compromising<br />

situations.<br />

All multimodal communications should ensure<br />

integrity of the communication from source to<br />

destination.<br />

Address issues related to security through credential<br />

management, avoid deliberate and accidental jamming<br />

and advanced hacking and spoofing.<br />

G. Functional Safety<br />

Functional safe systems requires the system to operate<br />

correctly in all kinds of inputs, and safely manage any errors or<br />

failures.<br />

Systems must function correctly in order to avoid<br />

hazardous situations<br />

<br />

<br />

Faults must be detected and controlled<br />

Possible fallbacks on other protocols like radars,<br />

LIDAR’s in case of connectivity medium failure to<br />

move into a system safe state.<br />

IV.<br />

SUMMARY<br />

The current landscape points that DSRC is highly likely to<br />

be deployed in US and ITS-G5 (based of 802.11p) somewhat<br />

likely in Europe. But 4G/5G may be the V2X technology of<br />

choice as a long runner. For V2X applications, 5G will likely<br />

start with 4G (LTE V2V) and LTE V2V will be the cellular<br />

V2X solution for several years to come. Definitely, the new 5G<br />

radio will augment and complement it over time and will play<br />

a major role as the technology is being studied and getting<br />

rapidly standardized over the coming years. It is hence essential<br />

to take some of the discussed challenges around certification of<br />

such connected vehicles into consideration.<br />

REFERENCES<br />

[1] Abdeldime M.S. Abdelgader, Wu Lenan, The Physical Layer of the IEEE<br />

802.11p WAVE Communication Standard: The Specifications and<br />

Challenges. Proceedings of the World Congress on Engineering and<br />

Computer Science 2014 Vol II WCECS 2014, 22-24 October, 2014, San<br />

Francisco, USA<br />

[2] Junyi Feng, Device-to-Device Communications in LTE-Advanced<br />

Network<br />

[3] John B Keneddy, Dedicated Shot Range Communication Standards used<br />

in the United States.<br />

[4] Moraes, Ricardo, et al. "How to use the IEEE 802.11 e EDCA Protocol<br />

when Dealing with Real-Time Communication." 11th Brazilian<br />

Workshop on Real-Time and Embedded Systems. 2009.<br />

[5] Jiaxin Yang, Benoît Pelletier, Benoît Champagne, "Enhanced<br />

autonomous resource selection for LTE-based V2V communication",<br />

Vehicular Networking Conference (VNC) 2016 IEEE, pp. 1-6, 2016,<br />

ISSN 2157-9865.<br />

[6] http://www.ces.tech/Show-Floor/Marketplaces/Self-Driving-<br />

Technology.aspx<br />

[7] https://www.nhtsa.gov/press-releases/us-dot-advances-deploymentconnected-vehicle-technology-prevent-hundreds-thousands<br />

[8] ISO 26262<br />

333


Radar sensors for autonomous driving<br />

From motion measurement to 3-D imaging<br />

Karthik Ramasubramanian, Jasbir Singh, Brian Ginsburg, Dan Wang, Anil Kumar, Chethan Kumar, Sreekiran Samala, Karthik<br />

Subburaj, Shankar Ram, Anjan Prasad, Sandeep Rao, Anil Mani, Snehaprabha Narnakaje<br />

Radar and Analytics Processors, Texas Instruments<br />

Dallas, TX, USA and Bangalore, India<br />

Abstract—In recent years, the automotive industry has been<br />

making rapid strides in various advanced driver assistance<br />

systems (ADAS), with the ultimate goal of enabling fully<br />

autonomous driving. Radar sensors play a key role in this vision,<br />

due to certain inherent benefits compared to other technologies.<br />

This paper provides an overview of the industry trends in this<br />

space and highlights the disruptive change brought about by<br />

unprecedented level of silicon integration achieved by TI’s CMOSbased<br />

radar, leading to ‘radar-on-a-chip’ sensors. Looking<br />

forward to the future, the industry is moving towards deployment<br />

of advanced ‘imaging radars’ that use multiple cascaded radar<br />

devices to achieve high angular resolution. The paper describes a<br />

4-chip cascaded radar design and demonstrates the imaging<br />

capabilities achieved that will help enable the future of<br />

autonomous driving.<br />

Keywords—radar; autonomous driving; mmwave<br />

I. INTRODUCTION<br />

Every year, a significant number of injuries and deaths in the<br />

United States, as well as all over the world, happen due to<br />

vehicle accidents. As per NHTSA statistics, in the year 2015,<br />

there were 22,144 passenger vehicle occupants who died in<br />

motor vehicle traffic crashes and an estimated 2.18 million<br />

passenger vehicle occupants who were injured. Automotive<br />

radar technology at 77GHz has the ability to significantly reduce<br />

the occurrence of these accidents, especially those involving<br />

frontal collision or blind spots, and this technology has been<br />

deployed in premium vehicles over the past decade [1, 2, 3]. The<br />

applications for radar include Adaptive Cruise Control (ACC),<br />

Autonomous Emergency Braking (AEB), Blind Spot Detection<br />

(BSD), Lane Change Assist (LCA) and Cross Traffic Alert<br />

(CTA).<br />

Radar sensors exhibit certain inherent advantages compared<br />

to other technologies, due to their ability to measure range and<br />

radial velocity precisely, as well as, their ability to operate well<br />

regardless of the ambient lighting conditions and in a wide<br />

variety of environmental conditions including fog, dust and<br />

smoke. In order to make the availability of automotive radar<br />

systems more common, and in order to extend the use of radar<br />

technology to additional safety functions including Parking<br />

Assist and 360-degree surround-view sensing, it is important to<br />

reduce the cost, size and ease-of-use of 77 GHz radar<br />

technology. This would make it possible for multiple sensors to<br />

be mounted in various spots around the vehicle, providing more<br />

advanced safety and comfort functions in a cost-effective<br />

manner, and also enable radar-based safety features to become<br />

standard offerings even in mid-end and entry-level vehicles.<br />

Further, this would promote newer in-cabin and body/chassis<br />

applications that are now emerging, such as radar-based driver<br />

vital signs monitoring, occupant detection (child left behind in<br />

car), gesture recognition, door opener and ground clearance<br />

measurement.<br />

In this paper, we discuss the trend of silicon integration and<br />

highlight the industry’s first CMOS-based radar-on-a-singlechip<br />

solution from Texas Instruments (TI), which makes radar<br />

sensors compact, cost effective and easy to use. A family of<br />

devices [4], namely AWR1243, AWR1443 and AWR1642, has<br />

been launched addressing different applications. This launch<br />

signifies the introduction of CMOS-based highly integrated<br />

77GHz RF devices into the mainstream, with the objective of<br />

accelerating deployment of radar sensors and helping designers<br />

improve safety for drivers and passengers all over the world.<br />

The key advantage offered by CMOS is the ability to<br />

integrate the RF and analog circuits together with all of the<br />

digital processing functions into a single silicon die, thus<br />

reducing the form-factor significantly and making it easy to use.<br />

In Section II, we show the high-level block diagram of the<br />

AWR1642 device and explain the key features that make it an<br />

excellent solution for corner radar applications. We also show<br />

illustrative chirp configuration examples and field test results<br />

demonstrating its functionality.<br />

One of the key challenges for radar technology is the<br />

inherently poor angular resolution. In order to overcome this<br />

limitation, ‘imaging radars’ are being developed in the industry.<br />

In Section III, we discuss this emerging imaging radar<br />

application based on multi-chip cascading. In this context, we<br />

highlight the AWR1243 front-end device that supports<br />

cascading of multiple devices and discuss the complexities<br />

involved in developing a cascaded radar sensor solution. We<br />

showcase a 4-chip cascaded radar design that employs 12<br />

transmitters and 16 receivers to achieve very high angular<br />

resolution and demonstrate its functionality through field results.<br />

II.<br />

RADAR-ON-A-CHIP SENSOR<br />

Traditionally, the corner radars have been based on 24GHz<br />

technology. However, there is a shift in the industry toward the<br />

use of the 77 GHz frequency band due to emerging regulatory<br />

www.embedded-world.eu<br />

334


equirements (upcoming sunset date for 24GHz UWB radar), as<br />

well as the smaller size, larger bandwidth availability and<br />

performance advantages [2, 5].<br />

Historically, radar implementations used discrete<br />

components (PAs, LNAs, VCO, ADCs), but more integrated<br />

solutions are now becoming available. A CMOS-based radar<br />

that integrates all RF and analog functionality, as well as digital<br />

signal processing (DSP) capability into a single chip represents<br />

the ultimate radar system-on-chip solution. Such a highly<br />

integrated device significantly simplifies radar sensor<br />

implementations, enables low power, a compact form factor for<br />

the sensor, and makes the solution cost-effective.<br />

AWR1642 uses complex baseband architecture and provides inphase<br />

(I-channel) and quadrature (Q-channel) outputs. There<br />

are several advantages of complex baseband architecture as<br />

described in [8].<br />

The radio processor sub-system (a.k.a. BIST sub-system)<br />

includes the digital front-end, the ramp generator and an internal<br />

processor for control and configuration of the low-level<br />

RF/analog and ramp generator registers based on well-defined<br />

API messages from the master or DSP sub-system. This radio<br />

processor takes care of RF calibration needs and selftest/monitoring<br />

functions (BIST), which makes the device easy<br />

to use.<br />

The DSP sub-system includes a TI C674x DSP clocked at<br />

600MHz for radar signal processing, typically the processing of<br />

raw ADC data until object detection. This DSP is customer<br />

programmable and enables the user full flexibility in using his<br />

proprietary algorithms.<br />

The master sub-system includes the Arm® automotive grade<br />

Cortex® R4F processor clocked at 200 MHz, which is customer<br />

programmable. This processor handles the communication<br />

interfaces and typically implements the higher layer algorithms<br />

such as object classification and tracking. This processor may<br />

also be used to run AUTOSAR. The master sub-system supports<br />

secure boot and includes cryptographic accelerators as well.<br />

Fig. 1. CMOS based Radar-on-a-chip sensor<br />

In this context, we highlight the AWR1642 device and its<br />

key features that enable the sensor to perform advanced ADAS<br />

functions.<br />

A. High level of integration and Ease of use<br />

The AWR1642 device offers unprecedented level of<br />

integration and includes all the RF/Analog components (LNA,<br />

PA, Synthesizer, IF, ADC) for 2 transmitters and 4 receivers, as<br />

well as built-in customer-programmable DSP and MCU<br />

processor units for radar signal processing (Figure 1). In other<br />

words, a single device handles the signals all the way from<br />

77GHz high frequency RF, to the final CAN-FD output through<br />

which the list of detected and tracked objects is sent to the central<br />

ECU of the vehicle.<br />

Figure 2 shows the block diagram of the device. As seen in<br />

the figure, the device comprises four main sub-systems – the<br />

RF/analog sub-system, the radio processor sub-system, the DSP<br />

sub-system and the master sub-system. The RF/analog subsystem<br />

includes the RF and analog circuitry – namely, the<br />

synthesizer, PAs, LNAs, mixers, IF chains and ADCs. It<br />

supports fast (sawtooth) FMCW modulation scheme, which<br />

allows range and velocity of objects to be measured using an<br />

elegant 2D FFT processing procedure [6, 7]. The RF/analog<br />

sub-system also includes the crystal oscillator, temperature<br />

sensors, voltage monitors and a General Purpose ADC. The<br />

6<br />

LNA<br />

LNA<br />

LNA<br />

LNA<br />

PA<br />

PA<br />

GPADC<br />

Osc.<br />

VMON<br />

IF ADC<br />

IF ADC<br />

IF ADC<br />

IF ADC<br />

Temp<br />

x4<br />

RF/Analog subsystem<br />

Synth<br />

(20 GHz)<br />

Digital<br />

Front-end<br />

(Decimation<br />

filter chain)<br />

Ramp<br />

Generator<br />

Radio (BIST)<br />

processor<br />

(For RF Calibration<br />

and Self-test – TI<br />

programmed)<br />

Prog RAM<br />

and ROM<br />

Data<br />

RAM<br />

Radio processor<br />

subsystem<br />

(TI programmed)<br />

ADC<br />

Buffer<br />

* Up to 512KB of Radar Data Memory can be switched to the Master R4F if required<br />

Bus Matrix<br />

Cortex-R4F<br />

at 200MHz<br />

(User programmable)<br />

Prog RAM<br />

(256kB*)<br />

DMA<br />

Data<br />

RAM<br />

(192kB*)<br />

Boot<br />

ROM<br />

Master subsystem<br />

(Customer programmed)<br />

Mailbox<br />

L1P<br />

(32kB)<br />

C674x DSP<br />

at 600 MHz<br />

L1D<br />

(32kB)<br />

L2<br />

(256kB)<br />

QSPI<br />

SPI<br />

SPI / I2C<br />

DCAN<br />

CAN-FD<br />

Debug<br />

UARTs<br />

Test/<br />

Debug<br />

LVDS<br />

HIL<br />

DMA CRC Radar Data Memory<br />

(L3)<br />

DSP subsystem<br />

768 kB*<br />

(Customer programmed)<br />

Fig. 2. Block diagram of AWR1642<br />

Serial flash interface<br />

Optional External<br />

MCU interface<br />

PMIC control<br />

Primary communication<br />

interfaces (automotive)<br />

For debug<br />

JTAG for debug/<br />

development<br />

High-speed ADC output<br />

interface (for recording)<br />

High-speed input for<br />

hardware-in-loop verification<br />

B. Wide RF bandwidth and Multi-mode capability<br />

The range resolution of a radar sensor depends on the RF<br />

bandwidth. If the RF sweep bandwidth used is B, then the<br />

theoretical range resolution R is given by:<br />

<br />

<br />

<br />

<br />

Here c denotes the speed of light.<br />

c<br />

R 2 B<br />

One of the primary advantages of 77GHz band is the<br />

availability of 76-77GHz, as well as, 77-81 GHz bands for<br />

automotive radar applications. The AWR1642 device supports<br />

multi-mode capability, such that the same device can be used in<br />

76-77GHz far-range use-cases, as well as 77-81GHz near-range<br />

<br />

335


use-cases. Also, the device supports up to 4 GHz of RF sweep<br />

bandwidth and can therefore achieve a range resolution of<br />

3.75cm. This is 20 times better resolution than a 24 GHz<br />

narrowband radar sensor that uses 200MHz of sweep bandwidth<br />

(achieving a range resolution of 75 cm).<br />

The range resolution performance is important because it<br />

signifies the ability of the sensor to separate out multiple objects<br />

in range. A sensor with good range resolution can provide a<br />

‘dense point cloud’ of detected objects and has better ability to<br />

distinguish objects such as a person standing near a car. This<br />

improves environmental modeling and object classification,<br />

which are important for developing advance driver assistance<br />

algorithms and enabling autonomous driving features.<br />

Also, higher-range resolution helps the sensor achieve better<br />

minimum distance. For automotive applications like parking<br />

assist, a minimum distance of detection is very important; the<br />

use of 77-81 GHz radar provides a significant advantage in this<br />

aspect in comparison to technologies like ultrasound sensors.<br />

Since the accuracy is also proportional to the range resolution,<br />

the AWR1642 device can achieve high accuracy as well.<br />

The availability of a fully programmable DSP in the AR1642<br />

device allows users to implement proprietary algorithms and<br />

build innovative solutions to address these difficult challenges.<br />

Specifically, the following are some critical areas where there is<br />

continued research and advancement of algorithms to improve<br />

the performance.<br />

Interference mitigation algorithms<br />

Improved detection algorithms<br />

High-resolution angle estimation algorithms<br />

Clustering and Object Classification algorithms<br />

For all of the above needs, the built-in DSP enables<br />

improved performance of the sensor by offering high<br />

performance and fully programmable signal processing<br />

capability.<br />

The afore-mentioned features enable the AWR1642 device<br />

to be used effectively as a radar-on-a-chip sensor, especially for<br />

various corner radar applications. The table below shows an<br />

illustrative example of a multi-mode use-case, where alternating<br />

frames are used with different chirp configurations to achieve<br />

80m and 20m maximum range respectively, with the former at<br />

normal resolution and the latter at high resolution.<br />

Fig. 4. Multi-mode use-case example<br />

Fig. 3. Illustration of high resolution with 77 GHz<br />

Figure 5 shows an example field test result with 80m chirp<br />

configuration. The scenario has a car driving away from the<br />

radar and it can be detected in the 2D range-Doppler heatmap as<br />

shown in the figure. This field test was done with a short-range<br />

radar antenna with gain of 10dBi. Based on the choice of<br />

antenna design and chirp configuration, even higher maximum<br />

range can be achieved with the sensor.<br />

C. DSP advantage for advanced algorithms<br />

FMCW radar technology has evolved significantly in the<br />

past several years and continues to evolve. More use-cases are<br />

getting added, as radar plays a larger role in modern vehicles,<br />

both for driver comfort features and safety features. The<br />

emerging use-cases also make the radar performance<br />

requirements tighter, in terms of spatial resolution, velocity<br />

resolution, object detection and classification.<br />

www.embedded-world.eu<br />

336


Fig. 5. SRR field test example with AWR1642<br />

The AWR1642 radar-on-a-chip sensor can be used as a<br />

standalone radar feeding detected and tracked objects to the<br />

ECU via the vehicle CAN bus. The availability of one CAN-FD<br />

and one CAN interface on the AWR1642 allows each sensor to<br />

communicate to the ECU over the vehicle CAN bus, as well as,<br />

to other sensors over a private CAN bus. The AWR1642 can<br />

also be used as a satellite radar mounted in the corners feeding<br />

detected objects to a central radar fusion box which combines<br />

the information from the multiple sensors to generate a surround<br />

coverage of the vehicle. Thus, the AWR1642 forms an excellent<br />

solution to various corner radar applications.<br />

single master AWR1243 can feed the LO output to multiple<br />

slave devices in order to maintain phase coherence.<br />

The LO synchronization is performed at 20GHz, as shown<br />

in Figure 7, which reduces the board routing loss compared to<br />

synchronizing at 77GHz. The LO signal from the master chip is<br />

sent through one or both of its output buffers, and after<br />

symmetric routing on PCB, all chips, including the master,<br />

receive this LO signal through their input buffer [9].<br />

Fig. 7. Illustration of LO distribution for 2-chip cascading<br />

Figure 8 shows the high level block diagram of a 2-chip<br />

cascaded radar implementation. In addition to the 20GHz LO<br />

signal, a digital SYNC_OUT signal from the master is also fed<br />

to the slave(s) to ensure that the ADC sampling is synchronous<br />

across all the devices.<br />

Fig. 6. Corner radar system topologies<br />

In the next section, we will cover the emerging trend of ‘imaging<br />

radars’ using multi-chip cascading to achieve high angular<br />

resolution.<br />

III.<br />

IMAGING RADAR<br />

One of the key challenges of radar technology is the angular<br />

resolution. The angular resolution depends on the number of TX<br />

and RX channels in the radar sensor. Angular resolution is very<br />

important to the future of autonomous driving due to various<br />

reasons – specifically, when there are two or more objects at the<br />

same range and velocity (for example, two static objects at same<br />

range), they need to be separated in the angle dimension and<br />

therefore, good angular resolution is vital in order to clearly<br />

identify objects in dense target situations. This is particularly<br />

important in scenarios such as dense urban driving conditions,<br />

drive-over or drive-under situations with small objects or<br />

overhead signposts/bridges/tunnels, and for curb detection<br />

during parking assist.<br />

In order to achieve high angular resolution, it is possible to<br />

cascade multiple radar devices and operate them in a<br />

synchronized manner to effectively increase the number of<br />

antennas. The TI AWR1243 device is a high-performance radar<br />

front-end that includes 3 transmitters and 4 receivers and sends<br />

out the ADC data via CSI2 to an external DSP/MCU. One of<br />

the key features of AWR1243 is multi-chip cascading, where a<br />

Fig. 8. 2-chip cascaded radar system<br />

A single AWR1243 device can create a virtual antenna array<br />

of 3*4 = 12 antennas. On the other hand, with two AWR1243<br />

cascaded, it is possible to create a virtual antenna array of up to<br />

6*8 = 48 antennas. Extending this further, with four AWR1243<br />

devices cascaded, it is possible to create a virtual antenna array<br />

of up to 12*16 = 192 antennas. This allows the 4-chip cascaded<br />

radar to achieve 16 times better angular resolution than the<br />

single-chip radar and such a cascaded implementation can be<br />

called as ‘imaging’ radar. The antennas can be distributed<br />

between azimuth and elevation dimensions and therefore the<br />

radar can provide good resolution in both these angular<br />

dimensions.<br />

Another important care-about with cascaded radar sensors is<br />

the ability to perform TX beamforming and beam-steering. This<br />

allows the transmitters to coherently transmit the RF signal in<br />

337


order to form a narrow beam and achieve farther maximum<br />

range. Also, by using phase shifters on each of the transmitters,<br />

the beam can be steered in any direction of interest. This<br />

capability allows the radar to scan towards the left or to the right<br />

depending on the situation at hand. The AWR1243 device<br />

includes a linear phase shifter with 6 degree step size that can be<br />

configured to achieve TX beam steering as needed.<br />

Figure 9 shows a 4-chip cascaded radar implementation<br />

using AWR1243. This implementation supports TX<br />

beamforming using 9 transmit antennas to achieve high range<br />

beyond 250m. Further, using the transmit and receive channels<br />

in a MIMO radar configuration, it is possible achieve an azimuth<br />

angular resolution of ~1.5 degrees.<br />

Fig. 10. Field test result with cascaded radar.<br />

These results demonstrate the significant improvement that<br />

is achievable with multi-chip cascading and showcases the<br />

significance of imaging radars for the future of autonomous<br />

driving.<br />

Fig. 9. 4-chip cascaded imaging radar<br />

Figure 10 shows a sample field test result from the 4-chip<br />

cascaded radar implementation. It can be noted that the three<br />

pedestrians can be clearly separated with the cascaded radar,<br />

whereas with a single chip radar, the angular resolution is not<br />

good enough to achieve clear separation. Also, the cascaded<br />

radar produces a ‘3D point cloud’ that includes elevation<br />

measurement as well, in addition to azimuth. A complete video<br />

of these field results are available in [10].<br />

IV.<br />

SUMMARY<br />

The automotive radar industry is rapidly evolving, in terms<br />

of emerging new applications, and in pursuit of the ultimate<br />

vision of autonomous driving. We presented two major<br />

advancements that may have significant impact on the future of<br />

automotive radar – a radar-on-a-chip sensor that represents an<br />

unprecedented level of integration enabling small form-factor,<br />

low-power and ease of use, and an imaging radar<br />

implementation that demonstrates high angular resolution and<br />

dense 3D point cloud capability using four cascaded radar chips<br />

for future advanced radar applications. TI’s portfolio of radar<br />

devices, spanning single chip radar to high-performance<br />

cascadable front-end solutions, thus enables developers to build<br />

a variety of radar sensor implementations from motion<br />

measurement to 3D imaging.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank the team members of the TI<br />

radar team for their contributions to the development of the<br />

devices highlighted in this paper.<br />

REFERENCES<br />

[1] M. Schneider, ‘‘Automotive Radar Status and Trends’’, German<br />

Microwave Conference (GeMiC), pp. 144--147, Ulm, Germany, April<br />

2005.<br />

338<br />

www.embedded-world.eu


[2] Karl M. Strohm, Hans-Ludwig Bloecher, Robert Schneider, Josef<br />

Wenger, “Development of Future Short Range Radar Technology”,<br />

EURAD 2005.<br />

[3] Jurgen Hasch, “Driving towards 2020: Automotive Radar Technology<br />

Trends”, IEEE MTT-S International Conference on Microwaves for<br />

Intelligent Mobility, 2015.<br />

[4] http://www.ti.com/sensing-products/mmwave/awr/overview.html<br />

[5] Karthik Ramasubramanian, Kishore Ramaiah, Artem Aginskiy, “Moving<br />

from legacy 24 GHz radar to state-of-the-art 77 GHz radar”, Whitepaper<br />

available at http://www.ti.com/lit/wp/spry312/spry312.pdf<br />

[6] Donald E. Barrick, “FM/CW Radar Signals and Digital Processing”,<br />

NOAA Technical Report ERL 283-WPL 26, July 1973.<br />

[7] A. G. Stove, ‘‘Linear FMCW radar techniques’’, IEE Proceedings F,<br />

Radar and Signal Processing, vol. 139, pp. 343--350, October 1992.<br />

[8] Karthik Ramasubramanian, “Using a complex baseband architecture in<br />

FMCW radar systems”, Whitepaper available at<br />

http://www.ti.com/lit/wp/spyy007/spyy007.pdf<br />

[9] B. P. Ginsburg, et al., “A Multimode 76-to-81GHz Automotive Radar<br />

Transceiver with Autonomous Monitoring,” Accepted for publication,<br />

International Solid-State Circuits Conference (ISSCC), 2018.<br />

[10] Dan Wang, “Imaging radar using multiple single-chip FMCW<br />

transceivers”, YouTube video, 2018.<br />

339


Automotive Synthetic Aperture Radar<br />

Florian Fembacher<br />

Infineon Technologies AG<br />

Neubiberg, Germany<br />

Email: florian.fembacher@infineon.com<br />

Gabor Balazs<br />

Infineon Technologies AG<br />

Neubiberg, Germany<br />

Email: gabor.balazs@infineon.com<br />

Abstract—This work presents an analysis of automotive frequency<br />

modulated continuous wave synthetic aperture radar<br />

(FMCW SAR) using a 77 GHz radar, with the focus on computational<br />

and memory requirements. Such an automotive SAR can<br />

be used for imaging purposes reusing already existing automotive<br />

radar systems and might be especially useful as an information<br />

source for future autonomous driving. The presented analysis relies<br />

on a range-Doppler and wavenumber domain algorithm, which<br />

are two of the most common used techniques in SAR applications.<br />

Based on given constraints of an automotive embedded system,<br />

different processing frameworks for each discussed algorithm will<br />

be proposed. The results for all proposals are tested in a real<br />

application and presented in this work.<br />

Keywords—SAR, FMCW, embedded system<br />

I. INTRODUCTION<br />

Today synthetic aperture radar (SAR) is a well established<br />

signal processing technique to create high resolution images<br />

without the need of large antenna sizes. It was originally<br />

invented by Carl Wiley in the early 1950s with the purpose of<br />

being used for surveillance systems. Since radar can be operated<br />

independently of light conditions and its ability to penetrate<br />

clouds and fog, it can be used independent of daytime and<br />

weather conditions.<br />

So far SAR is still mainly used for aeronautical but not<br />

for automotive applications, although radar systems are already<br />

commonly used for driver assistance systems like blind spot detection,<br />

automated cruise control (ACC) or collision avoidance<br />

systems. While automotive radars achieve high resolution in<br />

range, due to small aperture sizes only poor angular resolution<br />

is achieved. Therefore its use is limited to applications which<br />

do not require high angular resolution.<br />

There exists a high demand for automated park assist systems<br />

which require high resolution sensors in range and azimuth.<br />

A Bitkom survey conducted in 2017 [1] showed 69% of the<br />

interviewed drivers would be willing to hand over full control<br />

over the vehicle for automated parking. In 2016 already 16%<br />

of new cars are equipped with a park assist system and 64%<br />

with a park distance control [2].<br />

A high resolution imaging radar, which captures the road<br />

side, would be an essential tool for fully automated parking<br />

without the need of human interaction.<br />

A. Related Work<br />

Several approaches to implement an automotive SAR system<br />

can be found in literature.<br />

In [3] Wu and Zwick design a wavenumber domain SAR<br />

imaging system for parking lot scenarios and evaluate the<br />

influence of motion errors in their system using a simulation.<br />

As a result they suggest to compensate motion only in azimuth<br />

by controlling the chirp repetition frequency (CRF).<br />

A further improved motion compensation algorithm is presented<br />

by Wu et al. in [4] with experimental results that were<br />

achieved in a real environment.<br />

In [5] the authors use different implementations of a range-<br />

Doppler domain algorithm for radar imaging and test their<br />

system with an automotive 77 GHz radar, which is moved on a<br />

rail for parking lot detection to demonstrate the effect of range<br />

migration and compare the results of different algorithms.<br />

Imaging results of measurements that were performed in a<br />

real automotive scenario using a 77 GHz radar system moved<br />

on a rail are shown in [6]. The authors compare a range-Doppler<br />

domain and line processing algorithm, which both can be used<br />

to compute high resolution radar images.<br />

In [7] a computationally efficient radar imaging algorithm<br />

for automotive applications is implemented, which uses some<br />

approximations in the signal model. The basic idea of the<br />

algorithm is to map multiple range-Doppler images on a global<br />

cartesian coordinate system.<br />

To test if series radar sensors are applicable for synthetic<br />

aperture radar processing, the authors in [8] use a time-domain<br />

backprojection algorithm. A processing in the Fourier domain<br />

is not possible for series sensors because of limitations for the<br />

possible radar configurations. Using a 24 GHz the authors are<br />

able to increase the azimuth resolution to few centimeters.<br />

In the previous mentioned work it is shown that high<br />

resolution radar images can be computed using different SAR<br />

processing algorithms. Though the main focus lies on image<br />

quality but not on the implementation on an automotive embedded<br />

system.<br />

In the Fourier domain SAR algorithms can be computed<br />

more efficiently than in the time domain. Therefore it is preferable<br />

for applications in embedded systems. The range-Doppler<br />

[9] and ω-k algorithm [9] as well as their variations belong<br />

to the most commonly used algorithms for SAR processing.<br />

Both algorithms are evaluated in this paper with respect to<br />

their imaging quality, memory requirements and performance<br />

on an embedded system to conclude if they are suitable for an<br />

automotive use case.<br />

www.embedded-world.eu<br />

340


Range<br />

IF Signal<br />

IF Signal<br />

Range compression<br />

Range compression<br />

θ<br />

Azimuth FFT<br />

Azimuth FFT<br />

v<br />

Azimuth<br />

RCMC<br />

Reference function<br />

multiply<br />

Fig. 1. A parking lot detection scenario for a vehicle with a broadside corner<br />

radar.<br />

Azimuth compression<br />

Stolt mapping<br />

A. System Model<br />

II. SIGNAL PROCESSING<br />

Fig. 1 illustrates a parking lot scenario in which a broadside<br />

radar with a beamwidth of θ is mounted on a vehicle that moves<br />

with a constant velocity v. For simplicity of the following<br />

analysis, it will be assumed that the vehicle is moving on a<br />

straight line. This is usually not true for a real scenario where<br />

motion compensation has to be considered.<br />

The radar transmits frequency modulated continuous wave<br />

(FMCW) signals at a given pulse repetition frequency (PRF)<br />

and receives the echoed signals from targets. Multiple echoes<br />

over time are collected and stored in a memory buffer. One<br />

target will be measured multiple times during the movement of<br />

the vehicle and appear at different slant ranges in the recorded<br />

two dimensional signal. In the azimuth dimension the Doppler<br />

frequency shift is sampled, which is caused by the varying slant<br />

range of a target. Therefore the PRF depends directly on the<br />

vehicles velocity and the maximum range of interest. The range<br />

profile of one target can be modeled by a hyperbolic equation<br />

r(t) = √ r 2 o + v 2 t 2 a (1)<br />

where r 0 is the range at closest approach and t a refers to the<br />

slow time in azimuth.<br />

It is an essential step in SAR signal processing to correct this<br />

so called range cell migration (RCM). The way how the RCM<br />

correction (RCMC) is realized, is actually the key difference<br />

between range-Doppler and wavenumber domain algorithms.<br />

B. Signal Model<br />

Automotive FMCW radar systems are operated in the 24<br />

GHz, 77 GHz or 79 GHz frequency bands, respectively. The<br />

possible resolution in range<br />

δr =<br />

c<br />

(2)<br />

2B<br />

is limited by the radar’s bandwidth B.<br />

Azimuth IFFT<br />

Radar Image<br />

(a)<br />

Azimuth IFFT<br />

Radar Image<br />

Fig. 2. Block diagram showing the processing steps of the range-Doppler<br />

algorithm in a) and ω-k algorithm in b).<br />

The FMCW radar transmits a chirp signal, which can be<br />

mathematically described by<br />

s t (t) = exp(j2π(f c t + 1 2 αt2 )) (3)<br />

where f c is the radar’s carrier frequency, t the fast time<br />

variable within the PRI, and α the frequency sweep rate<br />

B/P RI.<br />

The received signal is mixed with the transmitted signal,<br />

which results in the intermediate frequency<br />

s if (t) = exp(j2π(f c τ + αtτ − 1 2 ατ 2 )). (4)<br />

with the round trip delay time τ.<br />

The range r of a detected target is directly proportional to<br />

the beat frequency<br />

f b = ατ = 2α c<br />

(b)<br />

r. (5)<br />

This signal model will be used in the following signal<br />

processing.<br />

341


C. Range Doppler Algorithm<br />

The processing of SAR signals in the Fourier domain consists<br />

of three major tasks. First, the SAR signal has to be compressed<br />

in range, which results in a hyperbolic range profile for each<br />

observed target. To focus the range profiles the RCM has to<br />

be corrected. Afterwards the SAR signal can be compressed in<br />

azimuth.<br />

Fig. 2a pictures the single steps of the range-Doppler algorithm.<br />

The range compression can be achieved by simply<br />

applying a fast Fourier transform (FFT) in range. Due to the<br />

properties of the IF signal (cf. Eq. 4) the range for each target<br />

is directly proportional to the occurring frequencies in the IF<br />

signal. In principle the RCMC can be done in the frequency or<br />

time domain. Correcting the range migration in the frequency<br />

domain is more efficient since trajectories of targets at the same<br />

slant range measured at different azimuth times will collapse in<br />

the frequency domain and can therefore be corrected together<br />

in one step.<br />

To correct the RCMC, the signal has to be shifted for every<br />

Doppler frequency f D by<br />

(√<br />

)<br />

1<br />

r rcm = r 0<br />

1 − ( λf D<br />

2v<br />

) − 1 . (6)<br />

2<br />

Finally the the range profiles can be compressed in azimuth<br />

by applying the matched filter<br />

)<br />

h a (f a , r) = exp<br />

(−jπ v2<br />

λr f a<br />

2 (7)<br />

where f a describes the frequencies in azimuth.<br />

After applying an inverse Fourier transform (IFFT) in the<br />

azimuth dimension the focused radar image is received as an<br />

output.<br />

D. Wavenumber Algorithm<br />

In this part the ω-k algorithm (cf. Fig. 2b), which operates<br />

in the wavenumber domain, is shortly explained. The data is<br />

completely processed in the two dimensional frequency domain<br />

where the range dependence of the range-azimuth coupling<br />

can be corrected. It is especially superior to the range-Doppler<br />

algorithm in case of wide azimuth apertures that are typical for<br />

automotive applications.<br />

The SAR data is transformed in the two dimensional<br />

wavenumber domain by applying a FFT in azimuth and range.<br />

In the frequency domain the data is already compressed in range<br />

and therefore only a correction of the RCM and a compression<br />

in azimuth is necessary.<br />

Partial focusing is achieved by multiplying the data with a<br />

two dimensional reference function<br />

( √<br />

)<br />

h a2D (ω, ω D ) = exp j 4 ω2<br />

c 2 − ω2 D<br />

v 2 r ref − 2πf a τ . (8)<br />

If the RCM is small the two dimensional function can be<br />

approximated by a one dimensional function that corresponds to<br />

TABLE I<br />

COMPARISON OF PSLR AND SNR OF THE ALGORITHMS’ PTR<br />

Algorithm PSLR [dB] SNR [dB]<br />

Range-Doppler 9.9 35<br />

Range-Doppler 4-tap 9.7 35<br />

Range-Doppler no RCMC 10.6 73<br />

ω-k 1D 11.0 73<br />

ω-k 2D 10.8 85<br />

the compression function used in the range-Doppler algorithm<br />

(cf. Eq. 7) at a reference range r ref .<br />

To focus the data at the other ranges a Stolt interpolation<br />

as described in [10] is applied. After an IFFT in azimuth<br />

dimension a focused image is received.<br />

III. SIMULATION RESULTS<br />

The two SAR processing algorithms are evaluated by comparing<br />

their point target response. For this evaluation a MAT-<br />

LAB simulation is used in which an image of a reference scene<br />

of size 10 m in range and 10 m in azimuth is computed. The<br />

point target was simulated in the middle of the scene. The radar<br />

velocity was set at 1 m s<br />

and the PRF at 1.6 kHz, which results<br />

in 400 range samples and 16000 azimuth samples.<br />

The quality of the computed images is evaluated using the<br />

peak side lobe ratio (PSLR) and signal to noise ratio (SNR).<br />

The PSLR is simply based on the ration between the intensity<br />

of the main lobe level I manlobe and the intensity of the largest<br />

side lobe level I sidelobe<br />

P SLR = 10 log 10<br />

I sidelobe<br />

I mainlobe<br />

. (9)<br />

This measure gives information about how well a SAR signal<br />

processing algorithm is able to identify weak targets.<br />

The quality of signal is additionally evaluated using the SNR<br />

which is the ratio of the power of the signal and the noise<br />

SNR = 10 log P signal<br />

P noise<br />

. (10)<br />

The results of the simulation are shown in Table I. In<br />

Fig. 2 an overview of the resulting point target responses is<br />

given. The point target responses are shown in their range<br />

and azimuth profile. The range-Doppler algorithm is capable<br />

of correctly focusing the point target using an approximated<br />

interpolation kernel. Using a full interpolation kernel does not<br />

improve the result though. For performance reasons a 4-tap<br />

kernel should be sufficient. Without the RCMC the point target<br />

is not correctly focused. The ω-k algorithm performs best<br />

with the two-dimensional reference function. Because of the<br />

large curvature of the range profile, the compression with an<br />

one-dimensional reference function results in a less focused<br />

response.<br />

A. Computational Performance<br />

To evaluate the computational requirements the SAR imaging<br />

algorithms were implemented on a NVIDIA Jetson TK1 board<br />

(NVIDIA 4-Plus-1 2.32 GHz Quad-Core ARM Cortex-A15,<br />

342<br />

www.embedded-world.eu


3000<br />

3000<br />

3000<br />

3000<br />

2500<br />

2500<br />

2500<br />

2500<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

500<br />

500<br />

500<br />

500<br />

0<br />

0 10 20 30 40 50 60 70<br />

Range sample<br />

(a)<br />

0<br />

7850 7900 7950 8000 8050 8100 8150<br />

Azimuth sample<br />

0<br />

0 10 20 30 40 50 60 70<br />

Range sample<br />

(b)<br />

0<br />

7850 7900 7950 8000 8050 8100 8150<br />

Azimuth sample<br />

3000<br />

3000<br />

3000<br />

3000<br />

2500<br />

2500<br />

2500<br />

2500<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

Amplitude<br />

2000<br />

1500<br />

1000<br />

500<br />

500<br />

500<br />

500<br />

0<br />

0 10 20 30 40 50 60 70<br />

Range sample<br />

(c)<br />

0<br />

7850 7900 7950 8000 8050 8100 8150<br />

Azimuth sample<br />

0<br />

0 10 20 30 40 50 60 70<br />

Range sample<br />

(d)<br />

0<br />

7850 7900 7950 8000 8050 8100 8150<br />

Azimuth sample<br />

2500<br />

2500<br />

2000<br />

2000<br />

Amplitude<br />

1500<br />

1000<br />

Amplitude<br />

1500<br />

1000<br />

500<br />

500<br />

0<br />

0 10 20 30 40 50 60 70<br />

Range sample<br />

(e)<br />

0<br />

7850 7900 7950 8000 8050 8100 8150<br />

Azimuth sample<br />

Fig. 3. Simulation results for the point target response of each implemented SAR algorithm. Subfig. a) shows the result for the range-Doppler algorithm (RDA)<br />

without RCMC. Compared to the RDA implementation with RCMC (Subfig. b) full kernel size and Subfig. c) 4-tap kernel size) the point target response is less<br />

focused in range. As it is visible in Subfig. d) the ω-k algorithm with a one-dimensional reference function has bad focusing in range since the one-dimensional<br />

approximation is in general only valid for small RCM. Therefore the ω-k algorithm with a two-dimensional reference function in Subfig. e) gives the best<br />

response.<br />

2 GB RAM). Automotive embedded systems usually only<br />

offer low computational performance. Therefore it is important<br />

to find a good trade off between the imaging quality and<br />

computational requirements of a radar imaging algorithm. To<br />

compare the previously discussed algorithms regarding their<br />

computational performance the clock cycles for the generation<br />

of a simulated SAR signal were measured. For the simulation<br />

of the SAR signal a reference scene with a range of 10 m and<br />

azimuth size of 20 m was generated. The PRF was set at 1.6<br />

kHz and the azimuth velocity at 2 m s .<br />

In Fig. 4 the results for the range-Doppler with a full size<br />

interpolation kernel, a 4-tap interpolation kernel and without<br />

RCMC are shown, as well as the ω-k algorithm using a onedimensional<br />

and two-dimensional reference function for compression.<br />

Using a full size interpolation for the RCMC in case<br />

of the range-Doppler algorithm results in a total of 8216 million<br />

clock cycles, which corresponds to more than 3.5 s runtime on<br />

the microcontroller. The runtime can be improved by a large<br />

amount using an approximated interpolation kernel. In this case<br />

the algorithm runs almost three times faster. Leaving out the<br />

RCMC correction gives an even better runtime performance<br />

with the cost of a less focused radar image. Depending on<br />

the application RCMC might not be necessary and therefore<br />

the range-Doppler algorithm without the RCMC might be a<br />

good option. If focusing is important, using a 4-tap interpolation<br />

kernel should deliver sufficient results. The performance of the<br />

ω-k algorithm with an one-dimensional and two-dimensional<br />

reference function is much better than the performance of the<br />

range-Doppler algorithm with RCMC. Considering the results<br />

presented in section III in which the ω-k algorithm showed<br />

much better results it seems to be the preferable option.<br />

B. Memory<br />

In general embedded systems usually only offer limited<br />

memory. Depending on the sampling frequency in range and<br />

343


8,000<br />

7,146<br />

8,216<br />

2D FFT<br />

CFAR<br />

RCMC<br />

Total<br />

6,000<br />

4,000<br />

2,000<br />

373<br />

790<br />

373<br />

790<br />

373<br />

790<br />

373<br />

790<br />

373<br />

CPU cycles [Mio]<br />

798<br />

727<br />

490<br />

790<br />

1,986<br />

590<br />

1,915<br />

1,689<br />

1,369<br />

0<br />

RDA<br />

ω-k 2D<br />

RDA 4-tap<br />

ω-k 1D<br />

RDA w/o RCMC<br />

Fig. 4. Computational load for the range-Doppler algorithm (RDA) with different interpolation kernel sizes and the ω-k algorithm with a 1D and 2D reference<br />

function.<br />

azimuth several MB of memory might be needed. For efficient<br />

processing a three memory buffer layout can be used.<br />

The first memory buffer is used to store the IF signal<br />

of one radar measurement. The IF signal can immediately<br />

be transformed in the Fourier domain applying a FFT. To<br />

compress the signal all negative frequencies as well as and the<br />

positive frequencies above the maximum considered frequency<br />

(depending on range) should be discarded. An intermediate<br />

buffer is filled up with the results of the range buffer, until<br />

enough measurements are stored for azimuth processing. In the<br />

azimuth buffer all azimuth computations are performed and the<br />

results are stored back in the intermediate buffer. After azimuth<br />

processing the resulting radar image can be read out of the<br />

intermediate buffer by third applications.<br />

IV. MEASUREMENT RESULTS<br />

In this section results from real measurements, which were<br />

taken on an outside car park, are shown. For the measurements<br />

a radar with the configuration shown in Table II was used. The<br />

radar was mounted at 90 ◦ on a movable trolley, which was<br />

shoved by hand at an approximated velocity of 2 m s<br />

. On the<br />

radar 8 receive antennae are simultaneously used to receive the<br />

echoed radar signal. The 8 channels were summed up before<br />

processing the signal to improve the SNR. An overview of the<br />

parking lot is given in Fig. 5.<br />

The resulting images are shown in Fig. 6. The results match<br />

the expectations from section III. The best result was achieved<br />

using the ω-k algorithm with a two-dimensional reference<br />

function. The radar images computed by the range-Doppler<br />

algorithm are better focused applying the RCMC.<br />

TABLE II<br />

PARAMETERS OF THE TEST SETUP<br />

Parameter<br />

Value<br />

RF carrier frequency f c<br />

76.0 GHz<br />

RF bandwidth<br />

1.0 GHz<br />

Ramp-up time t up 51.2 µs<br />

Ramp-down time t down 10.0 µs<br />

Samples per ramp 256<br />

PRF<br />

1.5 kHz<br />

Window-function at range-FFT Hann<br />

−3 − dB azimuth beam width 120 ◦ (±60 ◦ )<br />

Fig. 5. Overview of the recorded parking lot scene<br />

V. CONCLUSION<br />

A comparison of the range-Doppler and ω-k algorithm<br />

was given in this work. Both algorithms were analyzed for<br />

344<br />

www.embedded-world.eu


an automotive parking lot detection use case. To see how<br />

accurate both algorithms are, the point target response for<br />

different implementations was evaluated. As expected from the<br />

properties of each algorithm, the ω-k algorithm computes better<br />

focused radar images compared to the range-Doppler algorithm<br />

in case of a large radar beamwidth. Since the focus was on the<br />

specific requirements of the implementation on an automotive<br />

embedded system, different implementations of both algorithm<br />

were presented. The performance of each implementation was<br />

evaluated on an NVIDIA Jetson TK1 microcontroller. Additionally<br />

a suggestion for a memory buffer layout was given to allow<br />

efficient computation on an embedded system. The functional<br />

correctness of the presented implementations was tested in a<br />

real environment using an 77 GHz radar and the computed<br />

radar images were presented.<br />

An important aspect in automotive radar imaging, that was<br />

not included in this work, is handling motion errors. Implementations<br />

of motion compensation algorithms and their memory<br />

and computational requirements has to be considered for a<br />

complete automotive SAR application.<br />

REFERENCES<br />

[1] Deutsche Automobil Treuhand (DAT), “DAT Report 2016,” 2016.<br />

[2] “Mehrheit der Autofahrer würde dem Autopiloten das<br />

Steuer übergeben,” Bitkom, Feb 2017. [Online]. Available:<br />

https://www.bitkom.org/Presse/Presseinformation/Mehrheit-der-<br />

Autofahrer-wuerde-dem-Autopiloten-das-Steuer-uebergeben.html<br />

[3] H. Wu and T. Zwick, “Automotive SAR for parking lot detection,” in<br />

Microwave Conference, 2009 German. IEEE, 2009, pp. 1–8.<br />

[4] H. Wu, L. Zwirello, X. Li, L. Reichardt, and T. Zwick, “Motion<br />

compensation with one-axis gyroscope and two-axis accelerometer for<br />

automotive SAR,” in Microwave Conference (GeMIC), 2011 German.<br />

IEEE, 2011, pp. 1–4.<br />

[5] J. Mure-Dubois, F. Vincent, and D. Bonacci, “Sonar and radar SAR<br />

processing for parking lot detection,” in Radar Symposium (IRS), 2011<br />

Proceedings International. IEEE, 2011, pp. 471–476.<br />

[6] H. Iqbal, M. B. Sajjad, M. Mueller, and C. Waldschmidt, “SAR imaging<br />

in an automotive scenario,” in Microwave Symposium (MMS), 2015 IEEE<br />

15th Mediterranean. IEEE, 2015, pp. 1–4.<br />

[7] R. Feger, A. Haderer, and A. Stelzer, “Experimental verification of a<br />

77-GHz synthetic aperture radar system for automotive applications,”<br />

in Microwaves for Intelligent Mobility (ICMIM), 2017 IEEE MTT-S<br />

International Conference on. IEEE, 2017, pp. 111–114.<br />

[8] F. Harrer, F. Pfeiffer, A. Löffler, T. Gisder, and E. Biebl, “Automotive<br />

synthetic aperture radar system based on 24 GHz series sensors,” in<br />

Advanced Microsystems for Automotive Applications 2017. Springer,<br />

2018, pp. 23–36.<br />

[9] I. G. Cumming and F. H. Wong, “Digital processing of synthetic aperture<br />

radar data,” Artech house, vol. 1, no. 2, p. 3, 2005.<br />

[10] B.-C. Wang, Digital signal processing techniques and applications in<br />

radar image processing. John Wiley & Sons, 2008, vol. 91.<br />

345


(a)<br />

(b)<br />

(c)<br />

(d)<br />

(e)<br />

Fig. 6. Measurement results for different SAR algorithm implementations. Subfig. a) shows the result for the range-Doppler algorithm (RDA) with a full<br />

interpolation kernel, Subfig. b) with a 4-tap kernel and Subfig. c) without RCMC. In Subfig. d) and Subfig. e) the results for the ω-k algorithm are shown<br />

with an one-dimensional and two-dimensional reference function, respectively. The image computed with the RDA without RCMC is clearly less focused than<br />

the ones with RCMC. There is no visible difference between the full size and the 4-tap interpolation kernel though. Compared to the RDA the ω-k algorithm<br />

produces much better focused images.<br />

346<br />

www.embedded-world.eu


Visual Modeling of Self-Adaptive Systems<br />

Saivignesh Sridhar Eswari<br />

Software Designer at Nobleo<br />

Eindhoven, Netherlands<br />

s.e.saivignesh@gmail.com<br />

Juha-Pekka Tolvanen<br />

MetaCase<br />

Jyväskylä, Finland<br />

jpt@metacase.com<br />

Emil Vassev<br />

SEDEV Consult Ltd<br />

Sofia, Bulgaria<br />

emil@vassev.com<br />

Abstract—When developing autonomous systems, designers<br />

employ different kinds of knowledge to specify systems. We<br />

present a visual modeling approach created for specifying selfadaptive<br />

systems. The approach uses model-based approach to<br />

specify the system context and ontology addressing both<br />

structural and behavioral parts. The resulting models are used<br />

with code generation with knowledge reasoning frameworks and<br />

tools. The presented approach supports collaboration and<br />

communications within the safety design team, improve<br />

productivity of the team and reduce the cost of software<br />

certification.<br />

Keywords—autonomous systems; self-adaptive systems; visual<br />

modeling; model-based development; domain-specific languages;<br />

KnowLang; MetaEdit+<br />

I. INTRODUCTION<br />

Autonomous vehicles have to be safe and reliable. While<br />

certification programs and safety standards such as ISO 26262<br />

provide guidance, safety design and development of safe and<br />

reliable functionality is time-consuming and costly. Moreover,<br />

the integration and promotion of autonomy in vehicles is an<br />

extremely challenging task, and although autonomous cars are<br />

already seen on our streets, the first severe accidents prove<br />

that they are not as secure as we had hoped them to be.<br />

We present a visual modeling solution developed to meet<br />

recommendations for using model-based approaches in safety<br />

design and the trend on automotive industry on using code<br />

generation from visual models. The presented approach is<br />

based on the KnowLang framework [1] developed for<br />

Knowledge Representation and Reasoning (KR&R) in selfadaptive<br />

systems. The modeling approach consists of a set of<br />

integrated visual models, each providing a particular view of<br />

the system, such as its overall context, structures and their<br />

relationships, along with specific behavior. The visual<br />

modeling approach makes the system's description a relatively<br />

easy task where models support communication and gathering<br />

feedback within the team. The visual modeling approach also<br />

relies on capabilities to manage complexity, such as partition<br />

structure from behavior by providing different user-adapted<br />

views of the system (e.g. overall and detailed) and by<br />

providing possibility to have different views to the systems<br />

(e.g. view only inheritance among structural elements).<br />

Another key part is tooling that enables collaborative modelbased<br />

development: several engineers can edit the same<br />

specifications simultaneously with continuous integration.<br />

From the model-based specifications the implemented<br />

generator produces code for the KnowLang framework for<br />

further analysis and execution. This automates the routine and<br />

makes the KnowLang framework better accessible for<br />

engineers so that they can focus on safety design. The models<br />

can also be used for reporting, providing different views for<br />

different stakeholders and for documenting the system. By<br />

using visual modeling along with automatic code generation,<br />

we practically reduce both development time and effort,<br />

decrease certification costs and improve development<br />

productivity.<br />

In this paper, we present KnowLang framework along with<br />

the developed visual modeling approach and code generator.<br />

This work was done to meet needs of an automotive company.<br />

We describe the process of creating the tooling and show<br />

practical cases and examples of using the modeling approach<br />

when developing various self-adaptive systems.<br />

II.<br />

SELF-ADAPTATION AND KNOWLEDGE<br />

REPRESENTATION<br />

Autonomous systems, such as automatic lawn mowers,<br />

smart home equipment, driverless train systems, or<br />

autonomous cars, perform their tasks without human<br />

intervention.<br />

A. KnowLang<br />

KnowLang [1,2,3,4] is a framework for KR&R that aims<br />

at efficient and comprehensive knowledge structuring and<br />

awareness [5] based on logical and statistical reasoning.<br />

Knowledge specified with KnowLang takes the form of a<br />

Knowledge Base (KB) that outlines a Knowledge<br />

Representation (KR) context. A key feature of KnowLang is a<br />

formal language with a multi-tier knowledge specification<br />

model (see Fig. 1) allowing integration of ontologies together<br />

with rules and Bayesian networks [6].<br />

The language aims at efficient and comprehensive<br />

knowledge structuring and awareness. It helps us tackle [2]: 1)<br />

explicit representation of domain concepts and relationships;<br />

www.embedded-world.eu<br />

347


2) explicit representation of particular and general factual<br />

knowledge, in terms of predicates, names, connectives,<br />

quantifiers and identity; and 3) uncertain knowledge in which<br />

additive probabilities are used to represent degrees of belief.<br />

Other remarkable features are related to knowledge cleaning<br />

(allowing for efficient reasoning) and knowledge<br />

representation for autonomic behavior.<br />

Fig. 1. KnowLang Specification Model.<br />

By applying KnowLang's multi-tier specification model<br />

(see Fig. 1) we build a Knowledge Base (KB) structured in<br />

three main tiers [1, 2]: 1) Knowledge Corpuses; 2) KB<br />

Operators; and 3) Inference Primitives. The tier of Knowledge<br />

Corpuses is used to specify KR structures. The tier of KB<br />

Operators provide access to Knowledge Corpuses via special<br />

classes of “ask” and “tell” Operators where “ask” Operators<br />

are dedicated to knowledge querying and retrieval and “tell”<br />

Operators allow for knowledge update. When we specify<br />

knowledge with KnowLang, we build a KB with a variety of<br />

knowledge structures such as ontologies, facts, rules and<br />

constraints where we need to specify the ontologies first in<br />

order to provide the vocabulary for the other knowledge<br />

structures.<br />

A KnowLang ontology is specified over concept trees,<br />

object trees, relations and predicates. Each concept is<br />

specified with special properties and functionality and is<br />

hierarchically linked to other concepts through “parents” and<br />

“children” relationships. For reasoning purposes every concept<br />

specified with KnowLang has an intrinsic “state” attribute that<br />

may be associated with a set of possible state values the<br />

concept instances may be in. The concept instances are<br />

considered as objects and are structured in object trees - a<br />

conceptualization of how objects existing in the world of<br />

interest are related to each other. The relationships in an object<br />

tree are based on the principle that objects have properties,<br />

where the value of a property is another object, which in turn<br />

also has properties. Moreover, concepts and objects might be<br />

connected via relations. Relations are binary and may have<br />

probability-distribution attribute (e.g., over time, over<br />

situations, over concepts' properties, etc.). Probability<br />

distribution is provided to support probabilistic reasoning and<br />

by specifying relations with probability distributions we<br />

actually specify Bayesian networks connecting the concepts<br />

and objects of an ontology.<br />

B. Knowledge Representation<br />

When developing autonomous systems, designers employ<br />

different kinds of knowledge to derive models of specific<br />

domains of interest. There’s no standard classification system<br />

- the problem domain determines what kinds of knowledge<br />

designers might consider and what models they might derive<br />

from that knowledge [7]. Designers can use different elements<br />

to represent different kinds of knowledge. Knowledge<br />

representation (KR) elements could be primitives such as<br />

rules, frames, semantic networks and concept maps,<br />

ontologies and logic expressions [7]. These primitives might<br />

be combined into more complex knowledge elements.<br />

Whatever elements they use, designers must structure the<br />

knowledge so that the system can effectively process it and<br />

humans can easily perceive the results.<br />

In the dynamically changing automotive industry,<br />

designers need to achieve optimized designs and successful<br />

validation earlier in the automotive engineering process. Many<br />

adopt advanced automation technologies based on modeldriven<br />

development to meet this challenge. Note that various<br />

model-based approaches provide automotive engineering<br />

software for design, simulation, verification, and<br />

manufacturing, allowing one to create a digital model that<br />

drives the entire product development process. Advanced<br />

analysts and designers can use analysis and simulation<br />

solutions for kinematics, dynamics, structural, thermal, flow,<br />

motion, multi-physics, and optimization in a single<br />

environment. Seamless sharing of model data between design<br />

and analysis delivers results quickly, to impact critical design<br />

decisions. The use of visual models was also a requirement<br />

from the company for a tool for specifying self-adaptive<br />

behavior. The use of visual models is also backed by empirical<br />

research in particular when investigating studies on quality of<br />

the specification, effectiveness and efficiency. As an example,<br />

Jakšić et al. [8] performed a statistical analysis for comparing<br />

the quality, efficiency and productivity between textual<br />

representation and graphical model-based representation. They<br />

focused on feature trees applied in product lines which<br />

resemble perhaps the closest the concept trees of KnowLang.<br />

The result of the empirical study was that graphically created<br />

specification was more complete and of better quality than the<br />

textually specified ones. Also graphical modeling took less<br />

time than creating the same feature model with textual<br />

specification.<br />

C. Visual Modeling Tools<br />

While it is possible to create tools from the scratch, we<br />

applied Language Workbench approach providing most of the<br />

needed functionality automatically: Only support for<br />

KnowLang language, its model-based visualization, checking<br />

correctness of the specifications, code generation and<br />

integration with other tools was added. MetaEdit+ [10] was<br />

applied as the tooling as it satisfied the requirements of the<br />

automotive company. These included support for collaborative<br />

www.embedded-world.eu<br />

348


modeling, version control, integration with relevant tools<br />

applied in automotive (e.g. Simulink, HiP-HOPS), updating<br />

both models and metamodels, as well as availability of<br />

supporting services. Naturally tool support was expected for<br />

visual modeling and implementing the generators for<br />

integration with KnowLang and other targets.<br />

MetaEdit+ provides tools that enable developing modeling<br />

support iteratively without programming. Language definition<br />

and language use happens in the same environment allowing<br />

immediate testing of the language definition and update it<br />

based on using the language. The language definition follows<br />

the process of:<br />

1) Defining the language concepts used to create the<br />

models.<br />

2) Setting the rules for these concepts allowing<br />

preventing creating illegal or unwanted<br />

specifications.<br />

3) Defining the visual notation that is used when editing<br />

and reading the models. Notation can also show<br />

information on incompleteness, model references etc.<br />

that are not directly related to specification itself.<br />

4) Implementing the generators that produce the<br />

required artifacts like code, simulation data, tests etc.<br />

At any step of this process the language definition can be<br />

applied and tried out. This is also possible with multi-user<br />

version of MetaEdit+: Language engineers can define the<br />

language and others may at the same time use the modeling<br />

language. The feedback loop between language definition and<br />

language use helps to reduces errors, minimize the risks of<br />

creating unwanted language features and improve user<br />

acceptance.<br />

III.<br />

MODEL-BASED DEVELOPMENT FOR KNOWLANG<br />

The visual modeling support for KnowLang was<br />

implemented by one person. The implementation was done<br />

incrementally and the results were reviewed by three persons.<br />

The implementation was done during spring 2017 within a<br />

period of three calendar months.<br />

A. Defining and Formalizing Concepts<br />

The language definition started by identifying the different<br />

visual views for KnowLang specifications: Concept trees<br />

expressing domain ontology, predicates to express complex<br />

system states, contexts to specify environment or situation in<br />

which the concepts are, and behavior expressed with Boolean<br />

expression.<br />

Since the concepts of KnowLang were already defined<br />

(see Section II.A) the metamodeling process largely means<br />

mapping the KnowLang concepts to the visual modeling<br />

concepts, such as to objects, their relations, roles and<br />

properties. Fig. 2 shows a definition of Concept trees and its<br />

modeling elements. These include various concepts used as<br />

modeling objects, their inheritance and probability based<br />

relations shown as relationships, and roles for defining how<br />

objects participate with the relationships.<br />

Fig. 2. Definition of Concept tree in KnowLang<br />

The elements modeled for representing Concept Trees are<br />

Metaconcept, Generic Concept, Explicit Concepts and<br />

Relations. Each of the modeling elements shown in Fig 2 are<br />

defined in further detail. Fig. 3 shows one such definition: the<br />

Concept with its properties. The description in the bottom of<br />

window is used in the help system available for the modeler<br />

using KnowLang.<br />

Fig. 3. Definition of Concept of KnowLang<br />

Once defined, each part of the language specification was<br />

tried out to specify reference systems. Other language<br />

concepts of Concept trees are defined similarly. Also other<br />

www.embedded-world.eu<br />

349


views of KnowLang, such as behavior and complex states,<br />

were defined similarly.<br />

Since we were defining a visual language also the resulting<br />

language definition differs from the grammar definition used<br />

for textual languages. In visual models a particular element,<br />

such as ‘Passenger’ Concept can be entered only once and<br />

referred elsewhere from the specification - also from other<br />

places than in the same visual diagram. Thus change in one<br />

diagram is reflected to everywhere without the need for find<br />

and replace or using refactoring tools as when working with<br />

plain text. Similarly automated trace, such as where a<br />

particular ‘Passenger’ concept is used is directly available.<br />

This helps traceability and providing documentation reports.<br />

Visual language also provides views and separation of<br />

concerns for knowledge representation as well as possibility to<br />

view and filter specification in different level of detail or for<br />

different audience. For example, one might be interested to<br />

view plain concept inheritance whereas others their<br />

connections. This notation part is discussed in Section III.C.<br />

B. Defining Rules<br />

Each modeling element, such as Concept Trees or<br />

Concepts, illustrated above, may have rules and constraints.<br />

For instance, names of concepts may be unique with the<br />

concept tree, or inheritance between the concepts may allow<br />

multiple inheritance. If such rules are defined into the<br />

metamodel they can be checked at the modeling time<br />

preventing creation illegal or unwanted specifications. As it is<br />

cheaper to prevent errors happen rather than correct them<br />

later, we added to the metamodel also various model checking<br />

rules. For the metamodel definition MetaEdit+ provides ready<br />

rule templates as applied for KnowLang definition in Fig. 4<br />

and 5.<br />

constraint that each concept must have a unique name within a<br />

concept tree. Similarly rules for mandatory naming, legal<br />

connections, number of connections etc. were added to the<br />

metamodel.<br />

Fig. 5. Uniqueness rule for naming<br />

Rules for all other parts of KnowLang were added<br />

similarly. As we divided the visual presentation to different<br />

views the metamodel was finalized with interlinking the<br />

views. In most cases such linking appeared automatically as in<br />

MetaEdit+ the model elements can be reused and linked<br />

between the views. For other cases, like organizing the model<br />

hierarchically, the metamodel definition was extended with<br />

linking rules. Fig. 6 illustrates some linking rules, such as that<br />

Concept may have a State chart and Action Concept may have<br />

Pre-conditions and Post-conditions been defined in own<br />

views.<br />

Fig. 4. Binding rule for directed relationship.<br />

Fig. 4 illustrates the definition of directed relationship<br />

between a set of objects. Two-directed relationship must have<br />

always at least two Directed roles when connecting any of the<br />

defined objects listed. Fig. 5 shows a definition of uniqueness<br />

Fig. 6. Explicit rules for linking modeling elements with views<br />

C. Defining Notation<br />

Visual modeling requires a concrete syntax. We defined<br />

the syntax following the KnowLang presentation material and<br />

extended it with visual properties to gather summary<br />

www.embedded-world.eu<br />

350


information, modeling linking data and error annotation. We<br />

applied various visual variables for the notation, such as<br />

shapes, colors, fonts, as guided by [9] to improve readability,<br />

understanding and working with the specification models. Fig.<br />

7 illustrates definition of the notation for a Concept element.<br />

The notation is defined with Symbol Editor of MetaEdit+.<br />

Alternatively existing visualizations could be imported and<br />

applied.<br />

Initially the notation provided just the basics: A green<br />

rectangle showing the unique name of the concept. To manage<br />

different views, the definition was extended with a visual clue<br />

on the upper right corner to indicate if concept has associated<br />

subgraph. This visualizes the rule of the language defined in<br />

Fig 6.<br />

more complete illustration of the visualization aspects is given<br />

in the example section.<br />

D. Implementing KnowLang generators<br />

After having defined the notation we created<br />

simultaneously models representing the knowledge for<br />

KnowLang. While these models were used to test the language<br />

they also served as basis for generators. After having models<br />

we implemented generators producing the knowledge in<br />

KnowLang and calling it for compilation.<br />

The generator was implemented with Generator Editor of<br />

MetaEdit+. Fig. 9 shows the main structure of the generator on<br />

the upper left corner. A generator called Code starts from the<br />

Concept tree and for each concept produces information on its<br />

inheritance relationships with other concepts, as well as with<br />

defined properties, functions, and states. For each of these<br />

parts there is own subgenerator: These subgenerators match to<br />

the concepts expressed in the metamodel. In the bottom of the<br />

screen, one subgenerator is shown handling the generation of<br />

states. It calls again other generators producing behavioral<br />

logic in Boolean expression given for the concept.<br />

Fig. 7. Symbol definition for Concept<br />

While the elements have richer structure than just name, an<br />

alternative representation approach was added. A modeler<br />

may want to see visually further details of the element. For<br />

this purpose two possible representations were added - both<br />

been shown in Fig 8. The symbol on the left shows the<br />

minimal view and the symbol on the right characteristics that<br />

were considered important to visualize. Note that part of the<br />

data like properties and functions are directly took from the<br />

metamodel of Concept (see metamodel in Fig 3.) whereas the<br />

states are retrieved from the State chart linked to the concept<br />

(see metamodel for this part in Fig 6.).<br />

Fig. 8. Two possible visualizations for the Concept - as chosen at modeling<br />

time<br />

The definition of the notation was done similarly to other<br />

views and their modeling elements - not just for main<br />

modeling objects but also for their relationships and roles. A<br />

Fig. 9. Definition of the generator (example)<br />

The other parts of the Generator Editor provide access to<br />

the metamodel (top right) and to the generator commands (top<br />

middle). This allows writing the generator within the context<br />

of the given metamodel, aka KnowLang concepts.<br />

While the main part of the generator is navigating the<br />

visual model to produce the code, it also integrates with<br />

KnowLang reasoner by calling it at the end with the generated<br />

output. In this way the developer can move easily from the<br />

visual model to see the results been executed in KnowLang.<br />

In addition to generating the KnowLang code the same<br />

generator system was used to provide model checking rules<br />

not included in the metamodel. These included model<br />

guidance (e.g. created Boolean expression is partial), reporting<br />

(e.g. document generation), queries on models and producing<br />

metrics.<br />

www.embedded-world.eu<br />

351


F. On the implementation process<br />

The implementation was done by one person during spring<br />

2017 within a period of three months. Half of the effort was on<br />

defining the metamodel (Section III.A-C) and second half on<br />

implementing the generator (Section III.D). During<br />

implementation phase 3 persons provided feedback to the<br />

work done. The implementation was tested and verified by<br />

using the created modeling language to specify various kinds<br />

of systems and by comparing to the reference test cases.<br />

IV.<br />

EXAMPLE<br />

The modeling solution is applied to specify safety<br />

functionality in different application areas, such as<br />

autonomous cars, unmanned space explorer and surveillance<br />

drones. We use next car safety as an example and for the sake<br />

of brevity show parts of the key models only. The aim of car<br />

safety project is to compute a set of alternative routes for its<br />

current destination, to ensure that the vehicle always runs on<br />

sufficient battery and to drive safely around crosswalks.<br />

The modeling process starts with defining the ontology of<br />

the system with concept trees. For each concept, its properties,<br />

functions and states are defined. Fig. 10 shows the ontology<br />

with concept trees in MetaEdit+ modeling tool, and Fig. 11<br />

shows a portion of this model with details.<br />

Fig. 11. Concept tree for software phenomenon (partial)<br />

Behaviour is described with states using Boolean<br />

expression. A Boolean expression for AvoidCollision in shown<br />

in Fig. 12. This differentiates InLowTraffic and InHighTraffic<br />

conditions: in a high traffic NeedFix and FlatTireAtCrosswalk<br />

are not allowed. All these states refer to other Boolean<br />

expressions defined for other concepts: Traffic concepts of<br />

Route, NeedFix for Brakefailure, and FlatTire on Journey<br />

concept. These concepts were defined in the Concept tree (Fig<br />

11).<br />

Fig. 10. Concept tree of car safety<br />

The notation uses different colors and shapes for the model<br />

elements to assist reader identify the different KnowLang<br />

concepts. Fig. 11 shows details of the concept tree dealing<br />

with Software Phenomenon on Journey and Route as well as<br />

related knowledge on errors, situations and policies. If there is<br />

a need to exemplify the ontology with examples,<br />

corresponding instances of these concepts can be defined as<br />

object trees. Object trees were specified in the metamodel as<br />

part of the KnowLang support.<br />

Fig. 12. Behavior expression for avoiding collision<br />

Predicates in KnowLang are considered as complex system<br />

states because their evaluation depends on the evaluation of<br />

the involved concept states. Fig 13 illustrates such predicate<br />

dealing with three concepts on collision avoidance.<br />

www.embedded-world.eu<br />

352


●<br />

●<br />

●<br />

●<br />

The model can be applied for generating the code<br />

improving the productivity and removing the need for<br />

learning particular syntax and debug coding errors.<br />

The modeling language guides developers by<br />

partitioning the system specification into different<br />

concerns, like concepts, dependencies, behavior, etc.<br />

Reduce the cost of software certification.<br />

Reduce the time to market a product.<br />

Fig. 13. Predicate or Complex State for avoiding collision<br />

Created models can be transformed at any point of time to<br />

the analyzer, reasoner, or simulation applied - given that<br />

generator is available. Since we applied KnowLang the<br />

generator produces KnowLang code.<br />

Fig. 14 shows the portion of the KnowLang code as<br />

generated from the visual models. The part highlighted is<br />

related to concept Journey (Fig. 11) and its related states, like<br />

that dealing with flat tire used to define behavior in Fig 12.<br />

KnowLang and the generated code is running on top of the<br />

system developed. When the system wants to take decisions it<br />

consults with KnowLang providing self-adaption.<br />

Fig. 14. Generated code in KnowLang<br />

V. CONCLUSIONS<br />

We presented a visual modeling language for developing<br />

self-adaptive systems. The developed approach provides<br />

several benefits for development teams:<br />

●<br />

Support communication and collaboration within a<br />

team. Different users may take a different view to the<br />

specifications and can edit the same specification<br />

simultaneously.<br />

We presented also the actual language creation process<br />

covering the metamodel with rules, visual notation and code<br />

generator. The actual language implementation work was done<br />

in the period of 3 calendar months by one person. Because the<br />

investment is modest, it pays off quickly as all the other<br />

developers can then model with the language, and run the<br />

generators creating the code. As both the modeling language<br />

and generators are freely accessible, the presented approach<br />

also gives full control for the company for making possible<br />

extensions in the future.<br />

ACKNOWLEDGEMENT<br />

We would like to thank dr. ir. Ion Baroson and prof dr.<br />

Mark van den Brand at Eindhoven University of Technology<br />

for collaboration and for their continuous support and<br />

guidance throughout this project. We would also like to thank<br />

Baesis Automotive for initiating this project and supporting<br />

us.<br />

REFERENCES<br />

[1] Vassev, E., Hinchey, M., Knowledge Representation for Adaptive and<br />

Self-aware Systems. In Software Engineering for Collective Autonomic<br />

Systems, Volume 8998 of LNCS. Springer, 2015.<br />

[2] Vassev, E., Hinchey, M.., Knowledge Representation for Adaptive and<br />

Self-Aware Systems. In Software Engineering for Collective Autonomic<br />

Systems: Results of the ASCENS Project, Lecture Notes in Computer<br />

Science, vol. 8998. Springer Verlag, 2015.<br />

[3] Vassev, E., Hinchey, M., KnowLang: Knowledge Representation for<br />

Self-Adaptive Systems. In: IEEE Computer 48 (2), 81–84, 2015.<br />

[4] KnowLang Framework for Knowledge Representation and Reasoning<br />

for Self Adaptive Systems.<br />

http://www.knowlang.engineeringautonomy.com (accessed Jan 2018).<br />

[5] Vassev, E., Hinchey, M., Awareness in Software-Intensive Systems. In<br />

IEEE Computer 45(12), 84–87, 2012.<br />

[6] Neapolitan, R., Learning Bayesian Networks. Prentice Hall, 2013.<br />

[7] Vassev, E., Hinchey, M., Knowledge Representation and Reasoning for<br />

Intelligent Software Systems. In IEEE Computer 44 (8), 96–99, 2011.<br />

[8] Jakšić, A., France, F., Collet, P., Ghosh, S., Evaluating the usability of a<br />

visual feature modeling notation. International Conference on Software<br />

Language Engineering. Springer. 2014.<br />

[9] Moody, D., The “Physics” of Notations: Toward a Scientific Basis for<br />

Constructing Visual Notations in Software Engineering, IEEE<br />

Transactions on Software Engineering, Volume: 35, Issue: 6, 2009<br />

[10] MetaEdit+, http://www.metacase.com (accessed Jan 2018)<br />

www.embedded-world.eu<br />

353


IoT-Security and Product Piracy: Smart Key<br />

Management versus Secure Hardware<br />

Christian Zenger 1,2 and Mario Pietersz 2<br />

1 Ruhr-Universität Bochum<br />

Horst Görtz Institut für IT-Sicherheit<br />

Bochum, Germany<br />

christian.zenger@rub.de<br />

2 PHYSEC GmbH<br />

Universitätsstr. 142<br />

44799 Bochum, Germany<br />

mario.pietersz@physec.de<br />

The today’s fear to lose against competitors, manufacturers of<br />

physical products are urgently searching for solution to<br />

“smartify” their products, to establish new digital business<br />

models, and to offer new services. To them, digitalization means<br />

mainly the establishment of (Internet-) connectivity between their<br />

products and some digital service platform. However, many<br />

business models build on top of digitalization might lose its<br />

competitive advantage for the manufacturer if the data are not<br />

secured (available, authentic, confidential, and integer). We<br />

present a detailed overview what is arguably the most difficult<br />

part in the majority of security systems, namely device<br />

authentication and key establishment. We help answering a<br />

major question of decision makers: Which key establishment<br />

method and which (security) hardware solution reduces product<br />

piracy risk as well as cyber security risks sufficiently, is capable<br />

to start today with small charges and end up with a flexible longterm<br />

capable serial production, as well as provides a good costbenefit<br />

ratio for new IoT products? In the present paper we focus<br />

on details to find an individual answer, while potential lock-in<br />

effects of suppliers and platform providers are out of scope.<br />

Keywords—IoT-security; product piracy; key management;<br />

supply chain; ad-hoc provisioning<br />

I. INTRODUCTION<br />

We are in the midst of a deep technological transition. The<br />

result is described as the "Internet of Things" (IoT) and will<br />

affect all areas of business world. The IoT introduces the<br />

paradigm of an ecosystem of ubiquitous embedded systems<br />

that communicate with each other through everyday life. This<br />

is essentially what Eric Schmidt (CEO, Google) described in<br />

2015 at the World Economic Forum in Davos in the following<br />

words: “[The Internet] be part of your presence all the time.<br />

Imagine you walk into a room, and the room is dynamic. And<br />

with your permission and all of that, you are interacting with<br />

the things going on in the room.”<br />

Where once classic computers and servers communicated<br />

with each other over the Internet, today products, such as<br />

Coffee machines or industrial sensors, communicate with their<br />

associated virtualized, digital service platforms. The range of<br />

different IoT devices includes inexpensive consumer<br />

electronics as well as highly specialized industrial products<br />

(I4.0) as well as medical devices. All of these systems have<br />

very different requirements and properties. For instance, the<br />

protection of personal data, protection of production secrets,<br />

and high availability of industrial equipment can be<br />

differentiated.<br />

Gartner [1] predicts 20 billion devices that will be<br />

connected to the Internet by 2020. This creates new product<br />

features for customers and manufacturers (e.g., remote control<br />

or plagiarism controls), as well as revolutionary new business<br />

models (e.g., pay-per-use). In addition, the analysis of collected<br />

data promises cost and revenue optimization (such as,<br />

predictive maintenance). As part of this digitalization, new<br />

devices will be equipped with intelligent sensors and<br />

standardized connectivity solutions.<br />

In the time of digitalization and the fear to lose against<br />

competitors, manufacturers of classical physical products are<br />

urgently searching for solution to “smartify” and digitalize<br />

their products, to establish new digital business models, and to<br />

offer new services. To them, digitalization means mainly the<br />

establishment of (Internet-) connectivity between their products<br />

and some digital service platform enabling data sharing and<br />

artificial intelligence. However, many business models build<br />

on top of digitalization might lose its competitive advantage for<br />

the manufacturer if the data are not secured (available,<br />

authentic, confidential, and integer). Simultaneously, for the<br />

consumer and society at large, it is important that the<br />

technology is privacy preserving.<br />

www.embedded-world.eu<br />

354


Furthermore, now not only the honest and law-abiding<br />

manufacturers are digitizing, but unfortunately also criminals.<br />

Depending on the criminal’s "business model", e.g., building<br />

and leasing botnets, blackmailing or sabotage, attackers have<br />

several opportunities to compromise IoT ecosystems.<br />

Manufacturers come up against those threats with encryption<br />

and thus cryptographic primitives that require secret<br />

cryptographic material. However, stealing cryptographic keys<br />

is almost always the simplest, most impactful, and most<br />

"commercially" scalable attack. These keys are important<br />

because the root of the trust of modern security solutions is<br />

based on cryptographic keys. Thus, the question of origin and<br />

management of these keys arises.<br />

In the following, we will discuss the different approaches<br />

of key management and show that traditional ones are not<br />

necessarily suitable for the IoT. Afterwards, procedures will<br />

be described that focus on the user and ease of use. We present<br />

a detailed overview what is arguably the most difficult part in<br />

the majority of security systems, namely device authentication<br />

and key establishment. Today key establishment solutions for<br />

securing the IoT ecosystem are mainly dividable into three<br />

categories:<br />

Master secrets (e.g., hard-coded, factory default keys,<br />

easy to guess passwords).<br />

Device individual credentials integrated within the<br />

production (e.g., client certificates, symmetric token<br />

etc.).<br />

Ad-hoc based user- and device individual key<br />

establishment (e.g., using the resurrecting duckling<br />

principle).<br />

Each approach has its advantages (e.g., a cheap<br />

production, solid security, or flexible production) as well as<br />

disadvantages (e.g., a serious undermining in the case of a<br />

hack, new complexities and expenses within the supply chain,<br />

or manual provisioning) and works with standard MCUs,<br />

secure-MCUs (e.g., with read-out protection), or even secure<br />

hardware. A common example of a secure elements are<br />

Trusted Platform Modules (TPMs). They usually contain a coprocessor<br />

for energy-efficient computation of cryptographic<br />

primitives as well as a protected storage for keys.<br />

A major question of decision makers is: Which key<br />

establishment method and which (security) hardware solution<br />

reduces product piracy risk as well as cyber security risks<br />

sufficiently, is capable to start today with small charges and<br />

end up with a flexible long-term capable serial production, as<br />

well as provides a good cost-benefit ratio for new IoT<br />

products? In the present paper we focus on details to find an<br />

individual answer, while potential lock-in effects of suppliers<br />

and platform providers are out of scope.<br />

II.<br />

TECHNICAL BACKGROUND<br />

Before a more detailed description of classical approaches<br />

of key management is given, an overview of the protection<br />

goals in information security is provided. Based on this, the<br />

basic idea of key and identity management will be explained.<br />

A. Security goals in information security<br />

The requirements for security of information systems are<br />

so-called protection goals, is common today. A brief<br />

description of the security goals confidentiality, integrity,<br />

availability, and authenticity follows. These are the most<br />

important and widely accepted protection goals. The list is by<br />

no means exhaustive as there are much more sophisticated<br />

security goals defined in the literature.<br />

Confidentiality means that only legitimate parties can<br />

read a message.<br />

Data integrity ensures that messages have not been<br />

altered.<br />

Availability of systems implies that they have to be<br />

available on a long-term basis with an assured level of<br />

quality.<br />

Authenticity ensures the authorship of messages.<br />

The above protection goals can be achieved by<br />

cryptographic primitives. On basis of these primitives, it is<br />

possible to develop protocols which, in a reasonable<br />

combination, enable key and identity management. More<br />

detailed information on this can be found in Understanding<br />

Cryptography written by Paar et al. [2]. A well-known<br />

example from Microsoft Windows based Active Directories is<br />

the Kerberos protocol. The protocol uses, for instance, the<br />

cryptographic primitive of symmetric encryption to achieve<br />

the protection goal of authenticity.<br />

B. Key and identity management<br />

As the world gets more and more digitized, consumer<br />

products, industrial systems or more generally spoken "things"<br />

have been extended to communicate with their corresponding<br />

digital service platforms or directly with people. This is due to<br />

new business models in which a service provider depends on<br />

the generated data of the product, e.g., for billing purposes.<br />

Transferring this data directly via the Internet saves manual<br />

reading costs and thus scales to a large number of products in<br />

the market. There are many other applications, e.g., remote<br />

maintenance access for industrial systems in which protection<br />

of the transmitted data and thus compliance with the<br />

protection goals is necessary. From the perspective of the<br />

manufacturer and operator, therefore, the following questions<br />

arise:<br />

"How do we ensure that the data has not been<br />

manipulated?"<br />

"Can we protect the information from third party<br />

access?"<br />

"How can we be confident that a product is exactly<br />

what it claims to be?"<br />

With appropriate key and identity management, these<br />

questions can be answered. Such a management tool can<br />

establish the necessary relationships of trust between the<br />

devices via cryptographic protocols. Of course, these are<br />

based on cryptographic key material, which can be used in<br />

protocols to achieve the required protection goals.<br />

355


III.<br />

CLASSIC SOLUTIONS<br />

Based on a scenario of a generic manufacturer of products,<br />

the classic approaches of key management are considered first.<br />

In this chapter, particular attention is paid to the degree of<br />

security of a solution, as well as the impact on the<br />

manufacturing process.<br />

A. No security<br />

The starting point the digitalization of a product is adding<br />

a simple connectivity solution that allows the product to<br />

communicate with the manufacturer's digital service platform<br />

via Internet. In doing so, the product may send, among other<br />

things, usage values or status information about the individual<br />

components of the product to the service platform. This<br />

enables the manufacturer to implement new business models,<br />

for instance, pay-per-use or predictive maintenance.<br />

At this point in the development process, the concept does<br />

not yet provide any security requirements to achieve<br />

protection goals at all. This implies that the product<br />

communicate unsecured with the digital service platform via<br />

untrusted network segments. The lack of security could result<br />

in attack vectors like manipulate billing data, collect usage<br />

profiles, and the products can be captured and used for largescale<br />

attacks on other systems (DDoS botnet).<br />

A relationship of trust between the product and the digital<br />

service platform is not present at the moment. At the same<br />

time the manufacturing process is not touched obviously<br />

which implies that no security results in no additional costs in<br />

production.<br />

B. Pre shared keys<br />

After the manufacturer created a proof of concept, he may<br />

experience a demand for state of the art security features and<br />

wants to apply security concepts to the product. In real<br />

scenarios, this step may be motivated by customer<br />

requirements, regulation or by one's own business model.<br />

The manufacturer fulfils the protection goal of<br />

confidentiality by encryption with the secure and standardized<br />

encryption system AES (Advanced Encryption Standard).<br />

However, every symmetrical encryption primitive requires a<br />

cryptographic key, as every door of a house does require a<br />

physical key. The required key is intended to be included in<br />

the production process, i.e., stored on the product while<br />

flashing the firmware. The manufacturer can choose between<br />

two distinct options:<br />

With a master key, all or a large fragment of products<br />

share the same cryptographic key material. Anyone<br />

who knows the key can encrypt and decrypt the<br />

messages. Should the key be compromised, e.g.,<br />

extracted from the firmware, the security concept<br />

collapses entirely. A successful attack scales from one<br />

product to the entire product line. Secret modifications,<br />

in which the key is based on, e.g., the serial number or<br />

mac address, falls in the same category. Although these<br />

keys may look individually per device, they are not<br />

safe from a cryptographically point of view because of<br />

the information involved is public as soon as the<br />

mechanism is revealed, e.g., trough reverse<br />

engineering.<br />

Another option is individualized, pre-shared keys. In<br />

this case, each product is given its own, secret key<br />

during the production process. The advantage now is<br />

that successful attacks on individual products can no<br />

longer scale to the entire product line. However, this<br />

method has serious drawbacks. The complexity of the<br />

key distribution increases quadratically as the number<br />

of products increases. For example, if there are 1000<br />

products in the field, about 500,000 symmetric keys are<br />

required for individual communication links between<br />

all parties. Another obvious disadvantage is that there<br />

is an inevitable linkage between the production systems<br />

and the key management system and that the devices<br />

need an (Internet) connection to the key management<br />

system when they are commissioned.<br />

It should be noted that in both cases, at least two parties<br />

use the same symmetric key to encrypt the messages among<br />

themselves. That fact implies that it cannot be distinguished<br />

which the two parties created the message. Thus, the<br />

protection goals of non-repudiation and message<br />

authentication cannot be achieved in this way. When using<br />

pre-distributed keys, a tradeoff between the level of security<br />

and the integration effort must be considered. The use of<br />

master keys is trivial but unsecure. The high administrative<br />

burden of individualized, symmetric keys often does not<br />

justify the level of security that can be achieved.<br />

C. Public key infrastructures<br />

The manufacturer has learned from the mistakes of other<br />

manufacturers and rejected the idea of pre-distributed<br />

symmetrical keys. He is now dedicated to the use of a socalled<br />

Public Key Infrastructure (PKI) and the use of<br />

certificates. With this approach, each product gets individual<br />

key material and the ability to cryptographically sign and<br />

verify messages. These signatures are similar to a letter seal or<br />

a signature on a document. As a result, the product can<br />

identify itself with the individual key material. In other words,<br />

the key material represents and assures the identity of the<br />

product in form of signed certificate. In this scenario, a<br />

centralized, trustworthy authority, the certification authority, is<br />

responsible for issuing certificates which can be used for<br />

authenticated communication with third parties. But how are<br />

the certificates distributed individually to the products? The<br />

following three scenarios with different trust relationships can<br />

be distinguished:<br />

In the first scenario, the certificates together with<br />

cryptographic material are pre-generated and<br />

transferred to the products during production process.<br />

In addition to the public part of the key material, the<br />

private and therefore secret part of this procedure must<br />

also be transferred. Similar to the symmetrical,<br />

individual keys, the transmission of the key material<br />

www.embedded-world.eu<br />

356


must be carefully protected. Therefore, important<br />

questions arise, such as: Where is the production<br />

settled? Which parties does the manufacturer has to<br />

trust to not copy the key material and use it for<br />

unwanted actions, e.g., counterfeit products?<br />

The second scenario is based on the fact that the key<br />

material can be generated during the first start of the<br />

product during the production process and<br />

subsequently checked and signed by a central entity. In<br />

this case, the secret part of the key material does not<br />

leave the device. However, a central instance must be<br />

reachable at production time. What happens if it fails<br />

or gets blocked by DDoS attacks? How much does it<br />

cost to stop the production lines for hours / days?<br />

In the third scenario, the cryptographic material is<br />

stored in a separate, secure chip, e.g., in a Trusted<br />

Platform Module (TPM) delivered by a trusted partner<br />

and embedded in the product. More details on TPMs<br />

will be provided in Chapter V. The manufacturer has to<br />

trust the manufacturer of the modules on several levels.<br />

Who guarantees that the manufacturer does not need to<br />

keep or keeps backup copies of the key material? What<br />

happens if the manufacturer of the modules changes his<br />

price structure? Which PKI solutions are tightly<br />

coupled now? To which service platform is the PKI<br />

bound?<br />

Although the procedures differ significantly, they have one<br />

thing in common. In all three methods, each device must be<br />

individually intervened in the production, each with different<br />

means and effects. This intervention in the production process<br />

is illustrated in Fig 1. Obviously, the coupled trusted third<br />

parties are very critical elements in the manufacturing process<br />

and the security level of the infrastructure and technical and<br />

organizational processes needs to be state of the art as well.<br />

The security of using individual keys is considered state of the<br />

art and reasonably secure. However, the manufacturer buys<br />

this security level through new complexities, necessary<br />

relationships of trust with other parties and higher costs. The<br />

complexities based on key establishment by integrating the<br />

process into production or the supply chain prevent flexible<br />

modularization of the individual processes.<br />

Fig. 1. Integration of Public Key Infrastructures into the manufacturing<br />

process. The introduced dependencies and complexity can be challenging.<br />

D. Relying on standards<br />

IoT stands for a world-wide network of interconnected<br />

objects uniquely addressable, based on standard<br />

communication protocols. So the idea is interconnectivity<br />

based on standards. Unfortunately, all established<br />

communications standards (e.g., Wi-Fi [3], BLE [4], ZigBee<br />

[5], LoRaWAN [6]) are either insecure by design due to the<br />

cryptographic primitives or do not provide a (user-friendly)<br />

possibility to change key material.<br />

Therefore an additional security layer must to be<br />

implemented; ideally with over-the-air update mechanism.<br />

This in turn makes the choice of proper communication<br />

technologies complicated because claimed performances<br />

(energy consumption, computation amplitude and time<br />

constraints) are no longer maintained.<br />

E. Interim conclusion<br />

In this chapter, classical approaches to key and identity<br />

management were presented and described. An adequate level<br />

of security implies at the same time a cost-intensive and risky<br />

intervention in the production. A trade-off between costs and<br />

benefits can lead to further insecurity on the market. The step<br />

to individual keys is a major challenge for manufacturers<br />

today, as the costs and complexities to be expected are<br />

difficult to estimate due to a lack of information and<br />

experience. In practice, one often sees the alibi solution that<br />

the problem is delegated to the user or technician who puts the<br />

machines into operation. Today’s solutions of shifting the<br />

complexity towards the consumer, who is then required to<br />

introduce or change credentials, requires often access to<br />

device individual user interfaces which is already a<br />

cumbersome process. That a human is very bad in guessing<br />

good passwords is a second erroneous assumption:<br />

Shifting the key establishment to the end user is a good<br />

idea for many use cases. Especially for those where the user is<br />

at least once interacting with the device. However, to smartify<br />

and simplify his or her life this will be successful if and only if<br />

this process is negligible with respect to consumer’s expense<br />

and IT-expertise.<br />

IV.<br />

NOVEL APPROACHES<br />

The previous chapter provided an overview of the<br />

currently widely used approaches to key management. This<br />

chapter peeks at new and future approaches. The first<br />

subchapter deals with the concept of ad-hoc authentication.<br />

Afterwards, two new methods based on this model are<br />

described in the following subchapters.<br />

A. Ad-hoc authentication<br />

From the point of view of the manufacturer, the<br />

cryptographic individualization of products on the production<br />

line is a cost factor that should be avoided. One way to<br />

achieve this is to establish the actual relationship of trust at a<br />

later point in time, for example during commissioning or<br />

installation. This procedure is known, e.g., from the use of<br />

Bluetooth devices, which must be coupled at least once with<br />

other devices. The security measures on the use of pins,<br />

however, are not sufficient. Such solutions are created when,<br />

in addition to the costs, the usability also plays an important<br />

role. Security measures are accepted only if they do not hinder<br />

the user in their actions.<br />

357


With regard to the application, the actual process of<br />

individualization can be done later. For example, the<br />

technician who performs the maintenance might do so in<br />

parallel. The realization of this approach is outlined in the<br />

following two chapters on the basis of the generic use case.<br />

B. Decentralized key management<br />

After the manufacturer decided to use a PKI solution, his<br />

business models are secured and flourishing. For further<br />

development, the question arises whether leaner solutions with<br />

less administration effort exist. Also, these solutions should<br />

provide long-term security by countering future threats such as<br />

quantum computers.<br />

A modern and resource-saving cryptographic process<br />

called Physical Layer Security (PLS) utilizes physical<br />

properties instead of mathematical complexities. For this, the<br />

methods rely on information, e.g., sound, light effects or<br />

electromagnetic signals originating from the local<br />

environment of the IoT product [7]. Of particular interest is<br />

the use of electromagnetic signals because they are already<br />

supported by any radio-enabled IoT product. The physical<br />

properties of the transmission channels that are exploited here<br />

are:<br />

The channel is symmetrical, i.e., both the sender and<br />

the receiver observe identical channel characteristics.<br />

Channel observations of third parties are statistically<br />

independent of the sender / receiver.<br />

The channel is not trivially calculable and cannot be<br />

simulated, i.e., the channel offers a reasonable amount<br />

of entropy.<br />

For reasons of simplicity, only the product authenticate<br />

method is elaborated here. The product exchanges via a<br />

standardized cryptographic protocol keys with a digital service<br />

platform. In order for the service platform to be sure that it is<br />

the right product and not an attacker, the product must be<br />

authenticated. In this case, a coupling device, this may for<br />

example be a smartphone, is placed in the physical proximity<br />

of the product and thus verifies the key. This class of protocols<br />

is also referred to as context-based security, out-of-band<br />

authentication or distance bounding protocols [7, 8, 9].<br />

C. Key management middleware<br />

A key management middleware (KMM) is an abstraction<br />

layer that connects products with digital service platforms<br />

independently of each other. Those products can be connected<br />

to users of the digital service platforms in order to realize new<br />

business models. From the point of view of a product, a<br />

dynamic change of the digital service platform can also be<br />

realized over the lifetime, for example by using the KMM for<br />

this purpose. Over-the-Air Update (OTA) procedure just<br />

update the configuration of the product to achieve that.<br />

The introduction of a KMM allows the manufacturer to<br />

decouple the production and the product / key management.<br />

During commissioning, the individual cryptographic material<br />

is generated, which does not leave the machine and can be<br />

authenticated by a technician or user. This process of<br />

authentication is usually called pairing. This can be done for<br />

example via the smartphone of the technician. If the<br />

technicians have a trusted relationship to the manufacturer,<br />

they can unlock the individual features of the products,<br />

depending on the customer or service contract.<br />

The scenario can be further developed so that different<br />

users of different groups can also perform individual pairings<br />

with the product. As a result, comfort features can be<br />

automatically activated and used. Individual keys that can be<br />

inserted (and removed) by the user also can form the basis of<br />

protocols for the protection of digital privacy in the near<br />

future.<br />

With a KMM, the manufacturer has a solution that allows<br />

him to keep the production lean, flexible and independent and<br />

to move the key management into the commissioning phase.<br />

This also excludes scenarios where the confidential key<br />

material needs to be held by untrusted producers or suppliers<br />

within the supply chain. With the same level of security as for<br />

PKI-based solutions, manufacturers can now adapt their<br />

architecture to new needs and situations. For example, the<br />

manufacturer can change the digital service platform or opt for<br />

other hardware configurations.<br />

This process has been further developed to enable secure<br />

communication even over uninvolved, unmodified, and<br />

untrusted networks. A key management, which affects the<br />

production, is eliminated and instead one receives a set of<br />

keys, which were created and authenticated during the start-up<br />

phase. Obviously, there is no need for placing individual<br />

secret material into the product during manufacturing.<br />

Individual key material is brought and authenticated into the<br />

products dynamically during commissioning.<br />

Fig. 3. Exemplary architecture of a key management middleware. Business<br />

application is build on platforms. Products and users utilize the swapped out<br />

key management capabilities of KMMs.<br />

Fig. 2. After the product is completed, ad-hoc commissioning can be<br />

executed to setup the product.<br />

www.embedded-world.eu<br />

358


V. SOFTWARE, HARDWARE AND HYBRIDS<br />

Security solutions in general are resource-intensive<br />

compared to application or communication logic on embedded<br />

systems. Especially cryptography algorithms are resource<br />

consuming in terms of power and storage. Choosing a solution<br />

for security in IoT devices is always a compromise between<br />

achieving acceptable security levels, performance, and<br />

flexibility, cost and power consumption. There are three main<br />

categories of security approaches for IoT and embedded<br />

devices in general, which are hardware-based, software-based<br />

and hybrid. Figure 3 depicts a comparison approaches in terms<br />

of cost and performance.<br />

Fig. 4. Performance and cost versus hardware-based, software-based and<br />

hybrid solutions [10].<br />

The first category is the software-only security approach.<br />

This approach relay on programming the embedded general<br />

purpose processors (GPP) to accomplish security tasks. Using<br />

software-only approach will achieve goals in terms of cost and<br />

flexibility but power consumption will remain comparably high<br />

and it will not deliver any improvements in terms of silicon<br />

area and in some cases this approach can exhaust the<br />

processing abilities of the GPP. Some examples of softwareonly<br />

approaches are [11]:<br />

Copyright notice and watermarking<br />

Proof-Carrying Code<br />

Custom OS<br />

The second category of security approaches is the<br />

hardware-only approach. In this approach the technology of<br />

ASICs (Application Specific Integrated Circuits) is being<br />

utilized to realize the required cryptography algorithm in<br />

hardware. Following this approach gives designers the ability<br />

to accurately control the energy consumed, computation<br />

amplitude and time constraints but the downside is still that this<br />

approach is not suitable when flexibility and cost effectiveness<br />

are required. Some examples of hardware-only approaches are<br />

[11]:<br />

Read-out protection<br />

Tamper-resistant packaging (for only a certain circuit or<br />

for covering the entire device)<br />

Secure coprocessors (also called secure elements,<br />

trusted platform modules/TPMs)<br />

The third category would be the hybrid approach which<br />

will utilize both hardware and software technologies to achieve<br />

a balance between processing efficiency, the required security<br />

level, flexibility and at the same time be compliant with design<br />

constraints. This approach needs collaboration between<br />

hardware designers, software designers and security experts at<br />

the level of designing and manufacturing the device.<br />

VI.<br />

DESIGN IMPLICATIONS<br />

Attacks on end devices are focused mainly on hardware<br />

components where the attacker needs to be physically close to<br />

the device [12]. Feasible attacks are:<br />

(1) Node tampering (replace components and to gain access<br />

to or alter sensitive information, e.g., cryptographic<br />

keys)<br />

(2) Cloning end devices (because of the relatively simple<br />

hardware architecture of many devices cloning is easy)<br />

(3) Denial of service attacks (e.g., battery draining attacks)<br />

The first attack is critical to it-security (esp. confidentiality<br />

and integrity) and privacy. The second attack is important with<br />

respect to product piracy. Countermeasures to both attacks are<br />

intrinsically motivated. However, tamper protection is mainly<br />

requested due to consumer’s digital privacy need, while anticloning<br />

properties may have a significant impact on (future)<br />

market shares of a vendor and, therefore, are requested by<br />

vendor’s business plan and strategy.<br />

Question is how to solve problems (1) and (2) properly<br />

using key management and software/hardware solutions with<br />

minimal costs? – From our perspective the most important<br />

design criteria is the choice of a certain level of non-scalability<br />

of potential attacks.<br />

First of all, device individual keys are important to reduce a<br />

scaling from one device to many or all other devices. This<br />

includes also a secure key establishment and management<br />

process with no (or extremely good protected) single point of<br />

failure. Second consumer’s possibility to change the E2E key<br />

material without being an IT expert.<br />

In this context, we belief that shifting the key establishment<br />

to an upstream process, e.g., using suppliers or OEMs, is a<br />

good solution for vendors with a clear 1:1 relationship between<br />

a single user and a digital service platform. In this context<br />

TPMs are often a vehicle to securely merge the key material<br />

with the device. For devices that might be used by different<br />

people and/or connected to different user specific service<br />

platforms a downstream key establishment process is better<br />

suited, e.g., using the resurrecting duckling principle.<br />

359


The understanding (as well as the resulting trust) in an adhoc<br />

trust establishment process (with respect to problem (1)<br />

and the related digital privacy) might be larger compared to<br />

pre-shared keys.<br />

TPM are ostensible argued to be secure against physical<br />

attacks, e.g., side-channels, fault injection, and key extraction.<br />

Therefore, they can be used as a countermeasure of product<br />

piracy. They are more energy-efficient than software solutions.<br />

Moreover, the complicated, time intensive, and security<br />

sensitive authenticated key establishment process, e.g., though<br />

certificate signing requests, can be done by a certified supplier.<br />

However, finding security flaws in TPMs and cracking<br />

them is more than a professional hobby of hackers and<br />

competitors. Continuously new security flaws on TPMs are<br />

published, cf. [13] and due to its static behavior,<br />

(compromised) key material of TPMs in the field are<br />

unfortunately hard to change.<br />

Table I and Table II summarize the results of the paper at<br />

hand to provide a condensed overview for product managers<br />

and decision makers. For the evaluation we introduce three<br />

metrics:<br />

Product piracy:<br />

o No product piracy prevention/detection (3)<br />

o Product piracy detection through online capabilities<br />

(2)<br />

o Product piracy prevention through hardware<br />

security<br />

End devices security (non-scalability of attacks)<br />

o Low (D)<br />

o Medium (C)<br />

o High (B)<br />

o Highest (A)<br />

Costs/complexity of production, logistic and key<br />

management<br />

o High (γ)<br />

o Medium (β)<br />

o Low (α)<br />

TABLE I. PRODUCTION LINE KEY ESTABLISHMENT<br />

Master /<br />

Group /<br />

Network keys<br />

Device individual keys<br />

Production<br />

Device and user individual keys<br />

No E2E Symm. Asymm. Symm. Asymm.<br />

privKey by<br />

Server<br />

privKey by<br />

Device<br />

privKey by<br />

Server<br />

privKey by<br />

Device<br />

Software 3Dα 2Cβ 2Cβ 2Cγ 2Bβ 2Bβ 2Bγ<br />

Hardware/Hybrid 3Dα 1Bβ 1Bβ 1Bγ 1Aβ 1Aβ 1Aγ<br />

Non-reputiation - - - ok - - ok<br />

Offline key server (easy to<br />

protect)<br />

- ok ok - ok ok<br />

TABLE II. APPROACHES WITH AD-HOC KEY ESTABLISHMENT<br />

Ad-hoc<br />

Master /<br />

Group /<br />

Network keys<br />

Deviceindividual keys<br />

Device and user individual keys<br />

No E2E Symm. Asymm. Symm. Asymm.<br />

privKey by<br />

Server<br />

privKey by<br />

Device<br />

privKey by<br />

Server<br />

privKey by<br />

Device<br />

Software 3Dα 2Bβ 2Bα 2Bα 2Bβ 2Bα 2Bα<br />

Hardware/Hybrid 3Dα 1Aβ 1Aα 1Aα 1Aβ 1Aα 1Aα<br />

Non-reputiation - - - ok - - ok<br />

Offline key server (easy to<br />

protect)<br />

- - ok ok - ok ok<br />

360<br />

www.embedded-world.eu


VII. DISCUSSION<br />

Finally, the strengths and weaknesses of the individual<br />

approaches will be briefly discussed and the authors' opinion<br />

is shared. It is irresponsible to launch a product with network<br />

connectivity that is not sufficient protected against the<br />

heterogeneous attack vectors. Unencrypted protocols allow<br />

every conceivable attack on the manufacturer's product and<br />

infrastructure. If one chooses protective measures based on a<br />

pre-shared common secret, the protective measures can be<br />

bypassed with reasonable effort by suitable analyzes, e.g.,<br />

reverse engineering a single product. It is crucial for the<br />

attacker how well his attack in the width, for instance, an<br />

entire product batch, scales.<br />

Security based on individual certificates is currently state<br />

of the art in the Internet. With proper implementation and<br />

compliance with organizational measures, this approach can<br />

be considered secure. Due to the complexities, however, this<br />

approach is rarely consistently and correctly implemented.<br />

Safety analyzes carried out during various employments of the<br />

authors quickly revealed problems in the implementation of<br />

PKI systems, both, in the technical and organizational<br />

implementation. Even with the use of strong cryptography and<br />

secure implementation, the security concept falls apart if the<br />

secret keys on which the PKI is based have been handed over<br />

to the insecure infrastructures.<br />

As described above, there are also solutions that shift key<br />

management from the production phase to the commissioning<br />

phase available. This enables dynamic key management at the<br />

user level when the devices are put into operation. Such<br />

methods can relieve the manufacturer. The manufacturer does<br />

not have to use weak key management in favor of costs.<br />

Furthermore, consumer’s digital privacy needs are trustfully<br />

feasible.<br />

Both, pure software solutions as pure hardware solutions<br />

have different advantages and drawbacks. Hybrids that are<br />

developed tightly coupled between hardware and software<br />

engineers could be a promising approach for future security<br />

architectures.<br />

The opinion that only TPMs can achieve a decent level of<br />

security is relatively popular these days. Those modules are<br />

known to resist a large area of attacks as they are tamper<br />

resistant and normally the key material does not leave the<br />

module. However, there is a tradeoff between costs and the<br />

desired level of security. As TPMs also have serious security<br />

flaws on a regular basis, manufacturers stay skeptical.<br />

Insecure and thus misused IoT products pose a serious<br />

threat to the entire Internet through DDoS attacks.<br />

Furthermore, proprietary (in-) security concepts or costly<br />

interlocking represent a huge investment risk, even for the<br />

most lucrative digital business models. Therefore, the authors<br />

see a lot of catching up to do in the area of easy integration of<br />

IoT security. This is based in particular on the fact that the<br />

security measures of the Internet are not per se suitable for<br />

applications in the Internet of Things.<br />

VIII. REFERENCES<br />

[1] Rob van der Meulen, “Gartner says 8.4 billion connected ‘Things’ will<br />

be in use in 2017, Up 31 Percent From 2016”, Gartner 2017,<br />

http://www.gartner.com/newsroom/id/3598917<br />

[2] Christof Paar, and Jan Pelzl, “Understanding cryptography”, Springer<br />

Monograph Series, 2009<br />

[3] Mathy Vanhoef, and Frank Piessens "Key reinstallation attacks: forcing<br />

nonce reuse in WPA2", Proceedings of the 2017 ACM SIGSAC<br />

Conference on Computer and Communications Security, CCS 2017,<br />

Dallas, TX, USA, October 30 - November 03, 20172<br />

[4] Mike Ryan, "Bluetooth: with low energy comes low security", 7th<br />

USENIX Workshop on Offensive Technologies, WOOT '13,<br />

Washington, D.C., USA, August 13, 2013<br />

[5] Tobias Zillner, “ZigBee exploited the good, the bad and the ugly”,<br />

Blackhat 2015<br />

[6] Gildas Avoine, and Loic Ferreira, ”Rescuing LoRaWAN 1.0”, eprint,<br />

July 2017<br />

[7] Christian T. Zenger, Jan Zimmer, Mario Pietersz, Jan-Felix Posielek,<br />

and Christof Paar, “Exploiting the physical environment for securing the<br />

Internet of Things”, ACM NSPW 2015<br />

[8] Markus Miettinen, N. Asokan, Thien Duc Nguyen, Ahmad-Reza<br />

Sadeghi, and Majid Sobhani., “Context-based zero-interaction pairing<br />

and key evolution for advanced personal devices”, ACM CCS 2014<br />

[9] Christian T. Zenger, Mario Pietersz, Jan Zimmer, Jan-Felix Posielek,<br />

Thorben Lenze, and Christof Paar, “Authenticated key establishment for<br />

low-resource devices exploiting correlated random channels”, Science<br />

Direct, Computer Networks Journal 2016<br />

[10] Sachin Babar, Antonietta Stango, Neeli Prasad, Jaydip Sen, and Ramjee<br />

Prasad, “Proposed embedded security framework for internet of things<br />

(iot)”, Wireless Communication, Vehicular Technology, Information<br />

Theory and Aerospace & Electronic Systems Technology (Wireless<br />

VITAE), 2011 2nd International Conference on IEEE, 2011, S. 1-5<br />

[11] Joseph Zambreno, Alok Choudhary, Rahul Simha, Bhagi Nara-hari, and<br />

Nasir Memon, “SAFE-OPS: a compiler/architecture approach to<br />

embedded software security”, ACM Trans. Embedded Computing 4<br />

(2005), Nr. 1, S. 189-210<br />

[12] Shivangi Vashi, Jyotsnamayee Ram, Janit Modi, Saurav Verma, and<br />

Chetana Prakash, “Internet of Things (IoT): A vision, architectural<br />

elements, and security issues.”, I-SMAC (IoT in Social, Mobile,<br />

Analytics and Cloud)(I-SMAC), 2017 International Conference on<br />

IEEE, 2017, S. 492–496 J. Clerk Maxwell, A Treatise on Electricity and<br />

Magnetism, 3rd ed., vol. 2. Oxford: Clarendon, 1892, pp.68-73.<br />

[13] Matus Nemec, Marek Sys, Petr Svenda, Dusan Klinec, and Vashek<br />

Matyas, ”The Return of Coppersmith’s Attack: Practical Factorization of<br />

Widely Used RSA Moduli”, Proceedings of the 2017 ACM SIGSAC<br />

Conference on Computer and Communications Security, CCS 2017,<br />

Dallas, TX, USA, October 30 - November 03, 2017<br />

361


The IoT Requires Upgradable Security<br />

Lars Lydersen<br />

Senior Director of Product Security<br />

Silicon Labs<br />

lars.lydersen@silabs.com<br />

Abstract— Many of the things we use on a daily basis are<br />

becoming smart and connected. The Internet of Things or IoT, will<br />

improve our lives by helping us reach our fitness goals, reduce<br />

resource consumption, increasing productivity, and track and<br />

secure our assets. Many embedded developers realize the potential<br />

benefits of the IoT and are actively developing various<br />

applications, from connected home devices to wearables to home<br />

security systems. However, along with these benefits come risks.<br />

No one wants to design an application that’s prone to hacking or<br />

data theft. Undesirable events like high-profile hacks can lead to<br />

serious brand damage and loss of customer trust, and worst-case<br />

slow down or permanently reduce the adoption of IoT.<br />

Keywords— Internet of Things; Security; Hacking; Software<br />

updates.<br />

I. INTRODUCTION<br />

The Internet of Things (IoT) allows us to optimize and<br />

improve most aspects of modern life at an unprecedented scale.<br />

As billions of IoT devices unleash billions of dollars in<br />

economic value [1].<br />

In the race for time-to-market, proper security is<br />

inconvenient because it adds cost: development cost, component<br />

cost and complexity. At the same time, in many industries it is<br />

not crucial to have adequate security. Rather, not having the<br />

worst security is the key to not being hacked. The issue is that<br />

bad press and major security and privacy issues might<br />

temporarily or permanently slow down the adoption of IoT for<br />

improving our lives. Many are already skeptical to connect<br />

simple devices we rely on in our every day. And security<br />

researchers are calling IoT a catastrophe waiting to happen [2].<br />

In fact, quite recently there have been a number of highly<br />

publicized hacks that are gaining wide attention [3,4], so one<br />

could argue that the catastrophe is already on its way.<br />

II.<br />

THE HACKING OF QUANTUM CRYPTOGRAPHY<br />

The situation resembles that of Quantum Cryptography.<br />

Quantum Cryptography [5] (often referred to as Quantum Key<br />

Distribution) is a beautiful technology that unlike other key<br />

distribution schemes promises unconditional security based on<br />

the laws of physics. In comparison, most key distribution<br />

schemes rely on assumptions of the computational complexity<br />

of factoring large numbers or the discrete logarithm problem.<br />

Discovered in 1984, it took until ca. year 2000 before<br />

commercial cryptography systems were launched to market.<br />

Relying on single photons, building a quantum cryptographic is<br />

complicated, and again time-to-market was of the essence. In<br />

2010, the first security loophole that completely broke the<br />

security of the systems was published [6]. Quantum<br />

Cryptography is theoretically not possible to break, but in<br />

reality, there were side-channels that were not considered<br />

during the design of the systems. Also, interestingly, no<br />

loopholes were discovered until a dedicated team to break into<br />

such systems was established. Up until this team was<br />

established, the entire industry was focused on making the<br />

quantum cryptography systems robust and getting them to<br />

market.<br />

Several things can be learned from the Quantum<br />

Cryptography analogy. Notably, it was widely believed that<br />

Quantum Cryptography systems were unconditionally secure,<br />

until a novel attack rendered that not-true. In other words, the<br />

systems were secure against any attacker who was not aware of,<br />

or was not going to utilize the blinding attack. This shows how<br />

there are always assumptions on the adversary – who are you<br />

secure against, even in the cases where once tries to condense<br />

the number of assumptions to a bare minimum.<br />

One other interesting learning from the hacking of quantum<br />

cryptography is that it shows the importance of upgradable<br />

security. When the blinding attacks were discovered, the<br />

manufacturers of the systems were given a grace period to patch<br />

the vulnerabilities. It turned out, that it was possible to close the<br />

vulnerabilities via software updates. This article will not discuss<br />

the distribution of the software updates (this would require a<br />

quantum secure bootloader?), but the important point is that the<br />

security needed to be upgraded over the lifetime of the system.<br />

III.<br />

WHAT ATTACKER ARE YOU PROTECTING AGAINST?<br />

Security is not binary: secure, or insecure. The question one<br />

should ask is, secure against what? The reality is there are<br />

different levels of security, and a device can only be considered<br />

secure in the context of an attacker, when the level of security is<br />

higher than the capabilities of the attacker.<br />

www.embedded-world.eu<br />

362


Figure 1: How security upgrades are necessary to evolve with the security level of the attacker. A high level of<br />

security and hardware primitives (such as extra memory) maximizes the likelihood that security issues can be<br />

patched in the future.<br />

Moreover, the capabilities of the attacker are typically nonstatic,<br />

and therefore, the security level will change over time.<br />

The improved capabilities of the attacker can come about in<br />

several different ways, from the discovery and/or publication of<br />

issues and vulnerabilities to broader availability of equipment<br />

and tools. We have already discussed how this happened in the<br />

example of Quantum Cryptography, but let’s also review a few<br />

examples of how this has happened in classical security.<br />

In 1977, the data encryption standard (DES) algorithm was<br />

established as a standard symmetric cipher. DES uses a 56-bit<br />

key-size, so through increases in available computational power,<br />

the standard is vulnerable to brute-force attacks. In 1997, it was<br />

shown that it took 56 hours to break the algorithm via bruteforce.<br />

With DES clearly being broken, triple DES, basically<br />

running DES three times with different keys was established as<br />

a standard secure symmetric cipher. Regarding the security level<br />

of DES, there has been speculation that if governments could<br />

already break the cipher in 1977, DES could never resist nationstate<br />

attacks. However, since the early 2000s, DES could not<br />

even protect against hobbyists with personal computers due to<br />

the widespread availability of computational power.<br />

Since 2001, the advanced encryption standard (AES)<br />

replaced DES. But even AES does not guarantee absolute<br />

security. Even when the algorithm could not be easily broken,<br />

the implementation could be hacked, as was the case with<br />

Quantum Cryptography. Differential power analysis (DPA)<br />

attacks are done by measuring the power consumption or the<br />

electromagnetic radiation of the circuit performing the<br />

cryptography. The side-channel data is then used to obtain the<br />

cryptographic keys. Specifically, DPA involves capturing a<br />

large number of power consumption traces followed by analysis<br />

to reveal the key. DPA was introduced in 1998, and since then,<br />

companies such as Cryptographic Research Inc. (now Rambus)<br />

sell tools to perform DPA attacks, although at a price that made<br />

the tools inaccessible to hobbyists and most researchers. Today,<br />

the hardware tools to perform advanced DPA attacks can be<br />

purchased for less than $300, and advanced post-processing<br />

algorithms are available online free-of-charge. Thus, the ability<br />

to conduct DPA attacks has migrated from nation-states and<br />

wealthy adversaries to nearly any hacker.<br />

Now let’s discuss these historic lessons in the context of the<br />

longevity of an IoT-device. A typical lifetime of an IoT-device<br />

depends on the application, but in industrial applications 20<br />

years is common, and will be used for this discussion. A device<br />

that launched in 1998, for example, was once only vulnerable to<br />

nation-state attacks; today it must be able to withstand DPA<br />

attacks by hobbyists with $300 for tools, some spare time and<br />

lots of coffee. Predicting the future capabilities of a class of<br />

adversaries is very difficult if not impossible, especially over a<br />

20-year timespan. How does the adversary look in 2040? One<br />

might speculate if it is even human?<br />

The only reasonable way to counter future attack scenarios<br />

is for the security of the device to evolve with the increased<br />

capabilities of the adversary as shown in Figure 1. This requires<br />

IoT security that is software upgradable. There is of course<br />

functionality that requires hardware primitives, which cannot be<br />

retrofitted via software updates. However, it is incredible what<br />

can be solved in software when the alternative is a truck-roll.<br />

And it is clear that it is impossible to predict and account for all<br />

future attacks.<br />

IV.<br />

CONSEQUENCES FOR IOT PRODUCS<br />

First, the product needs to be able to receive software<br />

updates securely. Let’s discuss two aspects of secure software<br />

updates: from a technical point of view, namely the<br />

requirements for the device and software, and a process point<br />

363


of view, specifically the authorization and control of releasing<br />

software updates.<br />

From a technical perspective, secure updates involve<br />

authenticating, integrity checking and potentially encrypting<br />

the software for the device. The software handling such security<br />

updates is the bootloader, typically referred to as a secure<br />

bootloader. The secure bootloader itself, along with its<br />

corresponding cryptographic keys, constitutes the root-of-trust<br />

in the system and needs to have the highest level of security.<br />

This involves placing the bootloader and keys in immutable<br />

memory such as one-time-programmable memory or read-onlymemory.<br />

At this point, any vulnerability in this code is<br />

equivalent to an issue in hardware and cannot be fixed in the<br />

field.<br />

The authentication and integrity check should be<br />

implemented using asymmetric cryptography, with only public<br />

keys in the device. This way, it is not necessary to protect the<br />

signature-checking key in the devices. Since protecting keys in<br />

deployed devices is (or at least should be) harder than protecting<br />

keys in control of the device owner, it is also acceptable to use<br />

the same bootloader keys for many devices. Finally, since the<br />

device contain and use a public key, the system is secure against<br />

DPA attacks.<br />

Encrypting the software running on the IoT device has two<br />

benefits. First, it protects what vendors consider to be<br />

intellectual property (IP) from both competitors and<br />

counterfeiting. Secondly, encryption makes it more difficult for<br />

adversaries to analyzing the software for vulnerabilities.<br />

Encrypting the new software for secure boot does, however,<br />

involve secret keys in the device, and protecting secret keys<br />

inside a device in the field is becoming increasingly harder. At<br />

the same time, newer devices have increased resistance to DPA<br />

attacks. Furthermore, a common countermeasure against DPA<br />

attacks is limiting the number of cryptographic operations that<br />

can take place to make it infeasible to get sufficient data to leak<br />

the key. Even though protecting the key is difficult and<br />

motivated adversaries will likely extract it, it does make<br />

attacking more difficult for the attacker. Therefore, secure boot<br />

should always involve encryption of the software.<br />

Another consequence of secure updates is the likely future<br />

need for more memory in the IoT device. This is a complicated<br />

trade-off for several reasons. First, software tends to expand to<br />

the memory available in the device. So, a larger memory device<br />

requires discipline from the software team to leave room for<br />

future updates. The other complication is the value of free<br />

memory in the future versus the device’s initial cost. More<br />

memory tends to increase the cost of the device. This cost must<br />

be justified both from the device maker and the consumer point<br />

of view. Unfortunately, the fierce competition for market share<br />

makes many device makers myopic, and they are incentivized<br />

to prioritize cost over future security.<br />

Finally, it is important to have a plan for distributing the<br />

security updates. For most devices, these updates use the<br />

device’s existing Internet connection. But in some cases, this<br />

requires adding or using physical interfaces such as USB drives<br />

(using sneakernet). It is also important to consider that the<br />

devices might be behind firewalls or in some cases<br />

disconnected from the Internet.<br />

Once secure updates are possible from a technical point of<br />

view, the question becomes who has the authority to sign and<br />

issue software updates. Increasingly common for IoT devices,<br />

the software is fully owned and managed by the device maker.<br />

This means that the device maker should have proven processes<br />

in place to internally protect the signing keys and particularly<br />

who can issue updates. This might or might not be combined<br />

with authorization from the customer or end user. In fact, given<br />

the increase in device maker responsibilities, some times it<br />

might even be necessary to have the mechanism of forcing<br />

updates, not leaving the user the ability to opt out. For some<br />

devices, the end user must actively download an update and<br />

apply it, or at least initiate the update process. In other instances,<br />

the update is fully automatic. From a practical point of view, it<br />

is important that the scheme is fairly accommodating to<br />

different delivery mechanism. Especially if the device maker<br />

does not have direct contact with the device, but has to rely on<br />

3 rd party gateways or connectivity.<br />

V. SUMMARY<br />

The longevity of deployed IoT devices, combined with the<br />

proliferation of tools and knowledge of adversaries, makes it<br />

infeasible to create devices that will remain sufficiently secure<br />

at any security level for their lifetime. Therefore, for IoT devices<br />

to remain secure throughout their practical lifetime, it is<br />

necessary to ensure that the security of these devices is<br />

upgradable via software updates. But since an update<br />

mechanism is also an attack point, it is necessary to deploy<br />

security bootloaders in all programmable devices in the IoT<br />

product, and to properly secure the bootloader keys. A secure<br />

bootloader is functionality that IoT vendors should expect to get<br />

from the IC manufactures. Furthermore, IoT vendors need to<br />

plan up front for delivery mechanism and processes to issue<br />

updates. Luckily, secure bootloaders are readily available, and<br />

relevant devices are already internet connected, so enabling<br />

secure updates is a minor effort. So there is no excuse not to do<br />

it..<br />

REFERENCES<br />

[1] McKinsey & Company, The Internet of Things: Mapping the value beynd<br />

the Hype, June 2015.<br />

[2] B. Schneier, http://www.wired.com/2014/01/theres-no-good-way-topatch-the-internet-of-things-and-thats-a-huge-problem/,<br />

visited 14/01-<br />

2018.<br />

[3] B. Krebs, Hacked Cameras, DVRs Powered To-days Massive Internet<br />

Outage, https://krebsonsecurity.com/2016/10/hackedcameras-dvrspowered-todays-massive-internet-outage/<br />

October 2016, visited 14/01-<br />

2018.<br />

[4] E. Ronen, C. O’Flynn, A. Shamir and A. Weingarten, IoT Goes Nuclear:<br />

Creating a ZigBee Chain Reaction, http://iotworm.eyalro.net/iotworm.pdf<br />

, visited 14/01-2018<br />

[5] N. Gisin, G. Ribordy, W. Tittel and H. Zbinden, Quantum cryptography,<br />

Rev. Mod. Phys. 74, 145-195 (2002)<br />

[6] L. Lydersen, C. Wiechers, C. Wittmann, D. Elser, J. Skaar and V.<br />

Makarov, "Hacking commercial quantum cryptography systems by<br />

tailored bright illumination", Nat. Photonics 4, 686-689 (2010)<br />

www.embedded-world.eu<br />

364


Safety and Security from the Inside – a SoC’s<br />

Perspective<br />

Antonio J. Salazar Escobar<br />

Solutions Group, Synopsys<br />

Porto, Portugal<br />

Ralph Grundler<br />

Solutions Group, Synopsys<br />

Mountain View, California, USA<br />

Abstract—We can all agree that today’s electronics need to<br />

address safety and security considerations; however, these<br />

concepts are continuously evolving and are redefined based on<br />

perspective. Keeping unwanted eyes from your children’s<br />

monitors or protecting your smart-home's integrity are real<br />

concerns from today’s consumers, and with a growing trend for<br />

autonomous technology, from wearable, IoT to cloud, including<br />

artificial intelligence, it is a concern that needs to be addressed.<br />

That said, what are the issues? What do we need to secure and<br />

how? Although a number of protocols and infrastructures exist to<br />

“secure” data communication, what is the role of the endpoint, or<br />

better yet, the SoC? From the SoC perspective, this can translate<br />

into multiple things, such as encryption, key management,<br />

securing data, or even protecting a debug port. Of course, the<br />

specifics vary depending on the SoC’s purpose. Today’s<br />

intellectual property (IP) blocks that comprise an SoC are<br />

increasingly complex due to an ever-growing number of features<br />

and functional expectations; an understanding of the<br />

interoperability of the IP is paramount to ascertain where and how<br />

to add the necessary elements to keep your SoC secure. This paper<br />

discusses the considerations for safety and security from the inside<br />

of the SoC, going over the role of the IP, subsystems and overall<br />

design.<br />

Keywords—Security IP, Safety, HBM, SoC<br />

I. INTRODUCTION<br />

Today’s electronics are following trends where autonomous,<br />

near-ubiquitous and ever-connected are common expectations,<br />

and with them the expectation of reliability and data guarding.<br />

Such has motivated the need to understand and build solutions<br />

that account for security threats and safety concerns.<br />

Nonetheless, security and safety considerations are fluidic<br />

concepts, continuously evolving to address emerging pressures,<br />

threats and conditions of an ever-changing market and usage<br />

scenarios.<br />

From a process perspective, security/safety considerations<br />

need to be viewed as methodological requirements, ingrained in<br />

all aspects of the product’s lifecycle. The application space and<br />

usage scenarios can have significant impact on the associated<br />

costs (time, resources and procedural), affected by industry<br />

requirements, compliance standards and added market value.<br />

For instance:<br />

<br />

<br />

<br />

IoT: covers numerous industries from manufacturing<br />

and transportation, to healthcare, entertainment, and<br />

education. Safety-critical (healthcare) and high-impact<br />

(supply chain, facility, fleet management) applications,<br />

combined with the connected nature of IoT devices,<br />

drives strengthening of security and reliability<br />

requirements as key factors [1]. Security/safety profiles<br />

can vary based on considered usage scenario for the<br />

same device. For instance, a remotely accessed camera<br />

for monitoring children requires different<br />

considerations than one for parking lot surveillance.<br />

Automotive: although automotive electronic systems<br />

such as infotainment and motor control have to meet<br />

the requirements for stringent automotive standards,<br />

AEC Q100 and IATF 16949 (formally ISO/TS 16949,)<br />

advanced driver assistance systems (ADAS) are<br />

accelerating the adoption for functional safety standard<br />

ISO 26262. In addition, efforts such as the E-safety<br />

vehicle intrusion protected applications (EVITA),<br />

address security considerations related to remote and<br />

physical attacks to the integrity of the system, and<br />

inter/intra-vehicular communication, that can cause<br />

damage to the system, vehicle and structures, as well as<br />

injury to the passengers and/or passers-by.<br />

Media/Entertainment: content protection has gained<br />

significant traction in the past decades, driven by the<br />

digital revolution and content sharing; Content<br />

protection schemes require a number of collaborative<br />

components on all stages of the data sharing, advancing<br />

methods and frameworks such as: digital rights<br />

management (DRM), high-bandwidth Digital Content<br />

Protection (HDCP), digital transmission content<br />

protection (DTCP) and Movielabs digital distribution<br />

framework (MDDF).<br />

Neither security nor safety are just features to be added at a<br />

late stage of the product development. They need to be an<br />

integral part of the design from the start, and at all levels.<br />

II. CONCEPTS OF SAFETY AND SECURITY<br />

The level at which a design needs to consider safety/security<br />

requirements is not always evident. For example, researchers<br />

were able to demonstrate ways that hackers can put people in<br />

www.embedded-world.eu<br />

365


harm’s way, by accessing a Tesla Model S car’s control system<br />

remotely and gaining access to the car’s controller area network<br />

(CAN) bus [2]. The vulnerability allowed the researchers to<br />

access and control key systems (braking system, engine,<br />

sunroof, among other). Tesla followed up by providing a<br />

software update and code signing policy enforcement<br />

mechanism, for any new firmware installing into the CAN bus<br />

[3]. This approach was able to address the foreseeable issues;<br />

however, can a solely software based strengthening of the attack<br />

path resolve the underlined issues? Hardware elements, such as<br />

a trusted platform module (TPM) or even a hardware security<br />

module (HSM), could be required instead.<br />

Development procedures need to account for all<br />

stakeholders, following clear and traceable requirements,<br />

supported by strong documentation and verification results.<br />

Designers need to consider opposing forces during planning. In<br />

a “top down” view, designers must consider both SoC design<br />

elements and the safety documentation required by certain<br />

specifications/standards. On the other hand, they must consider<br />

a “bottoms up” approach to visualize how coverage closure can<br />

be achieved along with the required documentation, reports,<br />

verification specifications, etc. (Fig. 1).<br />

Figure 1: SoC development “top down” and “bottom up”<br />

methodologies for ADAS systems<br />

Addressing safety/security concerns could entail a<br />

methodological and process level inclusion that accounts not<br />

only for internal processes, but also 3rd party methodologies.<br />

For instance, when selecting IP or IP subsystems, designers need<br />

to ensure that the provider has the necessary processes and<br />

infrastructure. While at the design level, this might imply<br />

monitors, redundancy, or observability, the IP provider must<br />

have a “Safety & Quality Culture” to detect, manage, and<br />

address possible hazards, associated quality management<br />

systems that quantify and qualify errors, and standards of<br />

communication to minimize misinterpretations.<br />

Understanding of the use-case, methodologies and<br />

functional dependencies becomes paramount, in particular due<br />

to the increasing complexity of module interoperability within a<br />

design. In order words, understanding how to construct and use<br />

one’s defenses to shield against attacks/malfunctions becomes<br />

dependent on understanding how the different elements that<br />

compose your system interact and their associated dependencies.<br />

Within an SoC, this translates to understanding the different IP<br />

and IP subsystems present on your design, as well as the<br />

dedicated and shared resources.<br />

For instance, observability is a functional requirement of<br />

error recovery to meet a design’s safety goals. By adding<br />

functionality such as debug, performance monitors, and<br />

watchdog timers in the IP subsystem, the SoC or the system<br />

software can decide on the best mediating action when a fault<br />

occurs that requires recovery to a safe state. Often this can<br />

enhance the possibility of graceful recovery where the rest of the<br />

system is not affected. Other opportunities to leverage<br />

observability are measuring or fine tuning performance to make<br />

sure the needs of the complete system are met either in different<br />

systems or different operating modes. Regardless of the<br />

foreseeable benefits, there is inherit risk on any monitor/control<br />

path that needs to be weighed.<br />

Foreseeing potential weaknesses can be challenging;<br />

however, understanding designs’ inter-dependencies can<br />

facilitate where to focus efforts and when additional logic might<br />

be required, either for added robustness (such as built in selftesting<br />

or self-repair components), or for strengthening security<br />

(such as gateways or tamper resistance constructs).<br />

Cost becomes a decision factor for many design choices;<br />

related to time, effort, area, and so on. In general, the added<br />

value to the end-product needs to justify the associated cost; that<br />

said, cost can be mitigated from a product life-cycle perspective<br />

by introducing good practices. For instance, documentation can<br />

prove a value recourse of cost mitigation. From safety<br />

documentation to test plans, checks for linting errors, Clock-<br />

Domain Crossing (CDC) and Reset Domain Crossing (RDC),<br />

thorough documentation helps ensure that the implementation is<br />

clear and more seamless. When running required checks on the<br />

final SoC, the SoC engineer has these documents as reference<br />

should any questions arise.<br />

III. SOC ARCHITECTURE CONSIDERATIONS<br />

Today’s SoC are composed by an increasing number of<br />

processors, memories, interfaces, general logic and peripherals<br />

blocks, connected through progressively complex interconnect<br />

structures. Designing and implementing each individual block<br />

and interconnect arrangement has become prohibitively costly<br />

and reliance on pre-designed blocks is commonplace.<br />

366


additional resources to serve as gateways [6]. Another path is to<br />

utilize cJTAG (IEEE 1149.7), in lieu of JTAG, as to add a star<br />

topology and individual device addressing, that would facilitate<br />

overlaying security/safety considerations per block. A simpler<br />

strategy might be to use an auxiliary port, such as UART, to<br />

establish a formal authentication procedure that manages access<br />

to the JTAG TAP (see Fig. 3-c). In the end, the approach is<br />

dependent on the overall needs of the design and even the<br />

reusability of the resources for multiple purposes. One could<br />

argue that a key consideration is to have flexible and scalable<br />

solutions, thus future-proofing for unforeseen scenarios.<br />

Figure 2: Generic SoC Architecture<br />

Therefore, SoC designers benefit from IP providers that<br />

actively collaborate with their customers to provide more than<br />

just an IP block. Experts in the IP, with a system level<br />

understanding, can engage with designers at all technical levels<br />

to ensure their SoC requirements are met, and help assure that<br />

integration, testing, and integrity checks go smoothly all the way<br />

to tape out. By engaging at the subsystem level, all parties can<br />

consider clock domain crossing, reset domains, power<br />

management, and testability concerns. Moreover, by leveraging<br />

IP and interoperability knowledge a more systemic methodology<br />

can be applied.<br />

Of particular concern can be testing and verification<br />

considerations, key aspects to assessing requirement traceability<br />

and implementation. Embedded instruments and design for<br />

testability/debug structures have become commonplace within<br />

today’s SoCs. Hardware/software co-design strategies such as<br />

prototyping and emulation can have a significant impact, by<br />

accelerating bring-up and providing engineers added dedicated<br />

time to understand interoperability and IP functionality, as well<br />

as evaluating impact on the overall design requirements; thus<br />

strengthening the security/safety profile.<br />

Validation and debug structures can conflict with<br />

security/safety requirements [4] and ultimately affect the SoC<br />

architecture. For instance, an IEEE 1149.1 (commonly known<br />

as JTAG) port is frequently found in today’s systems (see Fig.<br />

3-a). This port has been time tested and proven to be a useful<br />

resource for debugging purposes, programming, data retrieval,<br />

among other uses. However, it represents a side-channel risk,<br />

providing physical access to system resources.<br />

Implementing an authentication protocol over the JTAG<br />

TAP, can require development of SoC specific communication<br />

software and IP wrapper re-design [5], which can prove<br />

time/resource consuming, and in the end might not properly<br />

address the concerns. An alternative could be to structure the<br />

boundary scan path to include IEEE 1687 (also referred as<br />

IJTAG) elements (see Fig. 3-b), which can facilitate securing<br />

access to specific areas of the SoC, by leveraging Segment<br />

Insertion Bit (SiB) and Test Data Register (TDR) blocks with<br />

Figure 3: (a) Generic IEEE 1149.1 implementation. (b) IEEE 1687<br />

architecture for embedded instrument connectivity. (c) IEEE 1149.1<br />

security augmented implementation<br />

IV. THE HARDWARE ROOT OF TRUST<br />

Adding ad-hoc security/safety mechanisms to each<br />

individual block could prove costly and against the overall<br />

design requirements, with regards to performance, area and<br />

timing considerations. SoCs can benefit from centralized<br />

approaches that capitalize on reusable structures and policies,<br />

such as a trusted execution environment (TEE). A properly<br />

structured TEE provides a framework in which the system can<br />

confidently run its privilege software and functions. Obviously<br />

the architectural approach can have significant impact on the<br />

methodologies to be implemented. For instance, an SoC can be<br />

augmented with an external trusted platform module (TPM),<br />

while using an integrated hardware secure module (HSM) [7] or<br />

a TEE implemented as a security subsystem with its dedicate<br />

secure CPU [8], can prove more cost efficient by saving space<br />

on the PCB and avoiding signal path exposure outside the<br />

silicon.<br />

An approach for a centralized, flexible and scalable SoC<br />

security framework comes through the implementation of a<br />

www.embedded-world.eu<br />

367


Hardware Root of Trust, i.e., a hardware protected TEE and<br />

device unique keys capable of implementing an array of security<br />

functions creating the basis (a root) for trust in the SoC. In<br />

general, a Hardware Root of Trust would count with a number<br />

of modules that support different operations in a runtime-secure<br />

tamper-resistant manner. To better understand the necessary<br />

components, it is important to consider expected functionality<br />

and all operation phases. Table I summarizes a number of<br />

common security tasks and the associated function.<br />

Table I. Security goals and associated security functions<br />

Operation<br />

Phase<br />

Goal<br />

Security Functions<br />

Power Off Non-volatile storage protection encryption/decryption<br />

Tamper protection<br />

authentication<br />

Device data binding<br />

device unique key<br />

storage<br />

Power Up Boot image protection code/data validation,<br />

signature check<br />

Device identity check authentication /<br />

identification<br />

Runtime Malicious instruction<br />

monitoring<br />

continuous transaction<br />

monitoring<br />

Access point protection authentication /<br />

permission control<br />

Secure communication<br />

integrity and<br />

confidentiality<br />

HSMs exist that address the different security functionality<br />

presented in Table I, through the use of a secure CPU and<br />

hardware based resources within a safe zone or security<br />

perimeter. The secure CPU provides an inherently trusted<br />

location for software components, which support the security<br />

functions, to run. Support blocks would include a secure<br />

memory to serve as a safe space for runtime data, a True Random<br />

Number Generator (TRNG) for producing a high level of<br />

entropy, and even a dedicated clock/counter for reliable time<br />

measurements. Additional blocks could improve performance<br />

with hardware cryptographic accelerators with side channel<br />

attack countermeasures for error detection, power and timing<br />

randomization, and so on (see Fig. 4). The management of<br />

security features is thus facilitated by a programmable security<br />

framework.<br />

Figure 4: Example HSM Architecture from Synopsys<br />

V. SECURITY STANDARDS<br />

HDCP (High-bandwidth Digital Content Protection) is most<br />

commonly used for integrating with HDMI (High-Definition<br />

Multimedia Interface) or video content protection but could be<br />

used for other key exchange protected data encryption such as<br />

audio or digital files. Also, HDCP is used in DisplayPort and<br />

USB applications. HDCP as well as other encryption engines<br />

will leverage a TRNG which will not only be used for key<br />

generation for the HDCP cypher but also to add more noise<br />

entropy to the system to make power monitoring attacks more<br />

difficult. Typically, the HDCP block is located close to the<br />

HDMI block but the TRNG is usually located centrally in the<br />

system unless needed as part of a subsystem as show in Fig. 5.<br />

Both blocks will require additional memories which are not<br />

shown.<br />

Figure 5: Synopsys HDMI RX Subsystem Showing HDCP and TRNG<br />

Communications and networking can be secured in many<br />

levels. MACsec (Media Access Control Security, IEEE<br />

Standard 802.1AE) can be used to secure Ethernet on the link<br />

layer. IPsec (Internet Protocol Security, IETF Standard RFCs)<br />

can be used to secure networks end to end. Both of them can<br />

benefit from a Hardware Root-of-Trust that protects the control<br />

plane software for authentication and key negotiation by<br />

providing secure key generation, and public key cryptography<br />

operations inside the security module. The session keys can also<br />

be injected to the data plane cryptography engines directly from<br />

the Hardware Secure Module, without exposing the keys in the<br />

host system. For higher bandwidth applications, a security HW<br />

accelerator leveraging an encryption hardware engine like AES-<br />

GCM (Advanced Encryption Standard-Galois Counter Mode)<br />

and TRNG is required. For encrypting the data in point to point<br />

applications, typically a MACsec hardware accelerator is used<br />

in-line with an Ethernet MAC and the physical layer as shown<br />

in Fig. 6.<br />

368


Figure 6: MACsec hardware accelerator used in-line with Ethernet<br />

MAC and Physical Layer<br />

In some applications because of legacy hardware, gate count<br />

for multiple ports or other system reasons the MACsec engine is<br />

placed as a look aside accelerator as shown in Fig. 7.<br />

safeguarding our physical selves. Addressing the wide and<br />

constantly evolving array of threats requires investment in<br />

understanding these attacks and delivering solutions at all levels<br />

of the design and all stages of the product lifecycle.<br />

Security and safety considerations need to address<br />

foreseeable threats; however, the usage scenarios of the end<br />

products are at times a moving target, requiring some solutions<br />

to aim to be adaptable and scalable as to be future-proof.<br />

Leveraging 3 rd party IP know-how and subsystem understanding<br />

can save time and lower design risks; while, early software<br />

development can permit strengthening of the final results.<br />

REFERENCES<br />

Figure 7: Ethernet Subsystem showing MACsec engine as look-aside<br />

accelerator<br />

Either implementation is highly flexible as the bandwidth<br />

support can be expanded by either speeding up the AES engine<br />

clock or adding in parallel additional AES engines supporting 5<br />

Gbps to Terabit applications.<br />

IPsec can be implemented as a look aside accelerator but<br />

typically as a hardware accelerator coupled to a processor. IPsec<br />

provides more cipher and security options, such as just the data<br />

payload encrypted for transmitting via routers, or the entire IP<br />

packet and creating a new header. This is known as ESP<br />

(Encapsulating Security Payload) and would be in used in VPN<br />

(Virtual Private Network) applications.<br />

VI. CONCLUSION<br />

Embedded systems growing complexities, numbers and<br />

communication networks require a revision and prioritization of<br />

security and safety considerations; continuously evolving to<br />

meet an ever growing set of requirements, driven by consumer<br />

and market expectations. Not only are connected devices<br />

becoming omnipresent and ever connected, they are becoming<br />

ingrained within everyday life, forcing a reevaluation of the<br />

value associated to protecting personal information and even<br />

[1] Columbus, L. (2017, December 10). 2017 Roundup of Internet of Things<br />

Forecasts. Retrieved from www.forbes.com:<br />

https://www.forbes.com/sites/louiscolumbus/2017/12/10/2017-roundupof-internet-of-things-forecasts/#4db5a3411480<br />

[2] Keen Security Lab of Tencent (2017, July 27). New Car Hacking<br />

Research: 2017, remote Attack Tesla Motors. Retrieved from<br />

keenlab.tencent.com: https://keenlab.tencent.com/en/2017/07/27/New-<br />

Car-Hacking-Research-2017-Remote-Attack-Tesla-Motors-Again/<br />

[3] A. Greenberg, (2016, September 27). Tesla Responds to chinese hack with<br />

a major security upgrade. Retrieved from Wired:<br />

https://www.wired.com/2016/09/tesla-responds-chinese-hack-majorsecurity-upgrade/<br />

[4] R. Sandip, J. Yang, A. Basak and S. Bhunia, “Correctness and Security at<br />

Odds. Post-silicon Validation of Modern SoC Design,” DAC 2015, San<br />

Francico, CA, USA, http://dx.doi.org/10.1145/2744769.2754896<br />

[5] G.M. Chiu and J. Li, “A secure test wrapper design against internal and<br />

boundary scan attacks for embedded cores,” IEEE Trans. on Very Large<br />

Scale Integration Systems, vol. 20, No. 1, pp. 126-134, January 2012.<br />

[6] S. K. K, N. Satheesh, A. Mahapatra, S. Sahoo and K. K. Mahapatra,<br />

"Securing IEEE 1687 Standard On-chip Instrumentation Access Using<br />

PUF," 2016 IEEE Int. Symp. on Nanoelectronic and Information Systems<br />

(iNIS), Gwalior, 2016, pp. 56-61. doi: 10.1109/iNIS.2016.024<br />

[7] A. Elias, (2017, October 30). Understanding Hardware Roots of Trust.<br />

DesignWare Technical Bulletin. Retrieved from www.synopsys.com:<br />

https://www.synopsys.com/designware-ip/technicalbulletin/understanding-hardware-roots-of-trust-2017q4.html<br />

[8] R. Collins, (2017, October 30). Securing High-Value Targets with a<br />

Secure IP Subsystem. DesignWare Technical Bulletin. Retrieved from<br />

www.synopsys.com: https://www.synopsys.com/designware-<br />

ip/technical-bulletin/securing-high-value-targets-<br />

2017q4.html?elq_mid=9452&elq_cid=32291<br />

www.embedded-world.eu<br />

369


Cyber Security for Automobiles<br />

BlackBerry’s 7-Pillar Recommendation<br />

Sandeep Chennakeshu<br />

BlackBerry Technology Solutions<br />

Ottawa, Canada<br />

Abstract—Auto cyber security is on national agendas because<br />

automobiles are increasingly connected to the Internet and other<br />

systems and bad actors can commandeer a vehicle and render it<br />

dangerous, amongst other undesirable outcomes. The problem is<br />

complex and the point-solutions that exist today are fragmented<br />

leaving a very porous and “hackable” system. BlackBerry<br />

provides a 7-Pillar recommendation to harden automobile<br />

electronics from attack. The solution is intended to make it<br />

significantly harder for an attacker to create mischief. This paper<br />

describes the 7-pillars and how BlackBerry can help.<br />

I. THE PROBLEM<br />

Cyber security for automobiles is on the national agenda of<br />

several countries. Why? There are four industry trends that<br />

make modern cars vulnerable to cyber attacks and potential<br />

failures:<br />

• Automobiles are increasingly accessible by wireless<br />

and physical means to the outside world and bad<br />

actors.<br />

• Software will control all critical driving functions and<br />

if bad actors can access and modify or corrupt the<br />

software it can lead to accidents and potential fatalities.<br />

The larger the amount of software in an automobile the<br />

larger is the attack surface.<br />

• Autonomous automobiles will be driverless. By design<br />

these automobiles will talk to each other and<br />

infrastructure by wireless means. This further<br />

exacerbates the vulnerability problem in so far as the<br />

number of access points through which an automobile<br />

may be breached. When this happens, the concomitant<br />

effects could be viral, as one car can infect another and<br />

so on.<br />

• Autonomous automobiles will deploy artificial<br />

intelligence, deep neural networks, and learning<br />

algorithms. These automobiles will learn from context.<br />

This means that software that was installed as being<br />

safety and security certified at production will morph<br />

with time and there needs to be new ways to ensure<br />

that the automobile is still safe and secure over its<br />

lifetime.<br />

This threat is amplified by the following characteristics of<br />

the automobile:<br />

• The electronics in a car (hardware + software) is built<br />

from components supplied by tens of vendors in<br />

multiple tiers who have no common cyber security<br />

standards to adhere to as they build their components.<br />

This makes the supply chain for the car complex and<br />

porous in respect to cyber security. Every vendor and<br />

every component is a point of vulnerability.<br />

• The electronics in a car is a complex network of<br />

distributed computers called electronic control units<br />

(ECUs). An ECU is a piece of hardware and software<br />

that controls an important function in the automobile<br />

such as braking, steering, power train, digital<br />

instrument cluster, infotainment and more banal<br />

functions such as window control and airconditioning.<br />

These ECUs are networked by buses<br />

(physical wires or optical fibre), which carry<br />

messages using some defined protocol. This<br />

interconnected network allows ECUs to talk to each<br />

other. Safety critical and non-safety critical ECUs<br />

interact through this network. Some of these ECUs<br />

can be accessed by wireless means or physical access<br />

(e.g. USB drive). Access means potential infection.<br />

Hence, it is paramount to isolate safety-critical and<br />

non-safety critical ECUs.<br />

• A car lives for 7 to 15 years. Over this period of time<br />

its software must be updated. This time period brings<br />

risk, as hackers become more sophisticated over time<br />

and users of cars may download software that may<br />

contain malware.<br />

Current practices and standards are inadequate. For<br />

example, functional safety standards like ISO 26262 (ASIL-A<br />

to ASIL-D), information sharing like Auto-ISAC, software<br />

coding guidelines like MISRA and the NHTSA 5-Star overall<br />

safety scores (which is more to do with collision), add value<br />

but do not solve the cyber security and safety problem<br />

described. These are point solutions not holistic solutions.<br />

There is need for a much more holistic cyber security solution<br />

for automobiles.<br />

II. EXPERIENCE DRIVES INNOVATION<br />

BlackBerry has a long history of cyber security with deep<br />

involvement in multiple facets of a holistic cyber security<br />

solution. As such BlackBerry understands the issues that need<br />

370


to be solved and has innovated to solve the same. It is therefore<br />

no surprise that BlackBerry:<br />

• Is regarded as the gold standard in government,<br />

regulated industry and enterprise mobile security.<br />

• Has been a leading supplier of reliable and safe<br />

software to the automobile industry for decades.<br />

• Supplies managed PKI (certificates) services, crypto<br />

tool kits and asset management (key injection) to<br />

major companies.<br />

• Operates a global over the air (OTA) secure software<br />

update service that has updated over 100 million<br />

devices in over 100 countries, with updates every<br />

week for over a decade.<br />

• Has built a safety aware culture amongst our<br />

automobile software developers through training,<br />

work methods and practices to secure safety<br />

certification and extends this training to its customers.<br />

• Developed and deployed world-class vulnerability<br />

assessment and penetration testing methods and tools.<br />

• Maintains an active and alert security incidence<br />

response team that monitors common vulnerabilities<br />

and exposures and reacts to address the same in<br />

products with industry leading response times.<br />

• Has built a FedRAMP certified emergency<br />

notification service that can be used to provide alerts<br />

when issues occur with bulletins on precautions to be<br />

taken by those impacted before a solution is delivered.<br />

• Is building a Rapid Incident Response Network to<br />

share information between enterprises to learn and act<br />

more quickly.<br />

BlackBerry’s experience and the products that it brings to<br />

bear on cyber security are extensive and valuable to the auto<br />

industry. BlackBerry’s DNA is security. We use our deep<br />

experience, vast repertoire of tools, practices and knowledge to<br />

innovate and stay ahead. It is via this accumulated knowledge<br />

and insight that we have developed the 7-pillar<br />

recommendation that is described below.<br />

III. THE 7-PILLAR RECOMMENDATION<br />

Safety and security are inseparable. Our approach to the<br />

problem is to look at the whole system and try and get as close<br />

to creating a system where there is an absence of unreasonable<br />

risk.<br />

The 7-pillars recommended by BlackBerry are outlined<br />

briefly below. These pillars are described for automobiles but<br />

can be extended to other devices and markets.<br />

A. The 7-Pillar Recommendation<br />

1) Secure the supply chain:<br />

a) Root of trust: Ensure that every chip and electronic<br />

control unit (ECU) in the automobile can be properly<br />

authenticated and are loaded with trusted software,<br />

irrespective vendor tier or country of manufacture. This<br />

involves injecting every silicon chip with a private key during<br />

its manufacturing stage to serve as the root of trust in<br />

establishing a “chain of trust” method to verify every<br />

subsequent load of software. This mechanism verifies all<br />

software loaded.<br />

b) Code Scanning: Use sophisticated binary static code<br />

scanning tools during software development to provide an<br />

assessment which includes: open source code content, the<br />

exposure of this open source code to common vulnerabilities<br />

and indicators of secure agile software craftsmanship. These<br />

data can be used to improve the software to reduce its security<br />

risk prior to production builds.<br />

c) Approved for Delivery: Ensure that all vendors and<br />

vendor sites are certified via a vulnerability assessment and<br />

are required to maintain a certificate of “approved for<br />

delivery”. This evaluation needs to be performed on a<br />

continuous basis.<br />

2) Use Trusted Components:<br />

a) Proven Components with Defense in Depth: Use a<br />

recommended set of components (hardware and software) that<br />

have proper security and safety features and have been<br />

verified to be hardened against security attacks. Create a<br />

security architecture that is layered and deep. For example:<br />

Hardware (System on Chips - SOCs) must be secure in<br />

architecture and have access ports protected (e.g. debug ports,<br />

secure memory etc.). SOCs should store a secret key, as<br />

described above, and act as the root of trust for secure boot<br />

verifying software that is loaded. The operating system must<br />

be safety certified and must have multi level security features<br />

such as access control policies, encrypted file systems,<br />

rootless execution, path space control, thread level anomaly<br />

detection etc. Applications should also be protected as<br />

described below.<br />

b) Application Management: All applications that are<br />

downloaded should be certified and signed by proper<br />

authorities. A signed manifest file will set permissions of<br />

those resources in the system this application will and will not<br />

be allowed to get access to. The applications must always run<br />

in a sandbox and are managed over their lifecycle.<br />

3) Isolation:<br />

a) ECU isolation: Use an electronic architecture for the<br />

automobile that isolates safety critical and non-safety critical<br />

ECUs and can also “run-safe” when anomalies are detected.<br />

b) Trusted Messaging: Ensure that all communication<br />

between the automobile and the external world and between<br />

modules (ECUs) in the car is authentic and trusted.<br />

4) In Field Health Check:<br />

a) Analytics and Diagnostics: Ensure that all ECUs<br />

software has integrated analytics and diagnostics software that<br />

can capture events and logs and report the same to a cloud<br />

based tool for further analysis and preventative actions.<br />

b) Security Posture: Ensure that a defined set of metrics<br />

can be scanned regularly when the vehicle is in the field,<br />

either on an event driven (e.g. when an application is<br />

downloaded) or periodic basis to assess the security posture of<br />

371


the software and take actions to address issues via over the air<br />

software updates or via vehicle service centers.<br />

5) Rapid Incident Response Network:<br />

a) Crisis Connect Network: Create an enterprise<br />

network to share common vulnerabilities and exposures<br />

(CVE) among subscribing enterprises such that expert teams<br />

can learn from each other and provide bulletins and fixes<br />

against such threats.<br />

b) Early Alerts: Typically, when a CVE is discovered<br />

there is a time lag between discovery of the issue and the fix.<br />

This time lag is a “risk period” and it is necessary to alert<br />

stakeholders on what to do with advisories until a fix can be<br />

deployed.<br />

6) Life Cycle Management System:<br />

When an issue is detected, using Pillar 4, proactively reflash<br />

a vehicle with secure over the air (OTA) software<br />

updates to mitigate the issue.<br />

7) Safety/Security Culture:<br />

Ensure that every organization involved in supplying auto<br />

electronics is trained in safety/security with best practices to<br />

inculcate this culture within the organization. This training<br />

includes a design and development culture as well as IT<br />

system security.<br />

IV. HOW DOES BLACKBERRY ADHERE TO THE 7-PILLAR<br />

RECOMMENDATION<br />

This section shares what BlackBerry provides by way of<br />

solutions and services to the 7-pillar recommendation.<br />

A. BlackBerry’s Solutions and Services<br />

1) Secure the supply chain:<br />

a) Root of trust: BlackBerry’s Certicom unit provides<br />

Asset Management equipment that can be used to inject keys<br />

into chips at silicon foundries or test houses. This system has<br />

been proven in over 450 million smart phone chips deployed.<br />

Furthermore, BlackBerry Certicom’s managed-PKI service<br />

issues certificates that can be included as part of each ECU<br />

while they are being manufactured. These certificates have<br />

been deployed in over 100 million Zigbee devices and 10<br />

million cars.<br />

b) Code Scanning: BlackBerry is developing a novel<br />

binary code scanning and static analysis tool that can provide<br />

a list of open source software files included in a build, as well<br />

as the files that are impacted by vulnerabilities and can list a<br />

wide variety of metrics/cautions that tell a developer what to<br />

improve to reduce the security debt of the code (secure agile<br />

software craftsmanship). This is a cloud based tool and hence<br />

BlackBerry can continuously upgrade the tool with new<br />

“execution engines” (engines that add new capabilities to do<br />

deeper scans) to enhance its capability and even add custom<br />

features for the auto industry.<br />

c) Approved for Delivery: BlackBerry Cyber Security<br />

services can conduct “bug bashes” and “penetration testing”<br />

on products and IT infrastructure to assess if the enterprise can<br />

be certified as secure and approved for delivery.<br />

2) Use Trusted Components:<br />

a) Proven Components and Defense in Depth:<br />

BlackBerry QNX runs in 60 million cars and offers safety<br />

certified secure software from an operating system and<br />

hypervisor to a host of platforms and components that are<br />

designed with defense in depth security. Further, BlackBerry<br />

can lend its expertise to hardware providers to assess security<br />

risks with their chip and module designs. BlackBerry<br />

Certicom also offers hardened security crypto toolkits and<br />

means to inject hardware with secret keys.<br />

b) Applications: All applications that are downloaded<br />

should be certified and signed by proper authorities. The<br />

signature of the applications and a signed manifest file will set<br />

permissions of what resources in the system this application<br />

will get access to. BlackBerry has fundamental patents in this<br />

area and can ensure that applications are signed properly.<br />

Further when built on the QNX operating system, applications<br />

will be managed with the right access permissions, path space<br />

restrictions and sandboxing to ensure the system is safer.<br />

3) Isolation:<br />

a) ECU isolation: BlackBerry recommends that all<br />

ECUs that are safety critical be run on a network that is<br />

physically isolated from ECUs that have external physical<br />

access or are not safety critical. Any non-safety critical ECUs<br />

access to a safety critical ECU should only be accessed by a<br />

security gateway, which enforces strict policies. This gateway<br />

could have a firewall with a single outbound port, similar to<br />

BlackBerry enterprise servers. All traffic will be authenticated<br />

and encrypted with rolling keys. Domain controllers that<br />

manage multiple virtual functions (e.g. braking, steering,<br />

powertrain) can be isolated by a safety certified hypervisor<br />

such as provided by QNX. Any one system can fail without<br />

“crashing” the other virtual systems or functions. This<br />

hypervisor-based isolation can also be used for safety certified<br />

and non-safety certified functions that share a single domain<br />

controller.<br />

b) Trusted Messaging: Messaging between ECUs and<br />

the outside world needs to be trusted. All external<br />

communication can be managed by the security gateway as<br />

described above for safety and non-safety critical ECUs.<br />

Messaging between ECUs should be authenticated and<br />

encrypted. As described in Pillar 1 each ECU has a unique<br />

private key and birth certificate, which can be authenticated by<br />

the security gateway and subsequently the Gateway can issue<br />

keys to the ECU, which can be used to sign messages it sends<br />

to other ECUs, such that receiving ECUs know the message is<br />

from an authentic source as well as being signed. Chips can be<br />

designed to render such protocols very fast. BlackBerry<br />

Certicom has developed such a protocol.<br />

4) In Field Health Checks:<br />

a) Analytics and Diagnostics: BlackBerry is developing<br />

analytics and diagnostic clients that can be embedded in<br />

ECUs, which can monitor events and log crashes and<br />

anomalies. These data are sent to the cloud that can be<br />

analyzed for valuable information and acted upon.<br />

b) Security Posture: BlackBerry is developing a cloudbased<br />

tool that can access ECUs in the automobile and scan<br />

372


key metrics either on a periodic basis or on an event driven<br />

(e.g. when an application is downloaded) basis. This allows<br />

the automaker to determine in pseudo real time to scan their<br />

automobile and take actions when there is a security or safety<br />

risk.<br />

5) Rapid Incident Response Network:<br />

a) Crisis Connect: BlackBerry is creating an enterprise<br />

network to share common vulnerabilities and exposures<br />

(CVE) among subscribing enterprises. This allows a network<br />

of skilled resources to share and act faster than if they were<br />

fragmented.<br />

b) Early Alerts: Typically, when a CVE is discovered<br />

there is a time lag to the fix. This time lag is a “risk period”.<br />

BlackBerry is developing a scheme to use its AtHoc<br />

emergency notification service to alert customers on<br />

precautions that can be taken during a risk period until a fix is<br />

deployed.<br />

6) Life Cycle Management System:<br />

BlackBerry has deployed a global, secure over the air<br />

(OTA) software update service. This service is unique in<br />

regard to its scalability, deployment options and security. The<br />

service was derived from its smartphone software update<br />

service, which served over 100 million devices in over 100<br />

countries with outstanding reliability. This service is now<br />

being deployed for automobiles, with the management console<br />

for administering complex deployments.<br />

7) Safety/Security Culture:<br />

BlackBerry has developed training to inculcate a safety<br />

and security awareness culture in its organizations working on<br />

safety and security software. This training includes education,<br />

processes, methods, tools and behaviours that are best<br />

practices and can be shared with a wider audience.<br />

While not every aspect of this 7-Pillar defense is deployed<br />

commercially, the overall framework is sufficient to build a set<br />

of standard requirements and criteria to achieve enhanced<br />

safety and security in automobiles.<br />

V. POLICY AND RECOMMENDATIONS<br />

Policy for automobiles is set by government bodies such<br />

NHTSA (National Highway and Traffic Safety Administration)<br />

and DOT (Department of Transportation). Typically,<br />

automakers do not support a common set of policies and their<br />

argument is that it stifles innovation and can raise costs.<br />

However, there have been some policies that have been<br />

successful. Mandating seat belts (passive restraint systems) and<br />

airbags (supplemental restraint systems), the NHTSA 5-Star<br />

scoring system for cars set in 1998 (mainly for front impact<br />

collision), which was later upgraded to the entire car in 2011.<br />

Likewise, we feel that NHTSA and DOT can mandate a<br />

minimum set of requirements such as the 7-pillars with certain<br />

criteria to be met to achieve a certain score. A 5- Star scoring<br />

system can be used to initially educate consumers and later to<br />

make their score a differentiator for their automobiles.<br />

However, implementations should not be mandated. This<br />

should be left to the automakers to differentiate their offerings.<br />

The scoring would be set based on how many of the<br />

recommended requirements are followed and how many<br />

objective criteria are met with tests. These requirements can<br />

also secure involvement with insurance companies to create the<br />

basis for insurance rates.<br />

Another area for policy and standardization is vehicle to<br />

vehicle and vehicle to infrastructure communication,<br />

collectively called V2X. This communication protocol,<br />

frequency bands, message structures, latency, security and<br />

misbehaviour management must be standardized. Here again<br />

we recommend that the standard focus on what is required<br />

rather than implementation, which should be up to the<br />

automakers and their eco system. The perfect example is 3GPP<br />

standards set by ETSI. In fact, they could set the V2X standard.<br />

They understand wireless and interoperability and can hence be<br />

efficient in creating such a standard.<br />

Privacy and security of data is another important topic for<br />

policy makers and regulators. For starters, automakers have<br />

expressed concerns regarding their ability to trust the data from<br />

another automobile (especially from a different automaker) or<br />

from the infrastructure (e.g. traffic light) the automobile is<br />

communicating with. In this regard standardization, as<br />

suggested above, will help. An equally important concern is<br />

how does one protect the rich data that an autonomous car<br />

collects regarding a consumer’s preferences and behaviours<br />

such as drive routes, favourite places to visit, travel times,<br />

applications downloaded and even transactions handled via the<br />

automobile.<br />

Autonomous cars pose several challenges to regulators,<br />

automakers and insurance companies. Regulators need to<br />

ensure that there is a national framework and individual states<br />

do not set up fragmented rules. Will Automakers make their<br />

own policies for actions that their driverless cars will take<br />

when confronted with a particular situation where machine<br />

learning and judgement can cause different outcomes for 2<br />

different car models or brands? Will such policies and rules be<br />

regulated?<br />

Insurance companies and underwriters will need to work<br />

with lawmakers and automakers to make the liability borne by<br />

an autonomous car proportional to the revenue of each<br />

component, and hence their contributing vendors, and not let<br />

the purchasing departments of automakers make this decision.<br />

These choices and resulting policy or regulations are unclear at<br />

this time.<br />

Intellectual property presents another challenge. There is a<br />

lot of innovation in autonomous cars. Will innovation ever<br />

come to fruition or will it be mired in inter partes reviews, as<br />

today and IP wars. Will the auto industry be like the cellphone<br />

industry? Will regulators set rules on the maximum stack of<br />

royalties that can be charged per car with appropriate<br />

allotments to patent holders, using certain rules, or will it be<br />

market driven?<br />

There are many unknowns. However, we need to make a<br />

start. To start with we recommend that we begin to define key<br />

requirements and criteria that makes the automobile safer and<br />

more secure. Towards this end we suggest to start with the 7-<br />

Pillars Recommendation by BlackBerry.<br />

373


ACKNOWLEDGMENT<br />

This white paper contains thoughts and ideas from several<br />

contributors from different parts of BlackBerry. Among those<br />

are Adam Boulton, Chris Hobbs, Chris Travers, Christine<br />

Gadsby, Grant Courville, Jim Alfred, John Wall, Justin Moon,<br />

Scott Linke and members of their teams. As such this is as<br />

much their contribution.<br />

374


Secure Boot Essentials<br />

Prevent Edge node attacks by securing your firmware<br />

Donnie Garcia<br />

NXP Semiconductor, IoT and Security Solutions<br />

Austin, Texas, United States of America<br />

Donnie.Garcia@nxp.com<br />

Abstract— The reality of a world filled with smart and aware<br />

devices is that there is a world of attack possibilities versus the<br />

technology our society is reliant upon. Just consider the scenario<br />

where an IoT edge node is attacked by replacing firmware to<br />

allow access to a trusted network. In today’s Internet of Things<br />

(IoT) world of connected devices, phishing scams perpetrated by<br />

re-purposing edge nodes is a real threat. Therefore, a plan for the<br />

development, manufacturing and deployment of IoT edge node<br />

devices must be made. The complexities of life cycle management<br />

create a demanding environment where the end developers must<br />

make use of a range of hardware security features, software<br />

components and partnerships to achieve their security goals and<br />

prevent malicious firmware from being installed onto IoT edge<br />

node devices.<br />

Essential to sustaining end to end security is a secure and<br />

trusted boot, which can be achieved with the right MCU<br />

hardware capabilities and ARM® mbed TLS. This paper will<br />

introduce a life cycle management model and detail the steps for<br />

how to achieve a secure boot with a lightweight implementation<br />

leveraging NXP® ARM Cortex®-M based microcontrollers with<br />

mbed TLS cryptography support.<br />

Keywords—Security, IoT Edge Node, Phishing, Secure Boot,<br />

Cryptography, Lifecycel management<br />

I. INTRODUCTION<br />

Secure designs begin with a security model consisting of<br />

policies, an understanding of the threat landscape and the<br />

methods used to enforce physical and logical security. To protect<br />

firmware execution within today’s threat landscape, there must<br />

be a policy to only allow execution of authenticated firmware, a<br />

secure boot. The methods used to enforce this policy must rely<br />

on MCU security technology to create a protected boot flow.<br />

The boot firmware can contain public key cryptography to<br />

authenticate application code. In addition to these components<br />

that are integrated in the end device, there are tools and processes<br />

that must be leveraged in the manufacturing environment. These<br />

include using manufacturing hardware for code signing and host<br />

programs for provisioning. This paper will provide an overview<br />

of the essential components of implementing a secure boot from<br />

the concept and planning phases all the way through<br />

deployment. To aid the developer, a real-world implementation<br />

using actual hardware and tools will be explored.<br />

II.<br />

SECURE BOOT SYSTEM ARCHITECTURE<br />

A. Components of a secure boot<br />

The design of a secure boot to achieve authentication of<br />

application firmware requires the integration of numerous<br />

components. Fig. 1 represents the system level view of the<br />

components and how they interact with one another.<br />

Fig. 1: Secure boot architecture diagram<br />

At the base of Fig. 1 there is the hardware providing physical<br />

and logical security. This is where microcontroller capabilities<br />

are necessary to protect data, perform cryptography and<br />

monitor access to memories and peripherals. Secondly, sitting<br />

above the hardware must be unchangeable boot code. This<br />

code must always run when the device is powered. This boot<br />

code contains low level drivers to set up relevant security<br />

peripherals, a cryptography stack for performing authentication<br />

and or confidentiality of data and in many cases a way to load<br />

application code (a bootloader).<br />

With the unchangeable boot code present on the hardware,<br />

application code that is present or loaded on the edge device is<br />

authenticated upon every boot. Application code can be<br />

changed but the cryptographic authentication applied to the<br />

code by the boot code ensures that the changes are only and<br />

always provided by a trusted entity. Application code can<br />

make use of all or a portion of the microcontroller resources as<br />

determined by the boot code. This is because upon boot, the<br />

www.embedded-world.eu<br />

375


oot code is always executed first, ensuring proper memory<br />

resource management and protection.<br />

Represented on the left of Fig.1 are tools used in the<br />

manufacturing and deployment of the device. The<br />

microcontroller must be programmed, so tools for key<br />

management, creating firmware files and connecting and<br />

downloading firmware into the device are needed to implement<br />

the secure boot design. With these components considered, the<br />

goal of authenticating application firmware upon every boot is<br />

achievable.<br />

III.<br />

SECURE BOOT ESSENTIALS<br />

A. Essential pre-design:Security Model<br />

When designing a secure system, it is important to apply a<br />

security model. A security model is built from policies, the<br />

threat landscape and methods as shown in Fig. 2. This model<br />

provides a framework for understanding and designing to the<br />

security goals of the device. The methods, or how the security<br />

policies are enforced to achieve product goals, are made<br />

possible by the security technology that is integrated into the<br />

embedded controllers.<br />

Fig. 2: Security Model<br />

As an example, for the case of protecting firmware with a<br />

secure boot, a security model would be represented by what is<br />

shown in Fig. 3. As shown in the figure, there is a policy that<br />

only authenticated firmware should ever be allowed to be<br />

executed. The threat landscape typical for an IoT edge node is<br />

attackers will have physical access to the device and so its<br />

communication and debug ports could be exploited. Lastly, the<br />

methods that make use of microcontroller security technology<br />

supporting trust, cryptography and anti-tamper will be<br />

employed to enforce the security policy to the levels demanded<br />

by the threat landscape.<br />

Fig. 3: Example Security Model for Secure Boot<br />

With a security model in place tradeoffs on the level of security<br />

versus cost and performance can be made during development<br />

based on the model.<br />

B. Essential hardware features<br />

At the hardware level, there are several functions the<br />

microcontroller must support. These are controlling the boot<br />

flow of the device, protecting memory resources and making<br />

firmware immutable. The following sections will detail how<br />

this is achieved for a specific MCU device, the NXP Kinetis<br />

K28 150MHz device.<br />

1) Control of boot flow<br />

Kinetis MCUs are architected to boot up from internal<br />

memory. This protects against the threat of hijacking an<br />

embedded application by changing an external memory<br />

device. Some Kinetis devices such as the K28 150MHz MCU<br />

have an internal ROM. For this secure boot implementation,<br />

the internal ROM is bypassed so that the trusted secure boot<br />

code can be customized using internal flash. This is done by<br />

setting non-volatile control register bits [BOOTSRC_SEL] as<br />

highlighted in Fig.4 from reference manual section 7.3.4 Boot<br />

Sequence. Once configured this way, the RESET module state<br />

machine of the K28_150MHz device will ensure that internal<br />

flash will be fetched and the secure boot code will always run.<br />

Fig. 4: Boot Source Select bit<br />

376


2) NVM Protection<br />

As detailed in section 33.3.3.6 of the K28_150MHz reference<br />

manual, “The FPROT registers define which program flash<br />

regions are protected from program and erase operations.<br />

Protected flash regions cannot have their content changed;<br />

that is, these regions cannot be programmed and cannot be<br />

erased…”<br />

The protected region size is chip specific as they are<br />

defined as program flash size divided by 32. In the case of a<br />

2MB flash like the K28_150 device, these are 64KB blocks.<br />

This is substantial space for this secure boot implementation,<br />

but for smaller flash size devices, multiple blocks could be<br />

configured. As shown in Fig. 5, the FPROT3[PROT0] control<br />

bit must be set and the unchangeable boot code placed at<br />

memory map location 0x0000_0000 to protect the secure boot<br />

code.<br />

Fig. 5: Using Flash Block Protection<br />

3) Chip security settings<br />

Once development of the secure boot code is completed,<br />

the chip security setting can be set to disable access from<br />

JTAG/SWD port and restrict data accesses to internal<br />

memory. See reference manual section 9.2 Flash Security. The<br />

only allowable flash command once the security is enabled is<br />

the mass erase operation. This ensures that the data residing<br />

inside the chip cannot be read, only destroyed. Furthermore,<br />

the mass erase operation can also be disabled if the MEEN bit<br />

in the FSEC register is set to %01. See reference manual<br />

section 33.3.3.3 Flash Security Register (FTFE_FSEC).<br />

a) Configuration fields<br />

The control registers for controlling boot flow, setting flash<br />

block protect and chip security settings are all part of a block<br />

of non-volatile registers as detailed in section 33.3.1 Flash<br />

configuration field description. As detailed in Fig. 6, these<br />

registers are physically located in the memory map starting at<br />

address 0x0_400. These registers are also mirrored into<br />

peripheral registers to represent the settings that have been<br />

pre-configured. For the case of flash block protection<br />

(FPROT), the settings can be changed during run time to<br />

increase areas of protection, but never decrease protection.<br />

This allows the secure boot code to dynamically protect<br />

regions of flash by increasing areas of protection if desired.<br />

Fig. 6: Flash Configuration Field<br />

C. Essential Software and Tools<br />

1) ARM mbed TLS<br />

To satisfy the cryptography needed for the secure boot<br />

implementation, the solution uses the MCUXpresso Software<br />

Development Kit (SDK) configured with ARM mbed TLS<br />

support. The MCUXpresso SDK software abstracts the<br />

interface to the available hardware peripherals with a package<br />

consisting of peripheral drivers, middleware, board specific<br />

configurations and application code. Within the package there<br />

are many demo applications. For ARM mbed TLS, there are<br />

two demo applications that can be leveraged to gain a working<br />

knowledge of the software library. These are the test and<br />

benchmark applications.<br />

When ARM mbed TLS support is ported onto Kinetis<br />

devices, the software is configured to make use of available<br />

microcontroller hardware resources. In the case of the<br />

K28_150MHz MCU, this is using the MMCAU cryptographic<br />

accelerator block that assist with AES, DES and hash<br />

operations.<br />

Formerly Polar SSL, the ARM mbed TLS library is<br />

perfectly aligned to the needs of the secure boot development.<br />

The library is well documented and supported with numerous<br />

discussion forum post and code examples. The library is<br />

available as opens source under the Apache 2.0 license, which<br />

allows the code to be used in closed source projects. In<br />

addition, the library was created to be modular and with the<br />

consideration of the constraints of embedded systems allowing<br />

developers to fine tune their use of the library for the needs of<br />

specific applications.<br />

As a representation of the alignment to our needs for secure<br />

boot, Fig. 7 details the main use cases for the library. As<br />

shown on the left, the library has modules related to key<br />

exchange. The specific capabilities provided by the public key<br />

module are represented on the right. Here we see the functions<br />

which we have introduced in the system architecture diagram<br />

(Fig. 1) for generating a public key pair, signing a message,<br />

and verifying signatures. The hardware abstraction provided<br />

by these functions greatly eases the burden on the end<br />

developer for completing the necessary cryptographic<br />

operations.<br />

www.embedded-world.eu<br />

377


Fig. 7: ARM mbed TLS Design<br />

The ARM mbed TLS source files which are critical for an<br />

ECDSA implementation of the secure boot are: ec_curve.h,<br />

eccurve_config.h, ecdsa.h, ecdsa.c and ec_curve.c. Importing<br />

these files allows the end developer to make use of the ecdsa<br />

context structure defining the key information and the<br />

supporting APIs related to the ecdsa operations. Specifically,<br />

these APIs include ecdsa_genkey for public key generation. In<br />

addition, for transferring curve information<br />

ec_use_known_curve_param API is used. Depending on the<br />

lifecycle stage of the device, the ecdsa_sign and ecdsa_verify<br />

APIs are used. The curve selection is made in the<br />

eccurve_config.h file. Here you can see the options for a<br />

scalable security level based on the curves supported by mbed<br />

TLS. There is support for ECDSA curves ranging from<br />

SECP192 to SECP521.<br />

2) Bootloader and Provisioning tools<br />

a) Bootloader<br />

Providing the bootloader functions is the NXP Kinetis<br />

Bootloader product known as KBOOT. As shown in Fig. 8,<br />

KBOOT embedded software consist of peripheral interfaces, a<br />

command and data processor and memory interfaces. KBOOT<br />

is provided as full source code and can be modified for end<br />

use<br />

details the command API that is supported by the command<br />

and data processor block. In addition to the base commands<br />

for downloading firmware, the command API includes the<br />

ability to direct the device to execute firmware. This<br />

functionality is used in the factory setting to execute specific<br />

functions and extract signature and key data.<br />

Depending on the end device, KBOOT supports<br />

provisioning for all available memory interfaces. For example,<br />

on the K28_150MHz MCU, in addition to RAM and Flash,<br />

KBOOT can manage the placement of data into external serial<br />

NOR flash via the QuadSPI interface.<br />

b) Provisioning Tools<br />

In addition to the KBOOT software which runs on the device,<br />

KBOOT also includes other tools packages that run on<br />

Linux®, Mac® or Windows® host machines. These are<br />

shown below in Fig. 9.<br />

:<br />

processing of<br />

binaries, elf and<br />

SREC files into<br />

elftosbElftosb<br />

secure binaries<br />

(Special formats<br />

to work with<br />

KBOOT)<br />

Capable of<br />

encrypting files,<br />

generating keys<br />

blhost<br />

Figure 9: KBOOT Tools<br />

Command line<br />

program that<br />

interfaces to a<br />

Kinetis MCU<br />

running KBOOT<br />

Supports every<br />

KBOOT<br />

command<br />

Kinetis Flash Tool<br />

Graphical user<br />

interface to<br />

interface to a<br />

Kinetis MCU<br />

running KBOOT<br />

Easier to use<br />

than blhost, but<br />

not as powerful<br />

For the processing of binaries, elf files and srecords there<br />

is a tool named elftosb. The elftosb tool takes commands from<br />

BD files. BD, short for boot descriptor file is an input<br />

command file used by elftosb to create secure binary files (sb<br />

file). The sb file contains commands and firmware data that is<br />

sent to the device that is running the KBOOT bootloader. The<br />

blhost tool is what is used to process the sb files and interface<br />

to the devices running KBOOT. Also, worth mentioning is<br />

the Kinetis Flash Tool and the Kinetis MCU host application<br />

but these are not used in this implementation.<br />

Both elftosb and blhost are provided as source code and<br />

can be built for different operating systems. Fig. 10 shows a<br />

typical workflow for using the KBOOT tools.<br />

Kinetis MCU Host<br />

Kinetis K66<br />

application that<br />

performs host<br />

functionality to a<br />

Kinetis MCU<br />

running KBOOT<br />

Fig. 8: KBOOT Block Diagram<br />

There are processor defines for configuring which<br />

peripheral interfaces should be enabled. This serves as a dual<br />

purpose as it allows for a way to optimize for code size and<br />

addresses security because it disables interfaces to the<br />

bootloader functions from unsupported peripheral interfaces.<br />

An example of how to use these defines is shown in the<br />

KBOOT reference manual section 11.6 Modifying a<br />

Peripheral Configuration Macro. The reference manual also<br />

Fig. 10: Typical KBOOT Tools Workflow<br />

378


Moving from left to right, first the elftosb tool is used. Based<br />

on commands passed by a BD file, the elftosb tool takes input<br />

firmware files and creates the secure binary. With a secure<br />

binary, at a different time and place, a host PC running blhost<br />

tool can be used to provision a Kinetis microcontroller like the<br />

K28_150MHz device that is running KBOOT.<br />

IV. LIFECYCLE VIEW<br />

The secure boot design which was detailed in the previous<br />

secctions is a critical component to maintaining the life cycle<br />

Less‐Trust Environments<br />

Developers<br />

In the development phase, the product owner develops a<br />

factory security tool and security tool firmware. This tool is<br />

used to generate public key/private key pairs, sign application<br />

firmware and interface securely to a cloud service provider.<br />

The product owner also develops the root of trust firmware<br />

such as the secure bootloader. This firmware performs secure<br />

boot and secure boot loading. This stage is where sensitive<br />

data such as product IDs and service IDs are generated. These<br />

secrets can be passed to the cloud service provider in the<br />

development phase.<br />

For the case of a controlled manufacturing site that is in a<br />

Cloud Service<br />

Secure Environments<br />

Development Phase<br />

Application<br />

Firmware<br />

Audit<br />

Security<br />

Policies<br />

Secure<br />

Boot<br />

Firmware<br />

Application<br />

Firmware<br />

Security<br />

Tool<br />

Firmware<br />

Factory<br />

SecTool<br />

Manufacturing Phase<br />

Factory<br />

SecTool<br />

Secure<br />

Boot<br />

Firmware<br />

Device Assembling Process<br />

Audit<br />

Cloud Service<br />

Firmware<br />

Loading Policies<br />

Factory<br />

SecTool<br />

Signed<br />

Application<br />

Firmware<br />

Secure<br />

Boot<br />

Firmware<br />

Signed<br />

Application<br />

Firmware<br />

Device Assembling Process<br />

Assembly<br />

Policies<br />

Deployed Phase<br />

Cloud Service<br />

User Policies<br />

Fig. 11: Lifecycle view for a secure IoT edge node<br />

of the device. As shown in Fig.11 the IoT edge node device<br />

flows through several phases. These are shown on the left of<br />

the diagram as Development, Manufacturing and Deployment.<br />

Within these stages of the lifecycle the product could be in<br />

Secure Environments or Less-Trust Environments as shown at<br />

the top of the diagram. For example, in the development stage,<br />

application code could be developed by external developers<br />

which would be in a Less-Trust Environment. Alternatively, if<br />

the firmware development is handled by trusted internal<br />

developers then this would be in the more Secure<br />

Environment.<br />

secure environment, the factory security tool is used only to<br />

sign application firmware. Then standard tools can be used to<br />

place the root of trust firmware and signed application<br />

firmware. Microcontroller security mechanisms are used to<br />

protect the root of trust firmware. For the scenario where a<br />

less-trusted manufacturing site is used, then the factory<br />

security tool could be deployed there. The factory security tool<br />

can interface to the cloud service provider securely to get the<br />

root of trust firmware. The root of trust firmware must be<br />

securely placed on to the end device. Once the secure<br />

bootloader is on the end device, then the device will only<br />

accept and execute signed application code.<br />

www.embedded-world.eu<br />

379


To implement such a lifecycle requires preset agreements<br />

with multiple parties, such as application code developers,<br />

external manufacturing sites, cloud service providers and<br />

component manufacturers. There are policies and audits which<br />

need to be in place. The complexities of lifecycle management<br />

create a demanding environment where the end developer<br />

must make use of all available hardware, software and<br />

partnerships to achieve their security goals and prevent<br />

malicious firmware from being installed onto IoT edge node<br />

devices.<br />

Throughout the lifecycle, there are important policies that<br />

govern how the device should be handled. These are detailed<br />

below as the Security Policies, Firmware Loading Policies,<br />

Assembly Policies and User Policies. Some examples are<br />

shown in the following Table 1.<br />

addition to bootloader functions are to generate a PUB/PRIV<br />

key pair and to generate the signature for application code<br />

using the private key.<br />

2) Secure Boot Firmware<br />

This bootloader application is for use in a deployed device.<br />

The main security functions in addition to bootloader functions<br />

are to check the signature of application code using the public<br />

key, and only allow execution of the application code if the<br />

signature is authentic.<br />

The firmware for the Factory SecTool and Secure Boot is<br />

completely independent of application code development.<br />

Application code development can occur on a different target<br />

device, by different developers. As shown in Fig. 12 below,<br />

TABLE I. POLOCIES FOR LIFECYCLE MANAGEMENT<br />

Software security policies<br />

Firmware loading policies<br />

Assembly policies<br />

User policies<br />

Polocies for lifecycle management<br />

Policy Name Description Examples<br />

Ensure that the application code<br />

maintains the security of the end device.<br />

Ensure that the proper steps are taken and<br />

controls are in place to protect the<br />

programming of the end device.<br />

Ensure that only approved components<br />

are used.<br />

Provide guidelines for the end user to<br />

maintain the security of the device.<br />

No prompts for sensitive data<br />

such as Enter PIN or<br />

password<br />

A list of words that the end<br />

device should not say<br />

Password control for<br />

firmware source binaries<br />

Upon receiving the<br />

microcontroller, the device<br />

should be completely erased<br />

to ensure that it is in a known<br />

state (no un-wanted<br />

firmware)<br />

All components should be<br />

inspected for expected for<br />

proper markings during<br />

assembly<br />

Visual inspection of the<br />

device for tampering<br />

Device should be physically<br />

protected behind locked<br />

doors<br />

A. Lifecycle with target SoC and tools<br />

The following section relates the NXP Kinetis K28<br />

150MHz device secure boot implementation and KBOOT<br />

tools to the lifecycle view introduced in Fig. 11.<br />

B. Development Stage<br />

During the product development stage, there are two<br />

separate firmware developments which are done in the secure<br />

environment (please refer to Fig. 11). Both developments are<br />

based on the software described in the previous sections,<br />

KBOOT and ARM mbed TLS.<br />

The two developments are:<br />

1) Factory Security Tool Firmware<br />

This bootloader application is for use in a secure<br />

manufacturing environment. The main security functions in<br />

memory mapping on the left, this development can follow a<br />

traditional development flow for microcontrollers with<br />

firmware located at the start of the NVM space. During the<br />

manufacturing stage, the resulting firmware files can be<br />

relocated as shown on the memory mapping on the right to<br />

work with the secure boot firmware, which includes KBOOT<br />

and mbed TLS cryptography.<br />

380


Fig. 12: Memory Map for App. Development<br />

C. Manufacturing Stage<br />

After the application code has been audited versus security<br />

policy guidelines as shown in Fig.11, the following steps can<br />

be taken to complete the manufacturing of end devices that<br />

use a secure boot. Steps 1 and 2 are represented at the top of<br />

Fig. 13, and you’ll find steps 3 and 4 at the bottom.<br />

1) Application SREC is combined with Factory BD file to<br />

create the Factory Secure Binary (Factory.SB)<br />

2) Using HW with the Factory Security Tool firmware, the<br />

Factory.sb is downloaded and blhost commands are used to<br />

extract binaries for signature and public keys.<br />

3) Application SREC is combined with signature binary to<br />

make the Production secure binary (Production.sb)<br />

4) Production secure binary is used to program final hardware<br />

Once a public key, private key pair is generated in steps 1 and<br />

2, the programming of the production image can occur on all<br />

devices that will be protected by the same private key.<br />

Variations of this implementation can be made to address<br />

multiple key pairs and roll back protections. For example,<br />

multiple public key/private key pairs can be generated and<br />

stored onto the device during the manufacturing stage and then<br />

selected based on version settings.<br />

V. CONCLUSION<br />

In today’s connected world, the protection of firmware is<br />

an essential component to delivering solutions that safeguard<br />

device manufacturers and their customers. Essential to<br />

sustaining end-to-end security is a secure and trusted boot,<br />

which can be achieved with the right MCU hardware<br />

capabilities and ARM mbed TLS. Though a secure boot is<br />

achievable, as demonstrated in the previous sections, the end<br />

design is closely linked to the target platform. The developer<br />

must have detailed knowledge about the hardware and tools.<br />

As the drive towards lower power and higher performance<br />

efficiency for IoT edge nodes continues, there exists an<br />

opportunity for standardization and abstraction to ensure<br />

adoption of secure boot for more end designs.<br />

REFERENCES<br />

[1] http://www.nxp.com/docs/en/reference-manual/KBTLDR200RM.pdf<br />

[2] http://www.nxp.com/docs/en/referencemanual/K28P210M150SF5RM.pdf<br />

[3] https://tls.mbed.org/high-level-design<br />

[4] https://tls.mbed.org/module-level-design-public-key<br />

Fig. 13:Manufacturing with KBOOT Tools<br />

www.embedded-world.eu<br />

381


Security Filters for IoT Domain Isolation<br />

Dr. Dominique Bolignano<br />

Prove & Run<br />

Paris, France<br />

dominique.bolignano@provenrun.com<br />

Abstract — Network segregation is key to the security of the<br />

Internet of Things but also to the security of more traditional<br />

critical infrastructures or SCADA systems that need to be more<br />

and more connected and allow for remote operations. We believe<br />

traditional firewalls or data diodes are not sufficient considering<br />

the new issues at stake and that a new generation of filters is<br />

needed to replace or complement existing protections in these<br />

fields.<br />

Keywords— Internet of Things; firewalls; filters; data diodes;<br />

security; formal methods; embedded devices; connected car.<br />

1 INTRODUCTION<br />

Modern IoT (i.e. Internet of Things) security architectures<br />

generally make use of partitions to define security domains and<br />

try to impose strict information-flow policies on the messages<br />

that transit from one domain to another. Typically, this is<br />

achieved by forcing all messages to transit through dedicated<br />

filters. The correct implementation of such filters is essential<br />

for the whole security of the system as the only path available<br />

to hackers to perform remote attacks, when the architecture is<br />

well designed, is to send triggering messages through these<br />

filters. Gateways in new automotive architectures are<br />

representative example of devices that implement filters. They<br />

are typically used to control the information flows between<br />

various security domains, such as the powertrain domain, the<br />

infotainment domain, the comfort domain, etc.<br />

The proposed approach is meant to be applied to filters but<br />

only in situations where it is possible to explicitly identify and<br />

characterize commands and responses that are allowed to go<br />

through a given filter. As we will see this is always the case (or<br />

should always be the case) to answer to the new security<br />

requirements arising when connecting critical systems (e.g.<br />

Cyber Physical Systems), or connecting SCADA systems (e.g.<br />

Operational Technology Systems connected to the IT<br />

infrastructure), in embedded automotive, aeronautic, or railway<br />

equipment, and more generally the IoT. For the IoT, this is<br />

mainly due to the fact that the large volume of connected<br />

devices creates huge opportunities and extremely good<br />

business models for hackers.<br />

In this paper we will first explain why there is a new<br />

challenge. We will then explain how this new challenge can be<br />

addressed in general, and then show how the security of the<br />

more demanding filters can be achieved.<br />

2 THE NEW CHALLENGE WITH REMOTE ATTACKS<br />

In this section we will show that the new challenge is<br />

mainly due to the existence of new business models for<br />

hackers. In the past reaching an acceptable level security<br />

mainly boiled down to implementing a few basic ingredients:<br />

cryptographic algorithms and protocols (such as digital<br />

signatures and encrypted communications), secure elements,<br />

etc. However the advent of the IoT and the need to connect<br />

remotely to SCADA and critical systems are changing the<br />

security paradigm. There is now a real business model for<br />

hackers and organized crime syndicates in performing remote<br />

attacks. By investing a few millions of euros they are now<br />

indeed almost sure to be able to identify potential large-scale<br />

remote attacks in current connected architectures with<br />

potentially a very high return on investment. In the IoT<br />

industry hackers can for example send a few devices to<br />

"reverse-engineering consultants" located in countries where<br />

this can be done legally or without too much risk. With the<br />

proper reconstructed documentation, they can then ask<br />

"creative" hacking consultants to prepare an attack. With such<br />

a budget at hand it is almost always possible to identify<br />

dramatic large-scale attacks, at least by exploiting bugs and<br />

errors that always exist in the OS and protocol stacks that are<br />

included in the Trusted Computing Base (TCB) of a device.<br />

Such errors can usually be found in the software architecture,<br />

or in the design, implementation or configuration of a device.<br />

www.embedded-world.eu<br />

382


The business model is usually quite obvious to find as usually<br />

such attacks make it at least possible to block the normal<br />

operation of the targeted infrastructure, causing damages that<br />

are way beyond the investment. In many cases such attacks<br />

could even create more dramatic situations that might lead to<br />

loss of life. An attack similar to the well-publicized Jeep attack<br />

[6] would correspond roughly to an investment of less than half<br />

a million of dollars (an estimate based on the detailed<br />

description of the identification phase of the attack by the<br />

authors), and if performed on a massive scale by criminal<br />

organizations could have led to the death of the very large<br />

number of people. These new business models (which in the<br />

case of the IoT is exploiting the combination of high volume<br />

and potentially physical impact) are bringing unpreceded<br />

security needs on the resistance to logical attacks and this is<br />

clearly a disruption in the security needs.<br />

Security for high volume transactions (such as in payment<br />

systems) were (and are) mitigated by the use of proper risk<br />

management. Such risk management techniques are a lot less<br />

efficient (and in some cases not applicable) when it comes to<br />

IoT systems, as actions cannot be delayed or canceled as<br />

financial transactions can be. It is for example not practically<br />

possible to detect and block in real time an attack that would<br />

make all cars of a certain model turn right at a given time.<br />

In the next section, we try to give more accounts on the fact<br />

that it is always possible to use the weaknesses of the OSs or<br />

protocol stacks that are part of the TCB.<br />

2.1 The Challenge of Securing OSs, Kernels and Protocol<br />

Stacks<br />

Various public databases (such as [2]) provide statistics on<br />

public bugs or vulnerabilities on all kinds of software. These<br />

databases clearly show that current OSs and kernels suffer<br />

from a great number of errors and weaknesses, no matter who<br />

writes them, and no matter how long they have been in the<br />

field. For example, new errors are still reported in the<br />

thousands every year on “well-known” systems such as Linux.<br />

This situation is basically due to the inherent complexity of<br />

such OSs and kernels, which rely more and more on complex<br />

and sophisticated hardware. OSs and kernels are by nature<br />

concurrent and hugely complex because of the need to support<br />

various kinds of peripherals (interruption handling becomes<br />

more and more difficult), the performance objectives (e.g.<br />

complexity of cache management), the resource consumption<br />

issues (e.g. need for a sophisticated power management), etc.<br />

This complexity increases with time, increases with new IoT<br />

architectures and increases when it comes to real<br />

microprocessors (as opposed to microcontrollers).<br />

Even Trusted Execution Environments (TEEs), i.e. small<br />

security OSs that were introduced to very significantly reduce<br />

the size of the TCB, are regularly attacked ([11], [12], [16]).<br />

The real challenge (and only known solution) is to produce<br />

and demonstrate that the OSs, kernels and software stacks that<br />

are part of the TCB are as close as possible to “zero-bug” i.e.<br />

are free from errors (in their design and implementation) that<br />

could be potentially exploited for logical attacks.<br />

Traditional software engineering techniques such as<br />

exhaustive testing or code inspections are clearly not sufficient<br />

anymore to bring the level of assurance that is needed to secure<br />

complex open systems. This is due to the fact that there are too<br />

many different situations to consider for a kernel designer or<br />

tester and no real methods to review the quality of such kernel<br />

code in a systematic way, beside the use of proof techniques.<br />

Instead we believe the only valid response to such<br />

complexity is a special class of formal methods, which are<br />

known as deductive techniques or proof techniques. Even other<br />

formal methods such as static analysis or model checking are<br />

not fully addressing the problem at hands. More details are<br />

presented in [1].<br />

2.2 Two Representative Attacks<br />

Many attacks on IT systems are reported every day. Here<br />

we use two very different ones as a matter of illustration. The<br />

first one is the so-called 2015 attack on the Ukrainian power<br />

grid [14]. It is quite representative of problems coming from<br />

the complexity of the general architecture of large-scale IT<br />

systems. Such attacks indeed exploit weaknesses in the general<br />

architecture or in its configuration.<br />

The second is the so-called Heartbleed attack which is one<br />

of the many attacks and vulnerabilities that were found on<br />

SSL/TLS overtime [12]. This latter attack is very representative<br />

of attacks that exploit the complexity of the software itself.<br />

Such bugs are very similar to the bugs that can be found in<br />

error-prone software components such as OS kernels or<br />

communication stacks.<br />

Errors are not only found in software. They can also happen<br />

at the hardware level and lead to logical and remote attacks<br />

such as the recently announced Meltdown [7] and Spectre [8]<br />

attacks. Other cache attacks had been demonstrated in the past<br />

([9], [10]) and new ones will probably be found in the future.<br />

We believe that hardware design should also be formally<br />

proven eventually, at least for their TCB part (MMU, ARM<br />

TrustZone mechanism, etc.). This will not prevent non-logical<br />

attacks such as the Rowhammer attack presented in [4], but it<br />

would prevent at least a large majority of logical attacks.<br />

However, errors in hardware that can be exploited for large<br />

scale remote attacks are very rare (one or two are found every<br />

year as of now) and they can usually be addressed by proper<br />

software countermeasures. Prove & Run has developed<br />

ProvenCore, a formally proven OS kernel that rely only on a<br />

few simple hardware mechanisms and to implement a very<br />

secure firmware update mechanism so that not only the risks<br />

from such hardware attacks are minimized but also that when<br />

they happen such problems can be easily fixed by a very robust<br />

over the air firmware update mechanism.<br />

3 ADDRESSING THE NEW CHALLENGE<br />

The proposed approach to design an extremely secure filter<br />

builds on the approach we presented in [1]. We recall here<br />

briefly this approach before presenting new ideas that can be<br />

used to develop this filter. Some of these ideas are patent<br />

pending.<br />

383


First it is important to use state-of-the-art security<br />

methodologies such as the one proposed by the Common<br />

Criteria framework. In particular we assume that for each<br />

architecture and use case a proper risk analysis and threat<br />

model are made available, and that a proper security target has<br />

been defined and is used to guide the security architect, the<br />

developers, the testers and the security evaluator. It is worth<br />

noticing that such documents can be reused from one<br />

evaluation to another so as to further reduce costs.<br />

We also recommend as described in [1] to explicitly<br />

describe a clear “security rationale” that fully explains the<br />

hypotheses, conditions and reasons why the security<br />

architecture meets the desired security level. The security<br />

rationale should not only be describe the countermeasures used<br />

to address each threat but also provide a detailed rationale as<br />

detailed and convincing as an informal mathematical proof.<br />

The last step of the approach is to define an architecture<br />

that is based on a TCB that contains only formally proven<br />

kernels and protocol stacks. So in the end the security rationale<br />

for the most complex parts of the TCB must rely on formally<br />

proven software (and using a tool is necessary to check that the<br />

proof is itself free of errors) whereas the other, simpler parts of<br />

the security rationale are presented as an informal proof which<br />

can be easily audited by experts. Now instead of formally<br />

verifying large OSs and kernels such as Linux or Android<br />

where new features and drivers are added on an ongoing basis<br />

so as to address new requirements we propose to use a separate<br />

formally proven secure OS kernel, i.e. in our case ProvenCore,<br />

to address peripherals that need be secured and to run secure<br />

applications, in a way that allows us to:<br />

• Retain the normal OS (for example Linux, Android or<br />

any other proprietary OS or RTOS) and thus benefit<br />

from all its features,<br />

• Push the normal OS outside of the TCB, so that any<br />

error in the normal OS cannot be used to compromise<br />

the TCB,<br />

• Use a proven OS to perform security functions.<br />

Our formally proven kernel, ProvenCore, was designed in a<br />

way that makes it generic enough to be used as COTS<br />

(Commercial Off-the-Shelf) in virtually any IoT architecture.<br />

We describe here how this can be done on ARM<br />

architectures that account for the vast majority of the IoT<br />

market, but the same approach can be transposed to other CPU<br />

architectures.<br />

On ARM architectures and in particular on the Cortex-A<br />

and Cortex-M families of ARM microprocessors and<br />

microcontrollers, a security mechanism called TrustZone<br />

provides a low-cost alternative to adding a dedicated security<br />

core or co-processor, by splitting the existing processor into<br />

two virtual processors backed by hardware-based access<br />

control mechanisms. This lets the processor switch between<br />

two states, i.e. two worlds, typically the “Normal World” on<br />

one side and the “Secure World” on the other side. Therefore<br />

TrustZone can be used as an extremely small and securityoriented<br />

asymmetric hypervisor that allows:<br />

• The so-called Normal World to run on its own,<br />

potentially oblivious of the existence of the Secure<br />

World and,<br />

• The Secure World to have extra privileges such as the<br />

ability to have some part of the memory, as well as<br />

some hardware peripherals, exclusively visible and<br />

accessible to itself.<br />

In the proposed architecture the proven secure OS kernel,<br />

i.e. ProvenCore in our case, runs in the Secure World, and the<br />

rich but error-prone OS (Linux, Android, etc.) runs in the<br />

Normal World.<br />

4 PROPOSED APPROACH AND SOLUTION<br />

Here the key assumption (or in other words the requirement<br />

that is to be met for the proposed solution to be applicable) is<br />

that the list of commands and arguments that we want to allow<br />

in each direction can be made explicit and fully characterized.<br />

In other words the security architect or administrator must be<br />

able to express a precise filtering security policy on the<br />

commands and arguments that must go across the filter from<br />

one security domain to the other. This may be difficult to do so<br />

within a standard information system: when security is not<br />

considered a high priority the administrator is often not in a<br />

position to fully characterize all the commands and arguments<br />

in use nor even to identify all information flows. However,<br />

defining such a filtering security policy is a must as soon as a<br />

high level of security is needed e.g. for connected SCADA and<br />

critical systems. If a filtering security policy goes beyond a few<br />

trivial commands taking no arguments, then the<br />

implementation of this policy as a filter must be formally<br />

proven. In the next section we will explore how formally<br />

proven filters can address the challenge of critical IoT systems.<br />

4.1.1 Connected Critical Systems and SCADAs<br />

In the case of critical or SCADA systems it is usually<br />

necessary to accept incoming commands sent through a VPN<br />

by authorized remote agents either to perform routine<br />

maintenance and configuration or to exert manual control, at<br />

least in the case of an emergency situation where some remote<br />

administrators or decision makers need to take action quickly.<br />

In this case it is quite easy to identify and characterize the list<br />

of allowed incoming commands and outgoing responses 1 . The<br />

filtering security policy may be stateless or state-based. For<br />

example, an authorized user might be required to authenticate<br />

itself before issuing a command that modifies the configuration<br />

of the system. In this case the corresponding filtering security<br />

policy will obviously be state-based (i.e. identification and<br />

authentication are required before accepting a given<br />

command).<br />

In the case of the 2015 attacks on the Ukrainian power grid<br />

[14] it appears that only a weak security policy was enforced,<br />

i.e. users with only a low-level credential could still send any<br />

commands and receive any response from critical systems. In<br />

1 The control of outgoing responses is less sensitive but still makes attacks<br />

more difficult and is also useful in case confidentially is at stake.<br />

www.embedded-world.eu<br />

384


their comprehensive report Booz-Allen-Hamilton recommends<br />

among other measures (1) to install a statefull firewall or data<br />

diode, (2) to use a stronger authentication mechanism (such as<br />

two-factor authentication) for some of the accesses. Using a<br />

statefull applicative firewall would allow to enforce a proper<br />

security policy but the security level of existing firewalls 2 is<br />

not sufficient to cope with potential attacks (considering the<br />

level of return of investment that could be obtained by<br />

organized criminal organizations). A data diode is simpler and<br />

therefore can be brought to the right level of security (for<br />

example some data diodes have obtained an EAL7 Common<br />

Criteria certification) but can only make sure that the flow of<br />

information goes in a single direction: it cannot selectively<br />

block some commands and allow other. In addition, such<br />

systems usually require bidirectional communications, so data<br />

diodes are not adequate for this purpose. The filter we propose<br />

in this paper brings the benefits of both, i.e. the resistance of a<br />

data diode with the selectivity and programmability of an<br />

applicative firewall.<br />

In the case of the Ukrainian critical infrastructure we would<br />

have proposed to clearly identify the list of remote commands<br />

that where acceptable for each authorized (and authenticated)<br />

user. This list could have been used as the base of a filtering<br />

security policy.<br />

4.1.2 Embedded Devices and the IoT<br />

In the case of embedded automotive, aeronautic, or railway<br />

connected equipment, or more generally any equipment part of<br />

the IoT, such filters will for example be placed in the gateways<br />

that exist for most of these systems, but may also be placed<br />

elsewhere (e.g. within the Telematic Control Unit of a car).<br />

In the automotive industry, this approach could be used to<br />

filter incoming V2X alerts coming from the car gateway.<br />

Today these alerts are delivered to the driver only through the<br />

dashboard, but in the very near future these alerts might be<br />

forwarded directly to the brake-control system, forcing the car<br />

to slow down. Filtering security policies similar to the ones<br />

described in the previous section may for example apply to<br />

data exchanged between the OEM and the car, and/or<br />

commands between various domains inside the car (such as<br />

chassis, engine or infotainment domains)<br />

Because of the new business models available to<br />

enterprising hackers high level security policies need to be<br />

expressed and enforced by the gateways. It is not easy (i.e. at<br />

the very best error prone and in some cases impossible with the<br />

right level of precision) to express such policies on the lowlevel<br />

objects (such as IP packets) that firewalls normally use.<br />

The administrator in charge of configuring such firewalls or the<br />

security architect defining the gateway has to use low level<br />

concepts such as ports whereas they would like to implement a<br />

high-level security policy where they could precisely specify<br />

and restrict the type of high level commands or responses that<br />

gets in or out.<br />

2 See the list of existing certified firewalls<br />

https://www.commoncriteriaportal.org/pps/<br />

As we will see in the following section the resistance of<br />

such implementations is not high enough to cope with the<br />

remote attacks at stake. Thus, even if the firewalls are properly<br />

configured, hackers will still have many ways to attack such<br />

entry points. They will typically bypass information-flow<br />

policies by exploiting bugs and errors commonly found in<br />

protocol stacks and OSs used to implement such firewalls. In<br />

fact, the security level reached by the most secure firewalls is<br />

usually very limited. In addition, the most secure ones have an<br />

expensive bill of material, which does not fit well with<br />

embedded systems requirements.<br />

4.1.3 Limitations of Traditionals Firewalls<br />

The firewall is indeed the right concept for controlling and<br />

building the segregation of an architecture but it has two<br />

significant drawbacks, (1) the configuration of a firewall is<br />

usually done on low level protocol concepts such as ports, IP<br />

addresses, etc., and making sure that such configuration<br />

implements the correct high level security policy is difficult<br />

and very error prone at best (2) most importantly the TCB of a<br />

firewall includes at least its OS as well as its protocol stacks.<br />

Both are very error prone. In practice the complexity of the<br />

attack surface forbids this architecture from meeting the<br />

highest level of security, which is a must for the use-cases at<br />

hand. The first drawback can be avoided using applicative<br />

firewalls so that the security policies can be expressed using<br />

higher level concepts very close to the objects used in the<br />

security policy at hand, so that configuring the protocol to<br />

implement the right security policy is simple and not error<br />

prone.<br />

The second drawback is much more difficult to cope with<br />

and in fact we believe, as we will try to show in the next<br />

sections, that the TCB which necessarily includes at least one<br />

OS is very error prone (i.e. the TCB is complex and not<br />

formally proven as it should be).<br />

The attack surface of a traditional firewall is indeed<br />

unnecessarily large. In order to better understand this, let us<br />

consider an extremely simple (and unrealistic) security policy<br />

which is meant to impose that only the text command “set” can<br />

be sent remotely and that this command has a single mandatory<br />

parameter whose values can be only “on” or “off”. Let us<br />

consider here that these commands are sent using TCP/IP on an<br />

Ethernet network and let us consider in a first step, for the sake<br />

of simplicity, that we are not using a VPN or more generally<br />

that messages are not signed or encrypted.<br />

Even if this security policy is only to accept two possible<br />

commands: “set on” and “set off”, the degrees of freedom for<br />

the attacker are huge, and hence the surface of attack. First at<br />

the lexical level, the attacker could insert spaces in the text<br />

command (or other allowed delimiters such as tabs) in an<br />

attempt to exploit, for example, implementation bugs they have<br />

found in the lexical analyzer. They could in the same way<br />

exploit bugs in the syntactic analyzer (typically after reverse<br />

engineering it). The chances that they find problems that lead<br />

to real attacks there are limited because lexical and syntactic<br />

analysis is a well-understood software engineering problem<br />

with lots of available scientific know-how and tools. However<br />

such weaknesses may still exist anyway (inadequate grammar<br />

type, buffer overflow due to improper memory configuration,<br />

385


etc.). What is important in this case is that such degrees of<br />

freedom will typically exist within each layer of the protocol<br />

stack (e.g. application layer, host-to-host transport layer,<br />

internet layer, network interface layer), which enlarges the<br />

attack surface, rising the possibility of finding an exploitable<br />

bug. Wireless communication links are more exposed to these<br />

issues compared to wired ones because radio technologies (i.e.<br />

GSM, WiFi, Bluetooth, ZigBee, etc.) are usually complex and<br />

very error prone. In addition, in an OS such as Linux protocols<br />

stacks are part of the kernel, which make the attacks even<br />

simpler. In any case attackers will have an extremely large<br />

surface of attack (i.e. many degrees of freedom) to try to<br />

exploit bugs in the various protocol layers or in the OS itself.<br />

4.2 Proposed Architecture<br />

Instead of filtering low-level packets we propose to filter<br />

high-level commands and arguments directly, using a so-called<br />

“protocol break.” We propose to implement this filter as<br />

formally proven (or at least highly secure) application (statefull<br />

or stateless, depending on the requirements of the task) that<br />

only operates on high-level commands and arguments, running<br />

on a formally proven and secure OS. This OS will have to<br />

guarantee a number of security properties (such as separation,<br />

integrity, …) and which in addition will have to enforce<br />

configurable information-flow policies between its<br />

components. This information-flow policy will make sure that<br />

communication flows coming from the outside (e.g. incoming<br />

commands) go through the filtering application which is the<br />

one applying the filtering security policy.<br />

Fig. 1.<br />

Following is an example of such an architecture in which<br />

we use ProvenCore to guarantee the security properties<br />

required to host the filtering application such as isolation,<br />

confidentiality and integrity [1]. As presented in Figure 1<br />

ProvenCore also enforces a (programmable) information-flow<br />

policy between the various security applications and between<br />

the hardware peripherals and the corresponding drivers and<br />

other security applications. This policy ensures that there is no<br />

possibility for an incoming command or outgoing response to<br />

somehow bypass the filtering application. It is materialized by<br />

the black arrows that represent the only authorized<br />

communication channels.<br />

In this architecture the twin protocol stacks used to support<br />

the protocol break execute as distinct processes on the same<br />

instance of ProvenCore. Since ProvenCore guarantees the<br />

integrity and separation of the processes it executes, even a<br />

severe problem within the hardware drivers or in the protocols<br />

stack themselves will not lead to any security problem besides<br />

a lack of availability 3 .<br />

In the example above the filtering application implements<br />

two filtering security policies: one on incoming commands,<br />

one on outgoing responses. More than one filtering<br />

applications can be used with more complex topologies in<br />

which ingoing (resp. outgoing) messages are routed to different<br />

filters according to their nature, but the overall principles<br />

remain unmodified.<br />

Such an architecture allows us to design a filter that can be<br />

formally proven or more generally brought to the highest level<br />

of certification. We have summarized our architecture in Fig. 2.<br />

Fig. 2.<br />

The TCB is composed of (1) a formally proven kernel, here<br />

ProvenCore which is the very first formally proven kernel on<br />

the market with the proper security features to support this<br />

filtering architecture, and (2) a formally proven filtering<br />

application, which is by itself a very simple application, even if<br />

it includes the filtering per se but also the command and data<br />

lexical and syntactic analysis. This architecture thus allows us<br />

to obtain a filter (or an applicative firewall) whose TCB is<br />

entirely formally proven to satisfy the given filtering policy<br />

expressed in a simple and high level formal language.<br />

In other words with traditional firewalls we had to cope<br />

with a very error prone TCB with a large attack surface, not<br />

surprisingly inadequate to meet the highest level of security.<br />

With this new kind of filter we are relying on a bullet proof<br />

formally proven TCB, which in addition can be proved to<br />

exactly implement the intended filtering function. Non<br />

surprisingly such a formally proven can be brought to the very<br />

highest levels of security.<br />

But there is more to it. Even with a bullet proof filter there<br />

is still the problem that we might be forced to authorize<br />

potentially damaging commands (i.e. it is very likely that we<br />

have to accept as part of the filtering security policy some<br />

commands that are dangerous but necessary). So the remaing<br />

problem is not about is not about tampering with the filter (or<br />

the security policy) but with the fact that some valid commands<br />

3 The lack of availability that would result from a successful attack on the<br />

protocol stacks can be mitigated by adding complementary security<br />

applications running in parallel to detect such attacks (such as a specialized<br />

IDS, i.e. Intrusion Detection System) and providing a security application in<br />

charge of reloading a new update over the air (or even inspect and repair the<br />

other software components). This is not featured here as it is out of scope of<br />

the current paper.<br />

www.embedded-world.eu<br />

386


may be used to attack the receiving side. Going back to our<br />

artificially simple “set on”/”set off” example of a filtering<br />

security policy illustrates in an obvious way the fact that the<br />

attackers have almost no degree of freedom left to perform an<br />

attack on the receiving side. The only commands that can be<br />

sent are “set on” and “set off” as planned and the filtering<br />

application will leave absolutely no degree of freedom in the<br />

way any of them can be expressed. The situation would be<br />

exactly the same for more complex and realistic filtering<br />

security policies: the only degree of freedom left is indeed the<br />

one allowed by the filtering policy itself. But the commands<br />

that are defined as being acceptable by the filtering security<br />

policy could be dangerous by themselves. For example, most<br />

embedded devices will need a "firmware_update" command to<br />

manage the firmware update process for the whole platform.<br />

For this reason, it is usually also important to make sure that<br />

incoming commands have not been tampered with and have<br />

been issued by authorized and trusted persons. In other words it<br />

is necessary to add proper authentication, and also guarantee<br />

the integrity and potentially the confidentiality of the<br />

commands. Guaranteeing these security properties is typically<br />

the role of a proper VPN. Here we propose to integrate a VPN<br />

application that can be brought to the same level of security as<br />

the filtering application(s). This will give the simplified<br />

architecture presented in Figure 3.<br />

Fig. 4.<br />

Now the same benefits can be achieved for any kind of<br />

(statefull or stateless) filtering security policy. Another<br />

significant advantage is that this can be achieved without any<br />

impact on the bill of materials and therefore at very little cost.<br />

Therefore such filters are not only much more secure than<br />

existing ones, but this architecture is applicable to costsensitive<br />

devices sold in large volumes. The only costly<br />

investment was the design, implementation and formal proof of<br />

the security of ProvenCore, an investment which has been done<br />

once and for all and can benefit to the huge volumes of<br />

compatible devices from various market segments. Depending<br />

on the situations this filters can be used to replace existing<br />

filters or to complement them (to be put in sequence with<br />

another firewall or an IPS).<br />

Fig. 3.<br />

Using a proper VPN thus further reduces the attack surface,<br />

and shows the benefit that can be obtained by the use of these<br />

new generation of filters. Our artificially simple filtering<br />

security policy makes it easy to see that an attacker would have<br />

only one degree of freedom left: the possibility of (either)<br />

slowing down (or theoretically accelerating although this<br />

would be much harder) the reception of ingoing commands.<br />

Attackers would have no other degree of freedom and thus the<br />

attack surface for performing any attack would be almost nil.<br />

Here the fact that TCB is formally proven and can be brought<br />

to the highest levels of security is key. It allows the filtering<br />

application itself to be brought to the highest level of security<br />

and we believe that such a possibility is a real breakthrough in<br />

the firewalling/filtering world.<br />

4.3 A Practical Implementation<br />

In practice, the architecture presented above can be easily<br />

implemented on an ARM processor using the architecture<br />

presented in Figure 4.<br />

5 CONCLUSION<br />

In this paper we have shown why it is very difficult (or<br />

even impossible) to bring traditional firewalls and filters to the<br />

required level of security. We have proposed an approach that<br />

allows us to build new filters based on protocol breaks where<br />

the software TCB is made very simple and is just composed of<br />

a formally proven kernel, namely ProvenCore here (which is<br />

currently seeking a Common Criteria EAL7 certification), and<br />

a few security applications that can also be easily formally<br />

proven. The other parts of the software stack which normally<br />

compose a firewall, such as the drivers, the protocol stack, and<br />

the normal OS are here kept outside of the TCB. This is why<br />

such filters can be brought to levels of security that only simple<br />

physical data diodes could previously meet.<br />

REFERENCES<br />

[1] D. Bolignano, “Proven Security for the Internet of Things,” in<br />

proceedings of the Embedded World Conference 2016, February 2016.<br />

[2] National Vulnerability Database. NIST. [Online]. Available:<br />

https://web.nvd.nist.gov/view/vuln/search [Accessed 15 Jan. 2016].<br />

[3] C. Miller and C. Valasek. "A survey of remote automotive attack<br />

surfaces". [Online] Available:<br />

http://illmatics.com/remote%20attack%20surfaces.pdf [Accessed 15 Jan.<br />

2016].<br />

[4] M. Seaborn and T. Dulien, "Project Zero: Exploiting the DRAM<br />

rowhammer bug to gain kernel privileges,” 2015. [Online]. Available:<br />

http://googleprojectzero.blogspot.fr/2015/03/exploiting-dramrowhammer-bug-to-gain.html.<br />

[Accessed 15 Jan. 2016].<br />

387


[5] "ADAC deckt Sicherheitslücke auf - BMW-Fahrzeuge mit<br />

'ConnectedDrive' können über Mobilfunk illegal von außen geöffnet<br />

werden,” 2015. [Online]. Available:<br />

https://www.adac.de/infotestrat/adac-im-einsatz/motorwelt/bmwluecke.aspx?ComponentId=227555&SourcePageId=6729.<br />

[Accessed 15<br />

Jan. 2016].<br />

[6] C. Miller and C. Valasek, "Remote Exploitation of an Unaltered<br />

Passenger Vehicle". IOActive, Seattle, WA, Tech. Rep., 2015. [Online].<br />

Available:<br />

http://www.ioactive.com/pdfs/IOActive_Remote_Car_Hacking.pdf.<br />

[Accessed 15 Jan. 2016].<br />

[7] M. Lipp, M. Schwarz, D. Gruss, T. Prescher, W. Haas, S. Mangard, P.<br />

Kocher, D. Genkin, Y. Yarom and M. Hamburg, "Meltdown," [Online].<br />

Available: https://arxiv.org/abs/1801.01207 [Accessed 11 Jan. 2018].<br />

[8] P. Kocher, D. Genkin, D. Gruss, W. Haas, M. Hamburg, M. Lipp, S.<br />

Mangard, T. Prescher, M. Schwarz and Y. Yarom, "Spectre Attacks:<br />

Exploiting Speculative Execution," [Online]. Available:<br />

https://arxiv.org/abs/1801.01203 [Accessed 11 Jan. 2018].<br />

[9] D.A. Osvik, A. Shamir and E. Tromer, "Cache Attacks and<br />

Countermeasures: The Case of AES," in Pointcheval D. (eds) Topics in<br />

Cryptology – CT-RSA 2006. CT-RSA 2006. Lecture Notes in Computer<br />

Science, vol 3860. Springer, Berlin, Heidelberg.<br />

[10] M. Lipp, D. Gruss, R. Spreitzer, C. Maurice, S. Mangard:<br />

"ARMageddon: Cache Attacks on Mobile Devices," in 25th USENIX<br />

Security Symposium (USENIX Security 16). Austin, TX : USENIX<br />

Association, August 2016.<br />

[11] C. Cohen, “AMD-PSP: fTPM Remote Code Execution via crafted EK<br />

certificate,” [Online]. Available:<br />

http://seclists.org/fulldisclosure/2018/Jan/12 [Accessed 11 Jan. 2018].<br />

[12] G. Beniamini, "Trust Issues: Exploiting TrustZone TEEs," [Online].<br />

Available: https://googleprojectzero.blogspot.com/2017/07/trust-issuesexploiting-trustzone-tees.html<br />

[13] TLS/SSL Explained – Examples of a TLS Vulnerability and Attack,<br />

Final Part. [Online]. Available:<br />

https://www.acunetix.com/blog/articles/tls-vulnerabilities-attacks-finalpart/<br />

[Accessed 11 Jan. 2018].<br />

[14] When The Lights Went Out - A Comprehensive Review Of The 2015<br />

Attacks On Ukrainian Critical Infrastructure. [Online]. Available:<br />

https://www.boozallen.com/content/dam/boozallen/documents/2016/09/<br />

ukraine-report-when-the-lights-went-out.pdf [Accessed 11 Jan. 2018].<br />

[15] Internet Security Threat Report, Volume 22, April 2017, [Online].<br />

Available:<br />

https://www.symantec.com/content/dam/symantec/docs/reports/istr-22-<br />

2017-en.pdf [Accessed 11 Jan. 2018].<br />

[16] G. Beniamini, "Extracting Qualcomm's KeyMaster Keys - Breaking<br />

Android Full Disk Encryption,” [Online]. Available: http://bitsplease.blogspot.fr/2016/06/extracting-qualcomms-keymaster-keys.html<br />

[Accessed 15 Jan. 2018].<br />

www.embedded-world.eu<br />

388


From Matlab To FPGA in<br />

Manageable Steps, a True Story in<br />

Double Precision<br />

Mike Looijmans<br />

System Expert<br />

Topic Embedded Products B.V.<br />

Best, The Netherlands<br />

mike.looijmans@topic.nl<br />

Abstract—The growing computational power of our<br />

machines seems to only increase our hunger for even more<br />

teraflops. But at the same time we strive for low power<br />

consumption and flexibility. Our desktop CPUs have seen<br />

relatively little improvements over the past decade, unlike the<br />

highly capable hybrid systems that combine CPU, GPU and<br />

FPGA architectures, like Xilinx' Zynq MPSoC. FPGA based<br />

cloud computing provides computational power by the hour to<br />

those who need it.<br />

While FPGA devices offer a unique balance of flexibility and<br />

efficiency, programming these devices has usually been restricted<br />

to that handful of specialists who have the necessary knowledge<br />

and skills. This has been the major limiting factor in the broad<br />

adoption of these systems. And hybrid CPU/FPGA systems only<br />

appear to increase the amount of skill required, by requiring the<br />

engineer to also cope with the complexity of coupling the<br />

subsystems together.<br />

In this presentation I will show the complete flow from a<br />

mathlab model to its implementation in a hybrid CPU/FPGA<br />

system, a Xilinx Zynq. All that is required is a general<br />

understanding of what an FPGA is, and how it can be used to<br />

implement mathematical algorithms. No VHDL or Verilog<br />

experience required.<br />

The mathlab function in question is the discrete wavelet<br />

transform, often used in signal compression and pattern<br />

recognition. The algorithm implementation uses double-precision<br />

floating point math, usually frowned upon by FPGA engineers,<br />

but we will see later that this poses no problem for the hardware.<br />

We ported the implementation to plain C++ (or C) code, and<br />

write test code to verify it. This test code is being used throughout<br />

the project to detect any regression. We ran the code on the<br />

target CPU platform to get a baseline benchmark. Next step was<br />

to optimize the code for a more efficient implementation, and test<br />

and benchmark that again, on a desktop PC. Then we added an<br />

FPGA card to this desktop machine, to make it mimic the target.<br />

We used Dyplo to produce a bitstream for that FPGA card in a<br />

few clicks, which also adds high speed data transfer capability<br />

between the desktop CPU and FPGA. We'll pass the algorithm's<br />

C++ code on to Dyplo for implementation on the FPGA, and<br />

using the existing test code we can verify its performance. In a<br />

few iterations, the algorithm runs at the required speed, and<br />

within the resource limits. We can use the exact same software<br />

and tools to generate the final software and FPGA firmware for<br />

the Zynq target.<br />

In a few days work, we produced an implementation for the<br />

discrete wavelet transform on a hybrid CPU/FPGA platform that<br />

outperforms the CPU-only implementation in both speed and<br />

power efficiency. During the design we used test-driven software<br />

development, and in a sustainable pace arrived at the set goals.<br />

Keywords—acceleration; Dyplo; FPGA; wavelet<br />

I. INTRODUCTION<br />

Before diving into the technical challenge here, let's first<br />

describe the project.<br />

Delirium (acute brain failure) affects over 3 million<br />

hospitalized patients in Europe every year. It is a potentially<br />

fatal medical emergency that regularly leads to long-term<br />

cognitive impairment (dementia), longer hospital admission<br />

and higher healthcare costs. Its effect increases as the episode<br />

lasts longer, so timely detection is essential. To date, delirium is<br />

detected too little and too late, using subjective and ineffective<br />

methods.<br />

The project's mission is: "Safe and accurate delirium<br />

monitoring in routine hospital care". The DeltaScan Monitor is<br />

to be a brain activity analyzer that performs an algorithmic<br />

recognition of delirium, a combination of hardware and<br />

software and hence, an embedded device.<br />

Our customer, Prolira, has composed and verified a clinical<br />

mathematical model of the detection algorithm, and converted<br />

the algorithm into C++ code. From the C++ implementation,<br />

we have already learned that a modern desktop PC is able to<br />

run the algorithm within the given performance limits. Our<br />

challenge now is to implement this algorithm on a portable,<br />

battery-powered platform.<br />

www.embedded-world.eu<br />

389


II.<br />

PLATFORM<br />

From analysis of the mathematical model we derive that the<br />

core operations are discrete wavelet transforms (DWT) [1],<br />

both forward and inverse (iDWT). The discrete wavelet<br />

transform is based on convolution [2] operations, which in turn<br />

are multiply-accumulate (MAC) operations. These are very<br />

suitable for implementation on a broad range of accelerators,<br />

like SIMD, GPU, DSP and FPGA. Since a desktop PC is<br />

capable of meeting the performance requirements, we can be<br />

assured that it will also be possible to map the algorithm on a<br />

number of embedded platforms.<br />

Apart from running the algorithm, the embedded device<br />

must also acquire data in realtime from the frontend probe<br />

using an analog-to-digital converter (ADC), pre-process the<br />

data, provide appropriate clock and sample signals for the<br />

ADC, interpret the results and display them on a screen,<br />

provide a graphical user interface (GUI) for controlling the<br />

device, and monitor the battery status and the integrety of<br />

various other components.<br />

These combined requirements lead to the selection of a<br />

Xilinx Zynq 7000 series device, in particular the Topic Miami<br />

7030 system-on-module (SoM), as the central processing part.<br />

This is a combined dual-core ARM CPU and FPGA fabric,<br />

tightly coupled, in a single chip. We run Linux on the system,<br />

which gives us driver support for all peripherals on the board<br />

as well as the foundation for a GUI, and allows for a<br />

convenient hardware abstraction and programming<br />

environment. The task of ADC data acquisition is offloaded to<br />

the FPGA, so we can avoid using a real-time OS.<br />

III.<br />

IMPLEMENTATION<br />

The first approach is to implement all processing on the<br />

CPUs, using the C++ code as is. The data is being processed in<br />

sets of 4096 samples in double-precision floating point format.<br />

Each set must be processed in 2 milliseconds, to meet the<br />

performance deadlines. The C++ implementation does not<br />

come close to that, but we already anticipated that. What we do<br />

want to determine in this phase is where the performance<br />

bottlenecks are. Not surprisingly, this turns out to be the<br />

discrete wavelet transform function, which takes about 3<br />

milliseconds on this CPU, and each set requires 5 DWT<br />

operations. If we can offload the DWT operations to the FPGA,<br />

the system will meet the performance requirements.<br />

IV.<br />

DEVELOPMENT SETUP<br />

We'll first prototype the system, so we can work in a<br />

convenient development environment. So instead of an<br />

embedded board, we'll use a generic desktop PC with a PCIe<br />

card holding a comparable FPGA (a Kintex 160 series, similar<br />

fabric but with more logic cells). This sounds like introducing<br />

yet another issue, but this is completely mitigated by the<br />

operating system. Both systems run Linux, so whether the<br />

FPGA is connected through PCIe or directly integrated in the<br />

chip is completely transparent to our software thanks to<br />

hardware abstraction.<br />

Remains the job of creating a "bitstream" for the FPGA to<br />

load, so we can communicate with it. This is quite an easy task,<br />

just start the Dyplo DDE, create a new project, select the FPGA<br />

board from the list, and it's basically done. What's left for us is<br />

to select how many parallel DMA transfers we want, and what<br />

parts of the fabric we wish to use for our algorithms. I actually<br />

had done this for another project, so I just re-use everything<br />

from that. It's very convenient to just create a basic project with<br />

lots of bandwidth and free space that can be used for<br />

prototyping. The one I use divides the FPGA into 8 reprogrammable<br />

regions and provides 4 DMA channels, though<br />

for this project we'd only need one of each.<br />

Once that first static bitstream is loaded onto the board, we<br />

can plug it into the PC. Since Dyplo can reprogram the<br />

"algorithm" parts of the FPGA any time through the PCIe bus,<br />

we won't need to do this again.<br />

To send and receive data through the DMA channels, all we<br />

need to do in software is open a device file and read or write it,<br />

depending on the direction of the data flow. Before we can do<br />

that, we call a function that programs the algorithm into a part<br />

of the FPGA, and another to set up the data path between the<br />

DMA and algorithm nodes. All that is left for us to do now is<br />

actually implement the algorithm, writing (unit)test code to<br />

check the results and to measure performance.<br />

V. ALGORITHM CONVERSION<br />

Now that we have unit tests place, we can start on the<br />

conversion of the C code into hardware design language<br />

(HDL). To aid in the the conversion, we'll be using Dyplo.<br />

Though the FPGA excels at processing data at high speed,<br />

things like dynamic memory allocation cannot be practially<br />

implemented in logic. So the very first step for us is to refactor<br />

the code and change for example std::vector manipulations into<br />

simple arrays, and move all memory management outside the<br />

algorithm's main body. This turns out to be the major part of<br />

the work. As an added bonus, these changes also makes the C+<br />

+ algorithm run considerably faster on the CPU.<br />

Fig. 1.<br />

Processing pipeline<br />

So the system we want to build now must transmit a data<br />

set of 4096 samples from CPU memory to the FPGA, run the<br />

algorithm in FPGA logic and return the result, also 4096<br />

values, back to the CPU for further processing. We'll use Dyplo<br />

for the communication between CPU and FPGA, and to assist<br />

in the implementation of the FPGA logic.<br />

Fig. 2. Wavelet filterbank implementation (source: Wikipedia [1])<br />

The wavelet transform has been implemented as a<br />

“filterbank” implementation [1]. This repeatedly applies a lowpass<br />

filter g[n] and a matching high-pass filter h[n] to the data<br />

set, thus converting the sample data into a wavelet. With 4096<br />

390


samples per set, there are 11 stages. With a filter kernel size of<br />

8, The first stage requires 4096x8 MAC operations, and the<br />

next stager require an equal amount in total, so the conversion<br />

needs about 65536 MAC operations per set.<br />

In the Dyplo development environment, we create a new<br />

C/C++ "task" and point it to the algorithm's implementation C+<br />

+ file. Select the function to instantiate in logic, and Dyplo then<br />

creates the interface logic around it, and creates a Vivado HLS<br />

project that will do the main conversion. This step completes<br />

without errors now, so we can create a bitstream and run a<br />

functional test on the FPGA in the PC. This verifies that the<br />

HDL implementation is actually performing correctly in real<br />

hardware. This first attempt processes data at about 1.8MB/s,<br />

corresponding to 17 milliseconds per set. Some tuning and<br />

optimizing is required still to meet the target of 2 milliseconds.<br />

Open the Vivado HLS project and add the C test code to it.<br />

This allows us to simulate the end result without actually<br />

implementing it, and gives quick feedback on the changes in<br />

terms of number of clock ticks required to process one set of<br />

data. The major steps that contributed:<br />

<br />

<br />

<br />

<br />

<br />

The first step of the algorithm processes 4096 samples<br />

and produces 2048 results. The next 9 loops produce the<br />

remaining 2048 results. Split up in two parts, which<br />

allows the hardware to implement them in parallel. This<br />

doubles the speed.<br />

Adding the 8 filter results was being done in a loop.<br />

Rewrite this as a binary tree to get another 2x speedup.<br />

The convolution filter coefficients were being accessed<br />

in reversed order. Reverse the coefficient array instead,<br />

so the code uses them in natural order.<br />

Inline the high-pass and low-pass parts into a single<br />

function. This allows them to run in parallel and again<br />

doubles the speed.<br />

Insert the PIPELINE directive into the convolution<br />

code. This instructs HLS to try and re-use this block<br />

more efficiently, and combined with a few minor<br />

changes, made for a 5 times faster implementation.<br />

The end result was a HDL implementation that used 363<br />

microseconds per set of 4096 samples, well below our set goal<br />

of 2 milliseconds. Since each set requires 65536 MAC<br />

operations, the implementation is performing about 180 million<br />

double precision floating point multiply-accumulate operations<br />

per second.<br />

Since the embedded board uses a similar FPGA, all that<br />

needed to be done is run the place-and-route for the final<br />

embedded board, cross-compile the test benches, and copy<br />

these to the board for a final test to confirm the results.<br />

VI.<br />

FINAL RESULTS<br />

The measured speed of the algorithm is 363 microseconds<br />

per set of 4096 samples. This includes the overhead of sending<br />

the data to FPGA and back.<br />

The algorithm takes less than 15% of the resources, and can<br />

be instantiated multiple times to further increase throughput. In<br />

this case, it has been chosen to place 2 instances in the system,<br />

which is convenient since there are also 2 CPU cores to deliver<br />

and process the data.<br />

The FPGA consumes 200mW while actively using two<br />

instances of the algorithm. The CPU consumes 400mW during<br />

2 milliseconds to perform the same calculation, so the total<br />

power saving is huge. For 2 sets the FPGA would use 200mW<br />

during 363 microseconds, while the CPU would use 400mW<br />

for 2000 milliseconds, so the CPU solution uses 11 times more<br />

power for the same amount of data.<br />

Total time spent on this part of the project was about 4 days<br />

on the C++ conversion to make it suitable for FPGA synthesis,<br />

and another day to optimize performance.<br />

VII.<br />

What we learned during this project:<br />

FINDINGS<br />

Remarkably fast implementation cycle by a software<br />

engineer with no noticeable FPGA experience. Apparently<br />

tools have evolved to a point where knowledge of a hardware<br />

design language is no longer a requirement. Not only the<br />

algorithm can be described in C code, also the test benches are<br />

written in C code.<br />

The best C to HDL optimizations were accomplished by<br />

code manipulation. Changing the code to reflect what we want<br />

the hardware to do was far more efficient than attempting to<br />

guide the HDL process using only compiler directives.<br />

Maintaining double-precision floating point does not pose a<br />

problem for hardware synthesis. This leads to a much shorter<br />

time to market because a conversion to fixed point usually<br />

requires a much more thorough understanding of the domain.<br />

The bandwidth of the CPU-FPGA communication can<br />

become a dominating parameter. Describing this in detail is<br />

material for a separate technical paper.<br />

An agile approach using a test driven design methodology<br />

assured progress and traceability.<br />

REFERENCES<br />

[1] https://en.wikipedia.org/wiki/Discrete_wavelet_transform<br />

[2] https://en.wikipedia.org/wiki/Convolution<br />

www.embedded-world.eu<br />

391


Partitioning of computationally intensive tasks<br />

between FPGA and CPUs<br />

Tobias Welti, MSc (Author)<br />

Institute of Embedded Systems<br />

Zurich University of Applied Sciences<br />

Winterthur, Switzerland<br />

tobias.welti@zhaw.ch<br />

Matthias Rosenthal, PhD (Author)<br />

Institute of Embedded Systems<br />

Zurich University of Applied Sciences<br />

Winterthur, Switzerland<br />

matthias.rosenthal@zhaw.ch<br />

Abstract—With the recent development of faster and more<br />

complex Multiprocessor System-on-Chips (MPSoCs), a large<br />

number of different resources have become available on a single<br />

chip. For example, Xilinx's Zynq UltraScale+ is a powerful<br />

MPSoC with four ARM Cortex-A53 CPUs, two Cortex-R5 realtime<br />

cores, an FPGA fabric and a Mali-400 GPU. Optimal<br />

process partitioning between CPUs, real-time cores, GPU and<br />

FPGA is therefore a challenge.<br />

For many scientific applications with high sampling rates and<br />

real-time signal analysis, an FFT needs to be calculated and<br />

analyzed directly in the measuring device. The goal of<br />

partitioning such an FFT in an MPSoC is to make best use of the<br />

available resources, to minimize latency and to optimize<br />

performance. The paper compares different partitioning designs<br />

and discusses their advantages and disadvantages. Measurement<br />

results with up to 250 MSamples per second are shown.<br />

Keywords—FPGA; UltraScale+ MPSoC; partitioning; ARM<br />

NEON; SIMD; asymmetric multi-processing; high performance<br />

FFT; low latency processing<br />

I. INTRODUCTION<br />

The transition from field-programmable gate arrays<br />

(FPGAs) to System-on-Chips (SoCs) in 2011 was the<br />

unavoidable development when FPGAs needed to execute ever<br />

more complex software programs. The soft-core processors<br />

available for inclusion in the programmable logic were either<br />

not powerful enough or took up too many logic resources. The<br />

combination of hardware processors with the FPGA,<br />

interconnected through a high-performance bus showed the<br />

potential of this architecture.<br />

With the recent development of faster and more complex<br />

Multiprocessor System-on-Chips (MPSoCs), many different<br />

resources are available on one chip. For example, Xilinx's<br />

Zynq UltraScale+ MPSoC combines up to four ARM Cortex-<br />

A53 application processor cores, two ARM Cortex-R5 realtime<br />

cores, an ARM Mali-400 GPU as well as an FPGA fabric<br />

with programmable logic, on-chip memory, hardware<br />

multipliers (DSP slices) and many high-throughput I/Os. The<br />

challenge for the system architect has now become finding the<br />

optimal execution environment for your design's processes: the<br />

partitioning. The goal is to make best use of the available<br />

resources, minimizing latency and optimizing performance.<br />

In this paper, we use the Fast Fourier Transform as a<br />

computationally expensive algorithm that can be accelerated<br />

through several means:<br />

<br />

<br />

<br />

<br />

<br />

<br />

multiprocessing on several cores of the same type<br />

(Symmetric Multiprocessing)<br />

vector processing using a special instruction set<br />

for Single Instruction, Multiple Data (SIMD),<br />

available on most current processors<br />

using additional, different processors than the ones<br />

the main software is run on (Asymmetric<br />

Multiprocessing)<br />

generating accelerator functions that run in the<br />

FPGA fabric and using them as external functions<br />

running the whole algorithm in the FPGA core,<br />

controlled by a CPU core<br />

running the algorithm standalone in the FPGA<br />

For each method, we present the communication paths and<br />

software architecture, along with performance data.<br />

The FFT is a well-studied algorithm and many papers have<br />

been published on methods for efficient execution on specific<br />

multiprocessor architectures in [1], [2], [3], [4] and [5]. It is not<br />

the goal of this paper to improve on these methods, but to<br />

provide an overview and an understanding of the possibilities<br />

available in today's devices.<br />

The paper is organized as follows:<br />

Section II introduces the FFT algorithm and how it can be<br />

calculated on multiple processing devices. In Section III, we<br />

discuss the partitioning methods based on software, executing<br />

on processor cores. The FPGA-based methods are explored in<br />

Section IV. Section V elaborates on ways of collaboration<br />

between the FPGA and processors. Finally, in Section VI we<br />

sum up the advantages of the presented methods.<br />

www.embedded-world.eu<br />

392


II.<br />

FFT PARTITIONING<br />

The discrete Fourier Transform (DFT) is used to transform<br />

a sequence of samples from the time domain into the frequency<br />

domain to analyze the frequency components of the sampled<br />

signal. Spectral analysis, measuring and controlling, signal<br />

processing and quantum computing are but a few applications<br />

of the DFT. The DFT has a very high computational cost of<br />

O(N 2 ). The Fast Fourier Transform (FFT) improves efficiency<br />

of the transform by reducing the number of redundant<br />

calculations. This is achieved by splitting the sequence into<br />

smaller parts and performing the Fourier Transform on these as<br />

shown in Fig. 1. In doing so, the computational cost can be<br />

reduced to O(N log N). Note that the splitting includes a<br />

reordering of the input values, effectively selecting every other<br />

input value for each subset.<br />

Fig. 1. Principle of the FFT algorithm.<br />

Since the FFT algorithm is a divide-and-conquer approach,<br />

it is well suited for parallel processing on multiple processors.<br />

Each core can perform the smaller FFT on its part of the data,<br />

independent of the remaining data. However, as shown in Fig.<br />

1, there will be at least one step requiring data from other cores<br />

when combining the smaller FFTs into the complete spectrum.<br />

This all-to-all communication is a critical step because it<br />

requires synchronization of the cores. The optimization of this<br />

step has been the subject of several publications, i.e. [2] and<br />

[4].<br />

The technique of calculating smaller FFTs and combining<br />

them into larger spectra allows to efficiently process FFTs<br />

larger than the memory of your processor, given that the<br />

currently unused data is stored in an efficient way. Refs. [2],<br />

[4] and [6] have published possible implementations.<br />

III.<br />

MULTICORE PROCESSING FOR FFT<br />

A. Available resources<br />

The Xilinx Zynq UltraScale+ MPSoC portfolio offers<br />

multiple ranges of SoCs with varying numbers of processor<br />

cores and FPGA fabric resources. In this paper, we use the<br />

XCZU9EG device, an SoC with the following resources:<br />

<br />

<br />

<br />

four ARM Cortex-A53 application processing cores,<br />

running at 1100 MHz and featuring the NEON<br />

instruction set<br />

two ARM Cortex-R5 real-time processing cores with<br />

tightly-coupled memory (TCM) for low-latency access,<br />

running at 500 MHz<br />

an ARM Mali-400 GPU<br />

an FPGA fabric with 600'000 System Logic Cells, 32<br />

Mbit of FPGA memory and 2520 hardware multipliers<br />

(DSP slices)<br />

Fig. 2 shows the block diagram of the available resources.<br />

The ARM Cortex-A53 core is a mid-range application<br />

processing core that balances power usage vs. performance. It<br />

is equipped with the ARMv8 instruction set, including<br />

NEONv2 SIMD instructions for vectorized execution on<br />

multiple data (up to 128 bit wide). Four A53 cores make up the<br />

Application Processing Unit (APU), in the bottom right of Fig.<br />

2. The ARM Cortex-R5 core is a real-time processor with a<br />

focus on fast reaction to events. Its 128 kB of TCM allow very<br />

fast memory accesses, but in turn limit the amount of data that<br />

can be worked on. Two R5 cores form the Real-time<br />

Processing Unit (RPU) in the top right of Fig. 2. The Level 3<br />

interconnect enables fast data transfers between the APU, the<br />

RPU and the FPGA fabric with on-chip memory, DSP slices<br />

and programmable logic.<br />

B. Executing the FFT in Software<br />

When executing an FFT in software, you have the choice of<br />

several FFT libraries, many of them capable of exploiting both<br />

multiprocessing and vector processing.<br />

ARM Ne10 [7] provides highly optimized ARM NEON<br />

intrinsics written in Assembler and its FFT algorithm makes<br />

use of these. However, it does not support multiprocessing.<br />

kissFFT [8] is a very lightweight library with the goal of being<br />

easy to use and moderately efficient while supporting<br />

multiprocessing. However, it makes use of NEON instructions<br />

only to execute four separate FFTs in parallel instead of<br />

accelerating one FFT transform.<br />

The fastest and most versatile FFT library that was tested in<br />

our work is FFTW3 [9], exploiting both multi-processing and<br />

NEON instructions and including a mechanism to optimize the<br />

algorithm for the available hardware. This mechanism will test<br />

many possible FFT optimization algorithms, measuring<br />

performance and selecting the fastest one as described in [10].<br />

This is done in order to make the best possible use of first and<br />

second level caches, memory access speeds and other hardware<br />

characteristics. For our implementations, we used FFTW3 on<br />

the A53 and Ne10 on the R5.<br />

Fig. 2. Block diagram of MPSoC.<br />

393


C. Implementations<br />

The software-only implementations were run on the Xilinx<br />

PetaLinux operating system, using the FFTW3 library to<br />

calculate double precision floating-point complex FFTs. Using<br />

double precision limits the speed improvement for the NEON<br />

instruction set to a factor of two, because the NEON registers<br />

are 128 bit wide and can therefore accommodate only two<br />

double precision floating-point values.<br />

The following five scenarios were tested:<br />

a. Single-core A53<br />

b. Single-core A53 with NEON instructions<br />

c. Symmetric Multi-Processing (SMP) with four A53<br />

d. Symmetric Multi-Processing (SMP) with four A53<br />

with NEON instructions<br />

e. Asymmetric Multi-Processing (AMP) with the R5 as<br />

coprocessor<br />

Scenarios a-d require no special software stack except the<br />

pthread library for SMP.<br />

Scenario e requires additional frameworks and drivers for<br />

communication between the Master A53 core and the remote<br />

R5 core to enable AMP. Fig. 3 shows the software architecture.<br />

Two operating systems are required: Linux on the APU master<br />

CPU and FreeRTOS on the RPU slave CPU. First, the APU<br />

boots Linux and uses the OpenAMP framework to load the<br />

RPU firmware into the TCM via a DMA transfer. The RPU is<br />

then booted out of the TCM. The remoteproc driver handles the<br />

life cycle management, allocates the required resources and<br />

creates a virtual I/O (virtIO) device for each remote processor.<br />

RPMsg is the Remote Processor Messaging API to provide<br />

inter-process communication between processes on<br />

independent cores.<br />

The flow of OpenAMP booting and software execution is<br />

as follows (as in [11]):<br />

1. The remote processor is configured as a virtual IO<br />

device, shared memory for message passing is<br />

reserved.<br />

2. The Master loads the firmware for the remote<br />

processor into its memory, then boots the remote<br />

processor.<br />

3. After booting, the remote processor creates the virtIO<br />

and RPMsg channels and informs the master.<br />

4. The Master invokes the callback channel and<br />

acknowledges the remote processor and application.<br />

5. The remote processor invokes the RPMsg channel.<br />

6. The RPMsg channel is established, both Master and<br />

Slave can initiate communication via RPMsg calls.<br />

7. During operation, communication buffers in reserved<br />

shared DDR memory are used to pass messages<br />

between the Master and the Slave. Usually, these<br />

buffers are small. To load larger amounts of data, such<br />

as the FFT input and output data, the data is written to<br />

or read from on-chip memory (OCM) of the R5, and<br />

the pointers are passed via message buffer.<br />

Shutdown proceeds in the reverse sequence of the booting<br />

and initialization process.<br />

D. Performance<br />

To provide an overview of the performance, FFTs of three<br />

sizes (4’096, 16’384 and 65’536 data points) were calculated<br />

using the implementation scenarios a-d. The Cortex-R5 can<br />

only perform 4’096 point FFTs due to its limited amount of<br />

OCM<br />

Fig. 3. Software stack for Asymmetric Multi-Processing (as in [11])<br />

www.embedded-world.eu<br />

394


Table I shows the achieved calculation times in<br />

microseconds, comprised of the times for loading the input<br />

data, executing the FFT and storing the result in memory. We<br />

also show the feasible sampling rates that would allow the<br />

CPUs to keep up a seamless processing of the input data.<br />

TABLE I.<br />

SOFTWARE FFT PERFORMANCE<br />

Scenario 4’096 16’384 65’536<br />

time<br />

(µs)<br />

max.<br />

rate<br />

time<br />

(µs)<br />

max.<br />

rate<br />

time<br />

(µs)<br />

(MSa/s)<br />

(MSa/s)<br />

a A53 320 12 1600 10 8670 7<br />

b A53NEON 290 14 1460 11 7760 8<br />

c 4xA53 120 33 511 32 2770 23<br />

d 4xA53NEON 114 35 434 37 2290 28<br />

e R5 a 1455 3 -- -- -- --<br />

max.<br />

rate<br />

(MSa/s)<br />

a. R5 time includes OpenAMP communication overhead (approx.100 µs)<br />

It is evident from the data that the FFT scales well for<br />

multiprocessing. When using four A53 cores, a speed-up of up<br />

to factor 3.3 is observed. Scaling is better for large FFTs,<br />

because there are more calculation steps that don't require allto-all<br />

communication.<br />

More detailed tests have been run with one to four cores,<br />

but for the sake of brevity, the data is not shown here. In<br />

summary, the speedup corresponds almost directly to the<br />

number of cores as long as there remains at least one core for<br />

execution of the processes of the operating system. When all<br />

cores are used for multi-processing, the speedup is capped. A<br />

reasonable explanation is that the FFT is competing with the<br />

other processes, resulting in many context switches.<br />

Using the NEON instruction set, a speedup of roughly 10%<br />

is observed for single-processing. For multi-processing,<br />

enabling NEON yields a speed gain of 5-20%. This is nowhere<br />

near a factor of two that could be expected since two values<br />

can be processed at the same time. Among the possible<br />

reasons, we suspect the fact that the NEON instructions have<br />

their own execution pipeline and registers. If the algorithm can<br />

not be optimized to 100% NEON instructions, there will be<br />

data transfers between NEON and standard registers.<br />

The R5 is clearly not designed for computationally heavy<br />

tasks, its target assignment is to react to events in real-time. It<br />

has to be noted that the communication overhead of the<br />

OpenAMP framework contributes approximately 100 µs to the<br />

execution time, but even if this overhead could be avoided, the<br />

R5 would be no competition for the A53s.<br />

IV.<br />

ACCELERATORS IN FPGA<br />

Traditionally, bringing an algorithm to programmable logic<br />

means writing HDL code or using an existing IP core. Today,<br />

there are tools to generate HDL code from software code.<br />

Xilinx provides the SDSoC (Software Defined System on<br />

Chip) toolchain that generates logic blocks from your C-code<br />

along with the required data transfer logic. To allow your<br />

software to interface with this computation block, SDSoC<br />

compiles a software library with the required function for<br />

configuring the FFT core, loading and storing data as well as<br />

the necessary interrupt service functions.<br />

We have compared the performance of the following<br />

scenarios:<br />

f. SDSoC-Accelerator controlled by A53<br />

g. FFT IP-core controlled by A53<br />

h. FFT IP-core working standalone<br />

Scenario f: An SDSoC accelerator core can only be<br />

implemented for a fixed FFT size. Therefore, three accelerators<br />

were implemented in the programmable logic, clocked at<br />

300 MHz. The big advantage of performing the FFT in<br />

programmable logic is that the processor core can perform<br />

other tasks in the meantime. The processor will still be required<br />

for loading and storing the data. Fig. 4 shows the block<br />

diagram of this setup.<br />

Scenario g: The Xilinx FFT IP-core can be configured for<br />

different FFT sizes at runtime. One instance of the FFT core is<br />

therefore sufficient for our tests. The processor will load the<br />

input data into on-chip memory in the FPGA fabric, then start<br />

the FFT core and finally transfer the processed data back to<br />

DDR memory. These transfers can be done by DMA, leaving<br />

the CPU free for other tasks. This setup is shown in Fig. 5.<br />

Scenario h: If the input data is acquired in the FPGA fabric,<br />

there is no sense in transferring the data to DDR memory first,<br />

then loading it to FPGA on-chip memory for the FFT. Instead,<br />

the FFT core is configured for constant, standalone operation<br />

on the input data stream and a DMA stream is set up for<br />

transfer of the output data to a range of reserved DDR memory,<br />

where the APU can retrieve the processed data for analysis<br />

(See Fig. 6).<br />

Fig. 4. SDSoC accelerators in FPGA.<br />

Fig. 5. FFT IP block, controlled by A53<br />

395


efficiency reasons, this is best done in on-chip memory<br />

(BRAM).<br />

Fig. 6. FFT IP block, self-controlled<br />

Table II shows the achieved execution times for FPGAaccelerated<br />

FFT on the Zynq UltraScale+ MPSoC. Scenarios f<br />

and g have similar performance, showing that the SDSoCgenerated<br />

HDL code is efficient and compares well to<br />

manually optimized HDL code of the FFT IP.<br />

TABLE II.<br />

FPGA FFT PERFORMANCE<br />

Scenario 4’096 16’384 65’536<br />

time<br />

(µs)<br />

max.<br />

rate<br />

time<br />

(µs)<br />

time<br />

(µs)<br />

max.<br />

rate<br />

(MSa/s)<br />

(MSa/s)<br />

f SDSoC 108 38 410 38 1720 38<br />

g IP-A53 101 41 400 41 1680 39<br />

h IP-standalone 51 250 202 250 807 250<br />

Fig. 7. Partitioning FFT, parallel processing in FPGA<br />

The more efficient data path in scenario h can easily<br />

explain the difference in performance between scenarios g and<br />

h, omitting the transfer of the input data from DDR to on-chip<br />

memory. In fact, the limiting factor for the sampling rate is the<br />

DMA transfer rate from FPGA fabric into DDR memory.<br />

V. COMBINING FPGA AND PROCESSING SYSTEM<br />

The results in Sections III and IV show clearly that the<br />

FPGA easily outperforms the pure software implementations.<br />

Nevertheless, there are limits to the size of FFT than can be<br />

executed in the FPGA. The FFT IP core can process up to<br />

65’536 points for one FFT. The SDSoC toolchain would allow<br />

to create accelerator functions for larger FFT sizes, but the<br />

available FPGA resources (DSP slices and on-chip memory)<br />

would be exhausted quickly.<br />

We have explored ways for the FPGA and processing<br />

system to collaborate in processing the FFT. The goal was a<br />

65’536 point FFT, using fewer resources in the FPGA while<br />

still maintaining good performance.<br />

As shown in Fig. 1, the FFT algorithm is divided in clearly<br />

defined steps that can be processed in separate units. Our idea<br />

was to process the first steps of the FFT in the FPGA grid, then<br />

transfer the data into processor memory and do the remaining<br />

steps in software, as shown in Fig. 7. The FFT would be split<br />

into four 16’384 point FFTs in the FPGA. These smaller FFTs<br />

can be processed either in parallel or in series.<br />

Parallel processing requires four FFT cores with four times<br />

the resource usage. For serial processing (Fig. 8), only one FFT<br />

core is implemented, but the data of the three remaining FFTs<br />

must be stored until the core is ready for processing. For<br />

Fig. 8. Partitioning FFT, serial processing in FPGA<br />

We found that the amount of BRAM resources used is<br />

similar for both the parallel and the serial approach. Table III<br />

shows the number of BRAM blocks and DSP slices used.<br />

Values in parentheses show the percentage of all available<br />

resources. Because the amount of data to be stored or<br />

processed is the same as for a 65’536 point FFT, our approach<br />

even uses roughly the same amount of BRAM as the full<br />

65’536 point FFT core. With BRAM being the most limited<br />

FPGA resource for this application, there is no gain from<br />

partitioning the FFT between FPGA and processing system.<br />

Furthermore, the FFT calculation needs to be finished in the<br />

processing system, adding more latency and resource usage to<br />

the bill.<br />

www.embedded-world.eu<br />

396


TABLE III.<br />

RESOURCE REQUIREMENTS OF PARTITIONED FFT<br />

Scenario BRAM DSP<br />

Parallel FFT 4x 16k FFT 232 (27%) 180 (7%)<br />

Serial FFT 1x 16k FFT & BRAM 244 (25%) 45 (2%)<br />

Full FFT 1x 64k FFT 238 (27%) 54 (2%)<br />

VI.<br />

DISCUSSION<br />

For the FFT, we have shown that the FPGA fabric is able to<br />

perform several times faster than the complete processing<br />

system of the Zynq UltraScale+ MPSoC. This power can be<br />

harvested in several ways, be it as stand-alone FFT processor<br />

or as an external accelerator function.<br />

Depending on the amount of processing to be done apart<br />

from the FFT, doing the whole transform in the processing<br />

system can also be an option, leaving more room in your<br />

FPGA.<br />

The decision where to execute an algorithm depends on<br />

many factors, such as:<br />

<br />

<br />

<br />

<br />

Where does your data originate? Try to keep it local,<br />

reducing the amount of data transfer.<br />

What are the required data rates? Can the amount of<br />

data be transferred over the L3 interconnect without<br />

interfering with the remaining processes?<br />

How well can your algorithm be split up and be<br />

processed in parallel? The more an algorithm can be<br />

parallelized, the better the FPGA will perform in<br />

comparison to the processing system.<br />

How many FPGA resources can you spare for your<br />

algorithm?<br />

In the end, it remains the challenge of the system architect<br />

to choose where and how the data is to be processed. A deep<br />

understanding of the algorithm and both processing system and<br />

FPGA hardware is required.<br />

REFERENCES<br />

[1] P. N. Swarztrauber, "Multiprocessor FFTs," Parallel Computing, Vol. 5,<br />

Issues 1–2, pp. 197-210, 1987.<br />

[2] J. Sánchez-Curto, P. Chamorro-Posada, "On a faster parallel<br />

implementation of the split-step Fourier method," Parallel Computing,<br />

Vol. 34, Issue 9, pp. 539-549, 2008.<br />

[3] T. H. Cormen, D. M. Nicol, "Performing out-of-core FFTs on parallel<br />

disk systems," Parallel Computing, Vol. 24, Issue 1, pp. 5-20, 1998.<br />

[4] E. Chu, A. George, "FFT algorithms and their adaptation to parallel<br />

processing," Linear Algebra and its Applications, Vol. 284, Issues 1–3,<br />

pp. 95-124, 1998.<br />

[5] S. Xue, J. Wang, Y. Li and Q. Peng, "Parallel FFT implementation<br />

based on multi-core DSPs," 2011 International Conference on<br />

Computational Problem-Solving (ICCP), Chengdu, pp. 426-430, 2011.<br />

[6] R. Lyons, "Computing large DFTs using small FFTs", [Online]:<br />

https://www.dsprelated.com/showarticle/63.php<br />

[7] ARM Ne10 Project [Online]: https://projectne10.github.io/Ne10/<br />

[8] kissFFT [Online]: https://sourceforge.net/projects/kissfft/<br />

[9] FFTW3 [Online]: http://www.fftw.org/<br />

[10] M. Frigo, S. G. Johnson, "The Design and Implementation of FFTW3,"<br />

Proc. IEEE, vol. 93, no. 2, pp. 216-231, 2005.<br />

[11] Xilinx User Guide UG1211, "Zynq UltraScale+ MPSoC Software<br />

Acceleration Targeted Reference Design", [Online]:<br />

https://www.xilinx.com/support/documentation/boards_and_kits/zcu102<br />

/2017_2/ug1211-zcu102-swaccel-trd.pdf. Xilinx, Inc. 2017<br />

397


Paper<br />

Analyzing the Generation and Optimization of an<br />

FPGA Accelerator using High Level Synthesis<br />

Sebastian Kaltenstadler<br />

Ulm University<br />

Ulm, Germany<br />

sebastian.kaltenstadler@missinglinkelectronics.com<br />

Stefan Wiehler<br />

Missing Link Electronics<br />

Neu-Ulm, Germany<br />

stefan.wiehler@missinglinkelectronics.com<br />

Ulrich Langenbach<br />

Beuth University of Applied Sciences<br />

Berlin, Germany<br />

ulrich.langenbach@beuth-hochschule.de<br />

Abstract—Multi-Processor System-on-Chip FPGAs can utilize<br />

programmable logic for compute intensive functions, using socalled<br />

Accelerators, implementing a heterogeneous computing<br />

architecture. Thereby, Embedded Systems can benefit from the<br />

computing power of programmable logic while still maintaining<br />

the software flexibility of a CPU. As a design option to<br />

the well-established RTL design process, Accelerators can be<br />

designed using High-Level Synthesis. The abstraction level for<br />

the functionality description can be raised to algorithm level<br />

by a tool generating HDL code from a high-level language like<br />

C/C++. The Xilinx tool Vivado HLS allows the user to guide the<br />

generated RTL implementation by inserting compiler pragmas<br />

into the C/C++ source code. This paper analyzes the possibilities<br />

to improve the performance of an FPGA accelerator generated<br />

with Vivado HLS and integrated into a Vivado block design. It<br />

investigates, how much the pragmas affect the performance and<br />

resource cost and shows problems the tool has with coding style.<br />

I. INTRODUCTION<br />

For modern computing systems it is getting more popular to<br />

use heterogeneous computer architectures to further increase<br />

computing power. There are multiple ways to compensate<br />

the stagnation of single core performance of CPUs, ranging<br />

from instruction set extensions, multi core processors and<br />

GPUs to coprocessors on FPGAs. Cryptographic and hashing<br />

functions for example can be accelerated on an FPGA. The<br />

advantages of an implementation on an FPGA are almost<br />

ASIC-like computing performance, quick adaption to new<br />

protocols and standards as well as low energy consumption<br />

[7]. To develop such a coprocessor, the logic of the algorithm<br />

has to be described with VHDL/Verilog on Register Transfer<br />

Level. This is complicated because of the high level of detail<br />

and the high susceptibility to errors due to the low level of<br />

abstraction on Register Transfer Level. To lower development<br />

time, one can raise the abstraction level to Algorithm level<br />

through High-Level Synthesis (HLS). Instead of a complicated<br />

description of the accelerator in VHDL/Verilog, High-Level<br />

Synthesis uses standard C/C++ code to describe the logic.<br />

The HLS-Tool Vivado-HLS offers compiler pragmas to further<br />

define the hardware architecture of the C/C++ code. With those<br />

pragmas, the developer can create different implementations<br />

of the same algorithm without touching the functionality by<br />

just inserting or deleting one line in the source code. With<br />

those capabilities, Vivado-HLS can be used for design space<br />

exploration. In this paper, an FPGA coprocessor is used to<br />

accelerate AES encryption and decryption calls from the Linux<br />

Kernel Crypto API.<br />

A. Definitions<br />

II. DEFINITIONS AND ABBREVIATIONS<br />

This paragraph specifies how common FPGA build flow<br />

terms are used in this paper.<br />

Synthesis is the whole process of High-Level Synthesis. It<br />

basically summarizes all design steps from Vivado HLS.<br />

Implementation summarizes all steps from Vivado.<br />

Vivado Synthesis is the Synthesis step inside of the Implementation.<br />

Bitstream is the output of the implementation. It is used to<br />

program the FPGA.<br />

B. Abbreviations<br />

This paragraph gives a short summary of all abbreviations<br />

used in this paper.<br />

AES stands for Advanced Encryption Standard. See section<br />

III-A for an explanation.<br />

BRAM stands for Block RAM. BRAM is one of the resources<br />

on an FPGA. BRAMs are arranged in slices of 36 KBit.<br />

FF stands for Flip Flop. FFs are one of the resources on an<br />

FPGA.<br />

HLS stands for High-Level Synthesis. See section III-C for a<br />

brief explanation.<br />

II stands for Initiation Interval. See section III-B for an<br />

explanation.<br />

LUT stands for Lookup table. LUTs are one of the resources<br />

on an FPGA. They build the logic gates inside of the<br />

FPGA.<br />

RTL stands for Register-Transfer Level.<br />

398<br />

www.embedded-world.eu


III. BASICS<br />

This chapter gives a short summary of the most important<br />

basics of the paper. It explains how AES works, what the<br />

design steps are within Vivado HLS and how optimization<br />

with HLS works.<br />

A. AES<br />

The Advanced Encryption Standard (AES) [4] is an encryption<br />

algorithm developed in 2000 by Joan Daemen and Vincent<br />

Rijmen and is one of the most important encryption algorithms<br />

today. It is a symmetric block cipher, which means, it uses<br />

the same key for en- and decryption. Block cipher means, it<br />

encrypts and decrypts blocks of data of a constant size, in this<br />

case the block size is 128 bits or 16 Bytes respectively. These<br />

16 Bytes are arranged in a 2-dimensional 4x4-array called<br />

state. The algorithm consists of four base operations repeated<br />

in multiple rounds. These operations are AddRoundKey, Sub-<br />

Bytes, MixColumns and ShiftRows. AddRoundKey is a simple<br />

XOR-connection between the state and the round key. Sub-<br />

Bytes replaces all Bytes of the state according to a substitution<br />

box, called S-Box. In this paper, the SubBytes operation was<br />

implemented using arrays with precomputed values as Lookup<br />

tables (this does not mean the LUT hardware resource on the<br />

FPGA, but the basic concept of Lookup tables in software).<br />

ShiftRows does a cyclic shift on the rows of the state according<br />

to its row number. MixColumns mixes the 4 Bytes so every<br />

input Byte affects every output Byte. MixColumns is like<br />

SubBytes implemented using precomputed arrays as Lookup<br />

Tables. To encrypt more than 16 Bytes, a operation mode is<br />

required. In this work Cipher Block Chaining (CBC) is used.<br />

In this mode, the ciphertext of a block depends on the plaintext<br />

and the ciphertext of the previous block. This data dependency<br />

cannot be resolved; thus the encryption cannot be pipelined.<br />

In decryption however this dependency does not exist, which<br />

enables pipelining.<br />

B. Performance of digital circuits<br />

To evaluate the results of High-Level Synthesis, we need<br />

to measure the performance of an FPGA accelerator which<br />

is determined by its throughput. The throughput is influenced<br />

by many different characteristics of the accelerator, which are<br />

listed below:<br />

Clock period is the time period of one clock cycle. All<br />

registers in the design are connected to the same clock<br />

to synchronize read and write operations.<br />

Blocksize is the amount of data that can be read and computed<br />

at once. The unit is bit or byte. In this paper, the blocksize<br />

is 128 bit or 16 Byte.<br />

Latency is the number of clock cycles after reading data until<br />

the result is available at the output registers.<br />

Initiation Interval is the number of clock cycles after reading<br />

a block, until the circuit can read new data.<br />

With these quantities the throughput can be computed as<br />

shown in eq. (1).<br />

BW =<br />

BS total f<br />

L init + L single + II(n − 1)<br />

(1)<br />

Fig. 1. Design flow of Vivado HLS [2].<br />

with the throughput BW , the total amount of data BS total ,<br />

the clock frequency f, the latency for initialization process<br />

L init , the latency for a single block of data L single , the initial<br />

interval II and the total number of data blocks n, which<br />

is BS total /16 Byte. Without pipelining, the initiation interval<br />

and the latency for a single block are the same, so the terms<br />

are interchangeable. Since the initial latency is constant with<br />

around 1000 clock cycles, it does not influence the throughput<br />

for a big amount of data. This simplifies the formula to eq. (2).<br />

BW = BS totalf<br />

(2)<br />

IIn<br />

This formula assumes, that there is no input data stream stall<br />

when processing a stream at the input of the accelerator and<br />

the data is read from the output immediately, so it does not<br />

get slowed down by back pressure. The result of this formula<br />

is used as a metric for the performance of the generated<br />

accelerators.<br />

C. High-Level Synthesis with Vivado HLS<br />

High-Level Synthesis is the Synthesis of a hardware description<br />

on Register-Transfer-Level (RTL) from a description<br />

on algorithm level. In this paper, we generate an AES accelerator<br />

from C/C++ source code with Vivado HLS. For a<br />

more detailed introduction to HLS see [6]. Figure 1 shows<br />

the design flow of Vivado-HLS. The source code can be<br />

any C/C++ implementation, as long as it only makes no<br />

use of dynamic memory allocation. It is recommended to<br />

use a generic implementation instead of an optimized one<br />

for a special compiler or processor. Since an FPGA works<br />

differently than a normal CPU, optimizations for a CPU are<br />

not suited for an FPGA and might even worsen the results. The<br />

correctness on the algorithm level of the code can be checked<br />

with the C-Simulation using a testbench. Now, the interface of<br />

the accelerator has to be declared with the Interface pragma.<br />

399


It describes how the interface has to be generated from the<br />

parameters of the top level function and which type of bus<br />

has to be used. In our case, we used an AXI-Stream interface<br />

for the data input and output with an additional AXI-Lite<br />

interface for control signals like starting the encryption or<br />

changing the encryption key. Once the code is synthesizeable,<br />

it can be optimized using additional pragmas. A list of all used<br />

optimization pragmas and a short description is given below:<br />

Loop Tripcount lets the user specify a minimum and maximum<br />

number of iterations for a loop. This does not<br />

influence the synthesis, but helps to get a precise latency<br />

estimation.<br />

Array Partition partitions an array in multiple smaller arrays.<br />

The default behavior is to only generate one input<br />

and output port for every array. By partitioning it into<br />

smaller arrays with an input and an output port for each<br />

sub-array, the manipulation of single cells of the array<br />

can be parallelized.<br />

Loop Unroll generates multiple instances of the code body of<br />

a loop. If there are no data dependencies between loop<br />

iterations, they can be parallelized by creating multiple<br />

instances of the body.<br />

Pipeline creates a pipelined architecture for a specified function.<br />

This increases the throughput by reducing the initiation<br />

interval.<br />

Inline eliminates the hierarchy level of sub-functions and<br />

dissolves their logic into the logic of the caller function.<br />

The default behavior of Vivado HLS generates one<br />

module for every function in the source code with a submodule<br />

for every sub-function. The logic optimization<br />

only works inside a module on one hierarchy level. By<br />

eliminating those borders with inlining, it is easier for the<br />

optimization to simplify and shorten the RTL description.<br />

Before starting the C-Synthesis, we need to specify a target<br />

frequency, that specifies the frequency, at which the accelerator<br />

should operate.<br />

The C-Synthesis generates the VHDL/Verilog code from the<br />

C/C++ source code. The RTL code can be verified using<br />

the C/RTL-Co-Simulation. Now the code gets packaged and<br />

exported with the IP-Packager and can later be included into<br />

a Vivado block design. More details to Vivado HLS can be<br />

found in [2].<br />

IV. TEST SETUP<br />

Figure 2 shows the complete design flow with Vivado HLS<br />

and Vivado. A detailed view of the Synthesis is depicted<br />

in fig. 1. Both tools, Vivado and Vivado HLS, were used<br />

in version 2016.2. Newer versions could not be used because<br />

of incompatibilities with the hardware driver seen in<br />

fig. 4. Starting point is a C/C++ implementation of the AES-<br />

Algorithm. The one used for this paper can be found at [8].<br />

After running through all the steps explained in section III-C<br />

Vivado HLS returns estimations for the resource demand and<br />

the performance of the accelerator, including the initiation<br />

interval and latency. The user can go through the whole<br />

Synthesis<br />

Vivado Synthesis<br />

Implementation<br />

Fig. 2. Design flow using Vivado HLS and Vivado [3].<br />

hierarchy of the design and see these estimations for every<br />

single sub-module.<br />

The generated IP Core is part of the block design displayed<br />

at fig. 3 at the position highlighted in red. The clock is set to<br />

the estimated clock given by Vivado-HLS. After implementation,<br />

Vivado returns the actual resource costs and a timing<br />

analysis, including signal paths that fail the timing constraints.<br />

Performance tests are conducted on a Xilinx ZC706 board,<br />

featuring a Zynq 7045 SoC. Through a custom device driver,<br />

explained in [1], the Linux OS on the processing system<br />

(PS) of the Zynq SoC on the board is able to accelerate<br />

AES calls of the Linux Kernel Crypto API with the FPGA<br />

400


AXI Lite<br />

AXI Memory Mapped<br />

AXI Stream<br />

AXI Interconnect<br />

HP0<br />

ctrl<br />

src<br />

MM2S<br />

AXI\_LITE<br />

ZYNQ<br />

AES<br />

AXI DMA<br />

GP0<br />

Zynq 7 Processing System<br />

dst<br />

S2MM<br />

SG<br />

MM2S<br />

S2MM<br />

Fig. 3. Vivado block design [1].<br />

PS<br />

OpenSSL GnuTLS …<br />

User space<br />

dm-crypt IPsec …<br />

Kernel space<br />

Driver<br />

Priority<br />

Request<br />

Queue<br />

Request<br />

Queue<br />

Hardware<br />

Driver<br />

Crypto API<br />

AXI<br />

Software Driver<br />

AES SHA-2<br />

AXI Interconnect<br />

PL<br />

AES<br />

AXI<br />

AXI<br />

DMA<br />

Fig. 4. Test setup with connection to the Crypto API [1].<br />

accelerator. Figure 4 shows the whole test setup. This does<br />

not always work, there are two different types of failure<br />

that we observe. The first one is the creation of incorrect<br />

logic. When the driver is loaded, the Crypto-API verifies the<br />

correctness with a testbench. If the generated logic is incorrect,<br />

this returns an error message stating that the ciphertext is<br />

wrong. With the other type of failure, the measurement halts<br />

at the initialization of the measurement. Since both, driver and<br />

functionality, always stay the same for all tests, this points to<br />

a failure in the Synthesis. In the results paragraph we do not<br />

differ between the two types of failure and only state if the<br />

design passed the test or not.<br />

The accelerator is generated with different sets of optimization<br />

pragmas in five tests. Each test contains 9 designs with<br />

the same set of pragmas, but a different target frequency. It<br />

ranges from 100 MHz to 260 MHz in steps of 20 MHz. For<br />

higher target frequencies than 260 MHz, the design always<br />

fails to meet the timing constraints, so these implementations/frequencies<br />

are not considered in the evaluation. The<br />

different optimizations are as follows:<br />

Test 1 contains no pragmas for optimizing the architecture.<br />

The Interface pragma is inserted, because it is necessary<br />

for the tool to synthesize the IP core. Also the Loop<br />

Tripcount pragma is inserted to generate more precise<br />

estimation results for performance. With this pragma, the<br />

user is able to define a maximum and minimum amount<br />

of loop iterations for the latency estimation.<br />

Test 2 contains the Array Partition pragma. By default, for<br />

every array in the source code, the Synthesis generates<br />

one BRAM with only one read and write port. This<br />

decreases the performance, because the single array elements<br />

can only be read or modified in a sequential<br />

manner. By partitioning the array into multiple smaller<br />

memory blocks with a read and write port for each block,<br />

access to different array elements can be parallelized.<br />

Test 3 extends Test 2 with added Loop Unrolling pragma. It<br />

creates multiple instances of the loop body to calculate<br />

the results in parallel as long as there are no data<br />

dependencies in between the loop iterations.<br />

Test 4 extends Test 3 with added Pipeline pragma. The<br />

Pipeline pragma allows the user to define an Initiation<br />

Interval for the pipeline. We used 10, 3 and 1 clock<br />

cycle as Initiation Interval to test the influence of different<br />

intervals on the performance and the resource cost.<br />

Test 5 extends Test 4 with an initiation interval of 1 clock<br />

cycle with added Inline pragma. The Inline Pragma shifts<br />

the inlined function on a higher hierarchy level and<br />

eliminates hierarchical borders. This does not optimize<br />

the architecture directly, but the following optimization<br />

step in the Synthesis has a lot more freedom to combine<br />

operations and getting rid of unnecessary registers in<br />

between operations, which reduces latency and resource<br />

cost.<br />

Since the goal is to test the capabilities of the tool and<br />

compare the influence of different optimizations, the optimizations<br />

focus on the decryption of AES. The encryption and<br />

decryption consist of the same operations in different order,<br />

but with the operation mode CBC, pipelining is only possible<br />

for the decryption. The results in section V only contains the<br />

results of the decryption, since the rest (encryption and key<br />

expansion) are not included in the optimization process.<br />

V. RESULTS<br />

The highest RTL hierarchy level of the decryption always<br />

looks the same, no matter which optimizations were applied.<br />

The RTL description is displayed in fig. 5. While the two<br />

blocks CBC-XOR and AES-Decryption are directly influenced<br />

by the C/C++ source code, the FSM (finite state machine) is<br />

automatically generated.<br />

The tool always returns a specific estimation for resource costs<br />

and performance. In the following paragraphs, the ranges given<br />

for some values are the the minimum and maximum values<br />

for the 9 different target frequencies.<br />

A. Test 1: no pragmas<br />

The code without any optimization pragmas generates an<br />

implementation with low resource costs. The estimated de-<br />

401


IV<br />

ciphertext<br />

Input1<br />

FSM<br />

Output<br />

Input1<br />

Input2<br />

State1<br />

....<br />

StateN<br />

CBC-XOR<br />

AES-CBC-Decryption<br />

Output<br />

AES Decryption<br />

Input<br />

State<br />

Output<br />

Fig. 5. Synthesized Logic AES-CBC-Decryption<br />

plaintext<br />

mand of LUTs ranges from 3654 to 3660, the amount of FFs<br />

ranges from 592 to 1155 and 9 BRAM slices are required.<br />

While the estimation of the BRAM is accurate, the actually<br />

required amount of LUTs and FFs is lower than the estimation.<br />

The required amount of LUTs ranges from 593 to 638, the<br />

required amount of FFs from 573 to 689. The minimal latency<br />

ranges from 1011 to 1683, the maximal latency from 1371 to<br />

2235 clock cycles, the initiation interval is identical. This leads<br />

to an estimated throughput of up to 3.06 MB/s according to<br />

eq. (2) for a target frequency of 260 MHz. The maximum for<br />

the target frequency seems to be around 180 MHz, since all<br />

higher frequencies fail to meet the timing constraints after the<br />

implementation. The actual measurement of the throughput<br />

failed for all designs in Test 1.<br />

B. Test 2: array partitioning<br />

Due to the added array partitioning, the estimated and<br />

the actual performance and resource requirements rise. The<br />

BRAM estimation and the actually required amount rise to<br />

40 BRAM slices. The estimated number of LUTs range from<br />

4918 to 5044, the FFs from 905 to 1946. The minimal latency<br />

ranges from 416 to 895 clock cycles, the maximum from<br />

562 to 1203. The highest estimated throughput is 6.55 MB/s<br />

for a target frequency of 240 MHz. The actual resource<br />

requirements are again lower than the estimation. The LUT<br />

demand ranges from 2363 to 2679, the FF demand from 880<br />

to 1212. All designs passed the functionality test, the highest<br />

throughput was 5.16 MB/s for a target frequency of 240 MHz.<br />

C. Test 3: loop unrolling<br />

The loop unrolling further increases the performance and<br />

the resource costs. The estimated number of LUTs drops to a<br />

range from 4602 to 4883. This is because the loop calculations<br />

now happen in parallel. Before, there was only one instance,<br />

which was reused for every loop iteration. That required<br />

a control logic which disappears for parallel computation.<br />

The amount of registers rises, since every register has to be<br />

duplicated for every parallel path. This leads to an increase of<br />

the estimated FFs to a range of 1590 to 2529. The required<br />

BRAM stays the same, since the methods that require BRAM<br />

do not contain loops. The minimal latency ranges from 50<br />

to 131, the maximum from 70 to 182. The actual amount of<br />

LUTs ranges from 1774 to 2338, the FFs range from 1592 to<br />

2012. The BRAM demand and estimation are identical. Apart<br />

from a target frequency of 180 MHz, all designs passed the<br />

measurement test with the highest throughput of 26.15 MB/s<br />

for a target frequency of 120 MHz.<br />

D. Test 4: pipelining<br />

1) Initiation Interval = 10 clock cycles: Pipelining only<br />

has a small impact on the minimal latency, but the maximal<br />

latency is now equal to the minimal latency. This is necessary,<br />

because pipelining needs a constant latency. For an initiation<br />

interval of 10 clock cycles it ranges from 44 to 131 clock<br />

cycles, the initiation interval is the specified 10 clock cycles.<br />

The LUTs estimation ranges from 7519 to 9693, the estimated<br />

FFs range from 1982 to 4247. The BRAM estimation rises to<br />

80 slices. Since pipelining requires additional instances of all<br />

base operations, the resource consumption increases. 189 MHz<br />

seems to be the maximum for the estimated frequency, since<br />

no design achieves a higher frequency. The highest estimated<br />

throughput is 301 MB/s. After implementation, the required<br />

amount of LUTs is between 2617 and 4028. The required FFs<br />

range from 1736 to 3820. Apart from a target frequency of<br />

180 MHz, all designs pass the measurement test. The highest<br />

measured throughput is 27.24 MB/s for a target frequency of<br />

100 MHz. This is significantly lower than the estimation of<br />

301 MB/s. The reason for this is explained in section VI, since<br />

it influences all pipelined designs.<br />

2) Initiation Interval = 3 clock cycles: The resource costs<br />

increase again, since the synthesis now creates 5 instances of<br />

all base operations. The latency ranges from 44 to 88 clock<br />

cycles. The LUTs estimation ranges from 17499 to 22541, the<br />

FFs from 2381 to 7097. The BRAM estimation rises to 200<br />

slices. The maximum for the estimated frequency seems to<br />

be 189 MHz again, the maximal estimated throughput is 1006<br />

MB/s. The implementation only needs between 4800 and 6635<br />

LUTs and between 2271 and 3915 FFs. Apart from a target<br />

frequency of 180 MHz all designs pass the measurement test.<br />

The highest throughput was achieved by the design with a<br />

target frequency of 140 MHz with 31.28 MB/s, which is again<br />

significantly lower than the estimations.<br />

3) Initiation Interval = 1 clock cycle: There are now 14<br />

instances of every base operation. This leads to a LUT estimation<br />

between 44790 and 45047 and a FF estimation between<br />

3077 and 15603. The BRAM estimation and utilization rises<br />

to 528 slices. The latency drops to a range from 42 to 84 clock<br />

cycles, the highest estimated throughput is 3019 MB/s for<br />

target clocks between 180 and 260 MHz. The actually required<br />

amount of LUTs ranges from 4653 to 5366, the FFs from 1085<br />

to 6857. Target frequencies from 120 to 180 MHz fail, the rest<br />

passes the measurement test. The highest throughput is 30.36<br />

MB/s.<br />

E. Test 5: Inlining<br />

Inlining the base operations decreases the latency to a range<br />

from 28 to 68 clock cycles. The estimated amount of LUTs<br />

402


anges from 13660 to 15327 and the estimated FFs range<br />

from 3431 to 13239. The estimated BRAM usage stays at<br />

528 slices. The maximal estimated throughput is 93 MB/s for<br />

a target frequency of 140 MHz when taking into account that<br />

pipelining does not work. The actual LUT usage ranges from<br />

5713 to 8163, the FFs from 3288 to 9918. No design passed<br />

the measurement test.<br />

VI. ANALYSIS<br />

All pipelined designs have an increased resource utilization<br />

in comparison to the designs without pipelining due to<br />

additional instances. The analysis with the Integrated Logic<br />

Analyzer shows, that there is always valid data at the input<br />

and the output always waits for the accelerator to read data.<br />

So the problem does not originate from the environment, but<br />

from the accelerator itself.<br />

So for example we investigate the design of Test 4 with an<br />

initiation interval of 10 clock cycles and a target frequency<br />

of 100 MHz. This design should achieve up to 181.81 MB/s,<br />

but the measurement only shows a throughput of 27.24<br />

MB/s. The synthesis shows, that it created an additional<br />

instance of every base operation and they still exist after the<br />

implementation. Pipelining with an initiation interval of 10<br />

clock cycles means, that one instance can only be occupied<br />

for 10 clock cycles by a single data block.<br />

AES consists of many rounds, as explained in section III-A,<br />

which consist of the four base operations AddRoundKey,<br />

SubBytes, MixColumns and ShiftRows. Depending on the<br />

key size there are between 10 and 14 complete rounds per<br />

data block. The scheduling diagram in Vivado HLS shows,<br />

that SubBytes and MixColumns are occupied for 2 clock<br />

cycles in each round. This sums up to 20 and 28 cycles<br />

for SubBytes and MixColumns each for every data block.<br />

Divided by two because of the additional instances of the<br />

base operations, this results in 10 to 14 clock cycles with an<br />

initiation interval of 10 clock cycles. So there need to be at<br />

least 3 instances of every base operation to actually enable<br />

pipelining with an initiation interval of 10 clock cycles.<br />

For the same reason, the design with an initiation interval<br />

of 3 clock cycles would need at least 14 instances of every<br />

base operation. For an initiation interval of 1 clock cycle<br />

and a target frequency of 100 MHz the High-Level Synthesis<br />

generates 14 instances of all operations plus an additional<br />

AddRoundKey instance for the first round. Because of the<br />

operations SubBytes and MixColumns always being occupied<br />

for 2 clock cycles at a time, it is still not possible to achieve<br />

the requested initiation interval.<br />

To check if this problem is specific for the tool version,<br />

we repeated the synthesis step of test 4 with tool version<br />

2017.2, which was the newest available version at the<br />

time. The synthesis results though stayed the same. The<br />

resource estimations were identical as well as the generated<br />

architecture. Even with this version, there were always to few<br />

instances of the base operations to enable pipelining.<br />

The goal of this paper was to evaluate the High-Level<br />

Synthesis with Vivado HLS. The generation of an FPGA<br />

accelerator with High-Level Synthesis is faster and easier<br />

compared to writing HDL code. It is possible to generate a<br />

completely different architecture for the accelerator within<br />

60 to 90 minutes including synthesis and implementation.<br />

Optimization pragmas like array partitioning and loop<br />

unrolling work just like they are supposed to. This enables<br />

the user to generate accelerators that are faster than most<br />

software solutions for fitting problems like encryption or hash<br />

algorithms.<br />

Pipelining however does not provide the throughput it should.<br />

Inlining can even change the logic and thus breaking the<br />

design if applied at the wrong place. The tool seems to<br />

heavily depend on the correct coding style. During the whole<br />

optimization process, only once an error message occurred<br />

stating that the placed optimization pragma does not work at<br />

this position. This was when trying to pipeline the encryption<br />

despite the operation mode making it impossible. In every<br />

other case, the tool stated a correct synthesis, the problems<br />

were only observed when actually loading the design to an<br />

FPGA and measuring the actual throughput. This behavior<br />

is not reliable and not usable for real-life applications. The<br />

problems with pipelining and inlining keep the user from<br />

creating high performance, high throughput designs.<br />

Another less important problem is, that the resource<br />

estimations, especially for the LUTs, is way higher than<br />

the actual usage after implementation. A possible reason<br />

for this could be an incorrect state machine which creates<br />

unnecessary states and logic that later gets optimized away<br />

through logic optimization in the implementation.<br />

VII. OUTLOOK<br />

There are multiple ways of progressing with this work. In<br />

this paper, we changed the target frequency for the High-<br />

Level Synthesis. By keeping the same design and increasing<br />

the frequency at the implementation step by step could show,<br />

how accurate the frequency estimation is and give another<br />

possibility to increase the throughput. One could also compare<br />

the results of different implementations to find the best coding<br />

style for High-Level Synthesis. One could also try out single<br />

pragmas for optimization to see the individual effect of the<br />

pragmas, but the results of this step highly depend on the<br />

algorithm and its implementation, so there would not be a<br />

general knowledge gain out of these tests.<br />

VIII. CONCLUSION<br />

In this work we showed how to generate an AES FPGA<br />

accelerator with High-Level Synthesis. It turned out to be<br />

faster and easier compared to standard RTL development<br />

with VHDL/Verilog. Especially for developers, who are not<br />

experienced in RTL development, it is a way to still profit<br />

from the compute power of an FPGA. It is also useful for<br />

design space exploration, since it it possible to generate<br />

a completely different architecture within minutes by just<br />

inserting or removing a compiler pragma.<br />

403


However the tool is not ready to be used for real-life applications<br />

yet. A main flaw is the difficulties with the coding style.<br />

It is probably necessary to write optimized code for the High-<br />

Level Synthesis, comparable to code optimized for special<br />

processors. Yet this would take away the biggest advantage<br />

of being easy to use. In the current state it is not possible for<br />

the user to generate reliable high performance, high throughput<br />

designs.<br />

Another problem is the inaccuracy of the resource estimations,<br />

apart from the BRAM estimations. The estimations were<br />

always too high. The real LUT usage is only a fraction of<br />

the estimation. This would make it possible to implement a<br />

design even though the tool estimates a usage of more than<br />

100%. In the current state, the tool has to improve on its<br />

reliability before it can be integrated into a professional reallife<br />

workflow.<br />

REFERENCES<br />

[1] S.Wiehler, CPU-Offloading von Transformationsfunktionen aus dem<br />

Linux-Kernel. 2016.<br />

[2] Xilinx, Vivado Design Suite User Guide: High-Level Synthesis(UG902),<br />

v.16.2. 2016.<br />

[3] Xilinx, https://www.xilinx.com/content/dam/xilinx/imgs/applications/isolationdesign-flow/idf-flowchart.jpg.thumb.319.319.png.<br />

16.1.18.<br />

[4] National Institute of Standards and Technology, FIPS PUB 197: Advanced<br />

Encryption Standard (AES). 26.1.2001.<br />

[5] Xilinx, Vivado Design Suite: AXI Reference Guide(UG1037). 15.6.2017.<br />

[6] Coussy, P. and Gajski, D. D. and Meredith, M. and Takach, A., An<br />

Introduction to High-Level Synthesis. 2009.<br />

[7] Andre Dehon, Fundamental Underpinnings of Reconfigurable Computing<br />

Architectures. 3.3.15.<br />

[8] kokke, https://github.com/kokke/tiny-AES128-C. 2017.<br />

404


Test automation for reengineered modules using test<br />

description language and FPGA<br />

T. Krawutschke, G. Hartung, N. Kopshoff<br />

Faculty of Information, Media and Electrical Engineering<br />

TH Köln<br />

Köln, Germany<br />

M. Schulze, G.B. Faluwoye, C. Hoffmann<br />

OTL Elektrotechnik und Audio<br />

Bonn, Germany<br />

Abstract— Reverse engineering is a very important technique<br />

in systems where the overall system lifetime is much longer than<br />

the lifetime of its electronic and digital components. Obsolete<br />

components (e.g. processors from the 1970s, digital logic<br />

components) are replaced with FPGAs. The systematic test of<br />

these reverse engineered devices is subject of a project carried<br />

out by scientists of TH Köln and engineers of the company OTL<br />

specialized in reverse engineering.<br />

Keywords—Reverse Engineering, Test Automation,<br />

Obsolescense<br />

I. INTRODUCTION<br />

The process of reverse engineering includes the<br />

identification of system parts and their interrelationships with<br />

the aim to develop a representation of the system in another<br />

form or at a higher level of abstraction [1].<br />

A reverse engineered electronic device (e.g. a board that is<br />

part of a system that is arranged of different devices in a rack)<br />

that is operated in railway vehicles needs an approval before it<br />

can be used for transportation. Several engineering standards<br />

have to be considered. The ISO 50126 is one of them and<br />

defines the lifecycle of an electronic device or system. The<br />

extended V-Model considers a phase of “change,<br />

retrofit/upgrade” (s. Fig. 1). Reverse engineering is part of this<br />

phase, when the requirement documents and engineering plans<br />

of the original devices are missing. Several standards related to<br />

Fig. 1. Extended V-Model (DIN EN 50126)<br />

the development of safety critical systems demand a complete<br />

chain of documentation of the development process. The<br />

combination of a Test Description Language (TDL) and a Test<br />

Automaton improve test coverage and documentation in<br />

relation to manual testing.<br />

FPGAs are used in two application areas: first as the target<br />

technology for a replacement device to replace obsolete logic<br />

components and microcontrollers, second as a measurement<br />

tool and pattern generator for the Test Automaton. The general<br />

process of reverse engineering and the usage of an FPGA as<br />

target technology are shown in the next chapter. The third<br />

chapter introduces the TDL and the Test Automaton using<br />

FPGAs for the data processing of measurements and stimulus<br />

generation. The fourth chapter describes the collaboration of<br />

the software interpreting TDL and the hardware of the Test<br />

Automaton. The last chapter gives an example of a reversed<br />

engineered device that is part of a railway slide protection<br />

system.<br />

II.<br />

GENERAL REVERSE ENGINEERING PROCESS WITH<br />

FPGA<br />

The aim of the reverse engineering process with FPGA as<br />

target technology is the development of a model that is<br />

functionally equal to the original device. The reverse<br />

engineering process of a digital electronic device without<br />

support of TDL and a Test Automaton consists of five steps (as<br />

illustrated in Fig. 2):<br />

The first step involves the analysis of the behavior of the<br />

original circuit and the digital components which requires a test<br />

setup with a pattern generator and a logic analyzer. By<br />

stimulation of the original device and measuring its reactions<br />

(measured values of outgoing signals) a set of test cases is<br />

produced which are used for verification of the next steps.<br />

In this second step the analysis of the circuit and the logic<br />

components of the original device produce a simulatable<br />

VHDL model of the device. This model may not be fully<br />

synthesizable which means that the describing VHDL Code (or<br />

parts of it) cannot be transferred to an FPGA. The simulatable<br />

405


Fig. 3.<br />

VHDL represents the reversed engineered original device<br />

The design and partitioning of the reverse engineered<br />

device is part of the fourth step. Until this step the HDL that<br />

describes the FPGA is independent of the FPGA-vendor. With<br />

the definitive selection of the hardware, vendor specific<br />

constraints are added. These covers pin assignment, logic<br />

levels, clock settings etc. Should the selected FPGA component<br />

become obsolete in the later lifecycle, a replacement with<br />

another FPGA is possible. The VHDL description stays<br />

unchanged; constraints are adapted to the new FPGA<br />

component.<br />

The vendor tool synthesizes, maps and fits the VHDL and<br />

constraint files to the FPGA. A post-fitting simulation reflects<br />

the expected timing behavior of the programmed FPGA. The<br />

test cases are now used to validate the synthesized design<br />

against the desired behavior before a prototype is build.<br />

Finally, a validation of the reverse engineered device is done<br />

with the same setup and test cases that were developed for the<br />

validation in the first step (s. Fig. 4)<br />

Fig. 2.<br />

General design flow reverse engineering<br />

model should represent the whole electronic device including<br />

parts that are externally to the FPGA. Analog parts that<br />

interface to the digital circuit may be emulated to achieve a<br />

simulation model with the same pin-out than the original<br />

device. For this initial step it is sufficient to have a pure<br />

simulative model. This VHDL model becomes the Device<br />

Under Test (DUT).<br />

A Testbench is created in the third step that stimulates the<br />

DUT which uses the test cases from step 1. The hierarchical<br />

concept of VHDL allows the arrangement of the derived model<br />

as subordinate of the Testbench. The Testbench connects to the<br />

inputs and outputs of the DUT. In general it contains a list of<br />

stimuli that are passed to the inputs of the DUT and it evaluates<br />

the responses of the model that are observed at its outputs.<br />

Moreover, the Testbench contains assertions about the<br />

expected behavior. The evaluation of the assertions is reported<br />

in a simple pass/fail expression (s. Fig. 3). If differences show<br />

up they may result from either a misunderstanding of the<br />

original device which leads to a change in the VHDL model or<br />

an uncritical derivation of timings which lead to a change in the<br />

assertions of the Testbench. The result of this step is a<br />

validated VHDL model of the original device and a set of<br />

refined test cases that can be used to test the reversed<br />

engineered device.<br />

III. REVERSE ENGINEERING WITH TDL AND A TEST<br />

AUTOMATON<br />

The process of reverse engineering outlined in the last<br />

chapter is an approved method. It takes extensive efforts in<br />

every step. Moreover, there are two major difficulties within it:<br />

• The process is error prone due to the difficulties in<br />

describing test cases correctly. VHDL pattern files are<br />

not really a good documentation of a test case.<br />

Fig. 4.<br />

The prototype of the reverse engineered device is tested<br />

406


Fig. 5.<br />

Fig. 6.<br />

Testbench and test definition are derived from a single TDL<br />

document<br />

• The tests of the reverse engineered device as well as<br />

of the original device is done by hand. Beside the<br />

effort of doing several tests manually it puts the<br />

serious questions how these tests are documented in a<br />

manner which is acceptable for the required test chain<br />

(see section I)<br />

To overcome these difficulties we defined a test description<br />

language (TDL) and a test automaton which allows to<br />

automatically test digital devices.<br />

The Test Description Language (TDL) is defined in<br />

accordance to ETSI Standard ES 203 119-1 [2]. A TDL file<br />

contains one or more test case definitions. Different test<br />

descriptions can be implemented in a single TDL file, e.g. to<br />

define tests that definitely fail to provoke an intended failure<br />

report. A test description itself describes the stimuli and the<br />

expected responses of a test sequence. The TDL syntax is<br />

implemented as a Domain Specific Language (DSL [5]) which<br />

is processed by a parser and code generators as follows: An<br />

Xtext [3] framework is used in conjunction with the abstract<br />

syntax tree defined in accordance to the ETSI standard to<br />

// begin declarative part, define participating instances<br />

TDLan Specification achskarte{<br />

...<br />

Test Configuration ak_cf{<br />

instantiate AKT as DUT of type hardware having {<br />

...<br />

instantiate TB_a as Tester of type TB having{<br />

...<br />

// Test Automaton Hardware I/Os<br />

SignalAdapter Configuration de0_nano_output {<br />

attach geber_def 0 downto 0 to position 4 downto 4;<br />

attach fl 1 downto 0 to position 2 downto 1;<br />

logiclevel TTL;<br />

...<br />

// connect instances<br />

connect gate stim_out to gate eight_bits_in;<br />

// begin test description<br />

Test Description test_geberdef {<br />

use Test configuration: ak_cf{<br />

...<br />

TB_a sends bit value of b0 to gate f1; gate f1 waits for<br />

(877 microseconds);<br />

...<br />

AKT sends bit value of b0 to gate geber_def;<br />

...<br />

Example TDL Code<br />

implement the parser for the TDL DSL. The code generator<br />

realized upon the parsed structure creates the VHDL<br />

Testbenches and test definition files (s. Fig. 5).<br />

The Test Automaton is a real hardware device. It consists<br />

of different Signal Processing Modules (SPM) for measuring or<br />

stimulus generation under control of a PC where the files<br />

describing the tests and the measurements are archived. Each<br />

SPM contains an FPGA/Memory combination for data<br />

processing and an interface to transfer measure or stimulus data<br />

from/to the control PC (s. Fig. 9). The SPMs are synchronized<br />

by a common clock and trigger distribution unit. The control<br />

PC is not only used as the data storage but also as the central<br />

documentation storage since every test action is protocolled.<br />

To integrate the test scenario in the test description<br />

documents we introduced a declarative part syntax in the TDL<br />

(s. Fig. 6). This allows describing the test members and its<br />

setup. The interfaces of the DUT are declared and the required<br />

SPMs are configured and instantiated. The connection between<br />

Test Automaton and DUT is defined, so that the tester knows<br />

how to connect the DUT to the SPMs and moreover the VHDL<br />

Testbench has matching input and output ports to the VHDL<br />

model of the original device. The last two lines of the TDL<br />

example code in Figure 7 show examples for the test<br />

description containing one stimulus and one assertion part of<br />

the TDL. With these elements, the TDL description is not only<br />

capable of defining the complete test scenario for a reverse<br />

engineered device but is also a key element in the provable and<br />

documented process of testing it.<br />

IV.<br />

TEST PROCESS WITH TDL AND THE TEST AUTOMATON<br />

The process of reverse engineering stays similar to the<br />

before described one (s. Fig. 7): the TDL file is created during<br />

the analysis part of the reverse engineering process and used<br />

for creating VHDL test benches for simulations and providing<br />

data for the measurements with the real hardware (reverse<br />

engineered and original device).<br />

With our Xtext/Xtend based generators several files (VHDL<br />

Testbench, stimuli files used by the pattern generators resp.<br />

converter to ‘translate’ measured data in VHDL form) are<br />

generated that support the different tests (shown in Fig. 9).<br />

While step 2, the analysis and creation of a first model of<br />

the reverse engineered device, is unchanged, in step 3 we use<br />

the generated Testbench to simulate the VHDL model that was<br />

derived from the original device. The generated assertions are<br />

evaluated by the VHDL simulator which collects results in the<br />

simulation report. The same is done in step 4: the post-fitting<br />

VHDL model is tested using the generated Testbench.<br />

In step 5 the measurement is prepared by generating several<br />

files with the testpattern generator from the TDL stimulus data<br />

for the different SPMs. The Xtext generators already mapped<br />

the stimuli to the output ports (defined in the declarative part of<br />

the TDL) and converted them to the exchange format for the<br />

SPMs. Before the testing starts, the stimulus data sets are<br />

uploaded to the dedicated output SPMs of the Test Automaton.<br />

A global trigger signal starts the pattern generation in parallel<br />

on all SPMs and activates the input SPMs to capture the<br />

407


validation of the measured data. The assertions are already<br />

present in the common Testbench used to simulate the<br />

behavioral VHDL model in the third step. To include the VCD<br />

data in the simulation an interfacing VHDL file is needed. As<br />

already mentioned, the Xtext generator prepared this interfacing<br />

VHDL file using the input and output port-naming defined in<br />

the TDL file in the first step. A simulation run with the<br />

simulator will now check the results of the measurements and<br />

generate a report containing the results, if the measured device<br />

behaves like expected.<br />

The use of the VCD data format has an additional benefit:<br />

The measurement can be easily displayed in a waveform<br />

viewer for inspection and the waveforms can be used for<br />

documentation of the test results.<br />

Fig. 7.<br />

Reverse engineering flow with TDL<br />

measurement data during the test process. After the completion<br />

of the test sequence the binary measurement data is<br />

downloaded from the input SPMs to the control-PC. Since the<br />

measured data is distributed over several files for each SPM,<br />

the data is merged to a single file and converted to the Value<br />

Change Dump (VCD, s. [4]) format based on the TDL test<br />

definition.<br />

The VCD file containing the measurements is used for the<br />

V. EXAMPLE OF A REVERSE ENGINEERED DEVICE<br />

The slide protection system of a city railway consists of<br />

several devices, some of them containing 8-Bit<br />

microcontrollers introduced in the 1970. Here, the focus is set<br />

on one of the devices that measures and evaluates the speed of<br />

an individual axle of the railcar. The speed is measured by a<br />

pulse encoder and the resulting frequency is transformed and<br />

digitized in the device. If the value of deceleration is too high,<br />

the device assumes an upcoming sliding of the train wheels and<br />

influences the pneumatic brake system by triggering valves to<br />

reduce the brake force.<br />

Since some of the original components are obsolete the<br />

company OTL decided to reverse engineer the device using an<br />

FPGA which replaces nearly all of the original digital<br />

components including the microprocessor which is mapped on<br />

a free available emulation. The original documentation as well<br />

as the software source is not available- So the reverse<br />

engineering process was difficult. At first, several test cases<br />

were developed to stimulate the device with different speed<br />

profiles and the resulting valve activation sequences were<br />

noted. Additionally the embedded software was examined to<br />

find the stimulation sequences of the different programmed<br />

reactions on speed changes. Test cases for the self-test feature<br />

and the failure detection were developed.<br />

The reverse engineered device contains an FPGA for the<br />

digital logic including the microcontroller emulation running<br />

the original software. The input electronic interfacing to the<br />

pulse encoder and the output electronic controlling the valves is<br />

Fig. 8.<br />

Original device (left) and protoype<br />

408


almost unchanged. A synthesizable VHDL model of the FPGA<br />

contents was developed and embedded in a (not synthesizable)<br />

VHDL model of the whole device. This allows the simulation<br />

of the complete device.<br />

An adapter board was developed to interface the original<br />

and the reverse engineered device to the Test Automaton to<br />

shift voltage levels and emulate the valves and their possible<br />

failures like short circuit or cable break. The detection of these<br />

failures is a function of the device that is tested.<br />

Single tests were carried out with a prototype of the Test<br />

Automaton using the original and the reverse engineered<br />

devices as DUT as well as the simulative part with test benches<br />

derived from TDL.<br />

VI. SUMMARY AND OUTLOOK<br />

We showed a test concept that integrates all test definitions<br />

in a single TDL file. Testbenches, stimuli and the evaluation of<br />

the measured data are derived from this single source to<br />

maintain consistency during the whole reverse engineering<br />

process. The simulation used to develop the prototype is tested<br />

with the same test data as the original device and the prototype<br />

of the reverse engineered device. Each simulation and<br />

measurement is documented by a test report. The usage of a<br />

DSL and code generators let us build automatically the<br />

different test representations from a single TDL file.<br />

A useful extension for the data management and<br />

organization would be the addition of a data base with a data<br />

model specialized in testing like realized in ASAM ODS [6].<br />

The proof of test coverage and automatic test generation would<br />

be a nice feature but would require additional research in the<br />

field of testing. To ease the process of approving devices with<br />

embedded software, one can think of replacing simple<br />

programs with state machines and a finite state space to<br />

achieve test coverage of 100%.<br />

ACKNOWLEDGMENT<br />

We like to thank the Federal Ministry for Economic Affairs<br />

and Energy of Germany for supporting this project and Gudrun<br />

Neumann from TÜV Saar for guidance in the process of<br />

approving safety related electronic devices.<br />

REFERENCES<br />

[1] Elliot J. Chikofsky. Reverse Engineering and Design Recovery: A<br />

Taxonomy.IEEE Software, 1990.<br />

[2] ETSI ES 203 119 v1.1.1. Methods for Testing and Specification (MTS);<br />

The Test Description Language (TDL); Specification of the Abstract<br />

Syntax and Associated Semantics. European Telecommunications<br />

Standards Institute (ETSI).<br />

[3] Xtext. http://www.eclipse.org/Xtext; Eclipse.<br />

[4] VCD, IEEE Computer Society: 1364-2001 IEEE Standard Verilog<br />

Hardware Description Language.<br />

[5] DSL, Martin Fowler. Domain Specific Languages. 1st edition. Addison<br />

Wesley, 2010. ISBN: 0321712943, 9780321712943.<br />

[6] ASAM ODS, Association for Standardization of Automation and<br />

Measuring Systems, Open Data Services, www.asam.net<br />

Fig. 9.<br />

Software and hardware architecture of the Test Automaton<br />

409


Hardware Deceleration<br />

The Challenges of Speeding up Software<br />

Kris Chaplin – Embedded Technology Specialist<br />

Intel Programmable Solutions Group<br />

Holmers Farm Way<br />

High Wycombe, Buckinghamshire, UK<br />

Kris.chaplin@intel.com<br />

Abstract— Developing a custom ASIC, or designing for a SoC<br />

FPGA, gives us the potential to create very specific accelerators<br />

to speed up software bottlenecks. However, this is not without its<br />

challenges. How do you account for cached data and translation<br />

from virtual to physical addresses when moving data payloads<br />

from user space into the hardware? Moving data from the SoC<br />

FPGA to the Accelerator and back, potentially has a significant<br />

software overhead before accelerators can be started. This paper<br />

will discuss techniques and mechanisms to allow hardware<br />

accelerators to accelerate, rather than slow down, a system (even<br />

accounting for the potential overhead required).<br />

Keywords—FPGA; acceleration; OpenCL; HLS; VHDL;<br />

Verilog<br />

I. INTRODUCTION<br />

FPGA devices can often play a vital role in accelerating<br />

functions that cannot be performed quickly or efficiently<br />

enough, in software. In many cases the bespoke nature of<br />

accelerators can make custom designs faster and more power<br />

efficient than their software counterparts. There is however not<br />

one clear acceleration technique that will work for all<br />

scenarios, and a broad range of acceleration methodologies can<br />

be used.<br />

This paper addresses some techniques that can be used to<br />

determine an accelerator strategy, and highlights some of the<br />

pitfalls that can be encountered should a sub-optimal<br />

acceleration strategy be used.<br />

II.<br />

DEFINING AN ACCELERATION STRATEGY<br />

It is vital that the system architect understand the data flow<br />

throughout their system to make informed choices about where<br />

acceleration would make sense versus where it would not.<br />

Also, there are different techniques that can be used to<br />

implement acceleration, each with their own unique benefits<br />

and drawbacks. This can quickly cause confusion and<br />

indecision. In some cases, offload to hardware can cause a<br />

reduction in performance, so it is in no way a ‘one size fits all’<br />

solution.<br />

When architecting a system for which performance is<br />

critical, one approach is to take an existing optimized algorithm<br />

or system and choose a faster processor to improve<br />

performance further. This has in the past been valid, with<br />

Moore’s law allowing for a steady increase in performance<br />

over time. However, this exponential increase in performance<br />

is not limitless [1] and cannot alone solve all increased<br />

performance needs. When looking for breakthrough<br />

performance gains, it can sometimes not be enough to just ‘go<br />

faster’. In some cases, no faster solution exists, or is too<br />

expensive. It is then vital that alternative acceleration solutions<br />

are explored.<br />

Software profiling techniques, with tools such as the GNU<br />

Profiler (GPROF) [2] or Arm® DS-5 [3], are key to<br />

understanding which parts of a given algorithm are taking the<br />

most time to execute, and which architectural features of a<br />

processor are being used at any moment in time. By<br />

understanding these performance bottlenecks, it is possible to<br />

identify candidates for software optimization or acceleration.<br />

Once target functions or workloads are identified, the<br />

architect can then decide on which hardware acceleration<br />

technique would be the most appropriate and efficient.<br />

Hardware acceleration implementation techniques can be<br />

broadly divided into several categories:<br />

A. Bump in the wire / pre-processing / post-processing<br />

If an input/output interface is presented to the FPGA/ASIC<br />

rather than directly into the processor, then there is the<br />

possibility of performing work on the external data stream<br />

without direct involvement of the processor. This is known as<br />

pre- or post- processing.<br />

Fig. 1. Pre-processing data prior to CPU input<br />

www.embedded-world.eu<br />

410


For example, in a video system, functions such as deinterlacing,<br />

scaling, color conversion and format conversion<br />

can be performed on the incoming video stream before a Direct<br />

Memory Access (DMA) controller copies the data into a<br />

framebuffer. In this way, those memory-intensive functions<br />

can be completely removed from the CPU workload.<br />

Another example would be streaming Ethernet traffic. If<br />

this data was input via FPGA pins, then firewall or routing<br />

specifications could be implemented in FPGA hardware before<br />

authorized packets are received by the processor.<br />

1) Potential for performance gain<br />

For a data stream where it is feasible to pre-process data,<br />

such that it reduces CPU workload the performance gains<br />

should be relatively predictable. In addition, as the FPGA has<br />

a real-time, deterministic architecture, the data stream can<br />

benefit from these characteristics, with real-time tasks<br />

performed with clock-cycle accuracy.<br />

2) Potential for degradation in performance<br />

Pre- or post-processing of data will add to the latency<br />

between the processor and the I/O. In some instances, the<br />

latency of the transaction is critically important to the overall<br />

system performance, and needs to be minimized.<br />

B. Tightly coupled instruction set extension<br />

Some soft processor cores have facility to allow for the<br />

extension of the instruction set into the fabric of an FPGA.<br />

This would be via a bespoke custom instruction interface [4] or<br />

via a dedicated FIFO channel from the processor core [5].<br />

These interfaces allow for a custom instruction set to be<br />

created, and in some cases direct access to the register file of<br />

the processor.<br />

This form of acceleration is appropriate to very fine-grained<br />

acceleration, where simple register inputs and outputs are used,<br />

such as a binary operation, rotation or hash.<br />

1) Potential for performance gain<br />

If the accelerated function would be used frequently by<br />

critical code in the system, then an acceleration can be<br />

achieved. Tight code loops can be accelerated compared to a<br />

multi-instruction approach due to the close proximity of the<br />

accelerator to the CPU (In or close to the instruction pipeline).<br />

Ideally all data has to be internal to the processor, any external<br />

data fetches may slow down the custom instruction to the<br />

point where software can be just as fast. These interfaces also<br />

tend to run at the same clock speed as the processor, and as<br />

such the instruction needs to be designed in such a way as to<br />

not to become the critical timing path of the design.<br />

C. Tightly-coupled memory-mapped accelerator<br />

Processor systems implemented within FPGA architectures<br />

have the advantage of locally exposing processor system buses<br />

to the FPGA fabric. This allows for tight integration of a<br />

memory-mapped interface for custom acceleration logic. In<br />

the case of soft processors, the latency of such interfaces can be<br />

in the order of a few clock cycles. Connecting to faster hard<br />

processor cores can have a latency of some tens of CPU clock<br />

cycles, due to clock domain crossing and CPU interconnect<br />

latencies.<br />

Memory-mapped accelerators have the advantage of being<br />

potentially asynchronous to the CPU core. In contrast to<br />

custom instruction implementations, the coupling with the<br />

processor is looser, and as such the accelerator can run in a<br />

different clock domain, and be far more complex.<br />

1) Potential for performance gain<br />

If data can be streamed to the accelerator, with the results<br />

being read later then its pipeline can be filled, and<br />

performance is maximized. Further performance gains can be<br />

achieved if the accelerator acts as a bus master or is filled by a<br />

DMA engine as this further offloads the CPU from driving the<br />

bus transactions. In this scenario the CPU can work in parallel<br />

on other tasks while waiting for the acceleration transaction to<br />

complete. The accelerator can have local memory for data<br />

storage (parameters/intermediate values) and can also stream<br />

data results directly back to memory without direct CPU<br />

involvement.<br />

D. External, bus-connected accelerator card<br />

When the FPGA and processor are on physically separate<br />

boards, chips or die, an interface needs to be established<br />

between the components. Industry-standard memory-mapped<br />

interfaces such as PCI Express® can be used to couple the<br />

FPGA accelerator to the host processor system.<br />

2) Potential for degradation in performance<br />

In architectures where the custom instruction can cause a<br />

CPU stall, it is important to ensure that efficiency of the CPU<br />

pipeline is maintained. If the custom instruction has a<br />

dependency on an external data source it could stall for a long<br />

time waiting for a new data value.<br />

Fig. 2. Custom Instruction Accelerator connected to soft CPU<br />

Fig. 3. Memory Mapped Accelerator via FPGA Bridge in SoC device<br />

411


With vendor-specific acceleration hardware, it is also possible<br />

to use bespoke interfaces to connect at lower latency and<br />

provide cache-coherent interfaces.<br />

1) Potential for performance gain<br />

Where data can be constantly streamed over the interface to<br />

the accelerator card, and resultant calculations returned, the<br />

latency of the link only serves as an initial latency, and once<br />

the pipeline is filled, full-bandwidth use can be made of the<br />

accelerator.<br />

2) Potential for degradation in performance<br />

Compared to on-chip interfaces, board-to-board standards<br />

such as PCI Express generally have higher latency. This is due<br />

to the serial nature of the protocol as well as the transaction<br />

overhead. As such, it is even more important to be able to<br />

understand and compensate for this latency in the design of the<br />

data flow between CPU and accelerator. Additionally, as a<br />

shared resource, external system buses can have lower<br />

performance when the interconnect is heavily loaded from<br />

other masters.<br />

E. Cache-coherent accelerator with on-chip processor<br />

interface<br />

In any processor system that has more than one CPU core,<br />

there is an architectural consideration that needs to be<br />

addressed – cache coherency. With more than one processor<br />

under control of the same operating system, working on<br />

common workloads, it is likely at times that data that had been<br />

modified on one CPU core will need to be worked on with<br />

another. For performance reasons, it is highly likely that this<br />

data is cacheable, such that the data can be stored local to the<br />

CPU, and so both level 1 caches (local to the processor) and<br />

level 2 caches (common between multiple processors) are<br />

enabled.<br />

A hardware mechanism is needed to maintain cache<br />

coherency across these cores. That is, if data is cacheable on<br />

both processors, and changes are made on both processors<br />

during the lifetime of the data, the changes need to be<br />

automatically communicated to each cache to maintain correct<br />

data integrity. This is the purpose of cache snooping<br />

mechanisms, under the control of a cache coherency unit<br />

(CCU).<br />

When an accelerator makes use of system memory to<br />

transfer data to the host CPU, it can make sense to extend the<br />

cache snooping mechanisms into the accelerator. In this way,<br />

data can be cache-enabled on the processor and does not need<br />

to be flushed to main memory before being re-fetched by the<br />

accelerator to do work.<br />

For architectures with high latency to external memory,<br />

certain hardware workloads could be impossible to accelerate<br />

without cache coherency due to the delays involved in the<br />

flushing of data prior to acceleration.<br />

Depending on the architecture, some FPGAs with on-die<br />

processors allow for the accelerator to participate in cachecoherency<br />

with the processor. By using such interfaces, the<br />

user can mitigate some of the performance limitations<br />

associated with flushing data to memory. An Accelerator<br />

Coherency Port (ACP) allows for an arbitrary master to<br />

participate into the Snoop Control Units (SCU) view of<br />

cacheable memory.<br />

1) Potential for performance gain<br />

Participation in cache coherence can have major<br />

performance benefits in some instances compared to memorymapped<br />

systems. In a system that is to offload part of a<br />

process to hardware, time would usually be taken flushing<br />

cached data to a memory device, to hand over payloads to the<br />

accelerator. By enabling the accelerator to directly access the<br />

L2 cache, and participate in snooping of L1 CPU local caches,<br />

this flushing operation need not happen.<br />

2) Potential for degradation in performance<br />

Participation with the caches need to be managed to prevent<br />

‘cache thrashing’. If large payloads are being moved through<br />

the ACP interface, and the data is not already present in L1 or<br />

L2 cache, then the L2 Cache unit would fetch new cache lines<br />

to serve the request. If the ACP-connected accelerator is<br />

reading megabytes of data through this interface, then the<br />

cache will soon be filled, and then re-filled with data to service<br />

the requests. On return to normal CPU operation, previously<br />

cached instructions and/or data will be lost and time will be<br />

taken to refresh the caches with commonly-used data. This can<br />

be mitigated to a certain extent using cache way locking based<br />

on master, however this would then reduce the overall cache<br />

size available to each master in the system.<br />

III.<br />

VIRTUALIZATION AND ITS EFFECTS ON HARDWARE<br />

ACCELERATION<br />

Virtualization of operating systems is increasingly common<br />

– especially in multi-user environments, and in systems that<br />

require more than one operating system to function.<br />

Fig. 4. Accelerator card connected to Host CPU Via PCI Express link<br />

Fig. 5. Cache-coherent accelerator via ACP<br />

www.embedded-world.eu<br />

412


At a very general level, one of the ideas behind<br />

virtualization is to allow for multiple operating systems to run<br />

on a given hardware architecture under control of a hypervisor.<br />

The hypervisor sets up the system, and facilities available to<br />

each guest operating system and routes interrupts, service calls<br />

and exceptions appropriately. Depending on the type of<br />

hypervisor used, the guest OS may not need to have any<br />

specific awareness of being run in a virtual environment.<br />

With the mechanisms that allow for virtualization to<br />

function, there is need for an additional level of address<br />

decoding. A guest operating system such as Linux makes use<br />

of virtual addressing, which through the memory management<br />

unit (MMU) is converted to a physical address. However, with<br />

the addition of a hypervisor, this physical address is in fact<br />

defined as an intermediate physical address (IPA). The<br />

hypervisor controls an additional level of address decode that<br />

allows this intermediate physical address to be further mapped<br />

into a final, actual physical bus address.<br />

The reason for this additional stage of decoding complexity<br />

is to allow for multiple operating systems to map system<br />

memory and peripherals to the same address (intermediate<br />

physical address), but in reality, not conflict in physical system<br />

memory or peripherals available to the system.<br />

This additional memory decode however is not without its<br />

challenges to hardware acceleration solutions – especially those<br />

that rely on accessing system memory to share data. Whilst the<br />

operating system may communicate an intermediate physical<br />

address to initiate a DMA transaction from system memory, a<br />

further decode is required to allow for the IPA to physical<br />

address. In software, this is an additional overhead that would<br />

need to be implemented, and this can therefore increase the<br />

delay for the data to initially become available to the hardware<br />

accelerator. In processor architectures that support<br />

virtualization, it may be possible to use a System Memory<br />

Management Unit (SMMU). One of the functions of a SMMU<br />

is to provide hardware functions to automate the lookup and<br />

decode of intermediate physical addresses into physical<br />

addresses, therefore offloading the processor, and locally<br />

caching common table lookups local to the hardware. The<br />

SMMU can potentially be used both by dedicated hardware<br />

blocks, such as DMA and Ethernet cores, as well as custom<br />

accelerator logic, such as an FPGA.<br />

Fig. 6. Virtual to physical address translation under hypervisor control<br />

IV.<br />

LANGUAGES THAT CAN BE USED FOR ACCELERATION<br />

The user has a choice in the input languages that can be<br />

used to develop FPGA custom accelerators.<br />

A. Hardware Description languages<br />

VHDL and Verilog are examples of Hardware Description<br />

Language (HDL) that can be synthesized into hardware at a<br />

low level. These are considered hardware-focused languages<br />

and have been used for decades to describe and implement<br />

ASIC and FPGA systems at a hardware description level. It<br />

can be argued that HDL languages give the developer the<br />

greatest control over the implementation of the resultant FPGA<br />

hardware, however as a low-level, descriptive language, the<br />

downside is the complexity and training needed to truly<br />

achieve this. In addition, some functional changes that appear<br />

trivial at a higher level of abstraction can cause heavily<br />

optimized HDL code to be dramatically different – especially if<br />

primitives are instantiated in the code.<br />

B. High Level Languages<br />

1) OpenCL<br />

OpenCL [6] is a programming framework based on the C<br />

language. A software engineer can use OpenCL to describe<br />

parallelism and kernels describing their functionality with C<br />

and OpenCL APIs. By using the features of OpenCL, the<br />

developer can describe ‘kernels’ that would provide the<br />

acceleration function. OpenCL enables developers to target<br />

different accelerator targets, such as GPU, CPU, DSP and<br />

FPGA with the same source code, however in reality<br />

optimizations would need to be made for each class of<br />

accelerator to get the most out of its architecture.<br />

In general terms, The OpenCL framework defines a host<br />

CPU and infrastructure for accelerator kernels to be<br />

implemented within the FPGA. As such, the OpenCL<br />

development environment tends to assume that the entire<br />

FPGA is available to the host as a resource. It is possible to<br />

create other HDL in an FPGA as part of the OpenCL Board<br />

Support Package, however this is an advanced use case, and<br />

requires knowledge of HDL languages and FPGA design<br />

methodology.<br />

2) C/C++<br />

C and C++ can be used to describe the functionality of an<br />

accelerator, and these accelerators can be compiled into a<br />

memory-mapped IP block for inclusion in an FPGA design.<br />

This allows direct use of the C language without OpenCLspecific<br />

extensions. The system designer will take the resultant<br />

output IP block implemented from the C/C++ code and<br />

integrate it into the FPGA processor design using system<br />

integration tools and/or HDL.<br />

3) Other languages such as MATLAB/Simulink<br />

Vendors such as MathWorks® [7] have an ecosystem and<br />

environment around bespoke design techniques such as<br />

MATLAB®, and the block-level design tool Simulink®.<br />

Tools exist to integrate designs developed in these tools into IP<br />

blocks that can be implemented in FPGA fabric, and memorymapped<br />

to the processor core as an accelerator.<br />

413


V. SUMMARY<br />

The performance improvement of a processor system with<br />

accelerators is greatly influenced by architecture choices. The<br />

architect needs to consider latency and data flow dependencies<br />

to make sound implementation choices. In that way pipeline<br />

bubbles that can negatively affect performance are minimized.<br />

REFERENCES<br />

[1] Williams, RS “The End of Moore’s Law – What’s next”, Computing in<br />

Science and Engineering Issue 2 Mar-Apr 2017<br />

[2] S. Graham, P. Kessler, M. McKusick “gprof: a Call Graph Execution<br />

Profiler” Proceeding SIGPLAN ’82 Proceedings of the 1982 SIGPLAN<br />

symposium on Compiler construction pp. 120-126<br />

[3] R. Maiden “DPD Profiling and Optimization with Altera SoCs”; WP-<br />

01248-2.0 May 2016<br />

[4] Intel “Nios II Custom Instruction user Guide”; UG-N2CSTNST,<br />

December 2017<br />

[5] H. Rosinger "Connecting Customized IP to the MicroBlaze Soft<br />

Processor Using the Fast Simplex Link (FSL) Channel"; XAPP529<br />

(v1.3) May 12 2004<br />

[6] https://www.khronos.org/opencl/<br />

[7] https://www.mathworks.com/<br />

www.embedded-world.eu<br />

414


ARM Cortex-M and RTOSs are Meant for Each Other<br />

Jean J. Labrosse<br />

Micriµm Software, part of the Silicon Labs Portfolio<br />

Weston, FL, USA<br />

Jean.Labrosse@Micrium.com<br />

Abstract— A great majority of today's embedded systems are designed around 32-bit CPUs,<br />

which are integrated into microcontrollers units (MCUs) that also include complex<br />

peripherals, such as Ethernet, USB host, device, SDIO, LCD controllers and<br />

more. Integrating these peripherals demands the use of an RTOS kernel.<br />

Introduced in 2004, the ARM Cortex-M architecture is currently the most popular 32-bit<br />

architecture on the market, adopted by most if not all major MCU manufacturers. The<br />

Cortex-M was designed from the outset to be RTOS kernel friendly: dedicated RTOS tick<br />

timer, context switch handler, interrupt service routines written in C, tail-chaining, easy<br />

critical section management as well as other useful features. Once an RTOS kernel is ported<br />

to the Cortex-M using a given toolchain, the exact same port (i.e., CPU adaptation code) can<br />

be used with any Cortex-M implementation. Not only does Cortex-M excel at integer CPU<br />

operations, many Cortex-M MCU implementations are also complemented with a floatingpoint<br />

unit (FPU), DSP extensions, memory protection unit (MPU) and a highly versatile<br />

debug access port.<br />

Keywords: RTOS; Embedded System; Interrupts; Kernel; Debugging; IoT; Micrium<br />

415<br />

www.embedded-world.eu


I. INTRODUCTION<br />

A real-time operating system (aka real-time kernel or RTOS) provides many benefits when used<br />

with today’s CPUs and MCUs. A real-time kernel is software that manages the time of a CPU<br />

(Central Processing Unit) or MPU (Micro Processing Unit) as efficiently as possible. Most kernels<br />

are written in C and require a small portion of code written in assembly language in order to adapt<br />

the kernel to different CPU architectures.<br />

When you design an application (your code) with an RTOS kernel, you simply split the work<br />

into tasks, each responsible for a portion of the job. A task (also called a thread) is a simple<br />

program that thinks it has the Central Processing Unit (CPU) completely to itself. On a single CPU,<br />

only one task can execute at any given time. Your application code also needs to assign a priority to<br />

each task based on the task importance as well as a stack (RAM) for each task. In fact, adding lowpriority<br />

tasks will generally not affect the responsiveness of a system to higher-priority tasks.<br />

A task is also typically implemented as an infinite loop. The kernel is responsible for the<br />

management of tasks. This is called multitasking.<br />

Multitasking is the process of scheduling and switching the CPU between several sequential tasks.<br />

Multitasking provides the illusion of having multiple CPUs and maximizes the use of the CPU, as<br />

shown in Figure 1. Multitasking also helps in the creation of modular applications. With a realtime<br />

kernel, application programs are easier to design and maintain.<br />

Fig 1. RTOS decides which task the CPU will execute based on events.<br />

Most commercial RTOSs are preemptive, which means that the kernel always runs the most<br />

important task that is ready-to-run. Preemptive kernels are also event driven, which means that<br />

tasks are designed to wait for events to occur in order to execute. For example, a task can wait for<br />

a packet to be received on an Ethernet controller; another task can wait for a timer to expire, and<br />

yet another task can wait for a character to be received on a UART. When the event occurs, the<br />

task executes and performs its function, if it becomes the highest priority task. If the event that the<br />

task is waiting for does not occur, the kernel runs other tasks. Waiting tasks consume zero CPU<br />

time. Signaling and waiting for events is accomplished through kernel API calls. Kernels allow<br />

you to avoid polling loops, which would be a poor use of the CPU’s time. Below is an example of<br />

how a typical task is implemented:<br />

416<br />

www.embedded-world.eu


void MyTask (void)<br />

{<br />

while (1) {<br />

Wait for an event to occur;<br />

// Tasks are infinite loops.<br />

// Task consumes no CPU time while waiting!<br />

Perform task operation;<br />

}<br />

} // A task doesn’t return<br />

A kernel provides many useful services to a programmer, such as multitasking, interrupt<br />

management, inter-task communication and signaling, resource management, time management,<br />

memory partition management and more.<br />

An RTOS can be used in simple applications where there are only a handful of tasks, but it is a<br />

must-have tool in applications that require complex and time-consuming communication stacks,<br />

such as TCP/IP, USB (host and/or device), CAN, Bluetooth, zigbee and more. An RTOS is also<br />

highly recommended whenever an application needs a file system to store and retrieve data as well<br />

as when a product is equipped with some sort of graphical display (black and white, grayscale or<br />

color). Finally, an RTOS provides an application with valuable services that make designing a<br />

system easier.<br />

417<br />

www.embedded-world.eu


II.<br />

THE ARM CORTEX-M<br />

In 2004, ARM introduced a new family of CPU cores called Cortex-M (M stands for<br />

Microcontroller) based on a RISC (Reduced Instruction Set Computer) architecture. The first<br />

Cortex-M was called the Cortex-M3, and the family has evolved to include a number of derivative<br />

cores: Cortex-M0/M0+, Cortex-M4, high performance Cortex-M7 and the recently introduced<br />

Cortex-M23 and M33 with TrustZone-M.<br />

The programmer’s model (see Figure 2) of the Cortex-M processor family is highly consistent.<br />

For example, R0 to R15, PSR, CONTROL and PRIMASK are available to all Cortex-M<br />

processors. Two special registers, FAULTMASK and BASEPRI, are available only on the Cortex-<br />

M3, Cortex-M4, Cortex-M7 and Cortex-M33, and the floating-point register bank and FPSCR<br />

(Floating Point Status and Control Register) is available on the Cortex-M4, Cortex-M7 and Cortex-<br />

M33 within the optional floating-point. Some Cortex-M implementations are also equipped with<br />

a Memory Protection Unit (MPU).<br />

Fig 2. Cortex-M programmer’s model.<br />

418<br />

www.embedded-world.eu


The Cortex-M was designed from the outset to be RTOS kernel friendly such that once an RTOS<br />

kernel is ported to the Cortex-M using a given toolchain, the same port (i.e., CPU adaptation code)<br />

can be used with any Cortex-M implementation. This is especially true for Cortex-M3, -M4, -M7 and<br />

-CM33.<br />

Dedicated Timer for RTOS Tick<br />

The Cortex-M always includes a 24-bit timer to be used by RTOS suppliers to serve as the system<br />

heartbeat, which is used to handle time delays and timeouts. The timer also has a preassigned<br />

interrupt vector (#15). This means that the exact same timer initialization code can be used across<br />

all Cortex-M implementation, irrespective of the MCU supplier.<br />

Dedicated Context Switch Handler<br />

The context switching code for most RTOS kernels is implemented through an exception handler,<br />

and the Cortex-M has a dedicated exception handler (#14) for exactly that purpose. This handler is<br />

called the PendSV. This means that the exact same context switching code can be used across all<br />

Cortex-M implementation, irrespective of the MCU supplier.<br />

System Service Calls<br />

The CPU allows two modes of operation: Privileged and Non-Privileged. Privileged mode allows<br />

privileges that are typically meant for an operating system, such as disabling/enabling interrupts,<br />

accessing debug features, altering the configuration of an MPU, etc. Non-Privileged mode is<br />

typically meant to be used by application code and access system services through a dedicated<br />

exception handler called the SVC Handler. Again, this mechanism is the same across different<br />

Cortex-M implementations making it portable.<br />

ISRs Written in C<br />

The Cortex-M also allows you to write ISRs (Interrupt Service Routines) directly in C as shown<br />

below. This avoids having to learn assembly language, making the code easier to read and support.<br />

All the programmer needs to do is populate the vector table with a pointer to the ISR code.<br />

void MyISR (void)<br />

{<br />

Process interrupting device;<br />

}<br />

Dedicated Stack for ISRs<br />

Upon accepting an exception or an interrupt, the Cortex-M pushes onto the interrupted task’s stack<br />

the contents of eight CPU registers (R0-R3, R12, LR, PC and PSR), and, if the Cortex-M has an<br />

FPU, 17 FPU registers (S0-S15 and FPSCR). The Cortex-M then switches to a dedicate stack to<br />

process the exception or interrupt. This feature removes the requirements of allocating extra RAM<br />

for the stack of each task to accommodate for interrupt handling including nested interrupts.<br />

The NVIC<br />

The Nested Vectored Interrupt Controller (NVIC) supports up to 240 interrupts, each with up to<br />

256 levels of priority. Although the number of interrupts is fairly consistent across the Cortex-M<br />

family, it is always good to check the silicon manufacture’s data sheet.<br />

Stack Limit Registers (M33 Only)<br />

The recently announced Cortex-M33 contains stack limit registers, which are designed to prevent<br />

and detect stack overflows, one of the most common problems encountered in RTOS-based<br />

applications. There are two stack limit registers (one for the MSP and one for the PSP).<br />

419<br />

www.embedded-world.eu


CLZ Instruction<br />

The Cortex-M contains a special instruction called Count Leading Zeros (CLZ). Although<br />

originally intended to be used to normalize floating-point numbers, the CLZ instruction can be used<br />

by the RTOS kernel’s scheduler to determine the priority of the highest priority task that is ready<br />

to run. This greatly accelerates the scheduling process of the kernel.<br />

Load and Store Exclusive Instructions<br />

Special CPU instructions allow easy implementation of semaphores as well as mutual exclusion<br />

semaphores, which are common in most modern day RTOSs.<br />

Easy Critical Section Management<br />

Most RTOS kernels need to disable interrupts when entering a critical section and enable interrupts<br />

upon leaving the critical section. However, it is important to preserve the state of the interrupt<br />

disable mask prior to entering the critical section so that the same state can be restored upon leaving<br />

the critical section. The Cortex-M allows us to easily implement this in assembly language as<br />

follows:<br />

CPU_SR_Save_DI:<br />

MRS R0, PRIMASK<br />

CPSID I<br />

BX LR<br />

CPU_SR_Restore_EI:<br />

MSR PRIMASK, R0<br />

BX LR<br />

Entering a critical section is done by calling CPU_SR_Save_DI(), which returns the state of the<br />

interrupt disable state of the CPU. Leaving the critical section is handled by calling<br />

CPU_SR_Restore_EI() and passing it the previously saved state.<br />

Kernel Aware (KA) and Non-Kernel Aware (nKA) Interrupts<br />

The above method disables all interrupts, which might not be desirable under certain circumstances.<br />

Disabling all interrupts affects the responsiveness of your application to highly time-sensitive<br />

events. It is possible to allocate specific time-sensitive interrupt service routines (ISRs) outside the<br />

reach of the RTOS. These are called non-kernel-aware (nKA) ISRs, and, as the name implies, they<br />

simply bypass the RTOS kernel. nKA ISRs are ISRs that have a higher priority than kernel-aware<br />

(KA) ISRs.<br />

Figure 3 shows the priority levels of ISRs and tasks for a typical Cortex-M CPU. If the RTOS needs<br />

to protect a critical section, it will set the Cortex-M CPU’s BASEPRI register to 0x40 and thus<br />

disable KA ISRs (priorities 0x40 and below). Since priorities 0x00 and 0x20 have higher priorities,<br />

they will be allowed to interrupt the CPU, even if the RTOS is in the middle of a critical section.<br />

420<br />

www.embedded-world.eu


Fig 3. Cortex-M interrupt priority levels.<br />

Figure 4 shows that nKA ISRs are significantly more responsive than KA ISRs. Of course, nKA<br />

ISRs are not allowed to invoke any of the kernel services. However, it’s possible to have an nKA<br />

ISR trigger a KA ISR by using the interrupt vector of an unused peripheral and manually triggering<br />

the interrupt by writing to the NVIC->ISPR[n] associated with the peripheral.<br />

Fig 4. Responsiveness of nKA vs KA ISRs<br />

421<br />

www.embedded-world.eu


Low Power Mode<br />

The Cortex-M contains a special instruction called Wait For Interrupt (WFI) that allows the<br />

processor to enter a low power state. The kernel can call this instruction whenever there are no<br />

tasks that are ready to run. In other words, this instruction would be called by the kernel’s Idle<br />

Task. The amount of energy savings is highly MCU manufacturer-specific. As its name implies,<br />

the low power state exits when an interrupt occurs.<br />

Optional FPU with Lazy Stacking<br />

Although not RTOS specific, the optional FPU allows applications that require floating-point<br />

computations to be greatly accelerated, which can also help reduce power consumption. The<br />

floating-point unit adds registers, which increases the overhead during a context switch. However,<br />

the Cortex-M logic is smart enough to only save the FPU registers onto the stack if the task actually<br />

made use of the FPU.<br />

Optional MPU<br />

Some Cortex-M implementations are equipped with a Memory Protection Unit (MPU), which can<br />

easily be programmed to protect against stack overflows and prevent code from executing out of<br />

RAM. Stack overflows occur when the programmer doesn’t allocate sufficient stack space for a<br />

task. As the stack overflows, memory locations used for other purposes are corrupted, which can<br />

cause unusual problems that might go undetected until the product is actually deployed. RTOS<br />

kernels might have a mechanism to check for stack overflows, but, often, the detection occurs too<br />

late. The MPU prevents RAM corruption by immediately detecting stack overflows. The newest<br />

Cortex-M cores based on the v8M architecture contain an improved MPU allowing regions to be<br />

as small as 32-bytes and aligned on 32-byte boundaries.<br />

Tail Chaining<br />

Although not a direct feature needed by an RTOS kernel, tail chaining reduces the amount of time<br />

it takes to handle back-to-back interrupts of the same or lower priority. This feature significantly<br />

reduces interrupt latency, which is always desirable in real-time applications.<br />

The CoreSight Debugger<br />

Cortex-M all have a special debug port that contains a free-running 32-bit cycle counter that can be<br />

used for CPU and execution time measurements. This is not a feature that is actually needed by an<br />

RTOS kernel but is quite useful for obtaining performance data on an application.<br />

The CoreSight debugger also offers on-the-fly reads and writes allowing PC-based applications like<br />

Micrium’s µC/Probe to monitor or change (at run-time) RTOS as well as application variables.<br />

µC/Probe is a data visualization tool that allows developers to display or change (at run-time) the<br />

current values of variables without requiring the application to be instrumented. µC/Probe has builtin<br />

µC/OS-III® (also available for µC/OS-II ® and µC/OS-5 TM ) kernel awareness, which means that<br />

it can display the current state of kernel objects, such as tasks, semaphores, mutexes, message<br />

queues, etc. A quick glance at the task screen in µC/Probe will show whether the system is behaving<br />

as expected. µC/Probe can display or change the value of any variables in an application as long as<br />

those are declared global or static. This allows developers to run “what-if” scenarios with PID-loop<br />

gains, change scaling offsets, etc.<br />

The CoreSight debugger also allows tools like Segger’s SystemView (Figure 5) or Percepio’s<br />

Tracealyzer to stream and store onto a PC the task execution profile of an application. This type of<br />

tool is invaluable when determining whether or not an RTOS-based application can meet its timing<br />

requirements.<br />

422<br />

www.embedded-world.eu


Fig 5. Segger’s SystemView for µC/OS-III<br />

III. SUMMARY<br />

The Cortex-M was truly designed from the outset to be RTOS kernel friendly.<br />

Special instructions help with scheduling, and exclusive access to shared resources makes it easy<br />

to disable/enable interrupts for critical sections, allow the CPU to enter low-power mode when<br />

running the idle task, and so on.<br />

The interrupt handling mechanism is especially well suited for supporting real-time applications<br />

through its responsive NVIC, tail-chaining feature, support of non-Kernel Aware and Kernel Aware<br />

ISRs and more.<br />

Industrial applications can especially benefit from the floating-point capability of the optional FPU<br />

module, the protection offered by the MPU and, with the Cortex-M33, prevention and detection of<br />

stack overflows through the stack limit registers.<br />

www.embedded-world.eu<br />

423


IV.<br />

REFERENCES<br />

[1] Jean J. Labrosse, “Detecting Stack Overflows,”<br />

https://www.micrium.com/detecting-stack-overflows-part-1-of-2/<br />

https://www.micrium.com/detecting-stack-overflows-part-2-of-2/<br />

March 8, 2016<br />

[2] Micrium, “µC/Probe, Graphical Live Watch®,”<br />

https://micrium.com/ucprobe/about/<br />

[3] Jean J. Labrosse, “Exploring µC/OS-III’s Built-In Performance Measurements,”<br />

https://www.micrium.com/exploring-ucos-iiis-built-in-performance-measurements-part-i/<br />

December 3, 2015<br />

[4] Michael Barr, “Multitasking Alternatives and the Perils of Preemption,”<br />

http://www.drdobbs.com/embedded-systems/multitasking-alternatives-and-the-perils/193000965<br />

September 14, 2006<br />

[5] Segger, “SystemView for µC/OS,”<br />

https://www.micrium.com/systemview/about/<br />

www.segger.com/systemview.html<br />

[6] Segger, “Debug Probes,”<br />

https://www.segger.com/jlink-debug-probes.html<br />

[7] Percepio, “Tracealyzer for µC/OS-III,”<br />

https://www.micrium.com/tracealyzer/about/<br />

http://percepio.com/tz/micrium-ucos/<br />

[8] Silicon Labs, “Simplicity Studio,”<br />

http://www.silabs.com/products/mcu/Pages/simplicity-studio.aspx<br />

424<br />

www.embedded-world.eu


Automating Power Management in MCU Operating<br />

Systems<br />

Nick Lethaby<br />

Connected Microcontrollers<br />

Texas Instruments<br />

Goleta, USA<br />

nlethaby@ti.com<br />

Abstract—With Internet of Things (IoT) applications fueling<br />

an increase in battery-powered connected sensors and actuators,<br />

power management has become a critical technology for MCU<br />

developers. Advanced power management features implemented<br />

in silicon are of limited use unless complemented by a software<br />

layer that enables such features to be easily leveraged. The<br />

importance of ease-of-use is accentuated in the IoT market,<br />

where many developers lack embedded expertise. This paper<br />

presents an RTOS-based power management framework that<br />

automates power management in wireless MCU applications<br />

without developers having to implement specific power<br />

management code or have their applications decide when to enter<br />

specific low power states. We overview the underlying component<br />

implementations required to achieve this, including power-aware<br />

drivers enable the OS to understand when specific peripherals<br />

may be turned off, and efficient tracking of future events, such as<br />

periodic functions and timeouts, by the RTOS. We next discuss a<br />

power policy program that decides when to transition to a lower<br />

power state and which state to transition to. We conclude by<br />

looking at power consumption benchmark numbers based on<br />

ARM Cortex-M wireless microcontroller.<br />

Keywords—power management; real-time operating system;<br />

MCU;<br />

I. INTRODUCTION<br />

The emergence of the Internet of Things (IoT) promises to<br />

greatly increase the deployment of low-cost sensors or<br />

actuators, such as intelligent lighting, industrial data loggers,<br />

asset tracking tags, and smoke detectors, which will need to<br />

communicate to the internet. These sensors and actuators<br />

(henceforth referred to as ‘IoT nodes’) will often need to run<br />

for months or years on coin cell or AA batteries. As a result,<br />

energy efficiency will be a critical concern for developers.<br />

Users of laptops, mobile phones, and tablets are<br />

accustomed to have the operating system control power saving<br />

activities such as dimming displays or hibernation of the<br />

system after periods of no usage. However, these devices are<br />

based on sophisticated operating systems such as Windows,<br />

Linux, iOS, or Android. The low cost nature of IoT nodes will<br />

result in many implementations using MCUs with limited onchip<br />

memory, precluding the use of such high-level operating<br />

systems. While traditional MCU developers are often satisfied<br />

with a set of low-level libraries for managing the hardware<br />

functionality, such an approach will often be insufficient for<br />

IoT nodes for several reasons:<br />

Over the last decade, new silicon processes have created<br />

significantly more power leakage compared to devices built<br />

using older CMOS processes. To achieve the energy efficiency<br />

optimal for IoT nodes, more sophisticated power management<br />

features are being designed into MCUs aimed at IoT<br />

applications. Only providing a low-level software interface to<br />

these creates a learning curve for potential users, making it less<br />

likely they will exploit them.<br />

Achieving optimal energy efficiency will require using<br />

more complex power down modes, where much of the SoC –<br />

CPU, peripherals, and memory is shut down or power cycled.<br />

The silicon vendor should provide higher-level functions that<br />

implement these ultra-low power states reliably to insulate the<br />

user from device-specific complexities. In addition, these<br />

higher-level power management solutions should address<br />

issues such as maintaining a reliable timebase in applications<br />

that are spending significant time in sleep modes.<br />

Many IoT devices are originating from companies not<br />

traditionally associated with embedded systems development<br />

and it is anticipated that there will be insufficient traditional<br />

embedded developers to address all the opportunities available<br />

in the IoT marketplace. Developers of MCU-based IoT nodes<br />

who lack prior embedded development experience will<br />

certainly not want to be dealing with low-level registerabstraction<br />

APIs. They will expect something much closer to<br />

what is available in Windows or Linux where one can select a<br />

specific power down mode or have the operating system<br />

actively manage power.<br />

In the wireless MCU space, the Software Development Kits<br />

(SDKs) used by embedded developers commonly include<br />

multitasking kernels, network connectivity and device drivers.<br />

In this paper, we will examine the implementation of a power<br />

management framework that provides automated power<br />

management for wireless MCUs using popular embedded<br />

RTOS offerings such as FreeRTOS and TI-RTOS. We will<br />

demonstrate the effectiveness of this framework using power<br />

consumption data obtained from the Texas Instruments<br />

www.embedded-world.eu<br />

425


SimpleLink® CC2640R2 wireless MCU, which supports<br />

Bluetooth Low Energy (BLE) communication. This MCU uses<br />

is based on the widely used ARM® Cortex®-M3 core.<br />

II.<br />

HOW AN RTOS HELPS POWER MANAGEMENT<br />

Except for the simplest designs, using an RTOS has some<br />

inherent advantages for energy efficient designs. The first of<br />

these is that the preemptive multitasking design paradigm<br />

encourages interrupt-driven rather than polling-based drivers.<br />

This eliminates unnecessary CPU usage simply spent polling<br />

peripheral registers. The second generic advantage is the OS<br />

automatically drops into an idle thread when there is nothing to<br />

do, clarifying when power saving techniques can be applied.<br />

Furthermore, as we will see in later discussion, some of the<br />

more advanced power management capabilities require that the<br />

device drivers communicate with a centralized database that<br />

tracks which resources are in use. This fits naturally into an<br />

OS, which typically manages some or all of a system’s<br />

peripherals. Beyond these natural advantages, a power-aware<br />

RTOS must offer numerous other capabilities to achieve an<br />

optimal low power operating performance. We will examine<br />

the specific power management techniques that combine to<br />

produce a comprehensive framework. However, before getting<br />

into the specifics of the software, we will briefly overview<br />

some of essential hardware power management features that<br />

must be present on the device.<br />

III.<br />

HARDWARE POWER MANAGEMENT FEATURES<br />

To comprehend the software power management<br />

techniques explained later, it is necessary for the developer to<br />

have a basic understanding of some of the underlying hardware<br />

features that assist in effective power management:<br />

<br />

<br />

<br />

Clock Gating: Clock Gating enables the clock to be<br />

turned off for a particular peripheral, which in turn<br />

reduces the power consumed by the peripheral’s logic.<br />

Power Domains: Although turning off the clock to a<br />

peripheral eliminates most power consumption,<br />

depending on the process used to manufacture the<br />

device, there will often still be some power drain due<br />

to leakage. To address this issue, a SoC may<br />

implement power domains to completely shut off<br />

power to a particular circuit. Unlike clock gates, which<br />

will usually have a one-to-one correspondence to a<br />

peripheral, a power domain typically controls multiple<br />

peripherals, such as all the UARTs or all the serial I/O<br />

peripherals.<br />

Wake-up Generator: To implement very aggressive<br />

low power states, both the CPU and virtually all the<br />

peripherals domains are powered down. Since no<br />

interrupts can normally reach the CPU in these<br />

circumstances, additional logic that enables a subset of<br />

peripherals to wake up the CPU is required. The SoC<br />

designer must decide which interrupts can wake up the<br />

CPU and ensure that the wake-up generation logic is<br />

able to catch these interrupts, take the CPU out of reset<br />

so it can respond to the interrupt, and then forward the<br />

interrupt to the correct vector.<br />

<br />

CPU-independent High-resolution Timer: Since the<br />

great majority of embedded applications have some<br />

time-driven events, it is essential that an accurate<br />

timebase can be maintained across power saving<br />

modes. This requires a timer to be kept active while the<br />

rest of the SoC is powered down. This timer must have<br />

sufficient resolution to maintain something similar to a<br />

1ms tick count and sufficient width to avoid rollovers<br />

during periods of deep sleep. The required resolution<br />

and width will depend on the CPU clock rate and how<br />

long the application will sleep for.<br />

Fast wake up time and appropriate run-time<br />

performance: Although not explicitly used for power<br />

management, the ability of the SoC to wake up<br />

quickly, complete work quickly, and go back to a lowpower<br />

state quickly is of paramount importance to<br />

maximize time in low power states. Important design<br />

choices here include having high-frequency clock<br />

source stabilize quickly and selecting the right CPU<br />

speed and performance so that the work can be done<br />

quickly.<br />

We will discuss how an RTOS power manager utilizes<br />

these features, beginning with a discussion on how to minimize<br />

run-time power consumption.<br />

IV.<br />

“CPU ACTIVE” POWER MANAGEMENT TECHNIQUES<br />

Minimizing power consumption while the CPU is active<br />

primarily means aggressively managing power consumed by<br />

peripherals such as timers, serial ports, and radios. To do so,<br />

the RTOS power manager is reliant on the clock gating and<br />

power domains designed into the CC2640R2 silicon, which<br />

enable inactive peripherals to be powered down. Leveraging<br />

this hardware requires knowing when a particular peripheral is<br />

in use or not. Such knowledge can be tracked by an operating<br />

system and its associated device drivers. Each device driver<br />

must declare a dependency on the specific peripheral it will<br />

use. For example, when the SPI driver is invoked, it declares a<br />

dependency to the OS power manager on the specific SPI port<br />

(e.g. SPI2). The OS power manager knows the clock gate and<br />

power domain that are associated with SPI2 and verifies that<br />

these are enabled. If they are not, it enables them. When the<br />

driver completes execution, it informs the OS power manager<br />

to release the dependency on the chosen SPI. The power<br />

manager maintains a database of dependency counts on the<br />

clock gates and power domains. Whenever the dependency<br />

count for a clock gate or power domain goes to zero, the power<br />

manager is responsible for disabling them to reduce power.<br />

These peripheral power downs are done during normal system<br />

run-time and help increase energy efficiency<br />

V. MAXIMIZING CPU POWER STATE EFFICIENCIES<br />

In many IoT nodes, it will be common for the SoC to spend<br />

much or even most of its time in some form of sleep mode. To<br />

maximize energy efficiency, it is critical to not only maximize<br />

the amount of time spent in sleep modes, but also appropriately<br />

utilize the most power efficient sleep modes where possible.<br />

Achieving the most power efficient sleep state will typically go<br />

beyond just putting the CPU into a sleep state. It may be<br />

desirable to power down memories in addition to on-chip<br />

426


peripherals. It also is essential to have a real-time clock or<br />

high-resolution timer kept alive across power downs to ensure<br />

proper functioning of the application’s time-based functions. In<br />

the CC2640R2 implementation, the real-time clock is part of<br />

the “always on” hardware, so the application always has access<br />

to it. However, in other silicon implementations, it may be<br />

necessary for the power manager to specifically keep a timer or<br />

clock alive. There are a number of different techniques that can<br />

be utilized to ensure that sleep modes are as efficient as<br />

possible. We will begin with a discussion of tick suppression.<br />

VI.<br />

TICK SUPPRESSION<br />

Embedded applications typically employ a regular timer<br />

interrupt as a ‘heartbeat’. This timer interrupt is used as the<br />

basis for calculating when any time-based activities such as<br />

periodic functions or timeouts should occur. For RTOS-based<br />

applications, this timer interrupt is known as the system tick,<br />

but no-OS applications will typically have a similar regular<br />

timer tick.<br />

In practice, ticks execute periodically, at a rate sufficient for<br />

the most granular timing needed by the application. As a result,<br />

most system ticks will not result in a time-driven function<br />

being executed. In energy efficient applications, it is clearly<br />

undesirable to be woken up from a low-power state just to<br />

service the system tick timer interrupt and then find there is<br />

nothing to do. Fortunately the OS knows when any periodic<br />

functions or timeouts are due to occur. To implement tick<br />

suppression, the OS reprograms the timer associated with the<br />

system tick so the next timer interrupt only occurs when the<br />

next time-based function must run. As illustrated in figure 1,<br />

this approach can eliminate the majority of timer interrupts<br />

associated with the system tick.<br />

In the TI-RTOS implementation, the user has to simply set<br />

a configuration parameter to enable tick suppression. An<br />

alternative approach is to provide application-driven control<br />

through APIs. However, this forces the tick suppression logic<br />

into the application code as well as adding the overhead of<br />

APIs calls to a relatively simple operation. The core overhead<br />

of tick suppression is low as reprogramming the timer<br />

peripheral is simply a register write. TI-RTOS and most other<br />

RTOSs automatically track the next tick interval when work is<br />

scheduled for so this information is always available. A minor<br />

side effect is that it may take somewhat longer to execute OS<br />

system calls that must return tick counts, especially on<br />

architectures with poor math performance. This is because the<br />

count must be calculated, versus just returning a count variable<br />

that is simply incremented upon each timer interrupt.<br />

VII. A POWER POLICY MANAGER<br />

In earlier versions of the TI-RTOS power manager that<br />

worked on DSPs in mobile phone applications, decisions on<br />

when to go a particular low power state and which power state<br />

to select were pushed up to the application. Once a decision<br />

had been made to go to a specific power state, a register/notify<br />

framework enabled the power manager to notify relevant<br />

system entities such as device drivers, which would then take<br />

steps to complete any activities and prepare for a power state<br />

change. Once all the system entities had reported that they were<br />

ready, the power manager would then proceed with the power<br />

state change. This approach was sufficient in the mobile phone<br />

space where large application development teams incorporate<br />

power management experts and the non-deterministic nature of<br />

the notification process is acceptable when the main CPU is<br />

running a high-level operating system such as Android, which<br />

inherently has a lot of overhead.<br />

For IoT node applications, a simpler and lower-overhead<br />

approach is required. For the reasons earlier discussed<br />

concerning tick suppression, the OS power manager is wellplaced<br />

to make any decision about transitioning to a different<br />

power state. A function called a power policy manager was<br />

developed to provide a simple way to automatically decide on<br />

and manage power transitions. The register/notify framework<br />

was scaled back and greater use was made of a concept known<br />

as a constraint to simplify decisions about power state<br />

transitions. The power policy manager is configurable by the<br />

developer but comes with a set of default policies that can be<br />

used without the user having to understand significant levels of<br />

detail.<br />

When a multitasking OS-based application has nothing to<br />

do, it drops into an idle loop and the OS can invoke the power<br />

www.embedded-world.eu<br />

427


policy manager. The role of the power policy manager is to<br />

determine which low power state can be entered at this point. It<br />

is always safe to simply place the ARM core in a<br />

WaitForInterrupt (WFI) state as the core register contents are<br />

fully maintained and application execution can be resumed<br />

with minimal latency. However, since other power states offer<br />

much greater power savings, the policy manager will first<br />

determine if one of these can be entered.<br />

A common reason an application may drop into the idle<br />

loop is because one or more tasks are blocked waiting for<br />

peripheral IO operations to complete. If completing these IO<br />

operations or any other function is essential for the system’s<br />

correct operation, the application needs to be able to<br />

communicate this to the OS power manager. In the power<br />

manager implementation for the CC2640R2, the application<br />

informs the power manager of such critical functions by setting<br />

constraints. An example of when a constraint is appropriate<br />

would be when transmitting data over a BLE or 802.15.4 radio.<br />

An application that is waiting for acknowledgement or data<br />

from the wireless network would typically block on a<br />

semaphore. If no other application task needs to run, the<br />

application will then drop into the idle loop and the power<br />

policy would be run. Obviously, it would not be appropriate to<br />

shut down the radio and put the CPU into a long latency deep<br />

sleep mode, because this would result in the incoming BLE<br />

packets being lost. To prevent this from happening, the BLE<br />

stack or radio driver would set a constraint while it was<br />

operating. When its action was complete, it would release the<br />

constraint. The constraint should be limited to only the power<br />

down modes that would impair successful operation. For<br />

example, going into an IDLE state (see next section for more<br />

details of the different CC2640R2 power states) may be safe<br />

for a particular operation, but not going into a STANDBY<br />

state. The power manager tracks constraints in a relatively<br />

similar manner to dependencies. However it’s important to<br />

understand that the power policy only checks for constraints,<br />

not dependencies. The assumption is that power downs can be<br />

done regardless of on-going peripheral activity unless a<br />

peripheral’s associated stack or device driver sets a constraint.<br />

Assuming constraints are not preventing the system from<br />

transitioning to a lower power state, the power policy manager<br />

must weigh information from various sources to decide on<br />

which power saving mode to invoke. Each power saving mode<br />

is characterized by a specific latency, characterized by<br />

combining the time taken to perform the power down operation<br />

and time required for the SoC to fully wake-up and be ready<br />

for normal system execution. Similar to the technique used in<br />

tick suppression, the power policy will check when the next<br />

periodic functions or timeouts are due to occur and then<br />

compare this time against the latencies of the different power<br />

states. It will then choose the lowest applicable power state and<br />

program the appropriate wake-up configuration. The power<br />

policy understands the wake-up latencies from each power<br />

state and therefore will program the wake-up to occur<br />

sufficiently early to ensure the processor is ready to respond<br />

instantly to perform the previously scheduled work. When the<br />

power policy triggers a transition to a new power state, it will<br />

invoke callback functions registered by drivers that need<br />

notification of sleep transitions to shut down the peripheral’s<br />

activity. The default implementations of these callbacks are<br />

minimalistic and based on the assumption it is safe (due to no<br />

constraint being set) to shut down the peripheral as quickly as<br />

possible.<br />

VIII. SOC-SPECIFIC POWER STATES<br />

A key attribute of the Power Manager is that it provides<br />

proven implementations of a pre-defined set of power states for<br />

a device. These are extensively tested to ensure reliable<br />

transitions to and from the mode and eliminate the need for<br />

development from scratch.<br />

The power states for the CC2640R2 are listed in Table 1 as<br />

an example of those that can be present for a device poweroptimized<br />

for an IoT node. As can be seen from the data in the<br />

table, to achieve ultra-low power consumption, it is important<br />

to implement SoC-specific power states that do much more<br />

than simply sleep the main CPU.<br />

WaitForInterrupt mode simply results in gating the clock to<br />

portions of the main CPU. This may be used in any situation as<br />

it has virtually no latency. The primary role of the power policy<br />

manager is to determine if the IDLE or STANDBY modes can<br />

be used, as these greatly reduce power consumption, especially<br />

the latter. The IDLE mode will additionally power off some<br />

CPU logic completely, while retaining state of vital registers. It<br />

should be noted that no actions are taken in either the<br />

WaitForInterrupt or IDLE implementations to turn off<br />

peripherals. As a result, the actual power usage will vary<br />

depending on which peripheral and associated power domains<br />

are active<br />

In STANDBY mode, all peripheral domains are powered<br />

down, except for always on logic used for wake-up generation.<br />

The real-time clock in the ALWAYS ON domain is used to<br />

maintain an accurate time base while in this state. The device’s<br />

SRAM is put in retention mode and power supply is dutycycled<br />

to achieve further power savings, while maintaining<br />

sufficient charge to maintain vital state.<br />

The shutdown mode is provided for applications that wish<br />

to sleep for hours or even days. The main advantages of this<br />

mode compared to simply turning the whole SoC off is that any<br />

pin can be used to cause the SoC to power back up and there is<br />

no need for additional external circuitry to turn on the SoC.<br />

Because shutdown is used for very long power downs, the<br />

default power policy manager does not utilize it. The<br />

application can invoke it directly if appropriate or modify the<br />

power policy manager to use it.<br />

Power State<br />

Wake up time to<br />

ACTIVE state<br />

Current Used<br />

ACTIVE Not applicable 4.145 mA<br />

WaitForInterrupt A few cycles 2.028 mA<br />

IDLE 1.4 µs 796 µA<br />

STANDBY 14 µs 1-2 µA<br />

SHUTDOWN 700 µs 0.1 µA<br />

Table 1: Wake-up latencies and power consumption for the pre-defined power<br />

states of the TI CC2640R2, an MCU with integrated BLE<br />

428


IX.<br />

SUMMARY<br />

With the advent of the IoT triggering an explosion in<br />

battery-powered connected sensors and actuators, power<br />

management has become a critical technology for MCU<br />

developers. While aggressive power management strategies<br />

require specific features to be implemented in the silicon itself,<br />

it is equally important that a software layer be provided that<br />

enables such features to be easily leveraged. This is especially<br />

true in the IoT market, where many developers lack embedded<br />

experience. We illustrated RTOS-based power management<br />

components that provide low-level libraries for managing<br />

peripheral clocks and domains and transitioning to and from<br />

specific power states. These are complemented by poweraware<br />

drivers that enable the OS to understand when specific<br />

peripherals may be turned off. Finally, the OS power manager<br />

has the intelligence to decide when to transition to a lower<br />

power state, eliminating the need for the application to manage<br />

such details and simplifying the process for developers.<br />

ACKNOWLEDGMENT<br />

I would to thank Scott Gary, Senior Member of Technical<br />

Staff at Texas Instruments, for providing technical insight on<br />

software power management in operating systems.<br />

www.embedded-world.eu<br />

429


The state of embedded open source software in 2018<br />

Rod Cope<br />

CTO<br />

Rogue Wave Software<br />

Louisville, CO, USA<br />

rod.cope@roguewave.com<br />

It’s no surprise that the adoption of open source software for<br />

embedded development has caught up to the rest of the world – the<br />

advantages are just too great – so what data, trends, and lessons<br />

can we learn? Like commercial software, open source presents<br />

technical, security, and quality challenges but it also adds skills,<br />

experience, and maintenance considerations into the mix. As<br />

developers of embedded devices with strict resource, performance,<br />

and reliability requirements, how do we ensure open source is<br />

managed and deployed effectively?<br />

Rod Cope, CTO of Rogue Wave Software, discusses the state<br />

of open source use in embedded device development today, using<br />

statistics, use cases, and examples from around the industry and<br />

specific technical support tickets. By delving into popular<br />

packages and tools across the application stack, common issues<br />

and solutions are extracted to form a representative framework<br />

for how open source is used in development environments and<br />

production devices. The use of open source has implications for<br />

package selection, integration, team staffing, and maintenance,<br />

and these topics and more are covered to provide specific best<br />

practices for teams to guide their development efforts:<br />

<br />

<br />

<br />

How to identify potential risk areas for your project<br />

Steps to better manage open source within the team<br />

Where to find help if something goes wrong<br />

open source software, software security, software quality, best<br />

practices, use cases, embedded software development<br />

I. INTRODUCTION<br />

Open source is everywhere and continues to be a growing<br />

trend for embedded systems development, presenting new and<br />

unique challenges to software teams. Open source software<br />

(OSS) is replacing commercial versions of development tools<br />

and packages, and slowly taking up residence on embedded<br />

target platforms. A recent survey of readers of EETimes and<br />

Embedded shows that the use of open source operating systems,<br />

without commercial support, has grown to 41 percent, up from<br />

31 percent in 2012 1 . Similarly, VDC Research states that “Free<br />

and/or publicly available, open source operating systems such as<br />

Debian-based Linux, FreeRTOS, and Yocto-based Linux<br />

continue to lead new stack wins, with nearly half of surveyed<br />

embedded engineers expecting to use some type of free, open<br />

source OS on their next project.” 2<br />

With more embedded devices connecting to the Internet of<br />

Things (IoT), back-end data processing and analytics are also<br />

embracing OSS. As teams get comfortable with, or bow to<br />

outside pressures to adopt open source, it’s important to<br />

understand where the risks are and how the industry is<br />

overcoming them.<br />

This paper examines data extracted from the Klocwork static<br />

code analysis tool and findings in industry literature (white<br />

papers, articles, and blogs) to identify major areas of open source<br />

risk for embedded systems and steps to better manage open<br />

source within development teams.<br />

II.<br />

WHERE OPEN SOURCE IS DEPLOYED<br />

There is a high degree of probability that an embedded<br />

software developer has seen or used open source software.<br />

Today, OSS is easy to get and often fills in technical gaps for<br />

which there is no commercial equivalent. Plus, developers<br />

prefer packages that are popular and have strong communities<br />

behind them.<br />

A. Popular repositories<br />

A survey of the most popular open source hosting sites is<br />

listed in Table 1, illustrating the popularity of OSS projects<br />

today.<br />

TABLE I.<br />

USAGE STATISTICS FROM POPULAR OPEN SOURCE HOSTING<br />

SITES<br />

Site Users Projects<br />

Bitbucket a 6,000,000 a Unknown<br />

GitHub 27,000,000 b 75,000,000 b<br />

LaunchPad 4,140,275 c 41,141 d<br />

SourceForge “Millions” e 500,000 e<br />

a blog.bitbucket.org/2016/09/07/bitbucket-cloud-5-million-developers-900000-teams/<br />

b github.com/about<br />

c launchpad.net/people<br />

d launchpad.net/projects<br />

e sourceforge.net/about<br />

1<br />

m.eet.com/media/1246048/2017-embedded-market-study.pdf<br />

2<br />

www.vdcresearch.com/images/pr/2016/nov/EMB-Embedded-OS-11-29-16.html<br />

www.embedded-world.eu<br />

430


B. OSS deployments<br />

Given the less-stringent requirements on development<br />

environments versus embedded targets, it’s fair to say that most<br />

open source packages are used in the areas of software build and<br />

management tools. Yet there are other, perhaps surprising areas,<br />

where OSS is deployed:<br />

Hardware modelling tools – while electronic design<br />

automation (EDA) tools have been strongly held by proprietary<br />

vendors, open source alternatives are growing in popularity and<br />

features. Examples include Icarus Verilog, Verilator, and GNU<br />

Emacs 3 .<br />

Compiler tool chains – the GNU Compiler Collection (GCC)<br />

has been is use for 30 years and is, by far, the most popular code<br />

compiler supporting C, C++, Ada, Fortran, and other languages.<br />

Other open source compilers include Clang, Portable C<br />

Compiler (pcc), and Tiny C Compiler (TCC).<br />

Software libraries – focusing on embedded targets, there are<br />

many considerations when choosing which software libraries to<br />

include. Notably, with processor and memory resources at a<br />

premium, the library footprint is an important consideration.<br />

Popular examples include newlib, the C runtime library<br />

maintained by Red Hat 4 , and Qt for Device Creation 5 , a version<br />

of the widely-used Qt framework that supports various<br />

embedded targets and is free under the (L)GPL license.<br />

Debuggers – GDB is a popular open source option for source<br />

code debugging, and is often integrated into open source IDEs,<br />

such as Eclipse CDT, NetBeans, and SlickEdit 6 . Eclipse itself<br />

recently celebrated 15 years of supporting embedded systems<br />

development 7 .<br />

Version management systems – open source version control<br />

has always been around, examples are RCS and Subversion, and<br />

has evolved to include distributed and cloud-based systems,<br />

such as Git.<br />

Build systems – this is a broad category that covers build<br />

tools, such as GNU make, to modern continuous integration<br />

tools, such as Jenkins and Buildbot.<br />

Operating systems – real-time operating systems (RTOS) for<br />

embedded are characterized by their modularity and footprint,<br />

and there are several open source options available: FreeRTOS,<br />

eCos, and uClinux are a few examples. Linux is the most popular<br />

system, and the Yocto Project offers a complete development<br />

environment for embedded systems 8 .<br />

Databases – for development environments or back-end IoT<br />

servers, MySQL and PostgreSQL are widely-used database<br />

options, while SQLite is popular for embedded targets 9 .<br />

Web servers – while there’s a seeming disparity between the<br />

large processing and memory requirements of a web server and<br />

what’s available on a typical embedded device, connectivity is<br />

critical to IoT development and has driven the need for onboard<br />

HTTP. Busybox has a built-in httpd server and lighttpd is<br />

optimized for resource-constrained environments.<br />

III.<br />

THE CHALLENGES OF OSS USE<br />

The greatest strength of open source, and the reason it exists,<br />

also presents the biggest challenges to developers of embedded<br />

systems. As OSS packages can be developed and distributed by<br />

anyone, including commercial companies, to solve myriad<br />

technical needs, it is nearly impossible for a software<br />

development team to be able to support the ones they use. Most<br />

teams focus on the skills necessary to deliver new features, less<br />

so on the skills required to support any OSS packages integrated<br />

into environments and systems – especially if multiple packages<br />

are being used.<br />

The challenges of OSS use can be broken down into two<br />

areas, security risks and technical risks. Security risks are flaws<br />

in deployed software that can allow malicious entities access to<br />

program control or sensitive data, either on-board the system or<br />

through remote connections. A recent example is the Krack<br />

vulnerability in Wi-Fi devices, which allowed attackers to<br />

exploit a flaw in the WPA2 protocol and had the potential to<br />

affect “a seemingly infinite list of embedded and Internet of<br />

Things devices from companies like Linksys.” 10<br />

Technical risks are defined as coding, configuration, or<br />

architectural errors that can cause improper behavior or<br />

performance of one or more open source packages. While this<br />

covers a wide range of possibilities, an illustrative example is<br />

the Nest software bug, which caused the battery to drain<br />

prematurely and deactivate the device 11 .<br />

IV.<br />

TECHNICAL RISKS AND SOLUTIONS<br />

Running Klocwork static code analysis on the popular Boost<br />

C++ libraries, there is the potential for several bugs to impact<br />

the behavior and performance of code. Note that some of these<br />

bugs are also potential security flaws, categorized by the<br />

identified Common Weakness Enumeration (CWE) entry.<br />

Issue type<br />

(Klocwork checker<br />

name)<br />

Operands of<br />

different size in<br />

bitwise operation<br />

(CWARN.BITOP.<br />

SIZE)<br />

Void function<br />

returns value<br />

(VOIDRET)<br />

TABLE II. POTENTIAL BUGS IN BOOST 1.62.0<br />

Related CWE<br />

enumeration<br />

N/A<br />

CWE-394:<br />

Unexpected<br />

Status Code<br />

or Return<br />

Value<br />

Description<br />

When bitwise<br />

operations have<br />

operands of<br />

different sizes,<br />

unexpected results<br />

may occur<br />

Functions declared<br />

as void returning a<br />

value may indicate a<br />

logic problem in<br />

code<br />

Number of<br />

reported<br />

issues<br />

44<br />

23<br />

3<br />

opencores.org/howto/eda<br />

4<br />

sourceware.org/newlib/<br />

5<br />

www.qt.io/download<br />

6<br />

sourceware.org/gdb/wiki/GDB%20Front%20Ends<br />

7<br />

www.eclipse.org/community/eclipse_newsletter/2017/october/article2.php<br />

8<br />

www.yoctoproject.org/about<br />

9<br />

www.embedded-computing.com/embedded-computing-design/the-ins-and-outs-of-embeddeddatabases-for-the-iot<br />

10<br />

www.wired.com/story/krack-wi-fi-wpa2-vulnerability/<br />

11<br />

www.engadget.com/2016/01/14/nest-software-bug/<br />

431


Issue type<br />

(Klocwork checker<br />

name)<br />

Possible<br />

dereference of end<br />

iterator<br />

(ITER.END.DER<br />

EF.MIGHT)<br />

Function returns<br />

address of local<br />

variable<br />

(LOCRET.RET)<br />

Uninitialized<br />

variable<br />

(UNINIT.STACK<br />

.MUST)<br />

Related CWE<br />

enumeration<br />

N/A<br />

CWE-562:<br />

Return of<br />

Stack<br />

Variable<br />

Address<br />

CWE-457:<br />

Use of<br />

Uninitialized<br />

Variable<br />

Description<br />

When an iterator is<br />

dereferenced when<br />

its value could be<br />

equal to end() or<br />

rend(), unexpected<br />

results may occur<br />

When a function<br />

returns a pointer to a<br />

local variable, it<br />

returns a stack<br />

address that will be<br />

invalidated after<br />

return, potentially<br />

causing unexpected<br />

results<br />

Uninitialized data<br />

may contain values<br />

that cause<br />

unexpected results<br />

Number of<br />

reported<br />

issues<br />

13<br />

The complexity of the Boost library code is too high to<br />

reproduce in this paper but, to illustrate one of the above<br />

findings, the following general example shows mismatched<br />

operands in a bitwise operation.<br />

1 typedef unsigned int u32;<br />

2 typedef unsigned long long u64;<br />

3 u32 get_u32_value(void);<br />

4 u64 get_u64_value(void);<br />

5 void example(void) {<br />

6 u32 mask32 = 0xff;<br />

7 u64 mask64 = 0xff;<br />

8 u32 value32 = get_u32_value();<br />

9 u64 value64 = get_u64_value();<br />

...<br />

10 value64 &= ~mask32;<br />

11 }<br />

Line 10 shows a 32-bit mask used with 64-bit data, which<br />

may cause unpredictable behavior.<br />

While these types of technical risks apply to any open source<br />

code, there are several steps to consider when selecting and<br />

implementing any package:<br />

1) Identify any known bugs/issues with the package<br />

and, if necessary, check for newer versions to see if<br />

they have been mitigated<br />

2) Follow best practices for the set up, configuration,<br />

and deployment of packages<br />

12<br />

11<br />

3) Run functional tests against the package, in<br />

isolation, before integrating into the overall<br />

application<br />

4) Define a process for supporting the package,<br />

including identifying resources to solve issues in<br />

production<br />

V. SECURITY RISKS AND SOLUTIONS<br />

Code security is a popular subject across the software<br />

industry, no less so for embedded systems where user safety and<br />

privacy are of paramount concern. The types of code issues that<br />

can introduce vulnerabilities include buffer overflows, tainted<br />

data, uninitialized data, and dangling pointers, to name a few.<br />

Additionally, the configuration of open source packages can<br />

contribute to attack surfaces.<br />

The MySQL zero-day vulnerabilities in 2016 (CVE-2016-<br />

6662 and CVE-2016-6663) exemplify bugs that are both<br />

inherent in open source code and can be mitigated through<br />

package configuration. Both vulnerabilities could allow<br />

attackers to execute code with root privileges, even if kernel<br />

security were enabled with default active policies for the<br />

MySQL service on some major Linux distributions. While<br />

patches were eventually released, there were configuration<br />

changes proposed by the reporting researcher to protect servers<br />

in the meantime 12 . Being aware of, and having the skills to<br />

implement these changes is not necessarily something<br />

developers consider, but it’s essential to protecting critical<br />

embedded systems.<br />

Another example is SQLite, a popular database used in<br />

embedded systems. Running Klocwork static analysis on an<br />

older version, SQLite 3.15.0, yielded two, potentially<br />

significant, buffer overflow vulnerabilities.<br />

TABLE III. POTENTIAL VULNERABILITIES IN SQLITE 3.15.0<br />

Issue type (Klocwork<br />

checker name)<br />

Buffer Overflow -<br />

Array Index Out of<br />

Bounds<br />

(ABV.GENERAL)<br />

Buffer Overflow<br />

(class or structure) -<br />

Array Index Out of<br />

Bounds<br />

(ABV.MEMBER)<br />

Related<br />

CWE<br />

enumeration<br />

CWE-120:<br />

Buffer Copy<br />

without<br />

Checking<br />

Size of<br />

Input<br />

CWE-120:<br />

Buffer Copy<br />

without<br />

Checking<br />

Size of<br />

Input<br />

Description<br />

Array bounds<br />

violation: Access to<br />

an array element<br />

that is outside of the<br />

bounds of that array<br />

Array bounds<br />

violation in a class<br />

or structure: Access<br />

to an array element<br />

that is outside of the<br />

bounds of that array<br />

Number of<br />

reported<br />

issues<br />

17<br />

The complexity of the SQLite code is too high to reproduce<br />

in this paper but, to illustrate one of the above issues, the<br />

following general example shows an array bounds violation.<br />

4<br />

12<br />

thehackernews.com/2016/09/hack-mysql-database.html<br />

www.embedded-world.eu<br />

432


1 int main()<br />

2 {<br />

3 char fixed_buf[10];<br />

4 sprintf(fixed_buf,"Very long<br />

format string\n"); // Line 4. ABR<br />

5 return 0;<br />

6 }<br />

Line 4 shows a string of 24 characters being passed into an<br />

array fixed_buf[] of size 10, which may unintentionally<br />

overwrite adjacent memory.<br />

The OWASP Project has identified ten best practices for<br />

embedded application security, adding embedded-specific<br />

guidelines, such as firmware and included libraries, to general<br />

secure coding principles 13 . It’s up to the development team to<br />

decide how to implement policies and tests for these issues, but<br />

most organizations adopt static and dynamic analysis to offset<br />

development resources and prevent mistakes.<br />

TABLE IV.<br />

E1 – Buffer and Stack Overflow Protection<br />

E2 – Injection Prevention<br />

OWASP EMBEDDED TOP 10 BEST PRACTICES<br />

E3 – Firmware Updates and Cryptographic Signatures<br />

E4 – Securing Sensitive Information<br />

E5 – Identity Management<br />

E6 – Embedded Framework and C-Based Hardening<br />

E7 – Usage of Debug Code and Interfaces<br />

E8 – Transport Layer Security<br />

E9 – Data collection Usage and Storage – Privacy<br />

E10 – Third Party Code and Components<br />

Steps to consider to secure applications that include open<br />

source code:<br />

1) Identify any known vulnerabilities within the package<br />

and version by searching the U.S. Government’s<br />

National Vulnerability database<br />

2) Follow best practices for the secure configuration and<br />

deployment of packages<br />

3) Run security tests against the package, following the<br />

OWASP embedded top 10 practices above<br />

4) Train the development team to be able to understand,<br />

identify, and mitigate risks<br />

VI.<br />

WHERE DEVELOPERS GET HELP<br />

For technical and security risks, open source presents a<br />

unique challenge as packages typically do not offer commercial<br />

levels of support. There are three options for development teams<br />

to solve their issues:<br />

Self-support – relying on internal developers to be<br />

knowledgeable about packages, including staying ahead of<br />

technical issues, security vulnerabilities, and getting trained to<br />

deal with problems and the potential for poor documentation.<br />

Community support – relying on the community to help with<br />

set up, deployment, and production issues, and dealing with the<br />

possibility of slow response or the lack of a solution to a specific<br />

problem.<br />

Commercial support – relying on an outside organization<br />

that knows how to prevent and troubleshoot issues with the<br />

package or set of packages, following a guaranteed level of<br />

service and cost to meet project requirements.<br />

The best source of help is dependent on the unique<br />

requirements of the development team and often combines<br />

aspects of all three options. For embedded systems with strict<br />

mission- and safety-critical requirements, it’s highly<br />

recommended to find a level of support that is not only timely,<br />

but also offers the expertise necessary to cover the technical and<br />

security risks identified in this paper.<br />

VII. SUMMARY<br />

With over 37 million users across popular hosting sites and<br />

many different types of deployment scenarios, open source use<br />

continues to grow for embedded systems development,<br />

presenting technical, security, and support challenges that are<br />

different from traditional proprietary software packages. By<br />

identifying issues up front, following best practices for use, and<br />

running tests against the packages before integration, these<br />

challenges can be mitigated. By adopting a level of support that<br />

is timely and has the necessary package expertise, overall risks<br />

can be minimized.<br />

13<br />

www.owasp.org/index.php/OWASP_Embedded_Application_Security#tab=E<br />

mbedded_Top_10_Best_Practices<br />

433


Developing Safety Autonomous Driving Solutions<br />

Based on the Adaptive AUTOSAR Standard<br />

Leo Hendrawan<br />

Senior Member Technical Staff – Customer Support<br />

Wind River System, Germany<br />

Andrei Kholodnyi<br />

Senior Architect – CTO Office<br />

Wind River System, Germany<br />

ABSTRACT<br />

Since the first release of its standard in 2003, AUTOSAR[1]<br />

has established itself as one of the primary software development<br />

standards for the global automotive industry. As the automotive<br />

industry is now facing some of its greatest opportunities and<br />

challenges from the prospect of autonomous driving, new<br />

standards are needed to handle the complexity regarding<br />

software architecture for controlling the increasing number of<br />

E/E contents in the autonomous vehicle. The recent advent of the<br />

Adaptive AUTOSAR standard can help accommodate the<br />

extensive and complex requirements of autonomous driving by<br />

enabling a flexible, dynamic, and service-oriented platform while<br />

still maintaining the integrity of high degree of functional safety<br />

standards and also properly engaging with established platforms.<br />

The standard itself deploys some technologies/standards which<br />

are already established in the industry such as multi-core high<br />

end processors with MMU support, high speed Ethernet<br />

connectivity, hypervisor/ virtualization, POSIX PSE51, C++11<br />

for application development, ISO26262/ASIL compliance, etc.<br />

This presentation provides an example of an Adaptive<br />

AUTOSAR implementation based on VxWorks® RTOS from<br />

Wind River. As one of the very few solutions available on the<br />

market which is already fulfilling the requirements described<br />

above, VxWorks is a strong example of a foundational software<br />

platform for Adaptive AUTOSAR-based autonomous driving<br />

development. We will also explain how VxWorks<br />

features/profiles for Safety, Security Connectivity, and Device<br />

Management fit the basic components of Adaptive AUTOSAR<br />

standard.<br />

Keywords— Autonomous Driving, Adaptive AUTOSAR, POSIX<br />

PSE51, VxWorks, Safety, ISO26262<br />

I. INTRODUCTION<br />

Due to the push of the industry since a couple of years not<br />

only electric car, but a self-driving car has become more and<br />

more a reality to achieve soon than something which only<br />

exists in science-fiction movies. However, this doesn’t come<br />

without any challenges. It is estimated that an autonomous car<br />

will generate around 4,000 GB (4 terabytes) data a day in the<br />

future [2] which are coming from various sensors such as<br />

(cameras, LIDAR, radars, etc.) and also high speed<br />

communications such (5G, V2X, etc.). Thus the complexity of<br />

processing and managing these data is exploding exponentially.<br />

The (classic) AUTOSAR (AUTomotive Open System<br />

ARchitecture) standard has become the de-facto software<br />

standard in the automotive industry for embedded software<br />

application in the ECUs over the last decade. However its<br />

implementation is yet lack of versatility needed for the<br />

complexity of connected, autonomous driving application.<br />

Therefore the AUTOSAR consortium has now come up with<br />

the new Adaptive AUTOSAR platform to accommodate the<br />

challenges implementing connected and autonomous vehicles<br />

application while bridging also the classic AUTOSAR and<br />

infotainment applications in the vehicle.<br />

Furthermore safety becomes the key issue in the<br />

implementation of the new standard as driving application<br />

concerns human life.<br />

II.<br />

ADAPTIVE AUTOSAR<br />

Adaptive AUTOSAR is proposed by AUTOSAR<br />

consortium in 2017. The main goal is to define a software<br />

standard for advanced driving assistance application by<br />

offering high degree flexibility/modularity in a form of serviceoriented<br />

architecture. While the classic AUTOSAR standard is<br />

well defined for static and efficient implementation of<br />

application on top of microcontrollers and standard<br />

communication channel such as CAN bus, Adaptive<br />

AUTOSAR is defined on top of technologies which can cope<br />

with the high processing power and communication<br />

requirements such as multicore microprocessors, gigabit<br />

Ethernet communications, over-the-air update, etc. In order to<br />

have the flexibility between platforms and operating systems,<br />

Adaptive AUTOSAR also embraces other standards such as<br />

C++ and POSIX. As the intention is to have the same code<br />

running on top of any platform and operating systems, it is<br />

necessary to consider carefully the functional safety support of<br />

the underlying platform and operating system.<br />

Figure 1 shows the basic architecture of Adaptive<br />

AUTOSAR.<br />

www.embedded-world.eu<br />

434


A. Adaptive Applications (AA)<br />

Adaptive Applications (AA) are applications implementing<br />

the connected and autonomous driving functionalities. Each<br />

application is implemented as a single or multiple processes in<br />

the operating system which may contain single or multiple<br />

threads with separate address and name space with respect to<br />

another application/process. In order to communicate with<br />

other AA, an Adaptive Application may only use the ARA<br />

Communication Manager explicitly and not other means such<br />

as conventional IPC (Inter Process Communication).<br />

- REST: alternative communication management for AA<br />

based on RESTful API.<br />

- Diagnostic: implementing of UDSonIP (Unified<br />

Diagnostic Services on Internet Protocol)<br />

- Persistency: supporting mechanism for information<br />

storing in non-volatile memory.<br />

- Platform Health Management: supporting fail-safe<br />

applications by means of supervisions.<br />

- Update and Configuration Management: supporting<br />

flexible update of software and configurations through<br />

over-the-air updates.<br />

- Time Synchronization: offering time synchronization<br />

mechanism between applications.<br />

Fig. 1. Basic Adaptive AUTOSAR Architecture [3]<br />

B. AUTOSAR Runtime for Adaptive Applications (ARA)<br />

The AUTOSAR Runtime for Adaptive Applications<br />

(ARA) is an abstraction layer for the underlying hardware and<br />

operating system which is called Adaptive Platform (AP). The<br />

ARA abstraction layer is comparable to the AUTOSAR RTE<br />

(Run Time Environment) of the classic AUTOSAR. ARA<br />

provides standard C++ (or other language support in the<br />

future) interfaces to the Adaptive Platform consisting of a<br />

collection of the Functional Clusters.<br />

C. Adaptive Platform Foundation and Adaptive Platform<br />

Services<br />

The Functional Clusters of Adaptive Platform can be<br />

categorized into main groups: Adaptive Platform Foundation<br />

providing the fundamental functionalities of Adaptive<br />

Platform, and Adaptive Platform Services providing the<br />

standard services of AP. From Adaptive Application (AA)<br />

point of view however the two are almost indistinguishable<br />

due to the standard C++ interfaces.<br />

Here are the list of common/basic Adaptive Platform<br />

Foundation and Services [3]:<br />

- Execution Management: managing platform and<br />

application execution (based on Machine and<br />

Application Manifest)<br />

- Communication Management: managing<br />

communications between Adaptive Applications in form<br />

of either service oriented manner or static<br />

language/network binding.<br />

III.<br />

SAFETY COMPLIANT OS FOR ADAPTIVE AUTOSAR<br />

As mentioned earlier the Adaptive AUTOSAR standard<br />

aim to have a high degree of portability. Therefore it is<br />

important for user to take care the selection of the underlying<br />

platform and operating system to ensure the functional safety<br />

capabilities. The international functional safety standards for<br />

vehicle is the ISO 26262, which defines the Automotive Safety<br />

Integrity Level (ASIL) ranging from level A (lowest) to D<br />

(highest).<br />

As the safety concept of autonomous driving is still<br />

evolving, the automotive industry can refer to the already<br />

established safety-related concept from other industries. By<br />

taking an example of VxWorks RTOS (Real Time Operating<br />

System) which is well established COTS certifiable solution,<br />

here are the features which help user implementing safety<br />

critical application for autonomous driving:<br />

A. Real Time Process (RTP) with Time and Space Partition<br />

Scheduling<br />

The implementation of VxWorks RTOS kernel enables<br />

application processes to have pre-emptive scheduling with the<br />

addition of time partition and Core/CPU affinity policies. By<br />

using pre-emptive scheduling, critical applications which are<br />

usually implemented as high priority task will have<br />

predictable response time to ensure the safety of the system.<br />

Time partition guarantees that RTP tasks have access to the<br />

CPU at the specified time windows. Figure-2 illustrates an<br />

example showing how Time Partitioning scheduler works on<br />

VxWorks 7 Safety Profile. CPU affinity will avoid core<br />

transfer during execution of specified task to ensure the<br />

predictability.<br />

B. Resource Access Control<br />

To avoid malfunctioning task damaging and putting the whole<br />

system into unsafe state, it is necessary to control of all<br />

resource available in the system such as memory, objects<br />

(shared memories, message queues, semaphores, etc.), and<br />

even system calls. VxWorks 7 Safety Profile supports this by<br />

implementing hard-coded data structures which define<br />

explicitly the access control to each resource which needs to<br />

be protected.<br />

435


Fig. 3. helloAdaptiveWorld basic ara::com example<br />

The well defined abstraction layer of Adaptive<br />

AUTOSAR also enables in practice to use multiple operating<br />

systems on a multicore hardware with a hypervisor. By doing<br />

this, critical applications can then run on top of a safety<br />

certified operating system with hard real time requirements,<br />

while to non-critical applications run on top of another<br />

operating system. The following Figure 4 gives an illustration<br />

how the proposed solution might look like by using VxWorks<br />

7 RTOS and Linux Operating System running on a multi-core<br />

hardware.<br />

Fig. 2. VxWorks Time Paritioning Scheduler Example [4]<br />

C. Support of Certified Hardware Platform and Software<br />

Tools<br />

The implementation of safety critical application will also<br />

require the functional safety-compliant software to run on a<br />

safety-compliant hardware platform. Usage of software tools<br />

and development standards also helps to improve the<br />

confidence in developing safety relevant application. One of<br />

the common development standard used for automotive<br />

application development is the Automotive SPICE (Software<br />

Process Improvement and Capability Determination).<br />

Automotive SPICE is for example used for development of<br />

DIAB compiler used as one of the compiler tools of VxWorks<br />

RTOS.<br />

IV.<br />

IMPLEMENTATION OF ADAPTIVE AUTOSAR ON<br />

VXWORKS 7<br />

The goal of Adaptive AUTOSAR is to have a high degree<br />

of flexibility and portability. For this there are 2 keys standard<br />

components required: C++ and POSIX standards. As VxWorks<br />

7 supports these two standards running ARA stack on<br />

VxWorks is pretty straightforward. Following example such as<br />

the ara::comm helloAdaptiveWorld is already running on<br />

multiple hardware platforms. Figure 3 show the illustration of<br />

the basic helloAdaptiveWorld example.<br />

Fig. 4. Multiple OS Adaptive AUTOSAR Implementation<br />

V. CONCLUSSIONS<br />

Adaptive AUTOSAR is defined to cope with the<br />

challenging requirements for implementing complex<br />

applications of connected and autonomous vehicle. As it offers<br />

high degree of flexibility, it is necessary to consider proven<br />

safety-compliant solutions for the underlying layers (operating<br />

system) in order to ensure the success of its deployment.<br />

REFERENCES<br />

[1] www.autosar.org<br />

[2] B. Krzanich, “Data Is The New Oil In The Future of Automated<br />

Driving”. Retrieved November 2016, from:<br />

https://newsroom.intel.com/editorials/krzanich-the-future-of-automateddriving/<br />

[3] “Explanations of Adaptive Platform Design”. March 2017. AUTOSAR.<br />

[4] “RTP Time Partition Scheduling” Retrieved December 2017 from:<br />

https://knowledge.windriver.com/enus/000_Products/000/020/000/020/020/000_Programmer's_Guide%2C_<br />

Edition_17/0F0/050<br />

www.embedded-world.eu<br />

436


Could Virtualization be the key to reducing<br />

complexity within the automotive E/E architecture?<br />

Nicholas Ayres, Daniel Hayes<br />

DIGITS<br />

De Montfort University<br />

Leicester, United Kingdom<br />

nick.ayres@dmu.ac.uk, daniel.hayes@dmu.ac.uk<br />

Dr Lipika Deka, Dr Benjamin N. Passow<br />

DIGITS<br />

De Montfort University<br />

Leicester, United Kingdom<br />

lipika.deka@dmu.ac.uk, benpassow@ieee.org<br />

Abstract—The vehicle embedded system also known as the<br />

electronic control unit (ECU) has transformed the humble motor<br />

car making it more efficient, environmentally friendly and safer,<br />

but has led to a system which is highly complex. The modern<br />

motor vehicles electronic/electrical (E/E) architecture has become<br />

one of the most software-intensive machines we use in our day to<br />

day lives. As new technologies such as vehicle autonomy and<br />

connectivity are introduced and new features added on to<br />

existing Advanced Driver Assistance Systems (ADAS), an<br />

increase in overall complexity will no doubt continue. To address<br />

these future challenges the motor vehicle will require a radically<br />

new approach to the current E/E architecture. Virtualization has<br />

had a resurgence and transformed data centers and facilitated<br />

huge growth in cloud storage, As such it can effectively address<br />

the increasing complexity of the vehicle E/E architecture. By<br />

converting a hardware and software-based ECU into a virtual<br />

environment transforms it into a Virtualized ECU (VCU)<br />

utilizing some of the major benefits of a virtualized environment.<br />

Keywords—Embedded System, ECU, Virtualization, E/E<br />

Architecture<br />

I. INTRODUCTION<br />

The modern motor car can no longer be considered a<br />

solely mechanical device since the introduction of the<br />

electronic control unit (ECU). ECUs of which there are in<br />

excess of 70 [1], [2] embedded in the modern motor car,<br />

monitor, and control a wide range of software-based functions<br />

and applications. This software incorporates over 100 million<br />

lines of code [3] responsible for sending and receiving data to<br />

numerous sensors and actuators often with real-time<br />

constraints across several automotive domains. In automotive<br />

terms, a domain is a “means to group mechanical and<br />

electronic systems” [4] connected over multiple in-vehicle<br />

networks such as CAN, LIN and MOST. As more tasks and<br />

functions come under the umbrella of the automotive<br />

electronic / electrical (E/E) architecture it has led to a huge<br />

growth in the number of ECUs deployed throughout the car<br />

resulting in a highly decentralized, rigid and complex system.<br />

Virtualization: a virtual version of a device or resource could<br />

provide the automotive E/E architecture with a number of key<br />

benefits including flexibility, availability, scalability and<br />

security. When applying these to an automotive context they<br />

could address the overall increasing complexity of the E/E<br />

architecture. This paper explores some of the main, potential<br />

benefits virtualization could provide and how complexity in<br />

the E/E architecture can not only be addressed but reduced.<br />

II. BACKGROUND<br />

Since Karl Benz in 1886 built what was considered the<br />

first modern motor vehicle: the Benz Patent-Motorwagen [5],<br />

the humble car has transformed, not just in looks but function.<br />

1977 saw General Motors release the Oldsmobile Toronado<br />

which is regarded as the first car to include an electronic<br />

control unit ECU, this first implementation managed the<br />

electronic spark timing [6] of the combustion process. Since<br />

ECUs were introduced software has become an integral part of<br />

the motor car similar to any mechanical component which aids<br />

in its function and operation. ECUs benefit the driver with a<br />

safer, efficient and more comfortable ride but the benefits can<br />

also be seen with regard to the vehicle such as lower CO 2<br />

emissions, reduced mechanical wear and higher efficiency in<br />

operation. Vehicle systems are no longer mechanically linked<br />

together, but rather software driven hardware, connected<br />

between driver input and vehicle output. Gone are the days<br />

where a depression of the accelerator pedal would simply<br />

propel the vehicle in motion. In a typical modern motor<br />

vehicle, every time the accelerator pedal is depressed a whole<br />

plethora of electronic tasks are initiated, software algorithms<br />

ensure that parameters such as ignition timing, air to fuel<br />

ratios, temperatures and pressures are all kept to an optimum<br />

ensuring that the vehicle will accelerate as efficiently as<br />

possible [7].<br />

Autonomous technologies currently being developed and in<br />

some cases deployed in a number of makes and models of road<br />

vehicles form the safety driven, advanced driver assistance<br />

system (ADAS). These safety systems include technologies<br />

such as park assist, adaptive cruise control, lane keeping and<br />

departure assist. To facilitate this, new technology will require<br />

a greater increase in hardware, software and network<br />

communication putting more complexity and pressure on the<br />

encumbered E/E architecture and it is clear that a new approach<br />

is required in tackling the inherent complexity of the E/E<br />

architecture.<br />

III. MOTIVATION FOR CHANGE, PAST & PRESENT<br />

The motor car has had to adapt from generation to<br />

generation in order to meet the challenges of new vehicle-<br />

437


ased technology. Since their introduction, ECUs were<br />

connected to sensors and actuators using a point to point (P2P)<br />

wiring system routing these dedicated wired connections<br />

through the vehicles existing wiring harness, as shown in Fig 1<br />

below. As each original equipment manufacturer (OEM)<br />

began to incorporate more and more technology into the<br />

running and monitoring of their vehicles the wiring harnesses<br />

became costly, unwieldy and overly complex.<br />

Fig. 1. ECU P2P Connection compared to CANBus Network<br />

A new method of connection based information<br />

interchange was required to address this complexity. To cope<br />

with the issues surrounding point to point connections within<br />

early E/E architectures the controller area network (CANbus):<br />

developed in the mid-1980s by Bosch and became the standard<br />

in-vehicle communication technology in almost every modern<br />

motor vehicle.<br />

With the increasing use of ECU based systems within the<br />

vehicle development costs have escalated to where 30-35% of<br />

a vehicles total cost is associated with the vehicles electronics<br />

and software [8]. The past and projected overall cost can be<br />

seen in Fig. 2 [9]. Again the motor industry rose to this<br />

challenge and in 2003 the AUTomotive Open System<br />

Architecture (AUTOSAR) consortium was formed to provide a<br />

standardized framework to promote software reusability in an<br />

attempt to lower automotive software development costs.<br />

Fig. 2. Automotive E/E Cost vs Overall Vehicle Cost<br />

In an effort to reduce the number of ECUs and associated<br />

hardware the ”car of the future” will incorporate “centralized,<br />

multifunctional, multipurpose hardware” [10] which is less<br />

reliant on an increasing number of ECUs, sensors, actuators<br />

and communication media. System on chip (SoC) and<br />

multiprocessor system on chip (MPSoC) are technologies<br />

which are being introduced into the E/E architecture often<br />

incorporating an ECU operation or function per core [11]<br />

consolidating many low level, individual ECUs. Although<br />

reducing the number of ECUs with MPSoC/SoC could be seen<br />

as a temporary solution due to new more complex system large<br />

scale integration applications [12]. As new technologies enter<br />

the automotive arena such as ADAS and vehicle autonomy<br />

complexity is and will be a fundamental issue and to address<br />

these future challenges the motor vehicle will require a<br />

radically new approach to the current E/E architecture.<br />

IV. THE MAIN SOURCES OF COMPLEXITY WITHIN THE<br />

MODERN AUTOMOTIVE E/E ARCHITECTURE<br />

Although the motor industry has historically met the<br />

challenges facing it over the years with key technologies such<br />

as vehicle networking and software architecture<br />

standardization complexity still exists within the current E/E<br />

architecture which is now facing new challenges, which<br />

include:<br />

A. Increasing Numbers of Physical ECUs<br />

The number of ECUs has grown dramatically [13] in the<br />

past 40 years since the first introduction in 1977 and this trend<br />

is set to continue over the coming decades [10]. As more<br />

legacy features and functions move from mechanical to<br />

electronic control as well as the addition of new ADAS and<br />

vehicle autonomy subsystems this number is inevitably set to<br />

rise bringing the associated rise in development and individual<br />

component costs, weight, required operating and application<br />

software and network traffic.<br />

B. Decentralized E/E Architecture<br />

ECUs control a vast and different array of vehicle tasks<br />

and functions over four key functional domains including<br />

chassis, powertrain, comfort, and infotainment [14], [15].<br />

Vehicle functional domains represent a logical distribution of<br />

ECU hardware and in-vehicle functions throughout the<br />

vehicle. ECUs are typically located near to the components<br />

they control or monitor but many ECUs can have functions<br />

and operations which are distributed over a number of ECUs,<br />

across several domains with communication between them<br />

achieved through in-vehicle networks. As more functions and<br />

features become available it results in ECUs requiring data<br />

from other sensors and ECUs in other functional domains<br />

which have led to a highly decentralized E/E vehicle<br />

architecture.<br />

C. Multiple In-vehicle Networks<br />

As the numbers of ECUs have grown from model to model<br />

over the years, increasing amounts of network traffic on the<br />

primary CANbus network has become an issue as it struggles<br />

to cope with the volume of different applications of network<br />

traffic [16]. This has resulted in the inclusion of multiple invehicle<br />

network media and protocols co-existing to handle the<br />

various types of in-vehicle communication. The local<br />

interconnect network (LIN) is a limited node, small bandwidth<br />

protocol which supports simple, non-critical low priority<br />

ECUs such as climate control, seat and wing mirror position<br />

motors. In contrast to this, the media orientated systems<br />

transport (MOST) network has been designed and<br />

438


implemented to handle the high requirements demanded by<br />

streaming video, voice, and other data suited for infotainment<br />

and hi-fidelity systems which require a high-speed network.<br />

D. Embedded Vehicle Software<br />

Vehicle software is a major component of the modern<br />

motor car from the ECU operating system supporting<br />

functional applications to the features for consumption within<br />

the HMI device. ECUs within the E/E architecture perform<br />

over 2000 individual vehicle-related functions [17] from<br />

engine management to passenger comfort. As more ECUs are<br />

introduced into the E/E architecture it will inevitably result in<br />

even more lines of software code required to drive those<br />

embedded systems. As more and more directly connected<br />

mechanical functions are separated with ECU functionality,<br />

their software inevitably has to interact with multiple sensors,<br />

actuators, and other ECUs often across multiple domains of<br />

responsibility increasing overall complexity. Over the coming<br />

decade, the modern motor vehicle will see an influx of new<br />

safety features ADAS and vehicle autonomy. The autonomous<br />

motor vehicle has been hailed as offering a wealth of benefits<br />

to not just our day to day lives but society in general, from<br />

making our streets safer with the reduction in traffic accidents<br />

to greater access to independent mobility solutions including<br />

for the elderly and non-car owners/drivers [18]. The amount<br />

of embedded software, as well as generated data to support<br />

these new systems, is set to increase dramatically.<br />

The scope for even more lines of software code and E/E<br />

complexity is rapidly becoming a pressing issue and one which<br />

requires a robust and secure strategy to ensure that the software<br />

within the motor car is correct, up-to-date and paramount: safe<br />

and secure.<br />

E. Embedded Software Updates<br />

ECUs and their associated functions are governed by<br />

software which is often designed and written months or even<br />

years before the vehicle is driven off of the sales forecourt.<br />

ECU software is preloaded during the manufacturing process<br />

and in many cases fixed to the specific hardware, it has been<br />

written for. Software is now a critical component of the<br />

modern motor car, but, all software code is vulnerable to errors<br />

especially during the design, coding and implementation stages<br />

of the vehicles development process which when discovered<br />

must be addressed. If software errors are not addressed they<br />

may expose the OEM or supplier to liability if something goes<br />

wrong due to an inherent flaw in their software. Once a flaw<br />

has been discovered and rectified that new, updated code needs<br />

to be deployed and installed to the target vehicle or vehicles in<br />

a manner that is safe, secure and one which offers minimal<br />

disruption for the customer but is also cost-effective to the<br />

OEM and supplier.<br />

There is no doubt that today’s modern car has<br />

technologically advanced into a highly complex device [19].<br />

The vehicles E/E architecture has become one of the most<br />

hardware and software-intensive machines we use in our day<br />

to day lives. There have been numerous initiatives,<br />

frameworks, and standards to address this increasing<br />

complexity, but, it is clear that a new approach is required<br />

before the E/E architecture reaches a saturation point.<br />

V. ECU VIRTUALIZATION<br />

Virtualization is a technology which could address the<br />

fundamental challenges concerning the modern automotive E/E<br />

architecture. A classic example of where virtualization<br />

technology has not just enhanced but transformed an industry<br />

is the traditional datacenter: These used a model that relied on<br />

individual and often underutilized servers dedicated to a<br />

particular role or task within an organization. As new tasks and<br />

roles were introduced typically new dedicated hardware and<br />

software was installed. The modern motor cars E/E architecture<br />

is very much in tune with the traditional data center model and<br />

many comparisons can be drawn where vehicle ECUs provide<br />

hardware and software resources to their dedicated clients,<br />

actuators and sensors, acting very much like individual servers.<br />

Virtualization technology can be applied to ECU functions<br />

where they could be converted to a virtual instance<br />

transforming an ECU into a virtual electronic control unit<br />

(VCU) see Fig. 3 below.<br />

Fig. 3: Automotive VCU Based System<br />

These VCUs could not only provide the similar<br />

functionality as their hardware-based counterparts but provide<br />

additional benefits: Virtualization could enhance the vehicle<br />

E/E architecture with improvements including flexibility,<br />

availability, scalability, utilization, security, and software as<br />

detailed below.<br />

A. Flexibility<br />

Hardware resources can be modified often ‘on the fly’ in<br />

order to meet peak demands of the system. In stark contrast,<br />

physical based ECUs have their hardware fixed at the design<br />

and subsequent manufacture ensuring that only the required<br />

and necessary hardware is embedded to run its embedded<br />

software. A virtual system can adapt to the current situation<br />

and provide additional resources as and when they are required<br />

especially when the original software is upgraded or replaced.<br />

B. Consolidation<br />

It is clear from the modern data center that consolidation<br />

has played a large part in the reduction of individual bespoke<br />

439


servers as well as lowering the energy used to power those<br />

devices. Automotive virtualization not only consolidates but<br />

addresses the decentralized nature of distributed ECUs along<br />

with the associated benefits of physical accessibility,<br />

maintenance, and replacement.<br />

C. Scalability<br />

A VCU is a software-based system and as such can be<br />

changed if an error in the code is discovered or additional code<br />

is added to provide additional functionality. The system is able<br />

to increase available memory, CPU cores, and other required<br />

resources to meet the demands of the newly updated system.<br />

Currently, practices where executed ECU code which has been<br />

modified to correct a flaw or provide additional functions or<br />

features may potentially not scale to the underlying hardware<br />

producing a reduction in overall performance.<br />

D. Utilization<br />

Physical-based ECUs have their hardware fixed at the<br />

design and subsequent manufacture ensuring that only the<br />

required and necessary hardware is included to run its<br />

embedded software, due to the large numbers of vehicles<br />

produced ensures that ECU hardware costs are kept to a<br />

minimum. In contrast to this, a system designed to run multiple<br />

virtual machines has a large number of resources available to<br />

cope with peak workloads and these virtual resources can be<br />

allocated as and when required.<br />

E. Security<br />

Virtualization provides a separation of services whereby<br />

these services can be separated into individual virtual machines<br />

[20]. If a service running a particular VM becomes<br />

compromised for any reason it will not directly affect other<br />

VMs on the same system. Critical vehicle functions can be not<br />

only segregated into its own VM but it can be run on a<br />

dedicated system whether that pertains to a particular domain<br />

or common vehicle function.<br />

F. Software<br />

History has shown that with any software there is always<br />

the need to periodically update the code to fix previously<br />

undiscovered bugs and vulnerabilities or offer new features and<br />

enhancements for the consumer. Many embedded systems in a<br />

motor vehicle do not allow or provide any form of mechanism<br />

to update embedded software. Such embedded system code is<br />

fixed during production making them more secure but isolated<br />

when it comes to any future software updates. A virtualized<br />

environment can be much more accessible as a VM is, in<br />

essence, a stored software image containing all the files<br />

required for their operating system, applications, and overall<br />

configuration. VM images are held within some form of<br />

permanent storage, by providing access to this centralized<br />

storage media allows new software to be deployed easily<br />

replacing faulty or obsolete VMs and their subsequent VCUs<br />

with updated code.<br />

Virtualization, although offers many benefits, does have its<br />

drawbacks. One such example is overhead; latency is<br />

introduced into the system due to additional layers of<br />

abstraction between application software and the underlying<br />

hardware especially when considering real-time constraints but<br />

this can be reduced with hardware assist. Single point of failure<br />

(SPOF) is also a factor concerning virtualization as multiple<br />

VMs operate on a single device but secondary redundant<br />

systems can provide an increased level of redundancy.<br />

Virtualization has already begun to enter the automotive<br />

domain, but, primarily within the human-machine-interface<br />

(HMI) unit which provides the vehicle occupants with<br />

infotainment services as well as key functions and features to<br />

control or adjust vehicle associated parameters. Often<br />

infotainment systems are coupled with some form of<br />

connectivity mechanism which is either built into the vehicle or<br />

is provided by mirroring a smart device such as a smartphone.<br />

Although access to these different services is through the same<br />

HMI device a clear separation of access has to be in place to<br />

provide not just shared functionality but more importantly<br />

security from external access and threats to the interconnected<br />

underlying critical systems.<br />

VI. CONCLUSION<br />

In summary, virtualization can bring many benefits to the<br />

E/E architecture as well as address aspects of its overall<br />

complexity. Virtualization is not a complete panacea, it does<br />

have its disadvantages including overhead and SPOF but some<br />

of the main drawbacks of virtualization are being addressed. As<br />

ADAS and vehicle autonomy become more mainstream<br />

technologies within new upcoming vehicle makes and models,<br />

the challenges surrounding E/E architecture complexity must<br />

be addressed. Virtualization is one such technology which not<br />

only can meet these challenges but add additional benefits<br />

especially when addressing embedded software. The data<br />

center environment has vastly benefitted from virtualization<br />

and there are many parallels that can be drawn from this<br />

industry when applying it to an automotive context. The<br />

number of ECUs can be consolidated on to centralized and,<br />

high specification yet redundant hardware which is flexible and<br />

scalable as and when the system demands.<br />

REFERENCES<br />

[1] S. Fürst, "AUTOSAR Adaptive Platform for Connect and Autonomous<br />

Vehicles," 1st December 2015. [Online]. Available:<br />

https://www.autosar.org/fileadmin/files/presentations/AUTOSAR_Adap<br />

tive_Platform_FUERST_Simon.pdf. [Accessed 11 October 2016].<br />

[2] J. A. Cook, "Control, Computing and Communications: Technologies<br />

for the Twenty-First Century Model T," IEEE, special issue on<br />

automotive power electronics and motor drives, vol. 95, pp. 334-355,<br />

2007.<br />

[3] A. Sangiovanni-Vincentelli and M. Di Natale, "Embedded system<br />

design for automotive applications," Computer 40, pp. 42-51, 2007.<br />

[4] F. Simonot-Lion and Y. Trinquet, "Vehicle Functional Domains and<br />

Their Requirements," in Automotive Embedded Systems handbook,<br />

Boca Raton, CRC Press, 2009, pp. 22-43.<br />

[5] S. J. C. Nixon, The Invention of the Automobile, Country Life, 1936.<br />

[6] R. N. Charlette, "This Car Runs on Code," 1 February 2009. [Online].<br />

Available: http://spectrum.ieee.org/transportation/systems/this-car-runson-code.<br />

[Accessed 13 October 2016].<br />

[7] D. Work, A. Bayen and Q. Jacobson, "Automotive Cyber Physical<br />

Systems in the Context of Human Mobility," National Workshop on<br />

High-Confidence Automotive Cyber-Physical Systems, pp. 3-4, 2008.<br />

[8] M. Shavit, A. Gryc and R. Miucic, "Firmware Update Over The Air<br />

(FOTA) for Automotive Industry," SAE, 2007.<br />

[9] R. Chitkara, W. Ballhaus, B. Kliem, S. Berings and B. Weiss, "Spotlight<br />

on Automotive," PwC, 2013.<br />

440


[10] M. Broy, "Challenges in Automotive Software Engineering," in<br />

Proceedings of the 28th international conference on Software<br />

engineering , Shanghai, 2006.<br />

[11] M. Urbina and R. Obermaisser, "ulti-core architecture for AUTOSAR<br />

based on virtual electronic control units," in Emerging Technologies &<br />

Factory Automation, Luxembourg City, 2015.<br />

[12] K. Suzuki, "Automotive Electronics Trend in Automotive Industry,"<br />

Nikkei Automotive Technology, 2015 1 28. [Online]. Available:<br />

https://www.slideshare.net/kenjisuzuki397/car-electronization-trend-inautomotive-industry-44007679.<br />

[Accessed 16 January 2018].<br />

[13] D. Reinhardt and M. Kucera, "Domain Controlled Architecture," in<br />

Third International Conference on Pervasive and Embedded Computing<br />

and Communication Systems , Barcelona, 2013.<br />

[14] M. Strobl, M. Kucera, A. Foeldi, T. Waas, N. Balbierer and C. Hilbert,<br />

"Towards automotive virtualization," in International Conference on<br />

Applied Electronics (AE), 2013.<br />

[15] D. Reinhardt, D. Kaule and M. Kucera, "Achieving a scalable e/earchitecture<br />

using autosar and virtualization," SAE International Journal<br />

of Passenger Cars-Electronic and Electrical Systems , pp. 489-497,<br />

2013.<br />

[16] D. Reinhardt and M. Kucera, "Domain controlled architecture A New<br />

Approach for Large Scale Software Integrated Automotive Systems," in<br />

Third International Conference on Pervasive and Embedded Computing<br />

and Communication Systems, Barcelona, 2013.<br />

[17] M. Broy, H. I. Kruger, A. Pretschner and C. Salzmann, "Engineering<br />

Automotive Software," Proceedings of the IEEE, pp. 356-373, 2007.<br />

[18] R. Ramos, "Self-Driving Vehicles -- Are We Nearly There Yet?," 10<br />

October 2016. [Online]. Available:<br />

http://www.eetimes.com/author.asp?section_id=36&doc_id=1330599&.<br />

[Accessed 11 October 2016].<br />

[19] G. de Boer, P. Engel and W. Praefcke, "Generic remote software update<br />

for vehicle ECUs using a telematics device as a gateway," in Advanced<br />

Microsystems for Automotive Applications, Berlin, Springer, 2005, pp.<br />

371-380.<br />

441


Cycle Approximate Simulation of RISC-V Processors<br />

Lee Moore, Duncan Graham and Simon Davidmann, Imperas Software Ltd., and<br />

Felipe Rosa, Universidad Federal Rio Grande Sul<br />

Abstract<br />

Historically, architectural estimation, analysis and optimization for SoCs and<br />

embedded systems has been done using either manual spreadsheets, hardware<br />

emulators, FPGA prototypes or cycle approximate and cycle accurate simulators.<br />

This precision comes at the cost of performance and modeling flexibility.<br />

Instruction accurate simulation models in virtual platforms, have the speed necessary<br />

to cover the range of system scenarios, can be available much earlier in the project,<br />

and are typically an order of magnitude less expensive than cycle approximate or<br />

cycle accurate simulators. Previously, because of a lack of timing information, virtual<br />

platforms could not be used for timing estimation. We report here on a technique for<br />

dynamically annotating timing information to the instruction accurate software<br />

simulation results. This has achieved an accuracy of better than +/-10%, which is<br />

appropriate for early design architectural exploration and system analysis. This<br />

Instruction Accurate + Estimation (IA+E) approach is constructed by using Open<br />

Virtual Platforms (OVP) processor models plus a library that can introspect the<br />

running system and calculate an estimate for the cycles taken to execute the current<br />

instruction. Not only can these add-on libraries dynamically inspect the running<br />

system estimate timing effects, they can annotate calculated instruction cycle timing<br />

back into the simulation and affect timing of the simulation.<br />

Introduction<br />

Performance and power consumption are two key attributes of any SoC and<br />

embedded system. Systems often have hard timing requirements that must be met, for<br />

example in safety critical systems where reaction time is of paramount importance.<br />

Other systems, particularly battery powered systems, have power consumption<br />

limitations.<br />

Because of the importance of these characteristics, many techniques have been<br />

developed for estimation of performance and power consumption. Recently, with the<br />

explosion of system scenarios that must be considered, this job has become much<br />

more difficult.<br />

Instruction accurate simulation has previously not been considered as a potential<br />

technique for timing and power estimation, because it is instruction accurate and does<br />

not model processor microarchitecture details: there is no information about timing or<br />

power consumption of instructions and actions in instruction accurate models and<br />

simulators. Recently some universities, using the Open Virtual Platforms (OVP)<br />

models and OVPsim simulator [1], have experimented with adding this information<br />

into the instruction accurate simulation environment as libraries, with no changes to<br />

the models or simulation engines [2]. These efforts have shown great promise, with<br />

442


timing estimation results within +/- 10% of the actual timing results for the hardware<br />

for limited cases.<br />

We report here on the further development of this technique, and the extension of this<br />

technique for RISC-V ISA based processors. This is critical for the RISC-V<br />

ecosystem, since for RISC-V semiconductor vendors to win embedded system sockets,<br />

their customers are going to want to know about the timing and power consumption of<br />

those SoCs when running different application software.<br />

Current State of the Art<br />

Historically, SoC architectural estimation, analysis and optimization has been done<br />

using either manual spreadsheets, hardware emulators, FPGA prototypes, cycle<br />

approximate simulators or cycle accurate simulator and performance simulators such<br />

as Gem5 [3]. These all have significant drawbacks: insufficient accuracy, high cost,<br />

RTL availability (meaning that the technique is only available later in the project<br />

when the RTL design is complete), low performance, limited ability to support a wide<br />

range of system scenarios or are very complex to use and gain good results. Table 1<br />

provides a summary of the strengths and weaknesses of each technique.<br />

Technique Strength Weaknesses<br />

Manual spreadsheets Ease of use Lack of accuracy; inability to<br />

support estimations with real<br />

software<br />

Hardware emulators Cycle accurate High cost (millions USD); needs<br />

RTL; < 5 mips performance<br />

FPGA prototypes Cycle accurate High cost (hundreds of thousands<br />

USD); needs RTL<br />

Cycle approximate Good performance Lack of accuracy; lack of<br />

simulation<br />

Cycle accurate<br />

simulation<br />

Cycle accurate<br />

availability of models<br />

High cost (hundreds of thousands<br />

of USD); lack of availability of<br />

models<br />

Gem5 Microarchitectural detail A lot of work to develop a model<br />

of specific microarchitecture and<br />

to get realistic traces of SoC.<br />

Table 1. Strengths and weaknesses of currently used techniques for timing and power<br />

estimation.<br />

Instruction Accurate Simulation<br />

Instruction set simulators (ISSs) have long been used by software engineers as a<br />

vehicle for software development. Over the last 20 years, this technique has been<br />

extended to support not only modeling of the processor core, but also modeling of the<br />

peripherals and other components on the SoC. The advantages of these simulators are<br />

their performance, typically hundreds of millions of instructions per second (MIPS),<br />

and the relative ease of building the necessary models. However, the simulator<br />

engines and models are instruction accurate, and are not built to support timing and<br />

power estimation.<br />

443


The performance of these simulators comes from the use of Just-In-Time (JIT) binary<br />

translation engines, which translate the instructions of the target processor (e.g. Arm)<br />

to instructions on the host x86 PC. This enables users to run the same executables on<br />

the instruction accurate simulator as on the real hardware, such that the software does<br />

not know that it is not running on hardware. Peak performance with these simulators<br />

can reach billions of instructions per second. A more typically use case, such as<br />

booting SMP Linux on a multicore Arm processor, takes less than 10 seconds on a<br />

desktop x86 machine.<br />

There are also significant libraries of models available, and it is easier to build<br />

instruction accurate models than models with timing or power consumption<br />

information, or real implementation details. One such library and modeling<br />

technology is available from OVP. The OVP processor model library includes<br />

models of over 200 separate processors (e.g. Arm, MIPS, Power, Renesas, RISC-V),<br />

plus a similar number of peripheral models. Most of these models are available as<br />

open source. The C APIs for building these models are also freely available as an<br />

open standard from OVP.<br />

Instruction Accurate Simulation Plus Estimation<br />

Instruction accurate simulation holds the promise of faster simulation performance to<br />

support examination of more system scenarios, plus lower cost and earlier availability.<br />

With the Imperas APIs and dynamic model introspection it is easy to add in timing<br />

and power estimation capabilities into the instruction accurate simulation environment.<br />

The idea of adding these capabilities as libraries is the combination of annotation<br />

techniques and binary interception libraries used with JIT simulation engines.<br />

Annotation techniques can be imagined as a full instruction trace which is then<br />

annotated with the timing or power information. However, just using annotation<br />

requires significant host PC memory, and can slow the simulation.<br />

Binary interception libraries are used with the Imperas JIT simulators to enable the<br />

non-intrusive addition of tools, such as code coverage and profiling, to the simulation<br />

environment. Combining these techniques maintains the high simulator performance<br />

with minimal memory costs. This combined technique is being called Instruction<br />

Accurate + Estimation (IA+E).<br />

In the Imperas simulation products, which require the use of OVP models, it is<br />

possible to create a standalone library module with entry points that are called when<br />

instructions are executed. This library can introspect the running system and calculate<br />

an estimate for the cycles taken to execute the current instruction, and can take into<br />

account overhead of different memory and peripheral component latencies. Not only<br />

can these add-on libraries dynamically inspect the running system and estimate timing<br />

affects, they can annotate calculated instruction cycle timing back into the simulation<br />

and affect (i.e. stretch) timing of the simulation. An overview of the simulation<br />

architecture is shown in Figure 1.<br />

444


Figure 1. Overview of the Imperas IA+E simulation environment.<br />

For processors, the instruction estimation algorithm includes:<br />

• a mixture of table look ups for simple instructions<br />

• dynamic calculations for data dependent instructions<br />

• adjustments due to code branches taken<br />

• taking into account effects of memory and register accesses<br />

A view of the timing estimation mechanism is shown in Figure 2.<br />

Load, store, branch, jump, barrier, etc<br />

a s s e m b l y c o d e<br />

b c s 25 d 0<br />

l d r r 3 ,[ p c ,# 172 ]<br />

s t r r 3 ,[ r 7 ,# 16 ]<br />

l d r r 3 ,[ p c ,# 168 ]<br />

l d r r 3 ,[ r 3 ]<br />

s t r r 3 ,[ r 7 ,# 20 ]<br />

b 255 e<br />

I S A t i m i n g i n f o r m a t i o n<br />

I n s t r . C y c l e s<br />

B c s 3<br />

L d r 2<br />

S t r 2<br />

L d r 2<br />

L d r 2<br />

S t r 2<br />

b 3<br />

Calibration from a<br />

reference CPU<br />

datasheet<br />

Figure 2. Simplified view of the timing estimation mechanism.<br />

For memory subsystems and peripheral components table, lookup and dynamic<br />

estimation can be made and timing back annotated into the simulation to simulate the<br />

delay effects of slow memories and other components.<br />

445


With this Instruction Accurate + Estimation (IA+E) approach, there is a separation of<br />

processor model functionality and timing estimation. This means while building a<br />

functional model there is no need to worry about any timing or cycle complexity. It is<br />

only when the more detailed timing is needed is it necessary to add the extra timing<br />

data to enable the Imperas IA+E timing tools to provide cycle approximate timing<br />

simulation for the RISC-V processors.<br />

This extra timing data is added in two steps. First, the cycle information is added to<br />

the library. Second, the time per cycle, which is dependent upon the specific<br />

semiconductor process and physical implementation details, is added.<br />

The approach of providing the timing data as a separately linked dynamic program<br />

enables RISC-V processor designers to create a cycle approximate timing simulation<br />

for their specific processor implementation - without sharing any internal information.<br />

IA+E simulation performance slows down from normal simulation performance, with<br />

typical overhead of about 50% of normal performance. Still, this puts IA+E<br />

simulation performance at 100-500 MIPS.<br />

IA+E does have some limitations. This technique has currently been proven only for<br />

simple processors with a single core, no cache, and in-order pipeline.<br />

Results<br />

This IA+E technique was first tested with Arm Cortex-M4 based processors. The<br />

results were much better than expected, with an average estimation error of +/- 5% as<br />

compared to the actual device. The device was an ST Microelectronics STM32F on a<br />

standard development board, running the FreeRTOS real time operating system, with<br />

39 different benchmark applications used. Almost all timing estimation errors were<br />

within +/- 10% of actual timing values. Figure 3 shows these results.<br />

Figure 3. Timing estimation results for IA+E simulation show average errors of<br />

better than +/- 5% over 39 different benchmarks for Arm Cortex-M4.<br />

446


IA+E was recently extended to support RISC-V processors, by using publicly<br />

available information (from the processor vendor's data books) to build the cycle data<br />

libraries.<br />

In the data below, showing processor implementations from Andes Technology,<br />

Microsemi and SiFive, only the cycle data is presented, since comparing timing for<br />

the various implementations would not be an accurate comparison. Also, in keeping<br />

with this theme, different benchmark applications were used for each of the different<br />

processors. All benchmarks were run with the range of compiler optimization settings,<br />

and estimated cycles were reported first assuming 1 cycle per instruction, i.e. using IA,<br />

then using the IA+E technique. These results are shown in Figure 4.<br />

Figure 4a. IA+E cycle estimation results for the Andes N25 processor.<br />

Figure 4b. IA+E cycle estimation results for the Microsemi Mi-V RV32IMA<br />

processor.<br />

Figure 4c. IA+E cycle estimation results for the SiFive E31 processor.<br />

447


Conclusions<br />

The Instruction Accurate + Estimation (IA+E) technique developed here has shown<br />

excellent results for timing estimation of in-order processors. It also has the benefits<br />

of easy model building, high performance to enable examination of multiple<br />

benchmarks and system scenarios, and lower cost than other techniques. In this paper,<br />

the IA+E technique has been extended to support RISC-V processors. Further work<br />

is needed to apply this technique to power estimation, and to more complex<br />

processors.<br />

Acknowledgements<br />

We would like to thank Andes Technology, Microsemi, and SiFive for access to their<br />

processor datasheets/databooks.<br />

References<br />

1. www.OVPworld.org<br />

2. Felipe Da Rosa, Luciano Ost, Ricardo Reis, Gilles Sassatelli. Instruction-<br />

Driven Timing CPU Model for Efficient Embedded Software Development<br />

Using OVP. ICECS: International Conference on Electronics, Circuits, and<br />

Systems, Dec 2013, Abu Dhabi, United Arab Emirates.<br />

3. Gem5, www.gem5.org<br />

448


Comparing Automotive Secure Gateway Design<br />

Approaches<br />

Carmelo Loiacono<br />

Field Applications Engineer<br />

Green Hills Software<br />

Turin, Italy<br />

carmelo@ghs.com<br />

Abstract— Considering the complexity of the today’s cars,<br />

guaranteeing their security is not an obvious task. In this scenario,<br />

a hacker could attack the on-board networks of the car, even<br />

without having physical access. Moreover, sending malicious<br />

messages to ECUs over the CAN bus can potentially compromise<br />

the safety of the vehicle. To prevent such kind of attacks, the<br />

automotive architecture introduced the Secure Gateway. Since the<br />

Secure Gateway is a complex system, bad design can compromise<br />

the security, and potentially the safety, of the car. We will focus on<br />

analyzing Secure Gateway design methods. We compare different<br />

design approaches and give guidelines to guarantee the security of<br />

the whole system. Finally, we discuss the advantages of using a<br />

separation kernel and the important hardware requirements for<br />

Secure Gateways.<br />

Cloud (3G-LTE-GPRS)<br />

V2V (IEEE 802.11p)<br />

External Devices<br />

(USB-WIFI-BT)<br />

V2I (IEEE 802.11p)<br />

Keywords—Automotive Secure Gateway, Separation Kernel,<br />

System Security<br />

I. INTRODUCTION<br />

The vehicle and mobility industry is dealing with the trend<br />

of bringing different electronic domains onto a single platform.<br />

This leads to the challenge of enabling applications with more<br />

strict security and safety requirements to work in a trusted<br />

environment on a single platform. Vehicle internal networks are<br />

now more connected to external devices, thereby exposing the<br />

internal network to the outside world.<br />

Moreover, the evolution of Vehicle to Everything (V2X)<br />

communication has increased data exchange with external<br />

resources via Wi-Fi, 3G, and LTE networks. Automotive ECUs<br />

could be subject to external attacks that aim to control their<br />

software behavior. Such attacks arrive as data over a regular<br />

communication channels (e.g. external network) and, once<br />

resident in program memory, trigger pre-existing hardware and<br />

software vulnerabilities. By exploiting such flaws, these attacks<br />

can subvert the execution of the software and gain control over<br />

its behavior [1]. Figure 1 shows how the modern car are<br />

connected to different sources incrementing the attack surface.<br />

Secure Gateways (SGs) are used to separate the internal<br />

vehicle networks from the external one, i.e. to protect the<br />

internal communications from potential attacks coming from<br />

Fig. 1 Modern Connected Cars<br />

external sources. SGs are crucial for the security of the vehicle<br />

so compromising the security of the SGs can compromise the<br />

security of the whole system. There are two main aspects to<br />

consider for the SGs security: low level security, more related to<br />

the operating system running on them and security of the data<br />

and applications. In this paper we analyze different possible<br />

ways to design SGs with respect to security and safety aspects.<br />

We also provide suggestions and guidelines to design secure<br />

SGs.<br />

The rest of the paper is organized as follows. Section 2<br />

reports comparison and guidelines on SGs design. Section 3<br />

presents methods to manage I/O devices in a virtualized system.<br />

Finally, Section 4 concludes the paper with summarizing<br />

remarks.<br />

II.<br />

SECURE GATEWAYS SOFTWARE DESIGN<br />

With increasing intelligence, modern vehicles are equipped<br />

with more and more sensors, such as sensors for detecting road<br />

conditions and drivers fatigue, sensors for monitoring tire<br />

pressure and water temperature in the cooling system, and<br />

advanced sensors for autonomous control [2].<br />

In addition, the increasingly interconnected nature of a<br />

vehicles control modules means there is no safety without<br />

security. Security features must include not just physical access<br />

www.embedded-world.eu<br />

449


and protection of confidential information, but also critical<br />

safety systems.<br />

For this reason, using a SG to protect the internal network<br />

from external attacks is very important for the security and<br />

safety of the vehicle. Figure 2 shows the main software<br />

components of a SG.<br />

Application Environment is composed by non-critical<br />

applications with no safety and security requirements. Security<br />

services are used to guarantee the confidentially and integrity of<br />

messages exchanged within SG components or between ECUs<br />

security (symmetric and asymmetric cryptography).<br />

Indeed, SGs are mixed criticality systems in which jobs are<br />

running with different security and safety requirements. SGs<br />

should be designed to ensure that the execution of non-trusted<br />

applications does not compromise the execution of the others.<br />

This can be achieved thanks to the separation properties offered<br />

by separation kernels.<br />

A. Separation Kernels<br />

Separation Kernels offer advanced features to the embedded<br />

systems software developers that need to: ensure that the<br />

heterogeneous software components are free from interference,<br />

protect the information flow and reinforce the car<br />

communication system with respect to security and safety<br />

requirements. A well designed Separation Kernel must ensure<br />

that errors within a process will not propagate in the whole<br />

system, this can be done by confining the writing space of the<br />

processes in a specific memory area. The Separation Kernel<br />

consist of "compartments" named partitions, in each of these<br />

partitions is running a process. A process running on a partition<br />

can be composed of multiple tasks (threads). Inside the partition<br />

the separation is not guaranteed, whereas the separation is<br />

ensured between the different partitions. The key benefits of the<br />

separation kernel are the following: to act as errors containers,<br />

to allow execution without interference of different critical<br />

processes on a single hardware platform, to ensure<br />

confidentiality of sensitive data and to integrate new features<br />

without having to re-test the entire system.<br />

Operating systems, which do not use the separation as the<br />

foundation, can enter undefined states, get in deadlock and have<br />

a non-deterministic execution flow. This can have serious<br />

consequences on systems, especially in the automotive field. A<br />

separation kernel designed to be used in critical systems, such as<br />

SGs, must always ensure that the computational and memory<br />

resources are always available to each process running on a<br />

partition. Another important property of operating systems<br />

security oriented is preventing denial-of-service attacks.<br />

Usually, such attacks are avoided by assigning to each process a<br />

fixed amount of resources in terms of CPU and memory.<br />

Moreover, the static allocation of resources in terms of time will<br />

ensure that each process will be executed in a given time<br />

window. This preserves the integrity of the processes avoiding<br />

the execution outside of their temporal window.<br />

The main requirements of SGs are: Real Time, safety,<br />

security, reliability, and performance. The microkernel<br />

architecture, introduced in some separation kernel, like the<br />

INTEGRITY RTOS [3] from Green Hills Software, ensures that<br />

the kernel is easy to be tested and verified, in order to be free of<br />

bugs and security holes. In microkernel architectures only basic<br />

services are part of the kernel: Support for communication<br />

between partitions (IPC), virtual memory management and<br />

scheduling. Other complex services are running inside the<br />

partitions, this allows to have a more safe and reliable kernel.<br />

Figure 3 shows a separation architecture composed of<br />

microkernel. Notice that some services, such as the file system<br />

management and device drivers are running in different<br />

partitions and independent from the microkernel, that<br />

implements only basic functions.<br />

B. Linux on Secure Gateways<br />

Embedded Linux OS, with its kernel and software packages<br />

consisting of millions of lines of code, provide an attractive set<br />

of ready-made software also useful for SGs design. Since it is<br />

virtually impossible to test millions of lines of code, it is<br />

inevitable that Linux will continue to contain security<br />

vulnerabilities and software bugs. Also, the increasingly<br />

interconnected nature of embedded systems allows hackers to<br />

exploit those vulnerabilities, sometimes even letting them<br />

perform remote attacks.<br />

When it is not possible to replace the Linux with a Separation<br />

Kernel operating system, a powerful method for improving the<br />

security of SGs (having Linux as an Operating System) is to use<br />

hypervisors that guarantee separation between the system<br />

software components. A hypervisor is a layer of software below<br />

Application<br />

Environment<br />

Security<br />

services<br />

Network<br />

Stack<br />

Secure /<br />

safety<br />

Critical<br />

apps<br />

RTOS Operating System<br />

Embedded HW Platform<br />

Fig. 2 Secure Gateways Software Components<br />

Fig. 3 Separation MicroKernel Architecture<br />

450


the OS that runs at a higher privilege level than the OS and<br />

virtualizes the hardware resources. Because of the higher<br />

privilege level, the integrity of the hypervisor remains intact<br />

even if the OS is compromised. A hypervisor that is designed to<br />

be secure and reliable from the ground up offers significant<br />

advantages over hardware for implementing low-level security.<br />

Also, it can provide multiple levels of privilege so that a service<br />

with sensitive data could run in an isolated “compartment”, or<br />

partition, alongside a service with less sensitive information.<br />

While these different levels of security can run concurrently,<br />

they would never be able to see or modify each other’s data.<br />

There are other approaches to support multiple OS contexts<br />

than using a Separation Kernel or classic Type-1 Hypervisor.<br />

Linux Containers [4], LXC in short, is a method for running<br />

multiple isolated Linux systems (containers) on a control host<br />

using a single Linux kernel. FreeBSD [5] Jails follows a similar<br />

container scheme with a BSD-compatible userland. However,<br />

containers do not address kernel-level attacks, in particular<br />

against device drivers, which are running privileged on Linux<br />

and BSD systems. Containers are less secure than Separation<br />

Kernel and hypervisors because the kernel that hosts the<br />

containers has a much larger attack surface. The smaller attack<br />

surface of the latter decreases the probability that a privilege<br />

escalation attack will allow an attacker to compromise the<br />

security of a virtual machine and affect other components on the<br />

system.<br />

III.<br />

I/O DEVICE SECURITY<br />

An important aspect regarding the security of a virtualized<br />

system is the device management. In particular, we refer to<br />

devices that will be available to a Guest OS (Linux) on SGs<br />

described in Section 2.b. Figure 4 shows a use case where a<br />

system as a SG is using high-speed expansion ports that permit<br />

direct memory access (DMA Devices). If the Separation Kernel<br />

were to give complete control of DMA devices to the Linux<br />

Guest, the security of the whole system may be compromised.<br />

Indeed, using the DMA, the Guest OS could instruct the Device<br />

to read or write directly to any area of main memory including<br />

the kernel. Unless specific protection is in place, an attacker can<br />

use such a facility to potentially gain direct access to parts of or<br />

all of the physical memory address space of the system,<br />

bypassing all security mechanisms.<br />

For this reason, many modern SoC have introduced some<br />

functionality to limit the scope of what a DMA Device can<br />

access: the IOMMU. The IOMMU provides a programmatic<br />

interface to define which ranges of addresses the device can<br />

access. This allows device drivers to run purely in a Separation<br />

Kernel partition, or a Guest OS. While direct Device access from<br />

the Guest is strongly discouraged where an IOMMU is not<br />

capable of protecting the system, this is a common practice,<br />

taken as a compromise for the sake of either maintainability or<br />

time-to-market.<br />

Without such hardware protection, DMA devices should<br />

instead be managed by the Separation Kernel to ensure that a<br />

DMA Devices<br />

Security Services<br />

Network<br />

stack/Gateway<br />

Virtual Driver<br />

flaw in a Guest OS device driver cannot wrongfully program the<br />

DMA hardware and cause potentially fatal memory corruption.<br />

More precisely, the DMA requests need to be handled by the<br />

Separation Kernel, while the more complex part of the driver can<br />

still run in the Guest. This pushes driver implementation toward<br />

a para-virtualized, specific model, to ensure the behaviour can<br />

still perform. The added complexity is the price to pay for<br />

keeping the system robust, safe and secure, and experience<br />

demonstrates that the overhead is actually smaller than<br />

anticipated when the interface is correctly designed.<br />

IV.<br />

Embedded HW Platform<br />

Fig. 4 Improving security using a Separation Kernel and Virtualization<br />

CONCLUSION<br />

Secure Gateways (SGs) are complex systems and they have<br />

to guarantee the security of the vehicle from external attacks.<br />

Using a secure and reliable Separation Kernel has several<br />

advantages to improve the safety and security of the SGs,<br />

assuring the separation between critical and non-critical<br />

software components while performing a multi-level protection.<br />

Typically such SW components can manage different types of<br />

buses, i.e., the whole purpose of the separation solution between<br />

different domains is to act as a gateway on the same hardware<br />

box, and possibly add filtering and gateway protection features.<br />

Finally, Separation technology open up new scenarios in the<br />

automotive world and improve the security of the whole system.<br />

REFERENCES<br />

Linux Guest<br />

Device<br />

drivers<br />

Application<br />

(HMI)<br />

Separation (Micro)Kernel<br />

Other Devices<br />

[1] Erlingsson Ú., Younan Y., Piessens F. (2010) Low-Level Software<br />

Security by Example. In: Stavroulakis P., Stamp M. (eds) Handbook of<br />

Information and Communication Security. Springer, Berlin, Heidelberg.<br />

[2] N. Lu, N. Cheng, N. Zhang, X. Shen and J. W. Mark, "Connected<br />

Vehicles: Solutions and Challenges," in IEEE Internet of Things Journal,<br />

vol. 1, no. 4, pp. 289-299, Aug. 2014.<br />

[3] https://www.ghs.com/products/rtos/integrity.html<br />

[4] https://linuxcontainers.org/it/<br />

[5] https://www.freebsd.org/it/<br />

www.embedded-world.eu<br />

451


Smart Contracts for Industry 4.0 Using<br />

Blockchain<br />

Christoph Reich<br />

University of Applied Science Furtwangen<br />

Christoph.reich@hs-furtwangen.de<br />

Abstract<br />

Because of the digital transformation of enterprises, a stronger collaboration of the companies is<br />

expected, to achieve customer adapted, individual hybrid business models. These inter-enterprise<br />

business models need secure, reliable, and repeatable possibilities to protocol and monitor the<br />

information flow between manufacturing machines, users and service providers.<br />

The blockchain technology allows end-to-end trust chains, create digitized Service-Level-<br />

Agreements (SLA) contracts, and grant evidential control of the data flow between the enterprises.<br />

Introduction<br />

The progressive conversion of the manufacturing industry to Industry 4.0 technologies and the<br />

associated networking of the individual components of production facilities, logistics and employees<br />

enable companies to collect detailed information about their processes and products, in order to<br />

analyse data for process optimization, condition monitoring, predictive maintenance, etc. 1 . The use<br />

of data across company boundaries results in further innovative hybrid business models. According<br />

to vbw 2 , a strong growth for these Industry 4.0 hybrid business models is expected. Manufacturing<br />

companies are turning into industrial service providers who will offer their customers individual<br />

industrial services. Future value chains will be highly networked structures with a large number of<br />

involved people, IT systems, automation components and machines.<br />

A fundamental requirement for the acceptance of the provision and exchange of information between<br />

the customers/users and the service providers is the trust in the underlying systems, the strict<br />

adherence to Service Level Agreements (SLAs) and the observance of the protection goals<br />

availability, confidentiality, authenticity, integrity and traceability in information processing across<br />

organizational and corporate boundaries. Blockchain technology (such as Ledger, Ethereum), as a<br />

distributed database, ensures encrypted, unchanging, permanent, (persistent) traceable and auditable<br />

storage of cross-company information with guaranteed integrity. The essential basis of the blockchain<br />

concept is the technique of distributed consensus building, which replaces trust in a third party with<br />

trust in a collective of participants, technology and cryptography.<br />

This paper starts with an introduction of blockchain and smart contracts, followed by an overview of<br />

possible Industry 4.0 use cases, where blockchain is an interesting approach, and ending with a<br />

conclusion.<br />

1 Institut der deutschen Wirtschaft Köln; „Digitalisierung und Mittelstand – Eine Metastudie“;<br />

https://www.iwkoeln.de/_storage/asset/312105/storage/master/file/10916485/download/IW-<br />

Analyse_2016_109_Digitalisierung_und_Mittelstand.pdf; Nov. 2016<br />

2 vbw-die bayrische Wirtschaft; „Neue Wertschöpfung durch Digitalisierung Analyse und<br />

Handlungsempfehlungen“; 2017; https://www.vbw-bayern.de/Redaktion/Frei-zugaengliche-Medien/Abteilungen-<br />

GS/Forschung-Technologie/2017/Downloads/vbw_Zukunftsrat_Handlungsempfehlung-V14RZ-Ansicht.pdf<br />

452


Blockchain and Smart Contracts<br />

The blockchain technology and smart contracts have the capability to solve some of the industry’s<br />

crucial problems, like provable product traceability, autonomous payment, etc.<br />

Blockchain<br />

The blockchain technology should be tamper-proof thanks to a clever combination of proven<br />

technologies and encryption mechanisms. In addition, it should make its users independent of<br />

monolithic systems and the associated risks. The best-known blockchain application to date is<br />

Bitcoin, a worldwide, decentralized, digital currency payment.<br />

The blocks of a blockchain are individual records that string together in a chronological order. Each<br />

block contains a hash (kind of checksum) from the previous block. In order to record a new block in<br />

the blockchain, it has to be verified including the hash, as shown in the Figure 1 below:<br />

Figure 1: Blockchain<br />

The hash and the blockchain technology guarantee that a falsification of already written blocks can<br />

hardly be changed afterwards or only with a very high effort. A change in an older block causes the<br />

hash of all following verified blocks to deviate, causing the change to flare up. The blockchain is<br />

checked in a decentralized network by all authorized subscribers. Only if consensus between all<br />

subscribers is achieved, a new block can be added. Blockchains features are:<br />

• Transaction transparence<br />

• Decentralized data records<br />

• Tamper-proof<br />

Blockchain technology provides a platform for industry to irreversibly store data, values or properties<br />

of things in a network. For example, you can register production data or important measured values,<br />

but also contracts or agreements that have been agreed upon. This creates a platform that is trusted<br />

by all participants within a production or supply chain network. The blockchain thus creates trust<br />

between partners who are not yet familiar with existing processes.<br />

Smart Contracts<br />

A smart contract is a contract implemented in blockchain as a piece of software in which various<br />

contractual conditions can be deposited. In this case, the contractual conditions defined in the digital<br />

contract are automatically monitored and specified actions are carried out automatically based on the<br />

information received [Christidis16].<br />

Ensuring the traceability (notary function) and the immutability in the use of such smart contracts is<br />

thus a key requirement for the contractual implementation of collaborative processes between several<br />

companies. Previous prototype implementations using blockchain technologies, however, focus<br />

primarily on financial applications (Bitcoin) or applications in supply chain management<br />

[Tschorsch16]. Smart contracts and the associated automation can be used to improve many processes<br />

and, in some cases, to reduce them to certified inspection bodies if the consistency of the information<br />

is ensured by a smart contract and audit-proof storage. Once information has been confirmed by the<br />

smart contract, it is documented audit-proof and can be integrated in a variety of contexts. Thus, from<br />

453


a technological point of view, the blockchain is a natural tool for process optimization. If, for example,<br />

it is only possible to import a video in a community platform if the corresponding audio rights are<br />

available, the entire monitoring and monitoring processes can be omitted. However, this consistency<br />

is easy to maintain through smart contracts. A simple example for a coin transfer checking to be<br />

balanced is shown Figure 2.<br />

Figure 2: An Example Smart Contract on Ethereum [ethereum]<br />

Industry 4.0 Blockchain Use Cases<br />

The prerequisites for using blockchain technology in the industrial environment is an infrastructure<br />

that operates the blockchain and ensures access to the blockchain for all subscribers. Blockchain<br />

infrastructures are divided in public (e.g. Etherium) and private. A private blockchain will only be<br />

made available to my network. The infrastructure for this can be operated by the company itself or<br />

by a cloud provider. There are distributed servers for the blockchain, each operating one node of the<br />

blockchain network. Ideally, each participant in a production network operates a blockchain node.<br />

Each node is connected to the other and each node has a complete copy of all the data stored in the<br />

blockchain. If legacy systems, sensors, machines, etc. must be connected, unique identity and<br />

protection against identity manipulation, a non-manipulatable connection and the protection against<br />

malware has to be ensured. This also applies to all other physical and logical blockchain subscribers,<br />

e.g. Manufacturing Execution Systems (MES) or gateways, etc. that want to put information into the<br />

blockchain. In semi-automated manufacturing processes, it is also necessary to include people's<br />

activities into a blockchain, which requires the support the management of identities and the use of<br />

appropriate input and output devices.<br />

Recently the interest to exploit blockchain in the manufacturing industry has increased dramatically.<br />

Especially applications of blockchain for supply chain management and auditing are always<br />

mentioned first. Making each step of a supply chain more transparent, where smart contracts allow<br />

tracking product movement from the factory to the store shelves or along the value chain. IoT devices<br />

can write location data straight to a smart contract, which allows simplifying the tracking process.<br />

Such feature provides real-time visibility of an entire supply chain and may improve your business<br />

e.g. by detecting products that are stuck at customs, reducing the risk of fraud and theft as well.<br />

Before a detailed use case is described, additional use cases are summarized in the following table,<br />

based on Bahga and Madisetti [Bahga2016] paper:<br />

454


Application<br />

On-Demand Manufacturing<br />

Smart Diagnostics & Machine<br />

Maintenance<br />

Product Certification<br />

Tracking Supplier Identity &<br />

Reputation<br />

Registry of Assets & Inventory<br />

Short Description<br />

Manufacturing services (such as CNC machining or 3D printing) by sending<br />

transactions to the machines.<br />

Machines will be able to monitor their state, diagnose problems, and<br />

autonomously place service, consumables replenishment, or part<br />

replacement requests to the machine maintenance vendors.<br />

The manufacturing information for a product (such as the manufacturing<br />

facility details, machine details, manufacturing date and parts information) is<br />

recorded to prove authenticity of the products.<br />

Application track various performance parameters (such as delivery times,<br />

customer reviews and seller ratings) for sellers.<br />

Applications for maintaining records of manufacturing assets and inventory.<br />

Reliable interactive maintenance<br />

Regular maintenance work is depicted in so-called maintenance plans (maintenance plans), which<br />

the machine and system builder creates for the respective system. What maintenance tasks have to<br />

be carried out remotely by the machine builder and which task by the plant operator himself is<br />

specified in service maintenance contracts. These contracts vary from no-service to full service by<br />

the machine builder. Figure 3 shows a small example of blockchain subscribers for a possible smart<br />

contract (pseudo code). The digitized contract checks the daily maintenance (cleancheck), logs<br />

the hourly temperature (temp) and allows remote maintenance between 15:00 and 18:00<br />

(productionLineCheck).<br />

In addition to maintenance-specific smart SLA contracts, the individual maintenance plans are also<br />

to be mapped and monitored by smart contracts to attest to the fulfillment of the maintenance tasks.<br />

If a claim should occur despite maintenance measures, the service logbook of the blockchain<br />

platform (ledger, general ledger) provides maintenance transparency and the responsibility for the<br />

repair is clearly defined.<br />

Figure 3: Pseudo Code of a Smart Contract for Maintenance<br />

455


In addition to the frequent maintenance intervals individual maintenance steps can be checked and<br />

the proof of actions immutable stored. The maintenance contracts can be easy customized according<br />

to the customer's requirements and each individual machine. With information about the equipment<br />

(e.g., age, replacement parts, etc.) maintenance plans can be further optimized. All in all, this leads<br />

to individual, over time changing maintenance contracts implemented as smart contracts into the<br />

blockchain.<br />

Conclusion<br />

This paper introduced the blockchain technology and described smart contracts. Thanks to<br />

decentralized nature of the blockchain technology it allows to interact with peers in a trustless,<br />

auditable manner. Smart contracts allow us to automate complex multi-step processes to achieve an<br />

agreement without involving any intermediaries. Concluded is this paper by describing some<br />

Industry 4.0 blockchain use cases. Still the full potential of the blockchain technology is not<br />

discovered, yet and further exploitation of the technology for manufacturing industry is required.<br />

References<br />

[Christidis16] K. Christidis and M. Devetsikiotis, "Blockchains and Smart Contracts for the Internet<br />

of Things," in IEEE Access, vol. 4, pp. 2292-2303, 2016.<br />

[Tschorsch16] F. Tschorsch and B. Scheuermann, "Bitcoin and Beyond: A Technical Survey on<br />

Decentralized Digital Currencies," in IEEE Communications Surveys & Tutorials, vol. 18, no. 3,<br />

pp. 2084-2123, thirdquarter 2016.<br />

[ethereum] Minimum Viable Token Coin Example; Source: https://www.ethereum.org/token<br />

[Bahga2016] Arshdeep Bahga, Vijay K. Madisetti; “Blockchain Platform for Industrial Internet of<br />

Things”; Journal of Software Engineering and Applications; Vol.09 No.10(2016), Article<br />

ID:71596,14 pages; 10.4236/jsea.2016.910036<br />

456


How connected cars are driving connected payments<br />

James Carroll<br />

CTO, Solutions Team<br />

Mobica<br />

Wilmslow, United Kingdom<br />

Jim.carroll@mobica,com<br />

Abstract— Technologies associated with IoT are<br />

inexpensive, low powered and frequently based on common<br />

software platforms. Recent rapid development in the<br />

automotive/haulage industries allows the integration of this tech<br />

into cars/trucks. These developments have coincided with the<br />

simultaneous rise in FinTech with the use of technology<br />

supporting financial services such as payments, fostering a<br />

commonality of innovative development across these domains.<br />

This paper describes innovative opportunities created by<br />

merging one market with another and also driven by IoT.<br />

Illustrated by the implementation of an autonomous "pay at<br />

pump" use case as an example, the integration of IoT and<br />

payments technology into car IVI systems and fuel pumps will<br />

allow the financial transaction associated with buying fuel to be<br />

handled completely autonomously. The deployment of computer<br />

vision technology for authentication, and connectivity for<br />

communication with payment authorities, will facilitate<br />

autonomous payments in a large range of scenarios including<br />

for example road- toll payments too.<br />

I. INTRODUCTION<br />

Modern cars generally include an in-vehicle infotainment<br />

(IVI) system. These systems usually combine multimedia,<br />

navigation, radio and telephony functions. Software to execute<br />

these functions requires a sophisticated operating system (OS)<br />

to provide underlying services. Rather than develop a fully<br />

custom OS, automotive manufacturers (OEM) and Tier 1<br />

suppliers tend to adopt existing, licensed OSs, such as Linux,<br />

Windows or QNX on which to build IVI systems. Such OSs<br />

require significant processor power to execute applications and<br />

services at an appropriate level of performance. Suitable<br />

System-on-Chips (SoC) for such applications are provided by<br />

semiconductor companies such as Intel, Renesas, NVIDIA and<br />

Qualcomm. Each SoC typically includes multiple CPU cores, a<br />

DSP and a GPU, in addition to an array of on-chip peripheral<br />

hardware.<br />

This combination of application software, complex OS and<br />

SoC is similar to the approach taken to build mobile phones.<br />

Application software such as ApplePay, AndroidPay and<br />

SamsungPay in combination with Near Field Communication<br />

(NFC) or Bluetooth Low Energy (BLE) peripheral hardware and<br />

techniques like Host Card Emulation (HCE) and QR codes allow<br />

a mobile to be used as a payment device.<br />

It is therefore possible to build payment technologies into<br />

automotive IVI systems using a similar approach, enabling the<br />

car to become a payment device.<br />

Note that although NFC is commonly used in mobile phones<br />

to facilitate payments, it may be impractical for automotive use<br />

cases. The range for NFC communication is 20 cm at most with<br />

most systems working at a range of less than 4 cm between<br />

receiver and transmitter. Reliably positioning a car with this<br />

proximity to an NFC reader is difficult. BLE has a much larger<br />

range, so requires less accuracy; QR codes are based on cameras,<br />

and have a still larger range, but require “line of sight”; wired<br />

solutions are the most reliable and secure, but may be considered<br />

less “user friendly”. None of these alternatives are yet<br />

standardised for payment use cases.<br />

Some IVI systems, when combined with specific mobile<br />

phones, provide support for “projection” type functionality -<br />

based on Android Auto and Apple CarPlay. With these systems,<br />

application software running on a mobile phone and connected<br />

to the IVI system renders onto the IVI system; the application is<br />

fully usable using the IVI system only. In this use case, the car<br />

may use the phone as a payment device. This use case is not<br />

considered in this paper; here we focus only on integrated<br />

solutions.<br />

Additionally, there are specific variants of Android available<br />

for use in IVI systems - O.Car is a recent version. As these<br />

variants present the same Application Programming Interfaces<br />

(API) for developers, existing software will be readily portable<br />

to these devices. However, the underlying hardware or software<br />

enablers are not available in IVI devices at present.<br />

A car is of little use as a payment device, if there are no<br />

devices available capable of accepting payments from them.<br />

Some devices such as toll road booths are already able to accept<br />

payment without human intervention. Other use cases, such as<br />

fuel payment at the pump and fast food drive-throughs are<br />

equipped for the use of card readers. Relatively simple changes<br />

are required to support IVI based payments.<br />

www.embedded-world.eu<br />

457


II.<br />

THE “PAY AT PUMP” USE CASE<br />

components can be considered as core OS components. The<br />

purpose of each of the layers is:<br />

• Application: provides the user with a method of<br />

accessing the payment services supported by the IVI<br />

system;<br />

• Middleware: provides hardware abstraction and<br />

protocols for the IVI system to communicate with<br />

external devices (pump and payment providers);<br />

• BSP: controls underlying hardware directly.<br />

Fig. 1. Connected Car Pay at Pump Usage Model<br />

Figure 1 illustrates how a connected car pay at pump model<br />

would work:<br />

●<br />

●<br />

●<br />

The driver fuels the car;<br />

When triggered by the driver, the IVI system<br />

communicates with the pump, using a secure, short<br />

range wireless connection, providing authentication.<br />

This may happen whilst refueling is in progress;<br />

When the driver has completed refueling and the<br />

authentication process, the car transfers money to the<br />

fuel company electronically, over the internet.<br />

III.<br />

Fig. 2. IVI Software Components<br />

IN-CAR TECHNOLOGY<br />

Figure 2 illustrates a typical architecture of an IVI system. In<br />

common with all modern OS deployments, a layered approach<br />

is taken. The diagram is not exhaustive - it shows only the<br />

components relevant to the use of connected cars as payment<br />

devices. The middleware and Board Support Package (BSP)<br />

Figure 2 shows that much of the required technology is<br />

already available (but not necessarily enabled) in the core OSs<br />

deployed in IVI systems:<br />

• Short range wireless and wired protocols: WiFi, BLE,<br />

NFC, USB;<br />

• Encryption algorithms;<br />

• Application environments: C/C++, Java, JavaScript,<br />

HTML5;<br />

• Internet connectivity - WiFi, 4G;<br />

• Hardware support - BLE, NFC, camera, image<br />

processing;<br />

• Security frameworks.<br />

The most significant software work required is porting and<br />

enabling of existing OS features and hardware support for the<br />

target devices. It is possible that device drivers for specific<br />

hardware modules may be unavailable directly, e.g. NFC,<br />

encryption, DSP hosted algorithms, MMU schemes, TEE.<br />

However, in such cases, it is likely that the semiconductor<br />

vendor will provide a base driver for a reference platform.<br />

Application software may need to be developed from<br />

scratch, depending on the underlying APIs and the nature of<br />

existing applications. At a minimum, applications will require<br />

porting from existing mobile platforms, considering different<br />

input methods, display geometry, hardware platforms and safety<br />

requirements.<br />

Software to be deployed in cars is subject to strict<br />

engineering process - there is likely to be significant additional<br />

software test work associated with the deployment of these<br />

features. Similarly, all software handling financial transactions<br />

is subject to regulation. This may lead to additional certification<br />

work.<br />

Note that it is common to deploy software in a virtualised<br />

environment within an IVI system. This may imply an additional<br />

level of system validation - at both OS and fully integrated<br />

system levels.<br />

458


Where open source software supporting payment use cases<br />

is deployed in an IVI system, there are also additional policy and<br />

procedural aspects of the software to consider:<br />

• Is the license suitable and acceptable?<br />

• How will 3rd party changes to the software be handled<br />

(down streaming)?<br />

• Will changes to the software deployed in the IVI system<br />

be made available to 3rd parties (upstreaming)?<br />

IV.<br />

PUMP TECHNOLOGY<br />

For the connected car pay at pump use case, we also need to<br />

consider the necessary changes to the fuel pump. It is already<br />

commonplace to allow a user to pay for their fuel at the pump,<br />

using a payment card. In this regard, the pump includes the<br />

hardware and software of standalone payment terminals. These<br />

terminals frequently include an embedded OS, such as Windows<br />

or Linux - this implies that the pump is designed in such a way<br />

to make software modification relatively simple.<br />

Generic payment terminals usually permit chip, magnetic<br />

strip and contactless modes of payment. Existing Pay at Pump<br />

solutions are mostly based on chip and PIN; some form of<br />

enhanced contactless payment must be added to the pump. As<br />

the payment terminals are based on embedded OSs, the<br />

necessary protocols and hardware abstractions are already<br />

available. The addition of hardware support for contactless<br />

payment is similar to the IVI scenario.<br />

Assuming that a wireless protocol is to be deployed for ease<br />

of use, the key decision to be made is the transport for<br />

communication between IVI system and pump. The following<br />

table summarises available transports and their suitability for<br />

communications between car, pump and local network.<br />

Transport IVI / Pump Pump / Network<br />

NFC N N<br />

It may be tempting to add other forms of authentication to<br />

the pump, such as facial or fingerprint recognition. However,<br />

there are several drawbacks to this:<br />

• All require additional hardware, such as fingerprint<br />

sensors or cameras. These would add to the software<br />

“footprint” and the overall BOM cost of the pump. In the<br />

case of cameras, it may be possible to make use of<br />

camera technology already deployed at fuel stations for<br />

security purposes;<br />

• The hardware will have many users and will be exposed<br />

to the elements; it may be easily damaged;<br />

• The human element of security in financial transactions<br />

has been demonstrated to be the most easily<br />

compromised.<br />

V. SAFETY<br />

Software intended for deployment in cars is developed<br />

according to standardised production processes (e.g. ISO26262),<br />

coding guidelines (e.g. MISRA) and risk classification (e.g.<br />

ASIL). The primary reason for this is driver, passenger and road<br />

user safety. Engineering and quality processes are also<br />

employed to identify defects early. The cost of fixing defects late<br />

in the product lifecycle is very high in the automotive sector,<br />

where vehicle recalls can cost many millions of Euros. Early<br />

defect resolution is a key goal.<br />

In general, safety standards are applied to driver systems<br />

such as Engine Control Units (ECU), Instrument Clusters (IC),<br />

Advanced Driver Assist Systems (ADAS) and autonomous<br />

driving. IVI systems are not typically subject to the same safety<br />

standards. The main safety issue for IVI systems is one of driver<br />

distraction. There are clear regulations in this domain, which are<br />

addressed at the specification stage - no additional process<br />

requirements are made of the development process.<br />

Increasingly, clusters are being integrated with IVI systems,<br />

running on the same SoC. The most common approach in such<br />

a scenario is to isolate safety critical and non-safety critical<br />

functions using OS virtualisation, based on hypervisors. An<br />

example architecture is illustrated in Figure 4, below.<br />

BLE Y N<br />

WiFi Y Y<br />

Camera Y N<br />

Fig. 3. Table of Connected Car Pay at Pump transports<br />

The table shows that only WiFi can support all of the<br />

required communication channels. It would be possible to use<br />

BLE or camera based solutions for IVI to pump communication<br />

and WiFi for pump to local network communication, but<br />

embedded devices are typically resource limited; it may not be<br />

practical to deploy multiple transports in the pump. Minimising<br />

software “footprint” is a common design goal.<br />

www.embedded-world.eu<br />

459


Fig. 4. IVI and Cluster Visualisation Architecture<br />

The safety issue for an IVI system in this architecture is one<br />

of shared hardware access. The safety critical system demands a<br />

guaranteed minimum access to underlying hardware. Where the<br />

hardware request is in conflict with an IVI system request, IVI<br />

performance may be compromised in favour of the safety critical<br />

system. An example of this is the use of the SoC’s GPU, which<br />

will be used to render both speedometer and payment<br />

applications simultaneously. In the pay at pump use case, this is<br />

unlikely to be an issue - the performance demands of the IVI<br />

system are not great and the speedometer will not be in use as<br />

the car is stationary!<br />

For future developments, this issue cannot be ignored:<br />

developments in the automotive industry are likely to introduce<br />

other conflicts. Cameras are increasingly being included in cars<br />

for driver assist (ADAS) applications - such cameras may also<br />

be used for IVI applications, including payment authentication.<br />

It is likely that this trend for “shared” hardware in the context of<br />

the car will continue.<br />

VI.<br />

SECURITY<br />

In the same way that automotive sector has safety at the heart<br />

of its technology, the financial sector focuses on security - fraud<br />

prevention and data security are crucial to the success of these<br />

businesses.<br />

In the context of software development, safety and security<br />

bear some similarities. Both aspects:<br />

●<br />

●<br />

●<br />

Aim to prevent damage or loss to individuals;<br />

Are subject to standards and regulation;<br />

Are implemented using technical and process<br />

measures.<br />

This means that industries experienced in the<br />

implementation of safety critical software should be able to<br />

adapt to the development of secure software (and vice versa).<br />

The main standards deployed for the development of<br />

financial software include PCI DSS and EMV. These standards<br />

are primarily concerned with specifying how the devices and<br />

processes should work, rather than how they are developed. PCI<br />

DSS describes information security; EMV describes how<br />

payment devices work and ensures compatibility across<br />

payment providers.<br />

Other important technologies and techniques in the<br />

implementation of financial software include:<br />

●<br />

●<br />

●<br />

Card tokenisation - for the substitution of sensitive data<br />

with non-sensitive data, minimising the handling of<br />

secure data;<br />

Single use keys - for the one-time encryption and<br />

decryption of data, minimising the risk associated with<br />

key loss or theft;<br />

Encryption - for the conversion of data (in storage and<br />

during transmission) between readable and nonreadable<br />

formats, preventing the use of fraudulently<br />

acquired data.<br />

These techniques are based on the implementation of<br />

algorithms; in modern SoC applications, these algorithms are<br />

frequently executed on a DSP or GPU rather than the host CPU.<br />

Although mobile payment applications may currently host such<br />

algorithms on CPUs, it may be advantageous to move these<br />

algorithms to a co-processor in an IVI system. The SoC is<br />

running many functions, performance enhancements may be<br />

achieved by sharing the load between processors in this way.<br />

OS deployments in IVI systems currently include security<br />

features associated with the OS itself - for example data caging<br />

and cryptography. Other related features provided by third<br />

parties are readily “integratable”, such as the Linux SMACK<br />

module or TrustZone specific device drivers.<br />

The introduction of payment related features into cars may<br />

make the car a target for malicious hackers. The creation of the<br />

pay at pump use case introduces an additional point of failure<br />

into the payment chain in the form of the car itself. A focus of<br />

payment technology is to reduce such vulnerabilities; a full<br />

security assessment will be a precursor to the development and<br />

deployment of such technology. Although penetration testing is<br />

prevalent in automotive software, it is by no means mandatory.<br />

If this threat is realised, it is likely that penetration testing will<br />

become a required part of the development process.<br />

As vulnerabilities are identified and corrected, an effective<br />

way of deploying software updates to consumer owned vehicles<br />

will be required. Such mechanisms are used on all mobile<br />

platforms today, and this provides the mechanism by which<br />

mobile payment software is updated. Similar mechanisms are<br />

available for IVI systems today, but they lack “immediacy”;<br />

updates are typically done by dealerships, not pushed out to<br />

users when available. There are no technical issues with the use<br />

of such methods in IVI systems: the problem is to be able to<br />

guarantee that Over The Air (OTA) updates do not cause issues<br />

for users, resulting in costly recalls.<br />

460


VII. CONCLUSION<br />

All of the necessary software and hardware components and<br />

processes for the creation of the connected car pay at pump use<br />

case are currently available in some form. The creation of a<br />

commercially viable, fully integrated solution is the next step.<br />

There are no technical or regulatory barriers to doing so.<br />

This use case is merely an example of the possibilities of the<br />

concept. There are myriad other potential use cases:<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Road tolls<br />

Parking<br />

Car taxation<br />

Servicing<br />

Regulatory periodic vehicle testing<br />

Drive through restaurants<br />

Car rental<br />

Insurance<br />

In-car entertainment<br />

Autonomous driving extends potential use cases further;<br />

many use cases implemented for non-autonomous driving may<br />

also require modification when implemented in autonomous<br />

environments.<br />

Many of the technology vendors working in the automotive<br />

and financial technology sectors are already working on proofs<br />

of concept integrating payments into cars. This includes:<br />

●<br />

●<br />

●<br />

●<br />

semiconductor vendors<br />

automotive OEMs and tier 1s<br />

platform and OS providers<br />

payment providers<br />

assign a payment card to the bank account or to initiate a money<br />

transfer from the account. The regulation changes are also<br />

allowing finance companies to sell each other’s products and<br />

services, broadening competition and creating niches in which<br />

the smaller FinTech companies can innovate. This is changing<br />

the world of mobile payment technology. It is also likely to have<br />

the same impact on connected car payments. Further, the new<br />

FinTech services being developed for deployment on mobile<br />

will also be deployed in an automotive context.<br />

In the embedded and automotive domains, the most<br />

significant opportunities for innovation in this area arise from<br />

the integration of IoT solutions. Sensors being added to cars and<br />

the smart cities in which they drive provide a plethora of<br />

potential new services, such as smart parking and motor<br />

insurance. These services will require payment solutions - users<br />

will increasingly expect to be able to make such payments from<br />

within their cars.<br />

VIII. REFERENCES<br />

[1] Mobile & NFC Council, “Host Card Emulation (HCE), Smart Card<br />

Alliance, June 18, 2015<br />

[2] Tod E Kurt, NFC & RFID Android, ThingM, 2011,<br />

https://www.slideshare.net/todbotdotcom/nfc-rfid-on-android<br />

[3] Visa, “The Connected Car: Visa Looks Ahead,” March 2015,<br />

https://usa.visa.com/visa-everywhere/innovation/visa-connectedcar.html<br />

[4] Chris Giordano & Jim Carroll, DiSTI & Mobica, “Hardware Convergence<br />

& Functional Safety: Optimal Design Methods in Today’s Automotive<br />

Digital Instrument Clusters” June 2016, https://www.disti.com/hardwareconvergence-functional-safety-whitepaper/<br />

[5] MISRA, https://www.misra.org.uk/<br />

[6] National Instruments “What is the ISO 26262 Functional Safety<br />

Standard?,” April 2014, http://www.ni.com/white-paper/13647/en/<br />

[7] EMVco, https://www.emvco.com/<br />

It is anticipated that the overlap of the Automotive and<br />

FinTech sectors will come to fruition in 2018, as IVI systems<br />

mature to provide a more generic (if not open) platform on which<br />

3rd party developers can provide innovative new software and<br />

services. The IVI system is at the heart of the connected car; in<br />

the same way that the mobile market rapidly expanded as mobile<br />

platforms matured, so too will the connected car market. The<br />

majority of large businesses now have distinct digital and mobile<br />

strategies for selling and deploying their products and services.<br />

Within a few years, it is likely that many of these will also have<br />

a distinct connected car strategy. Such strategies will, as a matter<br />

of course, include an element of monetisation; this requires<br />

integrated payment solutions. Will mPayment technology spawn<br />

a sub-branch to support connected cars - cPayments?<br />

The rise of the FinTech companies, challenger banks and<br />

changes to regulation of the financial services market in the EU<br />

have all stimulated significant innovation. In particular, the<br />

Payment Services Directive (PSD2) is allowing smaller<br />

financial organisations to provide services competitively. PSD2<br />

is focused on electronic payments and will therefore extend the<br />

use of such payments further. PSD2 compels banks to allow 3rd<br />

parties to extend electronic payment options - for example to<br />

www.embedded-world.eu<br />

461


ATM Protection Using<br />

Embedded Machine Learning Solutions<br />

Antonio Rizzo, Francesco Montefoschi,<br />

Alessandro Rossi, Maurizio Caporali<br />

University of Siena<br />

Siena, Italy<br />

antonio.rizzo@unisi.it, francesco.montefoschi@unisi.it,<br />

alessandro.rossi2@unisi.it, maurizio.caporali@unisi.it<br />

Antonio J. Peña, Marc Jorda<br />

Barcelona Supercomputing Center (BSC)<br />

Barcelona, Spain<br />

antonio.pena@bsc.es, marc.jorda@bsc.es<br />

Gianluca Venere<br />

SECO Srl<br />

Arezzo, Italy<br />

gianluca.venere@seco.com<br />

Carlo Festucci<br />

Monte dei Paschi di Siena<br />

Siena, Italy<br />

carlo.festucci@mps.it<br />

Abstract— ATMs are an easy target for fraud attacks, like<br />

card skimming/trapping, cash trapping, malware and physical<br />

attacks. Attacks based on explosives are a rising problem in<br />

Europe and many other parts of the world. A report from the<br />

EAST association shows a rise of 80% of such attacks between<br />

the first six months of 2015 and 2016. This trend is particularly<br />

worrying, not only for the stolen cash, but also for the significant<br />

collateral damages to buildings and equipment [1].<br />

We developed a video surveillance application based on Intel<br />

RealSense depth cameras that can run on Seco’s A80 Single<br />

Board Computer. The camera can be embedded in the ATM’s<br />

chassis, and focus the area under the screen, where explosive<br />

based attacks begin. The use of depth cameras avoids privacyrelated<br />

regulatory issues. The computer vision analysis rests on<br />

Machine Learning algorithms. We designed a model based on<br />

Convolutional Neural Networks able to discriminate between<br />

regular ATM usage and breaking attempts. The dataset has been<br />

built by recording and tagging depth videos where different<br />

people stage withdrawals and attacks on a retired ATM,<br />

replicating the actions the thieves do, thanks to the knowledge of<br />

the Security Department of the Monte dei Paschi di Siena Bank.<br />

The results show that the implemented architecture is able to<br />

classify depth data in real-time on an embedded system, detecting<br />

all the test attacks in a few seconds.<br />

Keywords— Bank Security; Machine Learning; Convolutional<br />

Neural Networks; Computer Vision; Intel RealSense; Single Board<br />

Computer<br />

I. INTRODUCTION<br />

In recent years the global digitalisation and the<br />

consolidation of information technologies sensibly changed our<br />

daily life and the way we interact together, both at local and<br />

global level. This digital revolution is also changing how users<br />

access banks and financial services, turning a relationship<br />

based on the peer-to-peer trust into a mainly online service,<br />

with sporadic human interactions. Such mutation and the<br />

resulting change in the bank branch structure obviously affect<br />

the criminal behaviour related to this environment. Sectorial<br />

international studies [2] show that despite the use of explosives<br />

and other physical attacks continues to spread, in the long term<br />

the attacks will focus on the cyber and logical approaches. In<br />

fact, ATM malware and logical security attacks were reported<br />

by seven countries in Europe during the year 2017.<br />

Moreover, statistics from ABI (Italian Banking<br />

Association) show a sensible increase of attacks to the ATMs<br />

in opposition to a reduction to bank branches robberies. This is<br />

due both to the juridical categorisation of the committed crime<br />

and to the lower amount of money that can be stolen in a<br />

www.embedded-world.eu<br />

462


obbery. Indeed, security systems are in general concentrated<br />

on the branch rather than on the ATM area, which is usually<br />

located outside of the buildings. This also allows perpetrators<br />

to perform the assaults during nightly hours. An important<br />

issue to consider about these gestures is related to the collateral<br />

effects. In fact, the violence necessary in such attacks often<br />

lead to serious physical damage to buildings and objects in the<br />

neighbourhood of the targeted area, such as cars; this is when<br />

considering the best scenario, where no human is involved.<br />

After these premises it is clear how can be fundamental to<br />

develop technologies capable of preventing in some way this<br />

kind of situation. Crucial features of such a system are the low<br />

rate of false alarms and effective promptness in detecting the<br />

potential risk, both to alarm the interested control systems and,<br />

in the first place, to try to automatically discourage the<br />

underway criminal action with some deterrents.<br />

In this paper we propose ATMSense, an automatic<br />

surveillance system based on video stream analysis of depth<br />

frames. This approach allows to analyse in real-time the action<br />

performed in front of the ATM, while preserving the privacy of<br />

customers. Depth images are processed by a Machine Learning<br />

algorithm in order to predict the nature of the running situation.<br />

Even if the tests are performed on data recorded in our<br />

laboratory, the goodness of the obtained results lays the<br />

groundwork for an in-depth experimentation on the field.<br />

II.<br />

RELATED WORKS<br />

A. Video Surveillance<br />

Recent advances in Deep Learning techniques and in<br />

particular in those approach dedicated to Computer Vision<br />

[3][4] lead to a cutting-edge improvement in Image and Video<br />

Analysis algorithms. Even if methodologies for Video<br />

Surveillance and, more in general, for Action Recognition [5]<br />

based on different approaches had been investigated in the past,<br />

allowing us to reach good results in restricted scenarios, Deep<br />

Learning methods can provide state-of-the-art achievements, at<br />

best in the short term. Taking in account these results and the<br />

possibility of fast and portable prototyping of such algorithms,<br />

it seems reasonable to follow this direction and to going<br />

towards technologies that should be even more widespread and<br />

consolidated in the future. Moreover, such approaches should<br />

also allow us a direct scalability when facing new kind of<br />

specific situation and typologies of attacks.<br />

B. ATMs Protection<br />

As ATMs started to play a central role in the customers<br />

services, many works had been developed trying to improve<br />

the security of these interactions. Several systems designed to<br />

deal with identity thefts [6][7][8], interactions with forged<br />

documents and certificates [9] and the detection of the different<br />

specific dangerous situation [10][11] had been developed<br />

through the investigation and the integration of various<br />

hardware devices. However, the most common approach is the<br />

analysis by surveillance cameras trying to recognise those<br />

actions characterising a potential critical scenario [13]. In other<br />

cases, more specific systems had been oriented towards face<br />

detection and tracking [14] or to the recognition of partially<br />

occluded faces and bodies [15][16].<br />

Fig. 1. ATMSense uses a depth camera connected to a Single Board<br />

Computer to analyse the surrounding of an ATM.<br />

In our approach, we head towards a quite new technology<br />

like the images analysis throughout depth cameras, which is, at<br />

the best of our knowledge, unexplored. This should allow us to<br />

join the representation capabilities of videos processing and the<br />

need for customer privacy protection, both for ethical and<br />

juridical reasons.<br />

III.<br />

ATMSENSE<br />

ATMSense is intended to discriminate people's behaviour<br />

exhibited in front of an ATM, in order to detect risky situations<br />

at an early stage. The sensor used to analyse the scene is the<br />

Intel RealSense depth camera. Using the depth image instead<br />

of the RGB one provides great advantages: we can avoid<br />

dealing with personal data and privacy issues; the image is<br />

unaffected by lighting conditions; from a computational point<br />

of view, we can rely on a slight improvement by reducing the<br />

input channels from three to one. Depth images are processed<br />

on a Single Board Computer (Seco A80) with image<br />

processing techniques and Convolutional Neural Networks.<br />

A. Intel RealSense<br />

Intel RealSense is a family of depth cameras proving<br />

several video streams: RGB, depth and Infrared.<br />

ATMSense is compatible with two camera models. The<br />

short-range RealSense SR300 can be placed in the ATM<br />

chassis, focusing the ATM keyboard area. The long-range<br />

RealSense R200 camera is intended to be placed above the<br />

ATM, focusing the whole interested scene. As stated in the<br />

Results section, the performance is similar for both cameras.<br />

The short-range camera should be embedded in new ATMs,<br />

the long-range fits better as an external plugin for already<br />

installed ATMs.<br />

Whichever camera is used, the depth video stream is used<br />

to classify what is going on in the ATM area. For debugging<br />

purposes RGB streams can be collected, but they are not used,<br />

neither for the system training, nor for the runtime.<br />

463


Fig. 3. On the left is shown a frame from Intel RealSense R200. On the<br />

right, the same frame is preprocessed reducing the noise and subtracting<br />

the<br />

background.<br />

Fig. 2. Seco A80 Single Board Computer.<br />

In fact, relying on the RGB stream would create a<br />

dependency on factors that we do not want to depend on, like<br />

light conditions. Moreover, dealing with faces and other<br />

personal images can be an issue for the privacy laws. Having<br />

only a low-resolution shape of the person does not allow the<br />

personal identification.<br />

B. Seco A80<br />

Seco A80 [17] (depicted in Figure 2) is a low power Single<br />

Board Computer based on the Intel Braswell CPU family, up<br />

to the quad-core Intel Pentium N3710. RAM memory is<br />

modular, providing two DDR3L SO-DIMM slots. The board<br />

offers standard desktop connectivity: USB3 ports, HDMI<br />

output, M.2 for SSDs, Gigabit Ethernet ports.<br />

By providing a standard UEFI firmware, it runs<br />

mainstream X86 operating systems. Our tests are done on<br />

Ubuntu 16.04, although any modern Linux distribution<br />

providing Python 2.7 can be used.<br />

C. Image Processing<br />

Depth images collected from the cameras are preprocessed<br />

before the classification. In this phase we want to remove both<br />

the noise and the background objects. The noise is intrinsic in<br />

the camera sensor and is reduced using a cascade of standard<br />

image processing filters (i.e. median filtering, erosion, depth<br />

clipping and so on). This technique leads to the generation of<br />

one video frame starting from 5 frames read from the depth<br />

camera. Although the dynamics of the system scales down<br />

from 30 fps to 6 fps, the information necessary to classify the<br />

images is preserved. The background suppression is related to<br />

the environment in which the ATM is located, and includes<br />

the device itself. The background is subtracted (using kNN<br />

based techniques) making the solution independent from the<br />

ATM machines and environments. Moreover, in order to<br />

improve the generalization capabilities of learning algorithms,<br />

it is better to provide only the necessary information.<br />

Fig. 4. Intel RealSense SR300 frames are less noisy. Background<br />

information is removed from the right image.<br />

The difference between the original image read from the<br />

camera and the cleaned version is visible in Figure 3 (Intel<br />

R200) and Figure 4 (Intel SR300).<br />

D. Convolutional Neural Networks<br />

Once we get a cleaned stream from the camera, we need to<br />

perform computations needed to predict the state of the current<br />

scene. As already said, the algorithmic approach relies on<br />

Deep Learning techniques. In particular, Convolutional Neural<br />

Networks (CNNs) represent the state-of-the-art in almost all<br />

Computer Vision applications as Image Segmentation and<br />

Classification, Object Detection and Recognition. This kind of<br />

architectures are biologically inspired by the human visual<br />

system [18] and the characterizing property is expressed<br />

through the concept of receptive field. These elements are a<br />

sort of pattern detectors, which are used to generate internal<br />

features maps representing the presence of specific shape in<br />

each region of the images. This process is reiterated<br />

throughout several layers (see Figure 5) to come up with a<br />

numerical 1-D vector by iteratively performing dimensionality<br />

reduction (Max-Pooling) and producing an encoding of the<br />

original image. Hence, the obtained representation can be feed<br />

to a standard Artificial Neural Network (ANN) classifier<br />

which should perform the desired predictions. This<br />

composition allows a high representational capability, a<br />

relatively simple training procedure (which is derived<br />

straightforward from the standard Back-Propagation<br />

algorithm), and weights sharing policy between hidden units<br />

that reduces the computational cost.<br />

However, the large number of parameters (of the order of<br />

tens of millions) of such algorithms requires a correspondent<br />

large dataset to achieve an effective training leading to<br />

accurate and general prediction performances.<br />

www.embedded-world.eu<br />

464


TABLE I.<br />

FRAME BY FRAME CLASSIFICATION ACCURACY<br />

Camera Sequence dim. Withdrawal Attack Average<br />

1 92.91% 94.30% 93.60%<br />

SR300<br />

5 92.69% 92.50% 92.59%<br />

10 92.86% 94.48% 93.67%<br />

Fig. 5. Convolutional Neural Network architecture.<br />

IV.<br />

EXPERIMENTAL SETTING<br />

In order to collect the required data, we reproduced in our<br />

laboratory the real working environment by installing<br />

ATMSense on a dismissed ATM provided by Monte dei<br />

Paschi di Siena Bank. As a prototype, we taped an Intel<br />

RealSense SR300 to the ATM frame, and we installed the<br />

R200 camera on the top of a support above the ATM. With<br />

both the cameras connected, we recorded 132 depth videos<br />

simulating both the withdrawal and the attack scenarios,<br />

representing the two class to be discriminated by the classifier.<br />

To improve variability and generalisation, these videos has<br />

been staged by several actors in different sessions, using<br />

different light conditions (which only slightly affect the<br />

acquired images). Videos have been manually labelled at the<br />

single frame level. Background profiling has been carried out<br />

by recording 25 videos without any kind of interaction with<br />

the ATM.<br />

A. CNN Training<br />

In the training phase, pre-processed videos (as stated in<br />

section III.C) are split as reported in Table 1 among Train and<br />

Test sets. Hence, the dataset is generated by separating and<br />

shuffling sequences of consecutive frames together with the<br />

correspondent labels. In this way we obtained about 250,000<br />

and 30,000 labelled samples for training and test respectively.<br />

The training phase has been performed within the Keras<br />

framework using the TensorFlow backend. This enables an<br />

easy implementation capable of exploiting the multi-GPU<br />

cluster (provided by Barcelona Supercomputing Center). Since<br />

this process requires several hours to be completed and<br />

considerable trial and error tests have been necessary to find<br />

the best hyper-parameters and network configurations, we also<br />

carried out an investigation on a few settings related to<br />

computational issues. In practice, a preliminary tuning of a<br />

few variables (i.e. the mini-batch size of the network forward<br />

step) enabled to halving the execution time of the training<br />

phase. Applying this tuning, the global train-validation-test<br />

process has been accelerated by a scaling factor of 1.86 while<br />

maintaining the same accuracy.<br />

R200<br />

V. RESULTS<br />

Different CNN architectures have been tested, but we only<br />

report the results of the best one, composed by 3 convolutional<br />

layers, with ReLU as non-linear activation and Max-Pooling<br />

to perform dimensionality reduction. The fully connected<br />

classification layer is composed of 256 hidden units. All the<br />

architectures have been tested on different datasets, generated<br />

using different lengths of the input sequences. The obtained<br />

classification accuracies of the best networks are reported in<br />

Table 2.<br />

Since the predictions are, in practice, not perfect, in order<br />

to refine the working performances, we added an additional<br />

layer. Such layer determines if to raise an alarm, based on a<br />

majority voting on a buffer of recent network predictions (of<br />

length varying from 10 to 20 elements). In fact, an alarm is<br />

raised only if more than the 95% of the last predictions are<br />

classified as attacks. This allows to correctly classify each<br />

video from the test set in a more realistic scenario. We can<br />

find many configurations in which no false alarm is raised on<br />

withdrawal videos and, on the other hand, all the attacks are<br />

detected. In Table 3 we report statistics on the detection time<br />

w.r.t. the beginning of an assault. For brevity, we only report<br />

the best case for each sequence length.<br />

As we can observe, the reported detection times are<br />

admissible w.r.t. a real situation, since a potential attack can<br />

be detected in few seconds giving enough time to the<br />

Surveillance Control Room to analyse the scene and, possibly,<br />

to take dissuasive actions or call the security. From a practical<br />

point of view, we can see how the additional layer used to<br />

filter the network’s predictions by the majority voting is<br />

fundamental to reach the final results. We can also observe<br />

that feeding the classifier with a sequence of frames (5 or 10<br />

TABLE III.<br />

1 90.78% 90.75% 90.76%<br />

5 92.18% 92.46% 92.32%<br />

10 91.16% 92.19% 91.67%<br />

ASSAULT DETECTION TIMES<br />

Camera Seq. dim. Avg (sec) Min (sec) Max (sec)<br />

1 3,00 2,50 5,16<br />

TABLE II.<br />

NUMBER OF VIDEOS USED FOR CNN TRAINING<br />

SR300<br />

5 2,69 2,33 5,50<br />

Withdrawal<br />

Attack<br />

10 4,25 4,00 7,00<br />

Train 54 42<br />

Test 18 18<br />

Total 72 60<br />

R200<br />

1 1,89 1,66 3,33<br />

5 4,45 4,00 10,00<br />

10 4,27 4,00 8,00<br />

465


in our tests) instead of than a single frame does not lead to a<br />

remarkable improvement and, at the end, this choice only<br />

delays the system promptness. This can be due to the fact that<br />

the scene understanding task is collapsed to a two-class<br />

classification problem. However, from an external point of<br />

view, it also seems reasonable that a human could be able to<br />

decide from a single picture of the scene if an assault is taking<br />

place or not.<br />

A. Real-Time Classification<br />

After the training, we tested the real-time performance on<br />

a Seco A80 SBC. Having relatively low computer power<br />

available, the application creates different threads to<br />

parallelize the computation. The first one handles the USB<br />

connection with the Intel RealSense camera and stores the<br />

incoming video frames in a buffer; another one preprocesses<br />

the incoming frames subtracting the background and reducing<br />

the noise; the last one classifies the image.<br />

The A80 SBC can execute all the computation in real time.<br />

The heaviest threads are the image preprocessor, which runs in<br />

23ms, and the CNN classifier, which runs in 17 ms.<br />

Considering that we need to classify 6 frames each second, the<br />

required computational power is more than enough for realtime<br />

operation.<br />

VI.<br />

CONCLUSION<br />

In this work we propose an application of Automatic<br />

Video Analysis to improve the surveillance and the security on<br />

ATMs. From laboratory tests, the system can detect attacks<br />

very quickly, both when the depth camera is integrated into<br />

the ATM itself, and when it is installed nearby. Moreover, the<br />

approach employs off-the-shelf technologies of a total cost<br />

which is quite inexpensive when compared with an ATM cost<br />

or with the potential financial and general damages. The<br />

software solution is general for the approach, even if an<br />

additional data collection and a re-training phase will be<br />

necessary, depending on particular needs of specific situations.<br />

Although the current solution is customised for a single<br />

mode of assault, the obtained results allowed us a short terms<br />

scheduling of a more real experimentation phase on the field.<br />

Indeed, the very fast attack detection time will allow to the<br />

Surveillance Control Room to promptly intervene. Moreover,<br />

the high accuracy reduces the possibility of false alarms.<br />

VII. FUTURE WORK<br />

Detection accuracy in a real-world scenario could be<br />

improved by collecting further data, statistically enlarging the<br />

events analysed by the system. In general, adding more<br />

training data helps the CNN to better generalize, instead of<br />

over-fit on training examples.<br />

The depth footage recorded for the training is focused on<br />

explosive-based attacks. New videos could be recorded with<br />

the perspective to detect additional kind of ATM assaults,<br />

providing a more complete surveillance equipment.<br />

The downside of having more depth videos is the need of<br />

manually tagging the frames. A complementary approach<br />

could be to introduce a Novelty Detection algorithm, which<br />

runs parallelly with the CNN. As an example, the solution we<br />

proposed in [19] to bank branch Audio-Surveillance can be<br />

redesigned in this scenario. This algorithm would be totally<br />

unsupervised, and capable of detecting any kind of anomaly<br />

which comes from unexpected users behaviour. An arbiter<br />

would take as input the outputs of both the algorithms, and<br />

rule a final decision.<br />

ACKNOWLEDGMENT<br />

This work was supported by Monte dei Paschi Bank grant<br />

DISPOC017/6. We thank the support of NVIDIA through the<br />

BSC/UPC NVIDIA GPU Center of Excellence. Antonio J.<br />

Peña is cofinanced by the Spanish Ministry of Economy and<br />

Competitiveness under Juan de la Cierva fellowship number<br />

IJCI-2015-23266.<br />

REFERENCES<br />

[1] European Association for Secure Transactions: ATM Explosive Attacks<br />

surge in Europe, https://www.association-secure-transactions.eu/atmexplosive-attacks-surge-in-europe/,<br />

2016<br />

[2] European Association for Secure Transactions: EAST Publishes<br />

European Fraud Update 3-2017, https://www.association-securetransactions.eu/east-publishes-european-fraud-update-3-2017/,<br />

2017<br />

[3] J. Deng el al. “A large-scale hierarchical image database,” IEEE<br />

Conference on Computer Vision and Pattern Recognition (CVPR), pp.<br />

248-255, 2009<br />

[4] A. Krizhevsky et al., “Imagenet classification with deep convolutional<br />

neural networks”, Advances in neural information processing systems<br />

(NIPS), pp. 1097-1105, 2012<br />

[5] S. Herath et al., “Going deeper into action recognition: A survey,” Image<br />

and Vision Computing, pp. 4-21, 2017<br />

[6] F. Puente et al., “Improving online banking security with hardware<br />

devices,”, 39th Annual International Carnahan Conference on Security<br />

Technology (CCST), pp. 174-177, 2005<br />

[7] H. Lasisi and A.A. Ajisafe, “Development of stripe biometric based<br />

fingerprint authentications systems in Automated Teller Machines,” 2nd<br />

International Conference on Advances in Computational Tools for<br />

Engineering Applications (ACTEA), pp. 172-175, 2012<br />

[8] R. AshokaRajan et al., “A novel approach for secure ATM transactions<br />

using fingerprint watermarking,” Fifth International Conference on<br />

Advanced Computing (ICoAC), pp. 547-552, 2013<br />

[9] H. Sako et al., “Self-defense-technologies for automated teller<br />

machines,”, International Machine Vision and Image Processing<br />

Conference (IMVIP), pp. 177-184, 2007<br />

[10] M.M.E. Raj and A. Julian, “Design and implementation of anti-theft<br />

ATM machine using embedded systems,” International Conference on<br />

Circuit, Power and Computing Technologies (ICCPCT), pp. 1-5, 2015<br />

[11] S. Shriram et al., “Smart ATM surveillance system,” International<br />

Conference on Circuit, Power and Computing Technologies (ICCPCT),<br />

pp. 1-6, 2016<br />

[12] A. De Luca et al. , “Towards understanding ATM security: a field study<br />

of real world ATM use,” Proceedings of the sixth symposium on usable<br />

privacy and security, 2010<br />

[13] N. Ding et al. “Energy-based surveillance systems for ATM machines,”<br />

8th World Congress on Intelligent Control and Automation (WCICA),<br />

pp. 2880-2887, 2010<br />

[14] Y. Tang et al. “ATM intelligent surveillance based on omni-directional<br />

vision,” World Congress on Computer Science and Information<br />

Engineering (WRI), pp. 660-664, 2009<br />

[15] I-P. Chen et al., “International Conference on Image processing based<br />

burglarproof system using silhouette image, Multimedia Technology<br />

(ICMT), ” pp. 6394-6397, 2011<br />

www.embedded-world.eu<br />

466


[16] X. Zhang, “A novel efficient method for abnormal face detection in<br />

ATM,” International Conference on Audio, Language and Image<br />

Processing (ICALIP), pp. 695-700, 2014<br />

[17] Seco SBC A80, http://www.seco.com/prods/it/sbc-a80-enuc.html, 2017<br />

[18] K. Fukushima, “Neocognitron: A self-organizing neural network model<br />

for a mechanism of pattern recognition unaffected by shift in position,”<br />

Biological Cybernetics, pp. 193-202, 1980<br />

[19] A. Rossi et al. “Auto-Associative Recurrent Neural Networks and Long<br />

Term Dependencies in Novelty Detection for Audio Surveillance<br />

Applications, ” IOP Conference Series: Materials Science and<br />

Engineering, 2017<br />

467


Powering the Processor<br />

Basics of power conversion<br />

George Slama<br />

Wurth Electronics Midcom Inc.<br />

Watertown, SD USA<br />

george.slama@we-online.com<br />

Abstract— The basics of power conversion from linear<br />

regulators to switching converters - non-isolated buck, boost and<br />

SEPIC converters used in low voltage and battery systems to<br />

isolated off-line flyback converters commonly used in adapters.<br />

Introduction to the major components, various control methods,<br />

compensation, ancillary circuits, safety and electromagnetic<br />

interference.<br />

Keywords—power; micrporocessor; linear; switch-mode;<br />

inductor; capacitor; buck; boost; sepic; flyback; emi, module<br />

I. INTRODUCTION<br />

For a microprocessor to perform any useful function, it needs<br />

clean regulated power from some type of power source. This<br />

might come from the wall receptacle or an energy storage<br />

device like a battery. The power source is usually variable and<br />

subject to harsh conditions. To begin with, the AC power from<br />

a wall receptacle has the wrong voltage and is of the wrong type<br />

– fluctuating between zero and as high as 375 volts peak,<br />

positive and negative, 50 or 60 times a second. Additionally it<br />

is subject to transients from other connected devices and<br />

lightning strikes. Batteries on the other hand may have the right<br />

type of voltage, DC but they do not stay at a constant level. The<br />

voltage drops as the energy is depleted and it changes with<br />

temperature and load.<br />

Modern microprocessors have extremely fine features and<br />

therefore require precise voltages to operate without being<br />

damaged. Gone are the days of 15 V tolerant CMOS digital<br />

logic ICs! Today processor cores need 3.3 V, 1.8 V or even 1.2<br />

V. These voltages must be tightly regulated while at the same<br />

time current demand is dynamic – suddenly changing from<br />

quiescent to full load when the system goes from sleep mode to<br />

active.<br />

II.<br />

A. Linear regulators<br />

VOLTAGE REGULATION<br />

The simplest form of regulated voltage is to use a Zener<br />

diode. This is only a solution for low power and, where strict<br />

voltage levels are not required.<br />

To increase the precision (or tightness) of the regulation and<br />

to extend the power capacity, a linear regulator can be used.<br />

These consist of a precision voltage reference, an error<br />

amplifier and a semiconductor device acting as a controllable<br />

resistance. It’s an adjustable power voltage divider. In the past,<br />

they would be discrete circuits but today linear regulators come<br />

as complete integrated circuits with many additional features<br />

built in such as thermal shutdown. Though not as efficient as<br />

switching regulators, they are quiet from an electrical noise<br />

perspective, fast and inexpensive.<br />

VIN<br />

Fig. 1. Linear regulator<br />

Q 1<br />

Error<br />

R 1<br />

Amplifier<br />

V REF<br />

The main problem with efficiency is that they must dissipate<br />

the product of the voltage difference between input and output<br />

times the square of the current as heat. For a large input and<br />

small output this could be more that the power being used! They<br />

are often used in two situations. One to regulate a low current,<br />

poorly regulated secondary or auxiliary output on a switching<br />

power supply. Secondly, where small shifts in voltage are<br />

required. For instance, a processor that runs on 3.3 V but gets<br />

its power from a 5 V USB connection or to generate voltages<br />

for sensors.<br />

In their integrated form, three terminal regulators come with<br />

either fixed output voltages or a pin where a resistive divider<br />

can set the output voltage. Different package styles determine<br />

the power handling ability. Over time, the minimum voltage<br />

drop between input and output has decreased from about 3 V to<br />

0.1 V as the pass transistor has been replaced by MOSFETs.<br />

The very low voltage units are called LDO for low drop out.<br />

B. Switching regulators<br />

Switching converters are the heart of modern power<br />

conversion. The concept is based on chopping up a DC voltage<br />

into pulses, storing or converting the pulse energy in capacitors,<br />

+<br />

-<br />

V FB<br />

R 2<br />

VOUT<br />

www.embedded-world.eu<br />

468


inductors or a transformer and then passing those pulses<br />

through a filter, which averages them back to a steady voltage.<br />

Regulation comes from being able to set and adjust the pulse<br />

width automatically. If the pulse width were zero, the output<br />

would be zero. Similarly, at 50% the output could equal half the<br />

input voltage. If the load current increases and causes the output<br />

voltage to drop, the pulse width increases to compensate.<br />

Similarly, if the input voltage were to increase, the pulse width<br />

decreases to compensate. Thus, there is both input line and load<br />

regulation. Clever use of inductors and capacitors as energy<br />

storage devices can even boost the output voltage higher than<br />

the input.<br />

III.<br />

MAIN POWER COMPONENTS<br />

It is important to understand the components that make up a<br />

power supply and how their characteristics affect the design. A<br />

large part of the time spend designing is the selection of these<br />

parts because they affect the ultimate performance and cost of<br />

the unit. There are three main categories – capacitors, magnetics<br />

(inductors and transformers) and switches (including diodes).<br />

A. Capacitors<br />

Capacitors come in a large variety of types and styles. They<br />

are energy storage devices and therefore their capacity is size<br />

and material dependent. The two most obvious characteristics<br />

are capacitance value and voltage rating. Additionally<br />

considerations are equivalent series resistance (ESR),<br />

equivalent series inductance (ESL), peak and RMS current<br />

rating, tolerance, aging effects, temperature limits, maximum<br />

dv/dt rating and failure mechanisms.<br />

Capacitors are divided into two general types – polarized<br />

and non-polarized. Polarized capacitors are electrolytic and<br />

super caps, and must be connected correctly. They tend to be<br />

large, have wide tolerances, store more energy, and have higher<br />

ESR and ESL. They mostly serve as bulk storage devices. Nonpolarized<br />

types are ceramic and metal film capacitors. They<br />

tend to be smaller, hence lower energy capacity, have lower<br />

ESR and ESL making them suitable for higher frequency<br />

operation. Typically, they are used for decoupling and filtering<br />

high frequencies. Ceramic capacitors can fail from overvoltage<br />

and mechanical stresses – like cracking.<br />

B. Magnetics (Inductors and transformers)<br />

Inductive components are one of the most important<br />

components in a switching power supply. Inductors function to<br />

limit current rate of change and store energy that provides<br />

power when the switch is off. Coupled inductors used in flyback<br />

transformer perform the same function but allow greater<br />

difference between input and output voltages and can provide<br />

galvanic isolation. Transformers are used to covert voltages and<br />

currents to different levels in real time. Inductors store energy<br />

in their magnetic fields and release it. Various specialized<br />

inductors also provide filtering for common mode and<br />

differential mode noise filtering.<br />

C. Power Switches/diodes<br />

Fast transistors and diodes make switching power supplies<br />

possible. Diodes can be considered as switches because they are<br />

one-way devices. The main characteristics of interest are the<br />

forward voltage drop when conducting, the reverse recovery<br />

time and the breakdown voltage. The forward voltage drop plays<br />

into efficiency and at low output voltages, synchronous rectifiers<br />

are replacing them. Synchronous rectifiers is a term applied to<br />

MOSFETS that are used as diodes. Reverse recovery time is the<br />

time it takes the diode to stop the current flow when the polarity<br />

reverses. A slow diode can allow large transient currents that<br />

reduce efficiency and can create noise.<br />

Most small switching power supplies today use metal oxide<br />

semiconductor field-effect transistors, commonly called<br />

MOSFETs. N-channel types are predominant because they are<br />

smaller, more rugged, and less expensive. The MOSFET is a<br />

voltage driven device with a high turn-on threshold whose gate<br />

capacitance requires a high current transient for fast switching.<br />

The voltage drop is the fixed on-state resistance (Rdson) times<br />

the current. The bipolar transistor (BJT) is a current driven<br />

device with low turn-on threshold and low voltage drop. Other<br />

devices like IGBTs and Thyristors are used in applications that<br />

are more specialized.<br />

IV.<br />

SWITCHING POWER TOPOLOGIES<br />

Switching topologies come in two classes. Non-isolated are<br />

supplies that share a common ground between the input and<br />

output whereas isolated supplies have some form of galvanic<br />

isolation between the input and output. Aside from technical<br />

reasons where it might be necessary, isolation is required for<br />

power supplies connected to mains (the AC power receptacle)<br />

for safety of the user.<br />

In the following explanations, the switching element is<br />

represented as simple switch. It could be a transistor, P-channel<br />

MOSFET or an N-channel MOSFET with or without a driver.<br />

In many cases the diode could be a synchronous rectifier to<br />

increase efficiency.<br />

A. Buck<br />

Buck or step-down converters always have lower output<br />

voltage than the input voltage. The output voltage is directly<br />

proportional to the duty cycle. Compact size, high efficiency,<br />

fast response, and the ability to be shut-off make them an<br />

attractive alternative to linear regulators.<br />

Fig. 2. Buck regulator.<br />

When S1 closes the current increases linearly in L1, storing<br />

energy in its magnetic field as it charges C1 and supplies the<br />

load. Diode D1 is back biased with its cathode at Vin. When S1<br />

opens the magnetic field in L1 starts to collapse, reversing its<br />

voltage polarity as it tries to maintain the current flow. Now D1<br />

469


ecomes forward biased and current continues to flow to<br />

recharge the capacitor and to the load. The low pass filter formed<br />

by L1 and C1 smooths the pulses into a steady voltage. The<br />

voltage-time product of the on period and the off period must<br />

equal.<br />

B. Boost<br />

Boost or step-up converters always have output voltage<br />

higher than the input voltage. It uses the same components only<br />

rearranged. This time the output voltage is proportional to 1/1-<br />

D where D is the duty cycle. The practical limit for voltage<br />

boosting is 2-3 times the input. The input voltage must not rise<br />

above the output for then D1 would conduct connecting the<br />

input to the output without regulation.<br />

In this circuit when S1 closes the current once again flows<br />

through the inductor, increasing linearly, storing up energy in<br />

the magnetic field. Diode D1 is back-biased because its cathode<br />

is at the input voltage. The load current comes solely from the<br />

capacitor C1. When S1 opens, the magnetic field of L1<br />

collapses, reversing the voltage, allowing D1 to conduct,<br />

recharging the capacitor and supplying the load. Note however,<br />

the polarity reversal from the input.<br />

D. SEPIC<br />

The SEPIC converter is a two-stage inverting buck-boost<br />

converter. The single ended primary inductance converter<br />

(SEPIC) is used when the required output voltage could be<br />

higher or lower than the input voltage. This converter does not<br />

invert the output voltage polarity and has the advantage of low<br />

ripple current. This makes it ideal for power supplies that use<br />

batteries as a power source because with a common ground rail<br />

it can recharge the batteries and at the same time power the<br />

circuit. The capacitor C1 provides inherent output short circuit<br />

protection.<br />

Fig. 3. Boost regulator.<br />

When S1 is closed the current flows through L1, increasing<br />

linearly, storing energy in its magnetic field. The load current<br />

comes solely from the capacitor C1. Diode D1 is back biased<br />

because its anode is tied to ground by S1. When S1 opens, the<br />

magnetic field of L1 starts to collapse, reversing the voltage.<br />

This voltage will rise up until the diode D1 conducts, then<br />

passing current to recharge the capacitor C1 and to the load.<br />

C. Buck-boost<br />

Inverting buck-boost converters provide a stable output<br />

voltage whether the input voltage is higher or lower. One cravat<br />

is that the output voltage polarity is opposite to the input. This<br />

can still be used with batteries because the battery can be left<br />

floating. Because S1 does not have a ground connection a level<br />

shifter is required which adds complexity to the design.<br />

Fig. 4. Buck-boost regulator.<br />

Fig. 5. SEPIC regulator.<br />

The SEPIC converter operates as follows. When S1 closes,<br />

the current flows through the inductor L1, increasing linearly,<br />

storing energy in its magnetic field. At the same time, capacitor<br />

C1, which would have previously been charged to Vin,<br />

discharges its energy into L2. The diode D1 is back-biased.<br />

Capacitor C1 supplies the load. When S1 opens, L1 transfers its<br />

energy to C1 while L2 transfers its energy through D1 into C2<br />

and the load. L1 and L2 can be on the same core.<br />

E. Isolated Flyback<br />

The flyback converter is an isolated version of the buckboost<br />

converter. Instead of a single winding inductor, it uses<br />

two coupled windings on one core but it is still an inductor. One<br />

winding is used during the first half of the cycle to store the<br />

energy in the magnetic field and the other is used to harvest the<br />

energy for the output. This provides two useful benefits in<br />

situations like off-line converters - galvanic isolation for safety<br />

and a large turn ratio allowing the converter to operate with a<br />

large input range and still have a reasonable pulse width. If a<br />

buck converter had to drop 400 V to 5 V the pulse width would<br />

be extremely narrow, too narrow to be useful. The turns ratio<br />

allows the converter to operate at reasonable duty cycles. It<br />

www.embedded-world.eu<br />

470


makes the wide input voltage range seen in universal battery<br />

chargers that can operate from 85 to 400 VDC possible.<br />

V IN<br />

Fig. 6. Isolated flyback converter.<br />

The flyback converter works like the boost converter.<br />

When S1 closes, the current flows through the inductor L1,<br />

increasing linearly, storing up energy in the magnetic field.<br />

Diode D1 is back-biased because its anode is effectively at<br />

ground by the dot convention of the windings (the dotted ends<br />

of a winding are at same polarity). Capacitor C1 supplies the<br />

load. When S1 opens, the magnetic field of L1 collapses,<br />

reversing the voltage polarity, allowing D1 to conduct,<br />

recharging the capacitor and supplying the load.<br />

F. Discontinuous and continuous mode<br />

Most of the converters mentioned can operate in several<br />

ways with respect to the currents flowing in the storage<br />

elements. In discontinuous current mode (DCM) the current<br />

returns to zero during each period. The advantage is lower turn<br />

on losses in the switch because no current is flowing when it<br />

turns on. Generally, the inductor is smaller but the ripple and<br />

peak currents are higher.<br />

In continuous current mode (CCM) the inductor current does<br />

not return to zero so is switched while flowing, increasing<br />

switching losses. The inductor is larger due to the constant DC<br />

bias but the ripple is smaller resulting in lower peak currents.<br />

In the quest for better efficiency, other methods have been<br />

developed such as boundary mode, valley switching, multimode,<br />

pulse skipping and so on.<br />

V. CONTROL METHODS FOR REGULATION<br />

A. Hysteretic control<br />

T 1<br />

D 1<br />

S 1<br />

C 1<br />

V OUT<br />

Current, S 1 closed<br />

Current, S 1 open<br />

Hysteretic control, sometimes called a “bang-bang” control<br />

is the simplest and least expensive method to implement. It is<br />

merely a voltage comparator that compares the measured output<br />

voltage against a voltage reference and turns the power switch<br />

on if it is too low or off if it is too high. The hysteresis is the<br />

difference between the two levels and determines the amount of<br />

output voltage ripple that will present.<br />

B. Constant on-time control<br />

This type of control also operates at a variable frequency.<br />

The on time is constant, set by a timer and the off time varies<br />

by comparing to the limits as in hysteric control. The<br />

advantages is a simple stable control system with few<br />

components, fast response, and high efficiency at light loads.<br />

C. Constant off-time control<br />

Though similar to constant on-time, this control scheme<br />

suffers under light load, where the frequency must increase and<br />

the pulse width decrease causing performance issues. However<br />

is has a place in some specialized applications like fast charging<br />

of flash capacitors.<br />

D. Voltage mode control<br />

This technique operates at a fixed frequency and varies the<br />

duty cycle in proportion to the error difference between the<br />

actual output voltage and a reference voltage. This means it can<br />

only respond to changes in the load voltage. Since it does not<br />

measure load current or input voltage, it must wait for the effect<br />

on the load voltage. There is always a delay of several clock<br />

periods before the control loop reacts and stabilizes. The control<br />

needs to be compensated to avoid instability and overshoot.<br />

Fig. 7 shows a typical implementation of a voltage-mode<br />

PWM controller. The error amplifier (EA) measures the<br />

difference between a highly accurate voltage reference and the<br />

output voltage scale down by the voltage divider formed by R1<br />

and R2. The error amplifier output is proportional the difference<br />

between the reference and the output voltage. This feeds to the<br />

PWM comparator where it is compared to a linear, periodic<br />

ramp voltage, which starts at zero at the beginning of each clock<br />

cycle. The latch also turns on the power switch at the beginning<br />

of each cycle. When the ramp voltage crosses the error voltage,<br />

the latch resets, turning off the power switch.<br />

The controller can overshoot and then undershoot so the<br />

voltage is always oscillating around the desired level. The<br />

feedback response is often slowed down to reduce this behavior<br />

with the disadvantage the converter will be slower to respond<br />

to sudden changes.<br />

V IN<br />

V OUT<br />

feedback<br />

V REF<br />

R1<br />

R2<br />

sawtooth<br />

+ V E<br />

-<br />

EA +<br />

-<br />

Fig. 7. Voltage mode control.<br />

E. Current mode control<br />

PWM<br />

comparator<br />

V R S<br />

R Ǫ<br />

OSC<br />

Clock<br />

Clock<br />

V E<br />

This control technique builds on voltage mode control by<br />

adding a second control loop based on the switch current. There<br />

V R<br />

Q<br />

V OUT<br />

471


is an inner control loop regulating the current and an outer<br />

control loop that regulates the voltage as in voltage mode. The<br />

loops run at different speeds. The current loop reacts on a pulseby-pulse<br />

basis whereas the voltage loop is slower, being after<br />

the output filter.<br />

Fig. 8 shows a typical implantation of a current-mode<br />

PWM controller. The principle difference is the current sense<br />

voltage replaces the ramp voltage. Operation is similar. At the<br />

beginning of the cycle, the latch is set and the switch turned on.<br />

The current sense voltage, typically measured across a shunt<br />

resistor rises until it meets the error voltage. The latch is reset<br />

and the power switch turn off until the cycle starts again.<br />

V IN<br />

V OUT<br />

feedback<br />

V REF<br />

R1<br />

R2<br />

Fig. 8. Current mode control.<br />

One benefit is that the inductive element builds up the same<br />

energy level regardless of the input voltage. A change in input<br />

voltage affects the rate of rise and duration of the charging<br />

current – taking longer for lower voltages and less time for<br />

higher voltages. This scheme adjusts on a pulse-by-pulse basis<br />

without the voltage control loop.<br />

A second benefit is pulse-by-pulse current limit by merely<br />

clamping the maximum error amplifier’s output voltage level.<br />

A third benefit is faster respond time to load changes. An<br />

increase in load would cause the error voltage to increase,<br />

extending the charging duration. The inner current loop would<br />

have limited effect until the output voltage rose to the regulated<br />

level again. Current control is the preferred mode for most<br />

designs.<br />

F. Multi-phase control<br />

PWM<br />

+<br />

-<br />

EA VE comparator<br />

-<br />

+<br />

V CS<br />

current<br />

sense<br />

RCS<br />

OSC<br />

One of the consequences of our ever-increasing digital<br />

world are processors that use low input voltages with high<br />

current loads. Additionally to save energy, these loads are<br />

dynamic where the current demands are large and fast while<br />

maintaining a tight voltage level. From a practical point of<br />

view, a single buck regulator can work up around 20 A. Beyond<br />

that, multi-phase controllers have been developed to run<br />

multiple regulators in parallel but offset in phase. This divides<br />

the load, reduces the ripple and allows faster response to load<br />

changes. The components are smaller, with less stress and the<br />

thermal load is distributed over a larger area.<br />

R<br />

S<br />

Clock<br />

Ǫ<br />

Clock<br />

V E<br />

V CS<br />

Q<br />

V OUT<br />

G. Feedback loop compensation<br />

All voltage regulators using negative feedback rely on the<br />

corrective nature of the feedback signal to compensate for<br />

errors generated in the forward path. In practice, the gain and<br />

phase of both the forward and feedback paths vary with<br />

frequency, and therefore it is possible that at some frequency<br />

(or frequencies) the output voltage will respond too slowly<br />

resulting in inadequate performance or, too fast causing<br />

oscillation or ringing.<br />

The term frequency compensation describes the design of<br />

feedback circuits that take into account the frequency response<br />

of the forward path and ensure that the frequency response of<br />

the feedback signal compensates it in such a way that the system<br />

provides adequate performance and is stable.<br />

There are three types of compensation schemes known as<br />

Type 1, Type 2 and Type 3 as shown in the fig. 9. The criterion<br />

for stability in a system employing negative feedback is that the<br />

loop gain must be less than 0dB when the loop phase is 360°.<br />

The term phase margin refers to the loop phase at the<br />

frequency where the loop gain equals 0 dB, and the term gain<br />

margin refers to the loop gain at the frequency where the loop<br />

phase equals 360° (see fig.10). Gain margin and phase margin<br />

are both terms that describe the stability of a negative feedback<br />

system in qualitative terms. In general, higher gain and phase<br />

margins indicate a more stable system.<br />

Generally, a system with a gain margin greater than 10 dB and<br />

a phase margin greater than 45° will perform adequately in most<br />

applications. Additionally, the loop gain should exhibit a slope<br />

of -20 dB per decade as it passes through the 0 dB axis.<br />

Explaining and determining the proper compensation is beyond<br />

the scope of this paper.<br />

Type 1<br />

Type 2<br />

Type 3<br />

Fig. 9. Types of compensation circuits.<br />

R 1<br />

C 1<br />

-<br />

V<br />

+ REF<br />

-<br />

+<br />

V REF<br />

R 3<br />

C 3<br />

R 1<br />

C 2<br />

R 1<br />

C 2<br />

V REF<br />

-<br />

+<br />

C 1<br />

C 1<br />

R 2<br />

R 2<br />

www.embedded-world.eu<br />

472


zener diode is use to trigger an SCR across the output terminals,<br />

effectively shorting them and causing a fuse to blow.<br />

0 dB<br />

Gain<br />

Frequency<br />

f c<br />

Gain margin<br />

E. Shorts<br />

Short circuits caused by a load failure (or accident) can be<br />

protected by fuses but it is preferred if the controller can limit<br />

the current and resume operation once the short is removed.<br />

Current foldback is one means and pulse-by-pulse current<br />

limiting inherent in current mode control is another means.<br />

0°<br />

Phase<br />

180°<br />

Fig. 10. Gain and phase margins illustrated on bode plot.<br />

VI.<br />

A. Under voltage lockout<br />

Phase margin<br />

ANCILLARY CIRCUITS<br />

Under voltage lockout protection circuits allow for<br />

controlled operation during power-up and power-down<br />

sequences. The UVLO circuit ensures that Vcc is adequate to<br />

make the controller fully operational before enabling the output<br />

stage. This may be part of the controller, an external circuit or<br />

a combination of the two. Today microprocessors often use<br />

multiple voltages and the turn-on sequence and timing is<br />

critically important. Specialized power controllers are often<br />

available to match a large microprocessor with multiple<br />

switching regulator controllers, LDO’s, timing and<br />

communication built into one chip.<br />

B. Soft start<br />

At power up, to limit inrush current due to uncharged<br />

capacitors or a connected load, it is desirable to increase the<br />

PWM pulse width gradually staring at zero duty cycle. Most<br />

modern controller ICs have this important function built-in or a<br />

means to achieve it with external circuitry.<br />

C. Fault management<br />

Power supply faults can be divided into two categories –<br />

human safety and circuit failure. Human safety is covered by<br />

international and national standards such as IEC 62368-1 2 nd<br />

“Audio/video, information and communication technology<br />

equipment – Part 1: Safety requirements”. This standard and<br />

others cover all potential hazards – electrical, fire, chemical,<br />

mechanical, thermal and radiation. They impose extra safety<br />

features beyond what’s needed for functional operation. For<br />

example, off-line transformers are slightly larger to<br />

accommodate extra spacing (creepage and clearance<br />

requirements) and extra insulation (double or reinforced).<br />

D. Overvoltage protection<br />

Should the controller or some critical component fail it is<br />

desirable to have some means of protecting downstream circuits<br />

for systems where the input voltage is higher than the output. A<br />

typical over voltage protection circuit is the ‘crowbar’ where a<br />

VII. ELECTROMAGNETIC NOISE<br />

The many benefits of switching power supplies, primarily<br />

their small size and high efficiency come with the price of<br />

needing to deal with the fast switching voltage and current<br />

waveforms. These fast transients may cause noise, which can<br />

interfere with other electronic devices. Generally, noise below<br />

30 MHz is conducted – either as differential mode or as<br />

common mode, and noise above 30 MHz is radiated – either<br />

magnetic or electric.<br />

Differential mode noise, also known as normal mode is the<br />

disturbance across the power or signal lines. It follows the<br />

normal current paths with current flowing down a wire in one<br />

direction and returning on another. In common mode noise the<br />

disturbance is across multiple lines with an external conduction<br />

path like earth ground or chassis. The currents return through a<br />

different path then the normal path.<br />

There are strict standards limiting the amount of noise<br />

allowed. The International Special Committee on Radio<br />

Interference puts out CISPR-22 which is the most universally<br />

accepted standard. Consequently, all power supplies must be<br />

designed and built to meet the standards. Most off-line power<br />

supplies need some type of additional input filtering in the form<br />

of inductors and capacitors. The entire topic is one of<br />

specialization requiring unique equipment and setups to test, all<br />

of which is beyond the scope of this paper.<br />

L<br />

N<br />

C X<br />

L 1<br />

Fig. 11. Minimum basic mains line filter.<br />

VIII. MODULES<br />

C Y1<br />

C Y2<br />

Complete power supplies as modules have been available for<br />

some time. More recently with the increase in operating<br />

frequency and miniaturization, modules the size of large ICs<br />

have become available. These contain the complete power<br />

supply – controller with the feedback loop, the magnetics and<br />

even some capacitance. Only the bulk input capacitor and<br />

perhaps a voltage divider to set the output need to be added.<br />

473


This can save considerable time in designing, building and<br />

testing a power supply.<br />

IX.<br />

SUMMARY<br />

Linear regulators are simple and easy to use. Though not as<br />

efficient and limited to only reducing voltages, they offer low<br />

noise and fast transient response. Switching regulators are more<br />

complex and efficient. They can be used to both reduce or<br />

increase the voltage. A prime example is the extending the<br />

operating time of battery based systems. Today products with<br />

microcontrollers often use both in the same system using each<br />

solution to the best advantage.<br />

REFERENCES<br />

[1] T. Brander, A. Gerfer, B. Rall, H. Zenkner, Trilogy, 4th ed., Waldenburg,<br />

Wurth Elektronik eiSos GmbH & Co. KG, 2010.<br />

[2] R. Mammano, Fundamentals of Power Supply Design, Texas Instruments<br />

Inc., 2017.<br />

[3] S. Roberts, DC/DC Book of Knowledge, 2nd ed., Austria, RECOM<br />

Engineering GmbH & Co. KG, 2015.<br />

[4] M. Brown, Power Supply Cookbook, Newton, MA: Butterworth-<br />

Heinemann, 1994.<br />

[5] S. Menzel, ABC of Capacitors, Waldenburg, Wurth Elektronik eiSos<br />

GmbH & Co. KG, 2014<br />

[6] S. Wolf, R. Regenhold, ABC of Power Modules, Wurth Elektronik eiSos<br />

GmbH & Co. KG, 2015<br />

www.embedded-world.eu<br />

474


Achieving Ultra Low Power in Embedded Systems<br />

Understand where your power goes and what you can do to make things better<br />

Herman Roebbers<br />

Embedded Systems<br />

Altran Netherlands B.V.<br />

Eindhoven, The Netherlands<br />

Herman.Roebbers@altran.com<br />

Abstract— Over the course of the last years the need to reduce<br />

energy consumption is growing. This article focuses on the<br />

possibilities for reduction of energy consumption in embedded<br />

systems. We argue that energy consumption is a system issue and<br />

therefore a matter of making compromises. Energy consumption<br />

can be reduced by software, but only so far as hardware allows.<br />

There are many things that can be done to reduce energy<br />

consumption. The goal is to define an approach for achieving less<br />

energy consumption. Also criteria for the selection of an<br />

appropriate MCU are presented. Conclusion: Many (unexpected)<br />

things can have a big impact on your achievable battery lifetime.<br />

Look beyond just the CPU/processor and software in order to<br />

achieve better results.<br />

Keywords— Ultra Low Power; approach; embedded; system<br />

issue; reducing energy consumption<br />

I. INTRODUCTION<br />

In the last years the need to reduce energy consumption is<br />

growing. One the one hand this is instigated by governments<br />

(e.g. EnergyStar), on the other hand by the need to do more with<br />

the same or less energy (think mobile telephone battery lifetime,<br />

Internet-of-Things node battery life time). In this article we will<br />

focus on the backgrounds of energy consumption in embedded<br />

systems and how to reduce this consumption (or its effect). This<br />

article covers a part of a two-day Ultra Low Power workshop<br />

about this subject which is available via the High Tech Institute<br />

(http://www.hightechinstitute.nl), T2prof and Altran.<br />

The fact that energy consumption is an important issue is<br />

illustrated by the fact that chip manufacturers make a lot of noise<br />

about their energy-economic chips. There even are benchmarks<br />

for energy economy of embedded processors: the EEMBC<br />

ULPMark TM (http://www.eembc.org/ulpbench) CP (Core<br />

Profile) and PP (Peripheral Profile), IoTMark-BLE and the<br />

soon-to-be-released SecureMark.<br />

Energy consumption is an important point in all sorts of<br />

systems. It gets more and more important in the IoT world,<br />

where the biggest consumer is usually the radio. All sorts of<br />

solutions are tried to require the radio for as short as possible.<br />

This leads to non-standard protocols that use much less<br />

energy than standard protocols.<br />

It is important to realize that energy consumption is a system<br />

issue. And a matter of weighing one thing against another and<br />

making compromises. Energy consumption can be reduced by<br />

software, but only so far as hardware allows. It is also a<br />

multidisciplinary thing, because both software discipline and<br />

hardware discipline must be involved in the design in order to<br />

achieve the desired goal.<br />

For this article we limit ourselves to smaller embedded<br />

systems like sensor nodes. These systems are typically asleep for<br />

a large proportion of the time. Depending on what functionality<br />

is required during sleep and how fast the system must wake up,<br />

the system can sleep lighter or deeper.<br />

There are many measures that can reduce energy<br />

consumption. The goal is to define an approach that should lead<br />

to less energy consumption. That approach is detailed in this<br />

article as well as in the workshop.<br />

II.<br />

CATEGORIES OF MECHANISMS FOR ENERGY<br />

REDUCTION<br />

The mechanisms for energy reduction fall into three main<br />

categories. TABLE 1 lists commonly used mechanisms per<br />

category. This list is not exhaustive. Different vendors may use<br />

different names for the same mechanism.<br />

A. Software only (includes compiler)<br />

The energy reduction mechanism is solely implemented in<br />

the software domain.<br />

B. Software and hardware combined<br />

Hardware and software together implement an energy<br />

reduction mechanism.<br />

C. Hardware only<br />

The energy reduction mechanism is implemented at the<br />

hardware level.<br />

Each of the hardware mechanisms mentioned in the table<br />

below may or may not be available in your system. If the<br />

hardware does not support it then software cannot use it.<br />

www.embedded-world.eu<br />

475


Overview of<br />

Power<br />

Management<br />

Mechanisms<br />

Power<br />

management<br />

works at all<br />

these levels<br />

TABLE 1. POWER MANAGEMENT MECHANISMS<br />

Level Mechanism Category<br />

Application<br />

Operating<br />

System<br />

Driver<br />

Board<br />

Chip<br />

IP block /<br />

chip<br />

IP block /<br />

RTL<br />

Transistor<br />

Substrate<br />

III.<br />

Event driven architecture<br />

Use Low Power modes<br />

Select radio protocol<br />

…<br />

Power API<br />

Operation Performance<br />

Points API<br />

Tickless operation<br />

Use DMA<br />

Use HW event<br />

mechanisms<br />

Suspend / resume API<br />

Dynamic Voltage and<br />

Frequency Scaling.<br />

Power Gating via I/O pin<br />

Controlling Voltage<br />

regulator via I/O pin.<br />

Clock Frequency Mgt.<br />

Controlling device<br />

shutdown pins by I/O pin.<br />

Power Gating<br />

Offer Low Energy Modes<br />

(Automatic) clock gating<br />

Clock frequency<br />

management<br />

Dynamic Power Switching<br />

Adaptive Voltage Scaling<br />

Static Leakage<br />

Menagement<br />

Power Gating State<br />

Retention<br />

Automatic power / clock<br />

gating<br />

Body Bias<br />

FinFet<br />

TriGate Fet<br />

Sub-threshold operation<br />

SOI, FD-SOI<br />

SIMPLE THINGS TO DO<br />

(Domain)<br />

Category A<br />

(Software)<br />

Category B<br />

(Software<br />

&<br />

Hardware)<br />

Category C<br />

(Hardware)<br />

A. Look at the OS configuration (if there is an OS)<br />

Operating Systems use a periodic scheduler invocation<br />

(‘tick’) to check whether the currently executing process is still<br />

allowed to use the processor or that it should be descheduled in<br />

favor of some other process. This periodic invocation can take<br />

quite some time, and also happens if no processes are ready for<br />

execution. In this case a so-called idle task is executed, which<br />

usually consists of a simple while (1) {}; loop, just<br />

burning energy.<br />

Some Operating Systems (e.g. Linux and FreeRTOS) offer<br />

what is known as a tickless configuration to make the CPU sleep<br />

until either a timer expires or an interrupt occurs. The standard<br />

scheduler tick timer (default 100 Hz for Linux versions prior to<br />

version 3.10) is then no longer necessary. In versions before 3.10<br />

the #define CONFIG_NO_HZ configures this behavior, in later<br />

versions it is the #define CONFIG_NO_HZ_IDLE. In order for<br />

FreeRTOS to be used in this way the #define<br />

configUSE_TICKLESS_IDLE must be set. When applicable,<br />

this is a very simple way to (possibly substantially) reduce<br />

power.<br />

B. We look at the architecture of the application.<br />

If we look at the architecture of the application software we<br />

can distinguish two major types: Super loop or event driven. The<br />

super loop goes around one big loop all of the time, often not<br />

sleeping at any time. In order to reduce energy consumption we<br />

would like the system to sleep as long as possible between<br />

successive passes through the loop. It depends on the application<br />

whether sleeping is allowed at all and what the maximum<br />

sleeping time can be. It may, however, be quite possible to do<br />

some sleeping at the end of the loop without causing any<br />

problem and in doing so save substantial energy.<br />

IV.<br />

APPROACH FOR OBTAINING ULTRA LOW POWER.<br />

We will now describe our approach toward achieving ultralow<br />

power in a step by step fashion. Basically the strategy is:<br />

Use the facilities the hardware offers. We can do this in steps,<br />

roughly in the order these features were offered over time.<br />

A. In the beginning<br />

In the beginning there was only one bus master in the system:<br />

the CPU. It could read data from instruction memory and read<br />

from and write data to data memory and peripherals. In order to<br />

check for an event the CPU had to resort to polling:<br />

while (!event_occurred())<br />

{};<br />

This piece of code keeps the CPU busy, as well as the code<br />

memory and the bus. Both the CPU and the code memory (flash<br />

usually) are big contributors to the total energy consumption,<br />

especially when code memory isn’t cached.<br />

B. Phase 2: Introducing Direct Memory Access (DMA)<br />

At some point in time a second bus master is introduced: The<br />

DMA unit. It is capable (after being programmed by the CPU)<br />

to access memory and peripherals autonomously. It can also<br />

generate an interrupt to the CPU to signal completion of its task,<br />

e.g. copying of peripheral data to memory or vice versa. This<br />

DMA unit can operate in parallel with the CPU, but they cannot<br />

access the bus simultaneously. While the DMA is copying data,<br />

the CPU can check a variable in memory for DMA completion.<br />

Pseudocode of the Interrupt Service Routine (ISR):<br />

void ISR_DMA_done(void){<br />

}<br />

... /* clear interrupt */<br />

ready = true;<br />

476


The main program:<br />

volatile bool ready = false;<br />

setup_peripherals_and_DMA();<br />

start_DMA();<br />

while ( ! ready )<br />

{<br />

}<br />

__delay_cycles(CHECK_INTERVAL);<br />

Here we check another variable, but not continuously.<br />

The __delay_cycles() function executes NOP<br />

instructions during CHECK_INTERVAL. This keeps the data<br />

bus free so that the DMA unit isn’t hindered by the CPU’s data<br />

accesses and so may complete its assignment quicker. The CPU<br />

is still fetching code from instruction memory, though.<br />

C. Stop the CPU clock when possible<br />

A relatively recent addition to the CPU’s capabilities is<br />

stopping the CPU clock until an interrupt occurs, saving power<br />

by doing so. This can be in the form of a<br />

WAIT_FOR_INTERRUPT instruction, which removes the<br />

clock from the CPU core until an interrupt occurs. ARM CPU<br />

cores offer the WFI instruction for this purpose, others such as<br />

MSP430 set a special bit in the processor status register to<br />

achieve the same effect. This does not affect our interrupt service<br />

routine. Our main program code changes thus:<br />

volatile bool ready = false;<br />

setup_peripherals_and_DMA();<br />

start_DMA();<br />

while ( ! ready )<br />

{<br />

}<br />

__WFI(); /*special insn, CPU sleeps*/<br />

In the new situation the CPU is stopped by disabling its clock<br />

until the interrupt occurs. This saves energy in several ways: The<br />

CPU is not active, instruction memory is not read and both the<br />

data bus and the instruction bus are completely available for the<br />

DMA unit to use. Most new processors know this trick.<br />

D. Events<br />

Later CPUs have the notion of events that also can be used<br />

to wake up the CPU from sleep. This mechanism is quite similar<br />

to that of using the interrupt, except that no ISR gets invoked.<br />

This saves some overhead if the ISR didn’t have to do anything<br />

other than wake the CPU. Using this mechanism requires that<br />

the CPU have an instruction to WaitForEvent. ARM Cortex<br />

processors have the WFE instruction, others, such as MSP430<br />

don’t have it.<br />

E. Passing events around: Event router<br />

When this event mechanism is coupled with peripherals that<br />

can produce and consume events using some programmable<br />

event connection matrix (‘event router’), a very powerful system<br />

emerges. In the case of Silabs EFM32 series the mechanism is<br />

referred to as Peripheral Reflex System; Nordic has another<br />

name for it. MSP430 has something a bit simpler than the other<br />

two.<br />

This mechanism allows quite complex interaction between<br />

peripherals can take place without CPU interaction. This allows<br />

the CPU to go into a deeper sleep mode and save more energy.<br />

As an example we can configure a system to do the following<br />

without any CPU interaction: On a rising edge on a given I/O<br />

pin an ADC conversion is started. The conversion done event<br />

triggers the DMA to read the conversion result and store it into<br />

memory, incrementing the memory address after each store.<br />

After 100 conversions the DMA transfer is done, generating an<br />

event to the CPU to start a new acquisition series and to process<br />

the buffered data.<br />

F. Controlling power modes<br />

The latest ULP processors have a special hardware block to<br />

manage energy modes and transitions between them in the<br />

system, combined with managing clocks and power gating<br />

peripherals in certain energy modes: The Energy Control Unit in<br />

EFM32, or Power Management Module for MSP430 for<br />

instance. They can save a lot of time otherwise required to<br />

program many registers when going to or coming out of sleep.<br />

They can also manage retaining peripheral register content at<br />

retention voltage (lower than operational voltage), such that the<br />

peripheral can immediately resume operation when power is<br />

restored. This hardware mechanism is called State Retention<br />

Power Gating.<br />

The main program is now:<br />

setup_hw_for_event_generation();<br />

configure_sleep();/*This is the extra*/<br />

start_DMA();<br />

__WFE();/*CPU sleeps, low power mode*/<br />

Using a deeper sleep can make a difference of more than a<br />

factor thousand!<br />

We have just seen what stepwise refinements we can<br />

implement to reduce energy consumption. Each step can be<br />

implemented as a logical successor to the previous one.<br />

V. WHAT TO LOOK FOR WHEN SELECTING AN MCU<br />

There is a number of parameters that one can look at and<br />

compare to select the best MCU for the application at hand. Here<br />

is one set of parameters:<br />

1) What is the active current (A/MHz) at what voltage<br />

2) What is the performance of the CPU (CoreMark/MHz)<br />

3) What is the sleep current in each of the low power modes<br />

intended to be used<br />

4) What is the wake-up time from each of these low power<br />

modes.<br />

5) What is the power consumption of each of the<br />

peripherals used<br />

6) What peripherals are available in which low power<br />

modes<br />

www.embedded-world.eu<br />

477


7) Can peripherals operate autonomously (e.g. be<br />

controlled by a DMA engine)<br />

8) Is there a hardware event mechanism to orchestrate<br />

hardware-based event production and consumption<br />

9) Do the available low power modes fit well with the<br />

application<br />

10) Are the peripherals designed for ultra low power<br />

operation (e.g. Low Energy UART, Low Power Timer)<br />

11) Can sensors be operated with low energy consumption<br />

(e.g. Low Energy sensor interfaces)<br />

12) Are there “on-demand oscillators”<br />

The answers to these questions serve as a guide to an informed<br />

selection of the MCU type to use for best performance for the<br />

given application. They can be used as input for a power<br />

model of the application and, together with a battery model<br />

can help predict the battery/charge lifetime for the application.<br />

VI.<br />

WHAT ELSE CAN ONE DO?<br />

There are still many more factors that all can play a role in<br />

the overall energy consumption. These are factors not obvious<br />

to many people, such as:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Regulator efficiency<br />

Switching sensors off when not in use: prepare your<br />

hardware to be able to do so<br />

Clocks: how to set them for lowest energy<br />

consumption<br />

Voltages: lower is better, the fewer the better<br />

Compiler: can make 50 % difference<br />

Compiler settings: can make 50 % difference<br />

Where do I locate critical code / data<br />

How to measure the consumption?<br />

<br />

I/O pin settings<br />

Battery properties in relation to energy<br />

consumption profile.<br />

<br />

Look for possibilities to make use of energy<br />

harvesting to prolong battery lifetime<br />

During the workshop many of these issues and others will be<br />

addressed and illustrated through hands-on sessions.<br />

VII. CONCLUSIONS<br />

Ultra-Low Power is a system thing. Hardware alone or<br />

software alone cannot achieve the lowest consumption.<br />

We have shown a stepwise approach to reducing energy<br />

consumption.<br />

In order to realize the maximum energy reduction one has to<br />

understand the details of the hardware and write the software to<br />

use available features.<br />

Energy savings can be found in unexpected places.<br />

It is possible to reduce consumption by more than a factor<br />

thousand in certain scenarios.<br />

ACKNOWLEDGMENT<br />

The author wishes to thank Altran for giving him the<br />

opportunity to investigate this subject matter and his colleagues<br />

for helpful feedback during the development of the workshop<br />

and for reviewing related publications [1].<br />

REFERENCES<br />

[1] H. Roebbers, “Hoe spaar je energie in een embedded systeem?,” Bits &<br />

Chips 08, pp. 34-39, October 2015.<br />

478


Understanding Power Management and Processor<br />

Performance Determinism<br />

Ben Boehman<br />

Enterprise, Embedded, and Semi-Custom Business<br />

Advanced Micro Devices, Inc.<br />

Austin, TX USA<br />

Abstract—High-performance embedded systems crave the<br />

processing power of modern x86 processors, but current hardware<br />

architectures consistently prioritize peak performance over<br />

deterministic behavior. Advanced power management methods<br />

exploit inherent part-to-part variations, boosting core frequencies<br />

in unpredictable ways. Adding to this, PC architectures tend to<br />

target specific processor power constraints that can artificially<br />

clamp operating frequencies to maintain thermal and electrical<br />

specs. This creates scenarios where the power-density of the<br />

workload defines the effective operating frequency of the CPU,<br />

further reducing predictability. Real-time operating systems are<br />

there to help address determinism in the software domain but they<br />

cannot address it at the hardware level. Once these hardware<br />

implications are understood, designers will know what to look for<br />

when choosing processors for embedded systems where<br />

performance determinism is an important factor. Discover<br />

methods to disable features of modern processors that reduce<br />

hardware determinism.<br />

Keywords – determinism; deterministic; x86; performance;<br />

power; management; real-time; AMD;<br />

I. INTRODUCTION<br />

Embedded system applications span a tremendous range of<br />

uses and some of these devices become mission critical<br />

equipment where performance behavior must be highly<br />

predictable. Embedded system designers in these markets are<br />

familiar with the use of real-time operating systems to improve<br />

determinism at the software level, but variations introduced by<br />

hardware are often overlooked. For this work, hardware<br />

determinism is defined as a guaranteed, predictable response<br />

time to an event, assuming a fixed sequence of code and input<br />

stimuli. Deterministic systems can replicate that predictability<br />

across all units. The increased demand for high-performance<br />

embedded systems has also driven a trend toward usage of PCcompatible<br />

x86 processors from desktop and notebook product<br />

lines, though their power and performance architecture are not<br />

designed with determinism in mind. Even product variants<br />

targeted at embedded markets tend to retain the favoritism<br />

toward performance prevalent in the PC models. Power<br />

management behavior in leading x86 processors has consistently<br />

striven to squeeze the last drop of performance out of every<br />

device, including exploitation of inherent part-to-part variations.<br />

This paper will review the source of these variations, discuss<br />

common power management behaviors that exploit them, and<br />

review methods of mitigation. Focus will be on common<br />

desktop, notebook, and embedded processors in the 6-65W<br />

power range and may not be reflective of x86 server processors.<br />

II.<br />

SILICON BASICS<br />

a. DEFINING OPERATIONAL LIMITS<br />

Before power management behaviors can be discussed, it is<br />

important to understand the fundamental limitations of silicon<br />

integrated circuits. In fact, the primary purpose for power<br />

management in such devices is to ensure these limitations are<br />

not exceeded so that device reliability and functionality are<br />

maintained. There are many factors that affect silicon-based<br />

transistor performance, but the focus here is to briefly<br />

familiarize readers with the most significant factors affecting<br />

x86 processors in their typical operating ranges.<br />

Processor frequency is possibly the most obvious of<br />

performance limiting factors. Even consumers have become<br />

quite familiar with equating frequency to performance.<br />

Frequency defines how fast the logic of the device is clocked,<br />

and thus how fast instructions are executed. Performance will<br />

not be equivalent when comparing two processors of equivalent<br />

frequency and different architecture, but it is generally true that<br />

increasing frequency will increase execution performance.<br />

Frequency in a processor can be limited by several underlying<br />

factors, but the most basic are voltage and current. Those<br />

familiar with transistor mechanics know that voltage has a key<br />

relationship to frequency. Faster switching of the transistors<br />

requires increasing voltage to overcome the resistive and<br />

capacitive elements of the transistor. However, higher voltage<br />

increases ageing effects (Gielen, 2013), putting practical limits<br />

on voltage application to ensure product longevity. Faster<br />

switching of transistors also generates higher currents as those<br />

capacitive elements are charged and discharged. While<br />

individual transistor currents may be very small, modern<br />

processors can have several billion transistors (Cutress, 2017),<br />

so this current adds up quickly. The processor die is typically<br />

mounted on a package of some kind and there are also real,<br />

practical limitations to how much current can be delivered to the<br />

die effectively. Every digital IC must deal with the balancing act<br />

of transistor voltage and current to yield a useful frequency.<br />

The combination of Ohm's and Joule's laws teach us that all<br />

this voltage and current generates power, and that both<br />

parameters have a direct relationship with power. In fact, the<br />

reality is that most processor frequency limitations also boil<br />

www.embedded-world.eu<br />

479


down to power or current limits. Faster switching of transistors<br />

increases current and may also require increasing voltage, and<br />

doing either will increase power. Integrated circuits of every<br />

kind must provide designers with a maximum power<br />

consumption limit so that systems can be adequately designed to<br />

handle the current and cooling requirements. Power limits are<br />

often the most significant performance limiting factor,<br />

especially at the lower end of a device family’s power range.<br />

Modern processors based on the x86 architecture tend to be<br />

power limited rather than frequency limited with heavy<br />

workloads. The reasons will be discussed in later sections.<br />

Die temperature is a simple factor to consider, though not the<br />

most obvious. As the processor operates, consumed power is<br />

converted to heat. Heat affects transistor operating<br />

characteristics, as well as the rate of diffusion of the doping<br />

elements in the silicon that form the transistor junctions.<br />

Eventually, diffusion will change the electrical properties of the<br />

transistors until they fail to operate correctly and the processor<br />

will reach the end of its life. Limiting junction temperature in<br />

the device is critical for maintaining its expected longevity.<br />

Manufacturers will set maximum die temperatures for their<br />

products that must be followed. Maintaining this temperature<br />

limit is an important task for the power management entity in the<br />

processor.<br />

b. LEAKAGE POWER<br />

Another basic principle of silicon transistors is that they leak<br />

current across junctions and to the substrate (Kaushik, 2003).<br />

The amount of leakage current in a processor of a particular<br />

process type will vary largely by applied voltage and<br />

temperature and it can become quite significant in today's highperformance<br />

processors. This is because the same factors that<br />

are required to make transistors switch faster (i.e., achieve<br />

higher frequency) also increase leakage. All this leakage current<br />

creates additional power that must be counted as part of the<br />

device’s total power consumption. Naturally, leakage power<br />

effectively reduces the amount of the device’s total power<br />

envelope that can be consumed as active power (i.e., power used<br />

in transistor switching that does work). Figure 1 below shows<br />

the leakage power distribution for a current, undisclosed AMD<br />

processor based on a 14nm FinFET process as a percentage of<br />

total processor power.<br />

Percentage of Units<br />

15%<br />

10%<br />

5%<br />

0%<br />

Figure 1- Leakage power distribution for an undisclosed AMD product<br />

based on a 14nm FinFET process.<br />

Leakage power is exponentially related to die temperature,<br />

often doubling several times over the operating temperature of<br />

an integrated circuit (Wolpert & Ampadu, 2012). This means<br />

that device power will increase as the device temperature rises,<br />

even if the rest of the operating scenario is unchanged (i.e., fixed<br />

clock frequency, voltage, and workload). CPU manufactures<br />

must either leave enough headroom to accommodate this<br />

potential increase in power over the temperature range, or have<br />

a power management scheme that is dynamic with device<br />

temperature. Figure 2 below shows how leakage power is<br />

affected by temperature in that same AMD processor family.<br />

Leakage Power as % of<br />

Total Power<br />

Processor Leakage Power Distribution<br />

(@high voltage & High Temp)<br />

14%<br />

16%<br />

18%<br />

20%<br />

22%<br />

24%<br />

26%<br />

28%<br />

30%<br />

32%<br />

34%<br />

36%<br />

38%<br />

40%<br />

42%<br />

Leakage Power as a Percentage of Total Power<br />

Processor Leakage Power Versus<br />

Temperature (@high voltage)<br />

30%<br />

20%<br />

10%<br />

0%<br />

0 20 40 60 80<br />

Die Temperature ( o C)<br />

Figure 2 - Leakage power over temperature for a typical sample of an<br />

undisclosed AMD product based on a 14nm FinFET process.<br />

c. PART-TO-PART VARIATIONS<br />

The silicon photolithography process used to create<br />

semiconductors has inherent imperfections that manifest as<br />

variations in transistor construction and thus affect their<br />

operational characteristics. These variations not only exist<br />

between batches of silicon wafers, but even across a single<br />

wafer. Such variations may require die in one area of the wafer<br />

to have a higher voltage to achieve the same frequency than its<br />

neighbors, or cause its leakage power to be greater. Figure 1<br />

illustrates leakage power variations quite well. Since power is a<br />

480


key factor in determining achievable performance for a given<br />

device, performance variations follow suite.<br />

Processor manufacturers sort these die into groups targeting<br />

various product models with different specifications (e.g., 25W<br />

vs. 35W) to maximize yield. The amount of variation possible<br />

across units is defined by the specific model, and lower cost<br />

models will tend to allow wider variance. It is important to<br />

understand why these variations exist before discussing how<br />

power management exploits them.<br />

d. WORKLOAD POWER DENSITY<br />

Understanding power management behavior in complex<br />

microprocessors also requires understanding the concept of<br />

workload power density. This concept essentially means that<br />

different workloads (i.e., executed instruction sequences) will<br />

generate different amounts of power consumption in the<br />

processor, even at the same utilization level. This is to say that<br />

the central processing unit (CPU) core power incurred by two<br />

workloads can be significantly different even if the core is 100%<br />

utilized (i.e., consistently busy executing instructions) in both<br />

cases. This situation can occur because different instructions<br />

stimulate different amounts of transistor logic inside the core.<br />

As an example, one can image that a complex floating-point<br />

calculation will trigger more transistor activity in the CPU than<br />

a simple data movement operation. Data movement from one<br />

CPU general purpose register to another involves a minor<br />

number of gates while a complex AVX or SSE instruction to<br />

perform a multiply accumulate operation at 256 bits wide may<br />

activate many thousands of gates. Workloads may repeat such<br />

operations as part of an algorithm, compounding the power<br />

consumption increase. The potential difference in power<br />

between workloads becomes even larger when considering that<br />

nearly all x86 microprocessors sold today are multi-core, and<br />

most have integrated many other functions that were previously<br />

external. Integration of the graphics processing unit (GPU) is<br />

the most significant, as it is a very large processing core on its<br />

own. As consumer use-cases have become increasingly<br />

graphical, the GPU in some x86 processors can be even larger<br />

(i.e., more transistors) than the CPU cores. This is especially<br />

true for companies like AMD, who specifically target high<br />

performance integrated graphics in their microprocessors.<br />

Mixed workloads that execute a combination of CPU and GPU<br />

instructions simultaneously can experience the effects of<br />

workload power density differences on both core types.<br />

Allocation of the power budget to these various cores is one<br />

challenge of processor power management that will be explored<br />

further in later sections.<br />

To illustrate the difference in workload power density,<br />

power consumption was measured with two different CPU-only<br />

workloads on a random sample of an AMD embedded RX-<br />

421BD SoC based on the “Excavator” CPU core. Both<br />

workloads can saturate a single CPU core while sustaining max<br />

frequency, so utilization will stay at 100% for the core under test.<br />

The Prime95 workload represents an extreme case (often<br />

referred to as a “thermal virus”), and power values have been<br />

normalized to that level.<br />

CPU Core Power<br />

(normalized to Prime95)<br />

Figure 3 - Prime 95 v29.3 b1 Large FFT, Microsoft SysInternals CPU<br />

Stress v1.0<br />

The data in figure 3 show that the power consumption of the<br />

less power-dense workload was only 57% of Prime 95 with a<br />

single CPU core active. When extrapolated across multiple<br />

physical cores, it is easy to see that power variation by workload<br />

can grow quite large. In this test case, the CPU was able to<br />

maintain maximum frequency (i.e., 3.5GHz) on the active core<br />

without reaching power or current throttling, so no frequency<br />

reduction was required.<br />

The power density of GPU workloads can be compared in<br />

the same way. The graph below compares a simple 3D workload<br />

from the Microsoft DirectX 9 SDK (“blobs”) to Furmark, an<br />

extreme GPU workload falling in the thermal virus class. GPU<br />

frequency was artificially limited to 720MHz to avoid power<br />

limit throttling and expose the full potential power consumption<br />

difference. A comparison of the RX-421BD processor power<br />

for both workloads is shown in Figure 4.<br />

GPU Power<br />

(normalized to Furmark)<br />

Workload Dependent CPU Power<br />

Consumption (1 Core)<br />

120%<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

Prime 95<br />

CPU Stress<br />

Workload Dependent GPU Power<br />

Consumption<br />

120%<br />

100%<br />

80%<br />

60%<br />

40%<br />

20%<br />

0%<br />

Furmark<br />

Blobs<br />

Figure 4 - Furmark v1.18.2.0, Microsoft DirectX 9 SDK "Blobs"<br />

The GPU power data shows the Blobs application consumed<br />

only 82% of the power of Furmark, confirming a difference in<br />

power density. It is also worth noting that the increase in power<br />

dissipation with the heavier workload will raise die temperature<br />

in a given system environment. The higher temperature will<br />

increase leakage power, adding to the power difference. Truly<br />

www.embedded-world.eu<br />

481


comparing the power difference caused only by the workload<br />

would require tight control of the die temperature which was not<br />

attempted in this test. However, the few degrees of difference<br />

observed here do not significantly affect the results.<br />

III.<br />

PROCESSOR POWER MANAGEMENT<br />

Previous sections establish the key observation that power is<br />

inextricably linked to temperature, frequency and<br />

voltage/current. Power management in modern processors is all<br />

about controlling these parameters to control power<br />

consumption, while maximizing workload performance.<br />

Current processors from AMD and Intel contain dedicated<br />

microcontrollers that are independent of the x86 processor cores<br />

to administer power management. The firmware in the<br />

microcontroller is tailored in some ways for the product’s<br />

intended use case. For example, mobile products will be more<br />

aggressive in the use of power saving features like clock and<br />

power gating in the interest of improving battery life. Desktop<br />

and server processors that are always wall powered will tend to<br />

favor performance and only save power when it has minimal<br />

impacts on performance.<br />

a. DEFINING POWER LIMITS<br />

Definition of the maximum power consumption is a common<br />

starting point when defining processor models. Manufactures<br />

choose power levels to address various use-cases with differing<br />

power restrictions, and performance (i.e., frequency) is largely<br />

derived from that. X86 processors are largely marketed by their<br />

Thermal Design Power (TDP), even though it is a specification<br />

related to the thermal solution requirement and not a maximum<br />

electrical power that the device can consume. Maximum<br />

sustainable power levels will be equal to or greater than TDP,<br />

depending on the product. This paper will focus on the<br />

maximum sustained power of the processor when discussing it<br />

as a limit.<br />

b. SCENARIO DEFINED PERFORMANCE<br />

The power management controller of the processor monitors<br />

key parameters to ensure the processor specifications for<br />

maximum power, current and temperature are not exceeded. If<br />

changes in the operating scenario cause any one parameter to<br />

approach its limit, the controller must throttle the processor’s<br />

performance to compensate. This throttling usually takes the<br />

form of reducing operating frequency of the core(s) consuming<br />

the largest amounts of power (i.e., CPU and GPU), as they have<br />

the biggest impact. Reducing frequency often allows voltage<br />

reduction for additional power savings. Reductions in power<br />

consumption will reduce temperature and current, helping the<br />

processor to stay within these specifications. These<br />

adjustments can happen as much as every millisecond for very<br />

quick response to changes in the operating environment or even<br />

the workload (Howse, 2015). Previously, x86 processors<br />

moved between discrete “performance states” (specific<br />

combinations of voltage and frequency at which cores can<br />

reliably operate) that differed by hundreds of megahertz and<br />

required suspension of execution during transitions. Newer<br />

Intel 7 th Generation Core Processors and AMD Ryzen<br />

Processor architectures allow much more granular frequency<br />

changes for better efficiency and, at least in the Ryzen case,<br />

uninterrupted execution.<br />

Since power consumption varies with the workload, one can<br />

recognize why achieving maximum frequency of a core may<br />

not always be possible. What if a very power dense workload<br />

is run on a CPU core at maximum frequency and causes the<br />

device to exceed its power limit? What if that workload is then<br />

run on multiple cores further exceeding the limit? What if a<br />

graphics workload is suddenly introduced on the integrated<br />

GPU simultaneously? In these cases, the power management<br />

controller has no choice but to throttle frequencies to maintain<br />

power and current limits. Many system designers erroneously<br />

assume that processor manufacturers configure their products<br />

to ensure that cores can sustain maximum frequency for any<br />

workload in all configurations. This is definitely not the case.<br />

Doing so would require these vendors to continuously search<br />

out the worst-case (i.e., most power dense) workload in<br />

existence, characterize the power usage on their architecture,<br />

and set the product’s maximum frequency low enough to<br />

accommodate it safely across all units of that model (including<br />

their part-to-part variations). This “fixed frequency” model is<br />

no longer used by most x86 processors in the PC and embedded<br />

spaces. Ignoring the fact that the worst-case workload could<br />

keep changing over time, the reality is that defining the max<br />

frequency in this way would be extremely limiting and easily<br />

reduce the operating frequency of a CPU core to a fraction of<br />

its potential because of the wide variance in workload power<br />

density. The consequence would be that lighter workloads with<br />

less power density would also be limited to this reduced<br />

frequency, even if it would have been safe to execute them<br />

much faster. An artificial performance limitation would be<br />

created to guarantee a predictable maximum frequency that is<br />

achievable for all workloads. A better approach for generalpurpose<br />

processors is to define the max frequency by silicon<br />

capability and allow the power management controller to<br />

dynamically provide the best performance possible for the<br />

specific operating scenario in real-time.<br />

Designers should remember that the operating scenario not<br />

only includes the workload (i.e., the exact instruction sequences<br />

running on processor cores) but also its timing and usage of<br />

integrated peripheral functions and I/O. With high levels of<br />

integration in modern processors, I/O power cannot be ignored<br />

(in this instance, logic power for the I/O interfaces will be put<br />

in the same category as the power used by the physical I/O<br />

pins). Interfaces like system memory, Serial-ATA, Ethernet,<br />

PCI Express, audio, and USB are commonly integrated and they<br />

all consume power. I/O power is largely dependent on the<br />

system configuration and usage model. For example, a network<br />

gateway device may not implement any SATA devices, while a<br />

network attached storage (NAS) system may have many. The<br />

NAS unit use-case will involve lots of ethernet activity<br />

(increasing power used in that logic), while a machine<br />

controller may have very little. The portion of the total power<br />

envelop consumed by I/O can’t be used by compute cores, so<br />

changes in configuration or usage model can impact achievable<br />

core performance when processors are power limited.<br />

482


Including the system configuration and I/O usage model in the<br />

workload definition is key when attempting to improve<br />

performance determinism.<br />

c. EXPLOITING DEVICE VARIATIONS<br />

The natural result for the power limited (versus fixed<br />

frequency) model is that performance is maximized for each<br />

workload, but frequency is not predictable with workload<br />

changes. Any scenario where the workload reaches<br />

temperature or power-limit throttling, performance can be<br />

degraded from the fixed-frequency model. System designers<br />

can avoid temperature throttling by developing enough<br />

headroom into the thermal solution to ensure maximum<br />

temperature is never reached. After all, the maximum sustained<br />

power level is a known quantity and airflow and ambient<br />

temperature limits can be specified for the final system. Power<br />

throttling is a more difficult challenge due to the part-to-part<br />

variations discussed earlier that affect power consumption.<br />

Two samples of the same processor model could have<br />

differences in their leakage power, causing one unit to reach its<br />

power limit at a lower average frequency even when running an<br />

identical workload under identical operating conditions.<br />

Vendors happily exploit this difference by allowing the lower<br />

leakage units to spend more time at higher frequency, yielding<br />

better performance. Earlier discussions of voltage<br />

dependencies reveal why different processor units of the same<br />

model can also have different voltage requirements to achieve<br />

a given clock frequency. This difference can be exploited by<br />

fusing unit-specific voltage vs. frequency curves into each part<br />

that enable the power management controller to minimize core<br />

voltage. Reductions like this to active power allows those units<br />

to further increase average frequencies before reaching power<br />

limits. Fortunately, lower leakage devices tend to also require<br />

higher voltages to reach the same frequency as a higher leakage<br />

device, so these two factors work to cancel each other out rather<br />

than compound. Despite this, material differences in the<br />

consumed power can remain.<br />

Many real-world PC use-cases have been found to be<br />

bursty, where applications often sit idle waiting for user input<br />

and then perform some activity before waiting again. This<br />

could be a user starting a program or loading a new web page.<br />

Periods of inactivity will naturally coincide with low power and<br />

lower die temperature. Some processors take advantage of this<br />

situation by defining a maximum power limit that is greater<br />

than the sustained power limit. The processor can be allowed<br />

to reach this higher power consumption for a short amount of<br />

time that is “thermally insignificant”. Thermal solutions have<br />

a relatively large thermal inertia, meaning it takes a while for<br />

the processor to raise their temperature to a steady state value.<br />

Increasing the power limit in this way allows for short periods<br />

of increased performance benefiting bursty workloads, but at<br />

the cost of performance determinism. The operating<br />

environment now has another mechanism by which to affect<br />

performance, and workloads may have to run for several<br />

minutes to reach a steady state behavior.<br />

d. REGULATOR TELEMETRY<br />

Since processor performance limitations boil down to power<br />

in so many ways, accurately determining power consumption is<br />

critical to maximizing performance. Measuring processor<br />

power in real time requires accurate current sensing which is not<br />

practical for implementation on high-speed digital process<br />

technologies. Until recently, processor power management<br />

technology relied on power curves derived from actual power<br />

measurements at manufacturing test time with a reference<br />

workload. Values were programmed into the processor and<br />

combined with run-time data from complex activity monitors in<br />

the logic. Management algorithms calculated power usage to<br />

ensure power limit adherence. This method allowed some<br />

exploitation of part-to-part variations but still required moderate<br />

guard-banding due to inaccuracies of the activity monitors to<br />

estimate power. Conservative estimations of power<br />

consumption leave performance headroom untapped. A recent<br />

change seen with AMD processors is use of power telemetry<br />

data from the regulators powering the primary voltage rails.<br />

Real-time voltage and current data allows the power<br />

management unit to be much more accurate in its total power<br />

calculation. Doing so enables every variation of the unit that<br />

affects power consumption to be factored in along with<br />

instantaneous environmental circumstances (i.e., temperature)<br />

and exploited for performance gain. Naturally, maximizing<br />

performance in this way increases non-determinism across units.<br />

IV.<br />

REDUCING EFFECTS OF POWER MANAGEMENT ON<br />

PERFORMANCE DETERMINISM<br />

Maximization of performance at the cost of determinism<br />

works well for consumer use-cases where the user does not rely<br />

on repeatable performance across multiple systems. Enterprise<br />

and embedded systems can be quite different and may not be<br />

able to tolerate performance variations across units. However,<br />

it important to differentiate the need for a minimum performance<br />

versus true performance determinism. For digital signage or<br />

casino gaming machine examples, a minimum performance<br />

need likely applies. Functionality and user satisfaction are not<br />

affected if the frame processing time varies slightly across units<br />

as long as it is fast enough to meet the level of the content (e.g.,<br />

60fps) in all cases. Units that complete work faster may simply<br />

spend more time idle between frames, which would be<br />

unobservable to the user. Special cases like industrial machine<br />

controllers or military applications may require truly repeatable<br />

performance due to sensitive timing interactions. Even some<br />

datacenters desire such repeatability so that job execution time<br />

can be predicted regardless of which system it is scheduled on.<br />

It should be clarified that true hardware determinism is not<br />

possible with modern x86 CPU architectures. Small timing<br />

variations can exist because of interactions between hardware<br />

and software, and hardware interrupts can occur with<br />

unpredictable timing. Some amount of variation will always<br />

exist, but there are ways to improve the situation, particularly for<br />

variations caused by power management. One thing that system<br />

designers can count on is that improving performance<br />

determinism will come with a cost to peak performance.<br />

a. MINIMUM PERFORMANCE LEVEL<br />

Ensuring a minimum performance level begins with testing<br />

in a worst-case environment. The specific workload of interest<br />

www.embedded-world.eu<br />

483


must be run on a worst-case processor sample operated at<br />

maximum temperature. The frequency behavior of the<br />

processor and the resulting performance of the workload should<br />

represent the lowest level of any sample in the distribution. If<br />

the performance is still acceptable, then the processor model<br />

choice is sufficient. System designers can have confidence that<br />

all samples of the chosen model will perform at this level or<br />

better. Of course, changing processor models or modifying the<br />

workload means testing must be repeated. Unfortunately,<br />

worst-case samples are rare and processor vendors don’t<br />

usually supply them upon request. Holding the processor die<br />

under tight temperature control while running an active<br />

workload can also be difficult, and usually requires a specialty<br />

thermal equipment like thermal stream blowers or oil baths.<br />

Many embedded designers will need an alternative method to<br />

ensure their minimum performance level.<br />

Most x86 processors sold today specify a base and boost<br />

frequency for CPU cores. A few models even do the same for<br />

integrated GPUs. A good rule of thumb has been that base<br />

frequency should be sustainable for all processor samples, but<br />

designers must understand when this can be broken. Processor<br />

vendors generally do intend for base frequency to be sustainable<br />

on all cores of a CPU under “real world” workloads.<br />

Differentiation is made because worst-case power density of real<br />

versus synthetic applications can be very large. The example<br />

provided earlier used two synthetic workloads for uniformity of<br />

power usage but it still illustrates the point. Real applications<br />

tend to have a mixture of compute, memory, and I/O operations<br />

while synthetic “power viruses” can intentionally loop on small,<br />

power dense instruction sequences. As previously discussed,<br />

defining base frequency with the most power dense workload<br />

available would be extremely conservative and would<br />

artificially limit performance of more typical workloads. The<br />

catch is that definition of “real-world” is subjective and varies<br />

by vendor and product. Vendors will choose reference<br />

workloads to represent the worst-case of the real-world, and then<br />

use it to define base frequency of the product. If the reference<br />

workload is known, both it and the custom workload can be<br />

compared on a random sample. If the power density of the<br />

custom workload is less than the reference workload, then it<br />

should be able to sustain base frequency across all units of the<br />

model distribution. To get an accurate measurement, testing<br />

should be performed on a specific sample/system in a fixed<br />

configuration and environmental conditions. CPU core<br />

frequency boost should be disabled to prevent reaching<br />

temperature or power/current limits (boost disable is commonly<br />

available in system BIOS 1 firmware options of most x86<br />

platforms). If either workload can reach these infrastructure<br />

limits then results will be skewed. Both AMD and Intel provide<br />

tools to log processor power, but they do not publicly disclose<br />

reference workloads. Such information must be obtained under<br />

non-disclosure agreements. If comparison is successful and the<br />

custom workload’s power density is less than the reference<br />

workload, then the performance of the boost-disabled scenario<br />

should be achievable across all units. Capping frequency in this<br />

way does reduce peak performance, but that is the sacrifice<br />

required for consistency.<br />

1<br />

Basic Input / Output System<br />

If characterization of the custom workload reveals it is more<br />

power dense than the reference workload, then further frequency<br />

reduction is required to ensure minimum performance across<br />

units. In addition to disabling boost states, frequency should be<br />

reduced until power/current measurements are below those of<br />

the reference workload on the test unit. Once a suitable<br />

frequency limit is determined, performance can be evaluated for<br />

acceptability. A challenge with this case is that setting a CPU<br />

frequency limit below base frequency can require more invasive<br />

software modification. The Linux kernel supports a simple<br />

software daemon (e.g., cpufreqd) that can set core operation at a<br />

specific P-state and this mechanism can be used to limit CPU<br />

frequency with some processor architectures. For Windows<br />

operating systems, custom modifications must be made to the<br />

ACPI 2 PSS table in BIOS that communicates supported CPU P-<br />

states to the operating system (OS) (Unified Extensible<br />

Firmware Interface Forum, 2017). Higher, unwanted frequency<br />

states can be removed, and the table rebuilt. The same method<br />

can be used for other operating systems that support the ACPI<br />

_PSS table. Modification of the table takes significant BIOS<br />

expertise and access to source code. Once a P-state is identified<br />

that brings power density of the custom workload below that of<br />

the reference workload, performance can again be evaluated for<br />

acceptance. It should be noted that this method of comparing<br />

power density is less exact than the ideal method of using a true<br />

worst-case sample. Including some reasonable margin into the<br />

operating point is wise.<br />

Mixed workloads that highly utilize both CPU and GPU<br />

cores simultaneously complicate the ability to confirm if a<br />

custom workload is more power dense than the reference<br />

workload. If a GPU base frequency is defined at all, reference<br />

workloads for CPU and GPU are likely measured independently<br />

so there is not significant interaction. Using the workload power<br />

density comparison method would also require tools to provide<br />

power data separately for each core type, which processor<br />

vendors do not typically provide. Workloads of this type cannot<br />

establish a reliable minimum performance operating point<br />

without assistance from the processor manufacturer. In cases<br />

where GPU performance is not critical, its maximum frequency<br />

could be set to a very low value in the interest of limiting its<br />

contribution to processor power and possibly avoid<br />

power/current throttling of CPU cores (AMD integrated GPUs<br />

use a vendor-specific “PowerPlay” table in BIOS to define<br />

frequency states for the GPU, much like the ACPI PSS table for<br />

CPUs; they can be edited, but this approach is not universal to<br />

other vendors). However, without a way to quantify the power<br />

usage to a reference point, ensuring repeatable performance<br />

requires adding a large, arbitrary guard-band to core frequencies.<br />

There will always be some uncertainty about coverage for worstcase<br />

samples.<br />

b. DETERMINISTIC PERFORMANCE<br />

Any system that desires performance determinism from the<br />

processor will need to start disabling power management<br />

features to get as close as possible to the old fixed-frequency<br />

model. The first to go are those features that provide temporary<br />

performance improvements based on real-time environmental<br />

factors. Examples discussed earlier include temperature-based<br />

2<br />

Advanced Configuration and Power Interface Specification<br />

484


oosting or time-based power excursions (e.g., AMD STAPM 3<br />

and sub-features of Intel DPTF 4 ). These features often don’t<br />

provide value anyway for embedded use-cases where a<br />

workload is run continuously. The ability to disable them may<br />

not be exposed in off-the-shelf embedded platforms, but most<br />

can be turned off by the system BIOS developer. Review the<br />

processor documentation thoroughly to understand which of<br />

these features can be disabled.<br />

Frequency boosting is also scenario driven and therefore<br />

must be disabled. Even if a frequency in the boost range is<br />

sustainable for the custom workload on a worst-case sample,<br />

current processors do not allow fixed frequency operation in this<br />

range. Operating systems are unaware of boost frequencies, so<br />

there are no OS-level mechanisms to set them. To fix CPU<br />

frequency, a value at or below base must be used. From there,<br />

evaluation of the workload power density compared to the<br />

reference workload can be used to determine if CPU frequency<br />

must be further reduced below base frequency. The method<br />

described earlier still applies.<br />

After the new maximum frequency has been established and<br />

set, states below this must also be eliminated to ensure fixedfrequency<br />

behavior of cores. Latency is increased when cores<br />

transition to low frequency during idle periods, reducing<br />

deterministic behavior. As described in the previous section, OS<br />

frequency governors can be used with Windows and Linux to<br />

set “performance” mode which will ensure hardware does not<br />

go below base frequency. This method has been verified on<br />

AMD Embedded R-series and G-series processors, as well as<br />

Intel 7 th Generation Core processors. Excursions below base<br />

frequency can still occur if triggered by thermal throttling, but<br />

proper design, as outlined here, can prevent it. Despite being<br />

effective for Windows and Linux, designers requiring<br />

determinism will likely be running a real-time OS. RTOSs with<br />

support for ACPI PSS tables can still use that method. Others<br />

with no CPU P-state management must rely on the platform<br />

BIOS to set the desired state before the OS handoff.<br />

If the workload is mixed for CPU and GPU, the same<br />

complications previously discussed apply. Extreme guardbanding<br />

could ensure fixed frequency operation across all units<br />

of a model distribution, but there is no reliable method to<br />

confirm that without manufacturer support. For applications<br />

that are willing to go the extra mile to secure deterministic<br />

performance, custom screening can be implemented where each<br />

sample is pre-tested to ensure operation within specific limits.<br />

Obviously, this kind of screening is very costly in both<br />

infrastructure and labor but has found use in military markets<br />

where cost sensitivity is low.<br />

server processors are more conservative in this area. The<br />

processor will exploit some of these differences to reach a<br />

maximum (i.e., deterministic) power consumption for a given<br />

workload (at a given temperature) and thus maximize<br />

performance while maintaining infrastructure power limits.<br />

“Performance Determinism Mode” offers the unique ability to<br />

achieve the same performance with every processor of a given<br />

TDP. Creating repeatable performance requires part-specific<br />

power and frequency curve data to be fused into the device at<br />

production time. This data essentially provides a negative<br />

performance offset that can be used to make each individual unit<br />

replicate the performance of a worst-case unit of the entire<br />

model distribution. The power management controller also uses<br />

the predictable calculated-power method based on activity<br />

monitors instead of regulator telemetry data. Enablement of the<br />

feature is a simple compile-time option in the BIOS firmware.<br />

In performance determinism mode, part-to-part variations will<br />

only result in differences in power consumption for a given<br />

workload (at a given temperature) while performance (derived<br />

from frequency behavior) is minimally impacted and remains<br />

consistent. This type of reliable hardware determinism can only<br />

be provided by the processor manufacturer and it ensures that<br />

only the minimum necessary performance sacrifice is made to<br />

achieve that determinism. The performance deterministic mode<br />

certainly simplifies system architecture for designers looking for<br />

improved performance determinism for enterprise applications<br />

and its existence is noteworthy given the topic of this paper.<br />

However, it is yet to be seen if this kind of feature will find its<br />

way into lower-power embedded processor products from AMD<br />

or Intel.<br />

V. VENDOR PROVIDED HARDWARE DETERMINISM<br />

AMD has recognized the demand for improved determinism<br />

in some enterprise and high-end embedded applications and has<br />

introduced dual operating modes in their EPYC line of<br />

enterprise processors to address differing needs. A “Power<br />

Determinism Mode” offers higher performance by taking<br />

advantage of many of the mechanisms described earlier in this<br />

paper including part-to-part variations (Fruehe, 2017), though<br />

3<br />

Skin Temperature Aware Power Management<br />

4<br />

Dynamic Power and Thermal Framework<br />

www.embedded-world.eu<br />

485


REFERENCES<br />

Cutress, I. (2017, February 22). AMD Launches Zen.<br />

Retrieved from Anandtech.com:<br />

http://www.anandtech.com/show/11143/amd-launch-<br />

ryzen-52-more-ipc-eight-cores-for-under-330-<br />

preorder-today-on-sale-march-2nd<br />

Fruehe, J. (2017). Power / Performance Determinism. Moor<br />

Insights and Strategy.<br />

Gielen, E. M. (2013). Analog IC Reliability in Nanometer<br />

CMOS. Analog Circuits and Signal Processing, DOI:<br />

10.1007/978-1-4614-6163-0_2, 23-28.<br />

Howse, B. (2015, Nov 26). Examining Intel's New Speed Shift<br />

Tech on Skylake: More Responsive Processors.<br />

Retrieved from Anandtech:<br />

https://www.anandtech.com/show/9751/examiningintel-skylake-speed-shift-more-responsive-processors<br />

Kaushik, S. a.-M. (2003). Leakage Current Mechanisms and<br />

Leakage Reduction Techniques in Deep-<br />

Submicrometer CMOS Circuits. IEEE.<br />

Unified Extensible Firmware Interface Forum. (2017, May).<br />

Retrieved from Unified Extensible Firmware<br />

Interface Forum:<br />

http://www.uefi.org/sites/default/files/resources/ACP<br />

I_6_2.pdf<br />

Wolpert, & Ampadu. (2012). Managing Temperature Effects<br />

in Nanoscale Adaptive Systems. In D. Wolpert, & P.<br />

Ampadu, Managing Temperature Effects in<br />

Nanoscale Adaptive Systems (pp. 22-24). Springer.<br />

486


Understanding Where Power Goes in Energy<br />

Efficient Systems<br />

Rod Watt<br />

Director of System Architecture, Arm<br />

Cambridge, UK.<br />

Abstract—The traditional way of measuring the efficiency of a<br />

system is to measure the overall system power while running a<br />

series of synthetic benchmarks and see which scores the highest<br />

while minimizing power consumption, or simply running use<br />

cases and measure the battery drain.<br />

Of course, the entire system including the main processor, the<br />

memory, the Wi-Fi chip set, the LCD and the rest of the circuitry<br />

all contribute to the drain on the battery. Measuring the power at<br />

the battery level will certainly provide data on how the entire<br />

system is performing but will not give the granularity required to<br />

really understand where the power is going during the task.<br />

This paper will discuss techniques and procedures used to allow<br />

systems to be measured down to a SoC level leading to a much<br />

deeper understanding of the system’s overall efficiency.<br />

It will also compare the differences in CPU activity and usage<br />

between synthetic benchmarks and traditional everyday use<br />

cases.<br />

Keywords—power, energy, efficient, synthetic, workloads<br />

I. INTRODUCTION<br />

When consumers are deciding to buy a new device, they will<br />

typically start off with a few basic requirements in terms of<br />

functionality. For example, the choice of a set top box may<br />

include requirements covering connectivity, picture quality,<br />

applications support and so on. Looking at what is available in<br />

the market place, several systems will no doubt meet these<br />

basis requirements. Looking beyond the top-level features, the<br />

consumer may choose to look at the detailed specifications of<br />

the system. Details such as the number and speed of the<br />

processors and the size of the memory in the system may give<br />

an indication of which system may be the highest performant<br />

solution, but it can be difficult to decide this based purely on<br />

the specifications. Equally, a system may come with a higher<br />

spec’d power supply which may suggest that this system<br />

consumes more power, but again, it may be naive to assume<br />

this.<br />

II. TRADITIONAL METHODS OF COMPARISON<br />

There are two basic methods for comparing the<br />

performance of a system.<br />

1) Use Cases<br />

Basically, this involves using the system and running typical<br />

workloads. For example, if it were a set top box, the user may<br />

choose to play some video content and check for picture<br />

quality and download speed. Playing a graphic intensive game<br />

and assessing performance in terms of screen lag and<br />

sustained frames per second would indicate how the graphic<br />

processors compare in the different systems.<br />

2) Synthetic Benchmarks<br />

If the user wishes to stress the system further, synthetic<br />

benchmarks may be of use. Although these do not typically<br />

replicate what a user would actually do, they will stretch the<br />

system and attempt to push the compute subsystems to their<br />

limits.<br />

Although both methods will give an indication of the<br />

performance of the system, neither will provide any<br />

information on the power or energy that was consumed to<br />

attain that performance. Without an appreciation of the power<br />

consumption, these measurements will only provide part of the<br />

answer.<br />

III. POWER VS ENERGY<br />

Frequently, commentators will use the words “Power” and<br />

“Energy” when talking about system consumption. However,<br />

it’s important to understand the differences between these two<br />

and how they should be used when discussing overall system<br />

efficiency.<br />

www.embedded-world.eu<br />

487


Power is an instantaneous measurement that deals with a point<br />

in time.<br />

Where<br />

and<br />

P(t) = V(t) x I(t) (1)<br />

P(t) is Power in Watts at Time, t<br />

V(t) is Voltage in Volts at Time, t<br />

I(t) is current in Amps at Time, t<br />

It is important to note that, in this equation, the time during<br />

which the current flows is not considered. The Power being<br />

measured is purely the product of the Voltage that is being<br />

applied and the Current that is flowing.<br />

In contrast, Energy, typically measured in Joules is the amount<br />

of energy that is consumed.<br />

Where<br />

and<br />

(<br />

! = $(&)<br />

)<br />

E is Energy in Joules<br />

P(t) is Power in Watts at Time t<br />

dt is the time interval<br />

Or alternatively, Power can be described at the rate at which<br />

Energy is consumed.<br />

Where<br />

and<br />

IV. DEFINING EFFICIENCY<br />

*& (2) Efficiency is defined a + b = as c. the ratio (1) of useful (1) work performed by a<br />

machine or a process to the total energy expended.<br />

$ = !/*& (3)<br />

P is Power in Watts<br />

E is Energy in Joules<br />

dt is the time interval<br />

What’s important to note here, is that Energy takes time into<br />

consideration, whereas typically Power is normally just an<br />

instantaneous measurement taken at a specific time.<br />

To illustrate the difference, consider two systems, “A” and<br />

“B”. System “A” consumes a high level of current for a short<br />

space of time whereas system “B” consumes a lower average<br />

current over a much longer period.<br />

Comparing the Power consumption of both systems, (assuming<br />

both systems run off the same Voltage), System “A”, with its<br />

higher average current, will consume the higher power.<br />

However, since System “A” only draws this high current for a<br />

short period of time, it may consume less Energy than “System<br />

B” which is drawing (lower) current for a much longer time.<br />

It’s Energy that is the important measurement, not just Power<br />

at an instantaneous time. It’s Energy that will drain the battery<br />

(current and power over time) and will also increase the<br />

electricity bill!<br />

Where<br />

and<br />

Figure 1 Power vs. Energy<br />

!,, = -/! (4)<br />

Eff is the efficiency<br />

W is the Work or task completed<br />

E is the Energy<br />

In this equation, Efficiency is calculated as a relative number,<br />

not an absolute value. This is due to the fact that in this case,<br />

how “Work” is measured will vary depending on the workload<br />

that is being run.<br />

For example, if the workload is a synthetic benchmark, the<br />

measure of Work could be the score that is reported.<br />

Comparing the Efficiency of both systems, (which would be<br />

calculated by dividing the benchmark score by the Energy<br />

consumed to complete the benchmark), would allow the user to<br />

calculate the relative efficiency of both systems.<br />

Similarly, the systems’ efficiency could also be compared by<br />

using traditional use cases. In this example, the “Work” will be<br />

defined depending on what the use case actually is. For<br />

example, if the use case is a video, the Work could be defined<br />

as the frames per second that can be achieved or it could be<br />

defined as the speed at which the video was loaded. If the use<br />

case was an application, the Work could be defined as the<br />

speed at which the application is started.<br />

As long as the definition of the Work is consistent during<br />

testing, the calculated efficiency can be used to compared the<br />

two systems.<br />

488


V. EFFICIENCY – A REAL EXAMPLE<br />

For this example, consider two similar systems, System A and<br />

System B. The challenge is to decide which one is the “best”.<br />

Of course, being best can mean many things but, in this<br />

example, Efficiency will be considered the measure of choice.<br />

Firstly, running a simple synthetic benchmark will give an<br />

indication of performance.<br />

TABLE I.<br />

TABLE III.<br />

Measurement System A System B<br />

Score 43329 12350<br />

Average Current 1898mA 1048mA<br />

Average Voltage 3.4V 2.78V<br />

Average Power 6495mW 2920mW<br />

Time For Test 5.59s 13.95s<br />

Consumed Energy 2527uAH 2836uAH<br />

Efficiency 17.1 4.4<br />

Measurement System A System B<br />

Score 43329 12350<br />

In this case, System A scores 43329 vs 12350 on System B,<br />

suggesting that System A is approx. three times “better” than<br />

System B.<br />

This test can then be repeated, however, this time, the system<br />

Voltage and Current will be noted. This will allow the average<br />

system Power to be calculated.<br />

TABLE II.<br />

Measurement System A System B<br />

Score 43329 12350<br />

Average Current 1898mA 1048mA<br />

Average Voltage 3.4V 2.78V<br />

Average Power 6495mW 2920mW<br />

Looking at the average power consumption shows that<br />

although System A did achieve the higher score, it did so while<br />

drawing more current and hence more power. So, although<br />

System A is providing higher performance, its taking more<br />

power to achieve this. At this stage, an efficiency calculation<br />

could be done, but as discussed earlier, for this to be accurate,<br />

the Energy, not the Power, needs to be considered.<br />

Running the tests again, but this time noting the time in<br />

addition to the other metrics, allows the Energy and hence the<br />

Energy efficiency to be calculated.<br />

Figure 2 Relative Scores<br />

Even though System A consumed a higher current, it did so<br />

over a much shorter time, hence the overall energy<br />

consumption was less. Calculating the relative Energy<br />

efficiency, (score/Energy), it can be see that System A is over<br />

four times more energy efficient than System B, for this<br />

specific workload. Other workloads may give different results.<br />

So, it can be seen, in order to calculate the efficiency of a<br />

system, two main measurements are required:<br />

1) The energy consumption of the system<br />

2) A measure of work<br />

VI. THE PRACTICALITIES OF POWER MEASUREMENT<br />

Different systems will require different methods for measuring<br />

power. Ultimately, it will come down to a number of variables,<br />

including<br />

1) The type of system<br />

2) The accuracy required<br />

3) Time and money<br />

This paper will discuss a few of the more common systems.<br />

A. Fuel Gauges<br />

Many consumer systems, such as mobile phones and tablets<br />

contain a “Fuel Gauge” which monitors the battery capacity<br />

and the current flow form it. Typically, these are used in<br />

mobile devices to report the amount of energy that is<br />

remaining in the device.<br />

It is possible to access the fuel gauge via software and also to<br />

monitor the current flow from the battery. Although this is<br />

non-invasive, the results can vary greatly in accuracy.<br />

489<br />

www.embedded-world.eu


Additionally, not all consumer systems have fuel gauges onboard.<br />

In the case where the USB connector is also used to transmit<br />

data, for example, as a debug port, this method has the<br />

disadvantage that this capability will be lost. Typically, this<br />

port will be connected to a personal computer that would<br />

supply power in addition to providing the debug port.<br />

E. USB Power Monitor<br />

B. Measure at the Battery<br />

Figure 3 Fuel Gauge Accuracy<br />

When a consumer device is powered from a battery, it may be<br />

possible to bypass this battery and power the system from a<br />

metered power supply. The practicalities of removing and<br />

bypassing the battery will be system specific. Some systems,<br />

with removable batteries, make this a relatively easy process.<br />

In this case, the battery’s terminals may be isolated and wires<br />

added to bypass it. In the case of a system which is not<br />

designed to have a removable battery, this process can be<br />

trickier. Physically removing the battery, getting access to the<br />

terminals and adding the bypass wires can be problematic.<br />

Assuming this process can be achieved, a metered power<br />

supply can be connected to power the system. Although this<br />

process is a good alternative when a fuel gauge is not<br />

accessible, it is invasive, the level of difficulty depending on<br />

the specific system.<br />

C. Measure at Power Jack<br />

Typically, development boards will be powered from a DC<br />

power supply which connects to a power jack on the board. In<br />

this case, it’s relatively easy to modify the cable to enable it to<br />

be plugged into a metered power supply.<br />

This does have the advantage that the development board itself<br />

does not have to be modified, just the power lead. In this case,<br />

it may be advisable to purchase an additional power lead to<br />

keep the original one intact.<br />

D. Measure at USB Connector<br />

Some development boards, such as the Raspberry Pi for<br />

example, do not use a separate power lead for the system,<br />

electing instead to use a USB style connector. Similarly, to<br />

modifying a “normal power” lead, the USB lead can also be<br />

modified to enable an external power supply to be used.<br />

Relatively inexpensive USB power monitors can be used to<br />

measure the power being provided to the USB port. These<br />

devices are totally non-invasive, (there is no need to modify<br />

the board), and also enable data to be transmitted and<br />

received. Unfortunately, they only indicate the instantaneous<br />

current and voltage, not the average.<br />

F. Modified USB Cable<br />

To overcome the problem of only being able to monitor<br />

instantaneous current and voltage, another method involves<br />

modifying the USB cable to enable direct measurement. In<br />

this case, a shunt resistor, typically 5-10mW is inserted into<br />

the power lines within the USB cable. By measuring the<br />

voltage drop across this shunt resistor, it is possible to<br />

calculate the current flow. When the absolute voltage of this<br />

power line is also noted, (by tapping into one side of the shunt<br />

resistor and the ground wire), the power consumption can be<br />

calculated.<br />

This method does enable accurate measurements of the power<br />

but requires relatively expensive test equipment. Typically, an<br />

instrument, known as Data Access Equipment, or DAQ, is<br />

used to monitor the power rail continually. This can be a<br />

relatively expensive piece of equipment. Development boards<br />

can use numerous USB cables. For instance, micro, mini and<br />

micro-C. In this case, the specific cable will need to be<br />

modified for each type of cable used.<br />

G. Modified USB Power Monitor<br />

One solution to the problem of having to modify multiple<br />

USB cables, is to modify a USB power monitor. Although<br />

these devices are designed to show the instantaneous current<br />

and voltage, it is possible to tap into the internal power rails<br />

and shunt resistors which can then in term be monitored using<br />

a DAQ as previously described. As the USB interface to these<br />

devices tends to be a standard USB type A connector (plug in,<br />

socket out), these enable various USB cables to be used.<br />

H. Measure at SoC<br />

All of the processes described above allow the total system<br />

power and, by implication, total system energy to be<br />

measured. Although this is of great interest from a user point<br />

of view (a user wants to know when the battery will run out or<br />

490


how much the system costs to run), it may be of less interest to<br />

a developer or engineer. When trying to understand the<br />

efficiency of a system, of course the total system<br />

power/energy is of interest. But, to truly understand the<br />

efficiency of the system, it’s important to have greater<br />

granularity; a deeper insight into which part of the system is<br />

consuming most/least energy.<br />

In order to understand this level of detail, it’s necessary to tap<br />

into various sections of the system and monitor the energy at a<br />

more detailed level. For example, a system may have separate<br />

power rails for the central processing unit, the graphics<br />

processor, the memory interface and the other peripherals.<br />

By adding shunt resistors into these individual power rails and<br />

using a DAQ, similar to the process described for modifying<br />

the USB cable, it is possible to monitor the individual power<br />

rails around the system. Although this method is very<br />

invasive, and potentially tricky, it does provide a very detailed<br />

insight into the energy consumption around the system.<br />

I. Summary of Methods<br />

A summary of the various methods described are shown in the<br />

table below. Each method will have its own challenges and<br />

will provide its own level of accuracy. Ultimately, cost, time<br />

and the level of accuracy needed will determine the best<br />

method to utilize.<br />

TABLE IV.<br />

Method Advantages Disadvantages<br />

Fuel Gauge Non-invasive. Not always available.<br />

Measure at Battery<br />

Measure at Power<br />

Connector<br />

Measure at USB<br />

Connector<br />

USB Power Monitor<br />

Modified USB Cable<br />

Modified USB Power<br />

Monitor<br />

Measure at SoC<br />

Works with/without<br />

Fuel Gauge.<br />

Accurate.<br />

Dev board does not<br />

require modification.<br />

Allows access to boards<br />

without separate power<br />

connectors.<br />

Non-invasive.<br />

Enables data port to be<br />

used.<br />

Enables data port to be<br />

used.<br />

Provides continuous<br />

power monitoring.<br />

Enables data port to be<br />

used.<br />

Can be used with any<br />

USB cable.<br />

Provides continuous<br />

power monitoring.<br />

Provides power<br />

breakdown .<br />

Invasive.<br />

Not all boards have<br />

separate power inputs.<br />

Lose access to debug<br />

port.<br />

Only shows<br />

instantaneous power.<br />

Each USB cable need<br />

to be modified.<br />

Additional, expensive<br />

equipment required.<br />

Additional, expensive<br />

equipment required.<br />

Very invasive.<br />

Can be tricky.<br />

TABLE V.<br />

Method Accuracy Granularity<br />

Fuel Gauge Variable System Level<br />

Measure at Battery Accurate System Level<br />

Measure at Power<br />

Accurate<br />

System Level<br />

Connector<br />

Measure at USB<br />

Accurate<br />

System Level<br />

Connector<br />

USB Power Monitor Not Accurate System Level<br />

Modified USB Cable Accurate System Level<br />

Modified USB Power Accurate<br />

System Level<br />

Monitor<br />

Measure at SoC Very Accurate SoC Level<br />

VII. BENCHMARKS VS REAL-LIFE WORKLOADS<br />

When deciding which workloads to use to exercise the system<br />

under test, a user typically has choices.<br />

1) Synthetic Benchmarks<br />

These are workloads that have been designed specifically to<br />

test certain aspects of the design such as processor<br />

performance, frames per second achievable from the<br />

graphics processor, memory bandwidth and so on<br />

2) Real-Life use cases<br />

These are general workloads that a user would typically run<br />

on the system. Typically, different systems would run<br />

different types of workloads. For example, for a mobile<br />

phone, a real-life use case could be a low-end game. For a<br />

tablet, a use case could be web browsing.<br />

The table below summarizes the general advantages and<br />

disadvantages of each.<br />

TABLE VI.<br />

Method Advantages Disadvantages<br />

Synthetic<br />

Benchmarks<br />

Real-Life Use<br />

Cases<br />

Easily Repeatable.<br />

Provide a “Score”.<br />

Designed to stress the<br />

major components.<br />

Not synthetic, this is<br />

what users actually do.<br />

Use is important, not<br />

quality.<br />

Cannot be tuned for a<br />

target device.<br />

Synthetic – not necessarily<br />

representative of what the<br />

system actually does.<br />

Variable/dubious quality.<br />

Can be tuned for a target<br />

device.<br />

Difficult to repeat.<br />

Don’t always stress the subsystem.<br />

Don’t typically provide a final<br />

score.<br />

In order to test which workload, synthetic or real-life, would<br />

best represent the efficiency of the system, the following tests<br />

were conducted.<br />

491<br />

www.embedded-world.eu


Three different mobile phones were used for the experiment.<br />

The phones varied in cost and specification.<br />

The table below summarizes the specifications of each of the<br />

handsets used.<br />

TABLE VII.<br />

Attribute Premium Mid-Tier Entry<br />

Cost > $400 > $200 < $400 < $200<br />

Thickness < 8mm < 9 mm < 10mm<br />

Screen > 1080p < 1080p < 720p<br />

CPU<br />

GPU<br />

4 x big<br />

4 x LITTLE<br />

> 70fps T-Rex<br />

4 x little<br />

4 x LITTLE<br />

> 40fps T-Rex<br />

4 x LITTLE<br />

< 40fps T-Rex<br />

Memory > 2GB > 1GB < 1GB<br />

Camera(s) > 12MP > 8MP < 8MP<br />

Each handset had a different set of attributes, these were<br />

categorized as following<br />

1) Cost<br />

The average cost of the handset, SIM free, measured<br />

in US Dollars.<br />

2) Thickness<br />

3) Screen<br />

4) CPU<br />

5) GPU<br />

The thickness of the handset, measured in mm.<br />

The screen density.<br />

The type and number of CPU clusters supported in<br />

each device. “4 x big” refers to 4 x Cortex-A57<br />

processors, “4 x LITTLE” refers to 4 x Cortex-A53<br />

processors running at > 1.5GHz, “4 x little” refers to<br />

4 x Cortex-A53 processors running at < 1.5GHz.<br />

The performance of the graphics processor was<br />

measured by running GFX-Bench T-REX benchmark<br />

and the scores were noted. The GPU performance<br />

was rated on the scores achieved on this test.<br />

6) Memory<br />

The amount of DRAM in the handset, measured in<br />

GB (Gigabytes).<br />

7) Camera(s)<br />

The number of pixels the camera can support,<br />

measured in MP (Mega Pixels).<br />

VIII. COMPARING BENCHAMRKS<br />

A series of benchmarks were run on each of the three<br />

handsets. The tests targeted some of the main functions of<br />

each handset such as the processor, the graphics, memory and<br />

storage. In addition to noting the scores of each test, the<br />

amount of energy that was consumed during the test was also<br />

measured. This enabled the efficiency of each test to be<br />

calculated. At the end, the efficiency of all the tests were<br />

averaged out across the three handsets and the results<br />

compared. A summary of the results are as follows.<br />

1) The results were normalized around the mid tier<br />

devices. So, in terms of the mid tier device, its performance,<br />

energy and efficiency were all normalized to 1.<br />

2) The premium device was 2.7 x more efficient than the<br />

mid-tier device. This advantage was a combination of higher<br />

performance and lower energy consumption<br />

3) The low-end device was 40% less efficient than the<br />

mid-tier device. This disadvantage was a combination of lower<br />

performance and higher energy consumption<br />

Not surprisingly the premium handset out performed the other<br />

two in terms of benchmark scores, undoubtedly due to the<br />

higher specified CPU, GPU, memory and screen. What may<br />

be more surprising, was that in general, the premium handset<br />

achieved these high scores while consuming less energy. The<br />

higher rated processors were able to complete the time bound<br />

synthetic benchmarks in a shorter time and hence minimized<br />

the energy consumption.<br />

IX.<br />

COMPARING WORKLOADS<br />

The tests were then repeated but this time using real-life<br />

workloads. A series of tests were run on each of the three<br />

handsets. The tests targeted various workloads such as social<br />

and messaging, web access and gaming. Again, the<br />

performance, energy consumption and efficiency was<br />

calculated and averaged across all the tests. A summary of the<br />

results are as follows.<br />

1) As before, the results were normalized around the mid<br />

tier device. So, in terms of the mid tier device, its<br />

performance, energy and efficiency were all normalized to 1.<br />

2) Again, the premium device was the most efficient, but<br />

this had now dropped to 1.8 x more efficient than the mid-tier<br />

device. This advantage was a combination of higher<br />

performance and lower energy consumption<br />

492


3) This time, the low-end device was 13% more efficient<br />

than the mid-tier device. This advantage was a combination of<br />

higher performance and lower energy consumption<br />

The results of running real-life workloads on the handsets was<br />

very different than running synthetic benchmarks. The<br />

performance advantage of the premium device was not as<br />

marked as previously; often very similar to the mid-tier and<br />

entry level phones.<br />

In some cases, the entry level device actually out-performed<br />

the mid-tier device. The entry level device had a smaller<br />

screen, in terms of pixel count. This reduces the amount of<br />

data the phone may have to process during the workloads.<br />

X. BENCHAMARKS VS REAL-LIFE - CONCLUSIONS<br />

Benchmarks are designed to stress the main processing<br />

elements within a system. Although a premium handset will<br />

show a performance and efficiency advantage over the others,<br />

when running less stressful, real-life workloads, this advantage<br />

is reduced.<br />

As the cost of the phone goes up, so does its overall<br />

specification. For example, the premium device and the midtier<br />

device both have bigger (in terms of screen density)<br />

screens than the entry level device. More processing power is<br />

required to service these higher screen densities and more<br />

power is required to run them.<br />

Synthetic benchmarks are designed to stress individual<br />

subsystems and measure peak performance. As they tend to<br />

target different parts of the system, (central processor, graphics<br />

processor, memory), no one benchmark really has the complete<br />

answer. In order to get a clearer understanding of the<br />

efficiency of the system, it’s advisable to run a number of these<br />

tests and look at the overall trend.<br />

Workloads are more representative of how the overall device<br />

will perform as they tend to put less stress on the main<br />

subsystems, rather spreading the load around the entire system.<br />

From this point of view, running real-life workloads will<br />

provide a better indication of the system efficiency whereas<br />

synthetic benchmarks will give an indication of the peak<br />

performance that an individual sub-system within the device<br />

can attain.<br />

XI.<br />

OVERALL BENEFITS OF EFFICIENT SYSTEMS<br />

Perhaps the most obvious benefit of a more efficient system is<br />

extended battery life. From a marketing point of view, a claim<br />

that the system provides a longer battery life than its<br />

competitors is a clear advantage. Manufacturers will use these<br />

claim in their marketing campaigns, sellers will use this to<br />

compare product lines and technical journalists like to quote<br />

this data in systems reviews. However, the design of efficient<br />

systems will also provide some secondary advantages.<br />

1) Cheaper / Fewer Components<br />

Power Management ICs or PMICs – devices that can be used<br />

to regulate the power inside a consumer device – can be<br />

relatively expensive. A recent analysis of the Bill of Materials<br />

(BOM) of a modern mobile device found that 4% of the<br />

silicon cost could be directly attributed to voltage regulation<br />

and power generation.<br />

If a system is designed to be more efficient, and hence able to<br />

run off a lower current rating, the cost of these components<br />

can be reduced.<br />

Lower current/power devices tend to Be less expensive. If the<br />

current required for a power rail can be reduced, it may be<br />

possible to share the power generation of this power rail<br />

between power regulators, hence reducing the component<br />

count and BOM cost.<br />

2) Printed Circuit Board Stack up<br />

When designing a Printed Circuit board (PCB), there are a<br />

number of rules that the designer follows.<br />

PCBs tend to have an even number of layers and these layers<br />

are typically symmetrical around the middle layers. When<br />

routing high speed signals, such as dynamic memory address<br />

and data lines, these signals should be routed adjacent to a<br />

solid power or ground plane. This ensures that the signal has a<br />

fast and uninterrupted return path.<br />

When there are numerous power planes within a PCB, some<br />

designers will choose to use split power planes where one<br />

plane on the PCB can support multiple power rails. Although<br />

this helps reduce the layer count, it does restrict the routing of<br />

traces on adjacent signal planes as these cannot be routed<br />

across these split power planes. The alternative to split power<br />

planes is to add additional layers – this obviously has a cost<br />

implication. Typically, each additional two layers will add<br />

10% to the overall cost of the PCB.<br />

An efficient design may be able to replace split power planes<br />

with thick signal traces, due to the reduction in current. As<br />

well as helping to reduce the overall layer count, these can be<br />

routed in such a way as to minimize the impact of the highspeed<br />

signals on the adjacent layers.<br />

In addition, if the current is reduced, power traces can be<br />

shared between components, again, easing the layout<br />

challenges.<br />

3) Easier Thermal Design<br />

With higher energy dissipation come thermal challenges, A<br />

system that is consuming more energy will naturally have the<br />

requirement of additional cooling. Ideally a system should be<br />

passively cooled, i.e. cooling naturally in the surrounding air,<br />

rather than be actively cooled by a fan. There are a number of<br />

reasons of this.<br />

www.embedded-world.eu<br />

493


a) Fans and heat sinks will add to the overall cost of the<br />

system.<br />

b) The introduction of moving parts will introduce<br />

another point of failure, potentially reducing the overall<br />

reliability of the system.<br />

c) Typically, a larger enclosure is required to house the<br />

additional fans, adding cost.<br />

d) To enable air flow to/from these fans, additional<br />

tooling costs will be incurred in adding air vents and slots.<br />

e) Overall this bigger, noisier and heavier design will<br />

be les elegant than a more efficient, passively cooled design.<br />

4) Component Failure<br />

Systems tend to fail following the classic “bath tub” curve<br />

where the majority of the failures happen at the start or the end<br />

of a system’s life cycle.<br />

The early “infant mortality” failures seen early on in a<br />

system’s life cycle are mainly due to faulty components. This<br />

can be for a number of reasons but will include manufacturing<br />

and process defects. Typically, if a system survives this early<br />

stage in its life, additional defects don’t tend to occur until<br />

well into the system’s life cycle. Again, these “wear out”<br />

failures can occur for a number of reasons but one of the most<br />

common is thermal stress.<br />

it’s not a true representation of the true “cost” of completing<br />

the workload.<br />

There are numerous ways of measuring and verifying your<br />

design’s energy consumption. These methods vary greatly in<br />

accuracy and complexity. When defining a strategy for<br />

measuring energy the user must decide on how much time,<br />

effort and cost they are willing to spend on the problem.<br />

Synthetic Benchmarks will give an indication of performance,<br />

but they don’t tend to measure efficiency. They are great for<br />

showing peak performance of the main subsystems in a device<br />

but don’t highlight a subsystem’s combined impact on user<br />

experience. So, although they stress the system, they are not<br />

necessarily representative of use cases. Care must be taken<br />

when using them as they can be manipulated.<br />

Consumers run workloads, or “real-life use cases” every day,<br />

not synthetic benchmarks. In terms of understanding the<br />

efficiency of the system, these can be of limited use as they<br />

don’t stress the system by pushing it to run at its peak. They<br />

don’t tend to provide an indication of performance (the movie<br />

just played!) and they can be difficult to repeat consistently.<br />

However, they do demonstrate that the total system is<br />

important, not just the process subsystem. Everything has an<br />

effect on the system efficiency<br />

Although the obvious benefit to an efficient system is<br />

extended battery life or reduced cost to operate, there are<br />

additional benefits to making an energy efficient system. The<br />

benefits of a, energy efficient, low power design go beyond<br />

battery life to include reduced system costs, increased<br />

reliability and overall design simplicity<br />

Thermal stress is caused by a component continually heating<br />

up and cooling down – one of the byproducts of an inefficient<br />

system when excess current, energy and heat need to be<br />

dissipated. If the system is designed to be be more power<br />

efficient, this thermal stress can be reduced and the life cycle<br />

of the product extended.<br />

XII. SUMMARY<br />

When comparing systems and deciding “which is best”,<br />

efficiency is a key measurement. Performance is an important<br />

metric, but without understanding the “cost” of that<br />

performance, it’s of limited use.<br />

Although the two words are frequently used together, it’s<br />

important to remember that Power is not the same as Energy.<br />

Power tends to represent a point in time and is calculated by<br />

multiplying the instantaneous current by the instantaneous<br />

voltage. This does not give any indication of the duration for<br />

which the current was being drawn at any specific voltage so<br />

494


Internet of Threats? – A Code Quality Management<br />

Strategy<br />

Mark Rhind<br />

Senior Technical Consultant, PRQA<br />

Ashley Park House, 42-50 Hersham Road, Walton on Thames<br />

Surrey, KT12 1RZ, United Kingdom<br />

Mark_Rhind@prqa.com<br />

Abstract— HP Security Research (2015), found that 70% of the<br />

most commonly used IoT devices, such as smart thermostats and<br />

home security systems, contain serious security vulnerabilities.<br />

The rising number of complex connected devices invites attacks on<br />

multiple fronts, from client applications and cloud services to<br />

firmware and applications. We need to prevent the Internet of<br />

Things (IoT) from becoming the “Internet of Threats”.<br />

How should we protect ourselves?<br />

The answer lies in finding software vulnerabilities in the<br />

applications as early as possible in the development stage. This can<br />

be achieved by incorporating code quality management including<br />

static analysis into your software development process.<br />

In this paper, we will outline the different types of software<br />

verification and provide advantages and drawbacks for each of<br />

them. We will explain that the most effective and proven<br />

methodology is to use static analysis tools with a coding standard.<br />

We will then provide the added benefits of using a static analysis<br />

tool.<br />

Keywords—Software engineering; security; internet of things<br />

I. JUST BUGS?<br />

As a security person, you need to repeat this mantra:<br />

"security problems are just bugs"<br />

In what has become a somewhat infamous tirade on the<br />

Linux Kernel mailing list, Linus Torvald asserted that "security<br />

problems are just bugs"; that the primary purpose of software<br />

hardening strategies is often debugging [1].<br />

Looking beyond the personal interests of the parties involved<br />

in this exchange, Torvald makes a valid point. Studies have<br />

found that 64% of the vulnerabilities described in CERT<br />

National Vulnerability Database were the result of programming<br />

errors [2].<br />

However, it might also be fair to suggest that Torvald's<br />

statements risk trivializing a complex and costly problem;<br />

creating bug-free software remains a significant challenge that<br />

is rarely - if ever - achieved. In this era of software flaws that<br />

have global implications and massive financial impact, is it<br />

appropriate to describe a programming error as "just" a bug? To<br />

take one example, the OpenSSL Heartbleed Bug has been<br />

estimated to have a cost in excess of $500M [3].<br />

HP Security research found that 70% of the most commonly<br />

used smart devices still contain serious security vulnerabilities<br />

[4]. With the number of internet-connected devices projected to<br />

reach more than 20 billion by 2020 [5], this raises an important<br />

question: How do organizations ensure that these devices are<br />

secure and bug-free?<br />

II.<br />

HARDENING VS. DEBUGGING<br />

A. Hardening<br />

The term software hardening is growing in popularity to<br />

describe a range of strategies and techniques to secure software<br />

or devices against intrusion or misuse. Presently, there is little<br />

formal consensus on what is covered by software hardening, but<br />

the term is frequently used to describe strategies for security-bydesign,<br />

which may include:<br />

●<br />

●<br />

●<br />

●<br />

●<br />

Layered security, or defense in depth<br />

Applying the principle of least privilege<br />

Encrypting communication where possible<br />

Securely storing sensitive data<br />

Enforcing secure configuration, such as minimum<br />

password requirements<br />

These design strategies are commonly verified by automated<br />

or manual penetration testing.<br />

Conventional penetration testing methodologies for IoT<br />

devices rely on testing the complete electronic ecosystem for a<br />

specific device. This includes the hardware, software - including<br />

any operating system, communications protocols, mobile<br />

applications, cloud services, and so on [6][7]. Testing of this<br />

breadth is often very expensive to perform and can be of limited<br />

value until the product is almost ready for launch.<br />

www.embedded-world.eu<br />

495


hashOut.data = hashes + SSL_MD5_DIGEST_LEN;<br />

hashOut.length = SSL_SHA1_DIGEST_LEN;<br />

if ((err = SSLFreeBuffer(&hashCtx)) != 0)<br />

goto fail;<br />

if ((err = ReadyHash(&SSLHashSHA1, &hashCtx)) != 0)<br />

goto fail;<br />

if ((err = SSLHashSHA1.update(&hashCtx, &clientRandom)) != 0)<br />

goto fail;<br />

if ((err = SSLHashSHA1.update(&hashCtx, &serverRandom)) != 0)<br />

goto fail;<br />

if ((err = SSLHashSHA1.update(&hashCtx, &signedParams)) != 0)<br />

goto fail;<br />

^<br />

MISRA C:2012 Rule-15.6 (qac-9.4.0-2212) Body of control statement is not enclosed within<br />

braces.<br />

goto fail;<br />

if ((err = SSLHashSHA1.final(&hashCtx, &hashOut)) != 0)<br />

^<br />

MISRA C:2012 Rule-2.1 (qac-9.4.0-2880) This code is unreachable.<br />

goto fail;<br />

Fig. 1. Example MISRA C:2012 violations in file sslKeyExchange.c from the Apple SSL/TLS library. Retrieved from:<br />

https://opensource.apple.com/source/Security/Security-55471/libsecurity_ssl/lib/sslKeyExchange.c<br />

While the value of software hardening strategies in securing<br />

devices is widely accepted, it is also possible for them to be<br />

undermined by the presence of a programming error. This is<br />

clearly demonstrated by the 2014 Apple "goto fail" SSL bug [8].<br />

In this case, a simple error stemming from erroneous indentation<br />

rendered Apple's official SSL/TLS library insecure, opening OS<br />

X and iOS devices to man-in-the-middle attacks.<br />

This flaw was present in released products, demonstrating<br />

that the software verification strategies used in this case were<br />

clearly ineffective. Research has shown that the cost of fixing<br />

defects doubles after the implementation phase and rises by six<br />

times for defects that must be fixed post-release [9].<br />

This suggests that even vulnerabilities that are discovered by<br />

penetration testing will cost significantly more to fix than those<br />

identified during the development of the code.<br />

B. Debugging<br />

A widely used strategy for debugging software is to employ<br />

a static analysis tool. In contrast to other testing tools, static<br />

analysis identifies issues in the source code without executing it.<br />

This allows static analysis to be utilized at any point in the<br />

development process.<br />

Taking the aforementioned Apple SSL vulnerability, it is<br />

apparent that this error could have been detected much earlier in<br />

the release process had the code been analyzed with a static<br />

analysis tool during development. Additionally, errors of this<br />

type are well known and documented in many coding standards,<br />

including MISRA and CERT (see Fig. 1.) The obvious<br />

implication is that, had an appropriate formal coding standard<br />

been used and enforced in the development of this library, this<br />

software would not have been released containing this<br />

vulnerability.<br />

Many well documented software vulnerabilities were<br />

previously recognized as serious software errors. A typical<br />

example of this is Buffer Overflow. "Classic Buffer Overflow"<br />

is ranked third on the CWE Top 25 Most Dangerous Software<br />

Errors [10] - however, issues of this type have been recognized<br />

for more than two decades and are addressed in coding standards<br />

as old as MISRA C:1998.<br />

Despite the awareness of buffer overflow as an attack vector,<br />

vulnerabilities of this nature are still frequently discovered.<br />

CVE-2017-1000251, published in September 2017 describes a<br />

potential buffer overflow vulnerability in the Bluetooth stack in<br />

the Linux kernel, which could be exploited to allow remote code<br />

execution [11].<br />

Static analysis tools have also been proven effective in<br />

identifying common vulnerabilities in modern software. Studies<br />

have found that even in addressing widely understood<br />

vulnerabilities such as SQL injection, static code analysis tools<br />

typically have much higher coverage than penetration testing<br />

tools [12].<br />

C. Hardening and Debugging<br />

By their nature, static analysis tools cannot entirely replace<br />

penetration testing. However, penetration testing is typically<br />

expensive - both in terms of time and money. Detection of<br />

common errors - such as buffer overflows - will rarely be the<br />

most effective use of the resources required for a comprehensive<br />

penetration test, particularly if it's possible to detect errors of this<br />

type much earlier in the development process. In addition, issues<br />

detected by penetration testing will be much costlier to resolve -<br />

particularly if follow-up testing is required.<br />

In contrast, static analysis tools can be built-in to the<br />

software development lifecycle at the implementation phase,<br />

allowing issues to be identified and resolved much earlier.<br />

Typically, organizations using static analysis tools report that<br />

they find - and fix - errors earlier in the development process and<br />

496


discover more defects overall than organizations not using static<br />

analysis [13].<br />

Therefore, it should be apparent that a combined approach is<br />

most effective. Static analysis will identify a large proportion of<br />

security issues and programming errors during a project's<br />

implementation phase. This can be supported by penetration<br />

testing during the integration and testing phase to verify the<br />

implementation of the design's security features.<br />

III.<br />

CODING STANDARDS<br />

Static analysis tools are often most effective when they are<br />

being used to enforce a well-defined set of coding guidelines. In<br />

industries that deal with safety-critical software, the use of a<br />

coding standard supported by suitable analysis tools is already<br />

standard practice. Industrial standards for the automotive and<br />

medical industries, such as ISO 26262 [14] and IEC 62304 [15],<br />

effectively mandate the use of these tools. However, in the wider<br />

field of embedded software, the use of these technologies is still<br />

relatively uncommon.<br />

A 2017 study by the Barr Group [16] reported that 60% of<br />

the organizations surveyed expected to be developing devices<br />

with some degree of network connectivity. Yet only two thirds<br />

of the organizations surveyed reported that they use a written<br />

coding standard, and only half reported that they use a static<br />

analysis tool.<br />

A. Enforcing a Coding Standard<br />

There is a significant range of free and commercial coding<br />

standards available. While selecting the correct one for any<br />

application is certainly not trivial, that decision is often less<br />

important than simply deciding to use a coding standard. It has<br />

been well established that coding conventions - including<br />

applying a coding standard - are only effective if the decision is<br />

made at the outset of the project [17]. Selecting a coding<br />

standard before beginning development and enforcing it<br />

throughout the software implementation will produce code that<br />

has fewer defects in addition to being consistent and easier to<br />

maintain. Developing a product and then attempting to make it<br />

safe and / or secure is costly and potentially dangerous.<br />

Yet the practical considerations are often more complex.<br />

Greenfield projects where all decisions can be made in advance<br />

are a rarity for the majority of organizations. Most product<br />

development involves some degree of pre-existing code.<br />

This is often true for consumer goods where the importance<br />

of time-to-market invites rapid development cycles, making<br />

code-reuse an appealing option.<br />

Conversely, industrial machinery typically has an expected<br />

lifetime measured in decades. It is costly and impractical to<br />

replace equipment of this nature, yet the benefits of connected<br />

infrastructure drive a desire to add this functionality into existing<br />

equipment [18].<br />

In both of these cases, products will contain, or interface<br />

with, a significant body of legacy code. In all probability, this<br />

code has not been developed in line with modern practices and<br />

almost certainly has not been developed with the security<br />

demanded by the IoT in mind.<br />

B. Legacy Code<br />

Historically, conformance of legacy code to a coding<br />

standard has not been enforced [19]. The conventional wisdom<br />

was that legacy code was "proven in the field" - if it had operated<br />

defect-free for a significant period of time, it was considered<br />

unlikely that any remaining errors in the code would lead to<br />

sudden failures.<br />

However, this principle cannot be applied when<br />

incorporating connective functionality. Additional interfaces<br />

open up a much broader range of attack vectors. It is possible for<br />

relatively innocuous defects that cannot cause a failure under<br />

normal operating conditions to become serious vulnerabilities if<br />

exploited by an attacker.<br />

This is particularly apparent - and frightening - in the 2016<br />

attack on Ukraine's power grid. Attackers succeeded in seizing<br />

control of power stations' SCADA software through public<br />

networks, resulting in a blackout in parts of the country that<br />

lasted for several hours. The attackers were able to move<br />

laterally through the network, first infiltrating the business<br />

networks and from there, gain access to the production<br />

networks.<br />

One element of this sophisticated and coordinated attack<br />

involved the hackers exploiting vulnerabilities in serial-toethernet<br />

connectors, commonly used to interface legacy<br />

industrial equipment with modern computers. The attackers<br />

were able to upload malicious firmware to these devices,<br />

compromising operators' ability to respond to the attack [20].<br />

C. Analysing Legacy Code<br />

While it is no longer acceptable to simply assume that legacy<br />

code is secure, it will typically be infeasible to make it fully<br />

compliant with any given coding standard. Fortunately, this is<br />

an issue that has been addressed in several industries and the<br />

solutions proposed in those cases can be applied to IoT devices.<br />

IEC 62304 introduces the concept of Software of Unknown<br />

Provenance (SOUP.) This is defined as either "off-the-shelf"<br />

(also known as third-party) software, or software that has been<br />

previously developed without adequate records of the<br />

development process.<br />

This standard lays out a set of requirements for the use of<br />

software of this nature, largely addressing the process of<br />

incorporating this software into the device. IEC 62304 requires<br />

the manufacturer to:<br />

● Document the requirements that make it necessary to<br />

use this software<br />

● Define the software architecture that ensure this<br />

software operates in appropriate conditions<br />

● Monitor the software's lifecycle, including patches and<br />

new versions<br />

●<br />

●<br />

Perform a risk analysis on the use of this software<br />

Manage the configuration of the software<br />

These principles are just as relevant to IoT devices as they<br />

are to the medical industry. It would be most appropriate to<br />

www.embedded-world.eu<br />

497


address these requirements during the design and planning<br />

stages of the project.<br />

Specifically addressing compliance with a coding standard,<br />

the MISRA Compliance:2016 guidelines [21] make a distinction<br />

between native code - defined as the code developed within the<br />

scope of the project, and adopted code - which includes thirdparty,<br />

auto-generated and legacy code.<br />

These guidelines set out several key requirements for<br />

claiming MISRA compliance in projects that make use of<br />

adopted code. These include:<br />

● There shall be no violations of a Mandatory MISRA<br />

Guideline<br />

● Violations of a MISRA Required Guideline must be<br />

supported by a formal deviation<br />

However, it is recognized that adopted code is unlikely to<br />

have been developed following the same processes and criteria<br />

as the code under active development. This means that violations<br />

of MISRA guidelines are likely to be unavoidable, particularly<br />

in system-wide guidelines that consider both the code under<br />

active development and the legacy code.<br />

For this reason, the MISRA Compliance guidelines propose<br />

using a Guideline Re-Categorization Plan. This allows certain<br />

guidelines to be "disapplied" - essentially ignored altogether -<br />

while others may be escalated to being Mandatory.<br />

For legacy code, this principle allows an assessment to be<br />

made of the potential impact of non-compliance with any<br />

particular guideline in the coding standard, and appropriate<br />

requirements enforced based on this assessment.<br />

While the MISRA Compliance guidelines are written to<br />

complement the MISRA coding standards, the principles<br />

described could be applied to many other coding standards with<br />

little modification. The majority of coding standards incorporate<br />

some concept of the importance or severity of the guidelines -<br />

for instance, CERT's "Severity" - which can be trivially mapped<br />

to the MISRA categories.<br />

IV.<br />

ANALYSIS TOOLS<br />

All coding standards require at least one analysis tool to be<br />

effectively enforced. When addressing legacy code, tool<br />

selection becomes particularly important. Any analysis tool used<br />

in this manner must have a robust mechanism for suppressing<br />

warnings for guidelines that have been disapplied or deviated<br />

from without masking genuine defects in the code.<br />

PRQA's QA·Verify incorporates a dynamic suppression<br />

system, designed for this purpose. This allows for specific<br />

warnings to be suppressed, with supporting processes to record<br />

a formal deviation. In addition, these suppressions can be<br />

applied across multiple versions of the project with full<br />

traceability.<br />

In any project that is adopting and revising legacy code in<br />

this manner, there will come a point where all priority issues<br />

have been resolved in the legacy code in preparation for new<br />

development to begin. At this stage, it is important that the<br />

chosen coding standard is enforced in its entirety on the newly<br />

developed code. In addition, it is necessary to identify defects in<br />

the interface between the new and legacy code.<br />

This process can be greatly simplified by creating a baseline<br />

of the legacy code before new functionality is implemented -<br />

essentially a known state of the project before any new<br />

development begins.<br />

QA·Verify includes the functionality to create an intelligent<br />

baseline. This means that warnings will only be issued for new<br />

code that is added, or for issues arising from newly added code.<br />

This allows developers to easily identify new issues in the<br />

project, without having to manually filter out any remaining<br />

warnings in the legacy code.<br />

V. CONCLUSIONS<br />

It is clear that the number of internet-connected devices is<br />

continuing to grow at an incredible rate. This includes critical<br />

infrastructure, and devices and equipment responsible for safetyor<br />

mission-critical functions. Therefore, ensuring these devices<br />

are both defect-free and secure is of great importance.<br />

In many cases, it can be demonstrated that serious security<br />

vulnerabilities are caused by common programming errors. This<br />

means that, in order to ensure the security of connected devices,<br />

it is critical to ensure the code is free of errors.<br />

In the process of developing a secure device, penetration<br />

testing and code analysis are complimentary verification<br />

techniques. Code analysis, in which a suitable coding standard<br />

is applied and enforced with a static analysis tool, will detect<br />

many programming errors and security vulnerabilities early in<br />

the development process, reducing the cost of fixing these<br />

defects.<br />

In projects that make heavy use of adopted code, fully<br />

enforcing a coding standard is often infeasible. However,<br />

existing strategies described by industrial standards can be<br />

reapplied to ensure that the product is robust and defect-free.<br />

REFERENCES<br />

[1] Torvald, Linus, 2017. Re: [GIT PULL] usercopy whitelisting for v4.15-<br />

rc1 [Online]. Linux Kernel mailing list. Available at:<br />

http://lkml.iu.edu/hypermail/linux/kernel/1711.2/01701.html<br />

[2] Heffley, J. and Meunier, P., 2004. Can source code auditing software<br />

identify common vulnerabilities and be used to evaluate software<br />

security? Proceedings of the 37th Hawaii International Conference on<br />

System Sciences - 2004.<br />

[3] Kerner, Sean Michael, 2014. Heartbleed SSL flaw's true cost will take<br />

time to tally [Online]. eWeek. Available at:<br />

http://www.eweek.com/security/heartbleed-ssl-flaw-s-true-cost-willtake-time-to-tally<br />

[4] HP, 2015. Internet of things research study. Hewlett Packard Enterprise.<br />

[5] Van der Meulen, 2015. Gartner says 6.4 billion connected "things" will<br />

be in use in 2016, up 30 percent from 2015 [Online]. Gartner. Available<br />

at: https://www.gartner.com/newsroom/id/3165317<br />

[6] Tierney, Andrew, 2017. IoT security testing methodologies [Online].<br />

PenTestPartners. Available at: https://www.pentestpartners.com/securityblog/iot-security-testing-methodologies/<br />

[7] Francis, Ryan 2017. How to conduct an IoT pen test [Online].<br />

NetworkWorld. Available at:<br />

https://www.networkworld.com/article/3198495/internet-of-things/howto-conduct-an-iot-pen-test.html<br />

498


[8] Ducklin, Paul, 2014. Anatomy of a “goto fail” – Apple’s SSL bug<br />

explained, plus an unofficial patch for OS X! [Online]. Naked security by<br />

Sophos. Available at:<br />

https://nakedsecurity.sophos.com/2014/02/24/anatomy-of-a-goto-failapples-ssl-bug-explained-plus-an-unofficial-patch/<br />

[9] Briski, K. A. et al., 2008. Minimizing code defects to improve software<br />

quality and lower development costs. IBM Development solutions white<br />

paper.<br />

[10] Christey, Steve, 2011. 2011 CWE/SANS top 25 most dangerous software<br />

errors [Online]. CWE. Available at: https://cwe.mitre.org/top25/<br />

[11] CVE, 2017. Vulnerability details : CVE-2017-1000251 [Online]. CVE.<br />

Available at: https://www.cvedetails.com/cve/CVE-2017-1000251/<br />

[12] Antunes, N. and Vieira, M., 2009. Comparing the effectiveness of<br />

penetration testing and static code analysis on the detection of sql<br />

injection vulnerabilities in web services. 2009 15th IEEE Pacific Rim<br />

International Symposium on Dependable Computing.<br />

[13] Balacco, S. and Rommel, C., 2011. The increasing value and complexity<br />

of software call for the reevaluation of development and testing practices<br />

[Online]. VDC Research Whitepaper. Available at:<br />

http://info.prqa.com/hubfs/Whitepapers/PRQA-VDC-white-paper-<br />

2011.pdf<br />

[14] ISO 26262-6, 2011. Road vehicles — Functional safety — Part 6: Product<br />

development at the software level. International Organization for<br />

Standardization.<br />

[15] IEC 62304:2006. Medical device software - Software life-cycle<br />

processes. European Committee for Electrotechnical Standardization.<br />

[16] Barr Group, 2017. Embedded systems safety & security survey.<br />

[17] McConnell, S., 2004. Code complete second edition. Redmond,<br />

Washington: Microsoft Press.<br />

[18] Intel, 2014. Connecting legacy devices to the internet of things (IoT)<br />

[Online]. Intel Solution Brief. Available at:<br />

https://www.intel.com/content/dam/www/public/us/en/documents/soluti<br />

on-briefs/connecting-legacy-devices-brief.pdf<br />

[19] MISRA, 1998. Guidelines for the use of the C language in vehicle based<br />

software. The Motor Industry Software Reliability Association.<br />

[20] E-ISAC, 2016. Analysis of the cyber attack on the Ukrainian power grid<br />

[Online]. Industrial Control Systems. Available at:<br />

https://ics.sans.org/media/E-ISAC_SANS_Ukraine_DUC_5.pdf<br />

[21] MISRA Compliance:2016. Achieving compliance with MISRA coding<br />

guidelines. Motor Industry Software Reliability Association. Available at:<br />

https://www.misra.org.uk/LinkClick.aspx?fileticket=w_Syhpkf7xA%3d<br />

&tabid=57<br />

www.embedded-world.eu<br />

499


Combining Static and Dynamic Analysis<br />

Paul Anderson<br />

GrammaTech, Inc.<br />

Ithaca, NY. USA.<br />

paul@grammatech.com<br />

Abstract— Static analysis tools are useful for finding serious<br />

programming defects and security vulnerabilities in source and<br />

binary code. These tools inevitably report some false positives, or<br />

bugs that are highly unlikely to manifest as real problems in<br />

deployed code. Consequently, results must be inspected by a<br />

human to determine whether they warrant action, and most tools<br />

provide program understanding features to make this easier. This<br />

inspection process, known as warning triage or assessment, can be<br />

much more effective if it is guided by information from dynamic<br />

analyses such as code coverage, crash analysis, and performance<br />

profiling. For example, a static analysis report of a resource leak<br />

that occurs on a path that has not been tested is more likely to be<br />

a real undiscovered bug than one that occurs in code that has been<br />

tested much more comprehensively. Furthermore, the results of<br />

static analysis tools can be used to guide testing too. For example,<br />

a developer can save a great deal of effort if the static analysis can<br />

prove that it is fundamentally impossible to achieve full condition<br />

coverage.<br />

This paper describes how the results of static analyses and<br />

dynamic analyses can be fused to allow developers to get more<br />

value from both processes, and produce higher quality software<br />

more efficiently.<br />

Keywords—static analysis; dynamic analysis; test coverage;<br />

crash analysis; defect reduction<br />

I. INTRODUCTION TO STATIC ANALYSIS<br />

The examples in this paper use CodeSonar (the advanced<br />

static analysis tool that I work on) to illustrate how static and<br />

dynamic analysis tools can be integrated. However, the<br />

techniques and principles are not unique to CodeSonar. Several<br />

other advanced static analysis tools are commercially available,<br />

and have features similar to those that are described here.<br />

Roughly speaking, advanced static-analysis tools work as<br />

follows. First they create a model of the entire program which<br />

they do by reading and parsing each input file. The model<br />

consists of representations such as abstract-syntax trees for each<br />

compilation unit, control-flow graphs for each subprogram,<br />

symbol-tables, and the call graph. Checkers that find defects are<br />

implemented in terms of various kinds of queries on those<br />

representations. Superficial bugs can be found by doing pattern<br />

matching on the abstract syntax tree or the symbol tables. The<br />

really serious bugs are those that cause the program to fail, such<br />

as null pointer dereferences, buffer overruns, etc., and these<br />

require sophisticated queries to find. Those queries can be<br />

thought of as abstract simulations — the analyzer simulates the<br />

execution of the program, but instead of using concrete values,<br />

it uses equations that model the abstract state of the program. If<br />

an anomaly is encountered a warning is generated.<br />

The kinds of defects fall into three main categories:<br />

1. Bugs that violate the fundamental rules of the runtime,<br />

thereby causing the program’s behavior to be<br />

undefined. These includes memory errors such as null<br />

pointer dereferences and buffer overruns, concurrency<br />

errors such as data races, and bugs such as use of<br />

uninitialized memory.<br />

2. Defects that arise because the program breaks the rules<br />

of using a standard API. For example, the C library<br />

does not specify what happens when the same file<br />

descriptor is closed twice; this makes no sense to do<br />

deliberately so is probably a bug. Leaks of finite<br />

resources such as memory also fall into this category.<br />

3. Inconsistencies or contradictions in the code. These<br />

may not cause the program to crash, but likely indicate<br />

that the programmer misunderstood an important<br />

property of the code. For example, a condition that is<br />

either always true or always false is unlikely to be<br />

intentional because it leads to dead code.<br />

Static analysis tools are also useful for finding violations of<br />

coding standards such as MISRA. These mostly fall into the<br />

third of the above categories. Also, they allow users to define<br />

their own domain-specific rules.<br />

www.embedded-world.eu<br />

500


Figure 1. An example warning from an advanced static-analysis tool.<br />

These tools are useful because they are good at finding<br />

defects that occur only in unusual circumstances, and because<br />

they can do so very early in the development process. They can<br />

yield value before the code is even ready to be tested. They are<br />

not intended to replace or supplant traditional testing techniques,<br />

but instead are complementary.<br />

Figure 1 shows an example warning report from CodeSonar<br />

for a null pointer dereference. The report shows the path through<br />

the code that must be taken for the bug to trigger. Interesting<br />

points along the way are highlighted. An explanation of the<br />

reasoning the tool used to conclude there was a bug is given at<br />

the point which the pointer dereference occurs.<br />

When a warning is generated, it is written to a database.<br />

Many advanced static analysis tools, including CodeSonar,<br />

allow users to annotate warnings. A user can mark a warning as<br />

a true or false positive, give it a priority, assign it to someone to<br />

fix, or attach a note. It is important that this information be<br />

persistent; that is, if a later run of the analysis detects the same<br />

warning, even if the code has changed, the information should<br />

continue to be associated with the warning. This is known as<br />

persistent triage.<br />

Some of the value of integrating CodeSonar with dynamic<br />

analysis tools comes from allowing persistent triage to be used<br />

for the results of the dynamic analyses.<br />

II.<br />

COMBINING STATIC AND DYNAMIC ANALYSIS<br />

This section describes two ways in which the results of<br />

dynamic analysis can be integrated with static analysis: metrics<br />

about execution can be imported and associated with elements<br />

of the program model, and information about anomalies such as<br />

memory errors can be imported into the database of warnings.<br />

Metrics about program executions are essential for<br />

understanding performance characteristics and for identifying<br />

parts of the code that are not adequately tested. These metrics<br />

can be used to help a user interpret the static results, which is<br />

especially useful for prioritizing results. Some examples are the<br />

following:<br />

<br />

<br />

<br />

Information about how many times a procedure is<br />

called can be very helpful at determining if certain<br />

defects are serious or not. The best examples are<br />

resource leaks, which are often insignificant in code<br />

that is called rarely, but highly serious in code that is<br />

called a lot.<br />

Data about whether a path through the code has been<br />

tested can help an analyst determine if a static analysis<br />

warning is a true or a false positive. A buffer overrun<br />

that is reported on a path that is not tested is more likely<br />

to be a latent defect than one that is reported on a path<br />

that is executed a lot.<br />

A memory profile, which shows which locations in the<br />

code are responsible for dynamically allocating<br />

501


memory, can be used to highlight which leaks reported<br />

by static analysis are most hazardous.<br />

Similarly, some dynamic analyses generate reports of<br />

anomalies such as invalid memory accesses, resource leaks, and<br />

program crashes. The most obvious way to integrate with a static<br />

analysis tool is to import those reports into the tool database.<br />

The following sections describe how the results of several<br />

<br />

<br />

<br />

The number of calls to a function<br />

The time spent in the function itself<br />

The time spent in the function and everything it calls<br />

transitively<br />

A simple approach to combining the static and dynamic<br />

results is to import this data as metrics. The following examples<br />

show the results of running a profile on an open-source<br />

Figure 2. A screenshot of a visualization of a call graph with metrics about dynamic execution superimposed.<br />

different classes of tool can be imported. For each example, the<br />

process of combining static and dynamic analysis results<br />

described in this paper is the following:<br />

<br />

<br />

<br />

A dynamic analysis is run, and the results are stored in<br />

a set of files.<br />

The static analysis is invoked, and the dynamic analysis<br />

results are used to augment the static results.<br />

The results of both analyses are presented through the<br />

same user interface.<br />

In each of these examples, the integration is done using either<br />

with a plugin to CodeSonar, or by setting configuration<br />

parameters.<br />

III.<br />

Time Profiling<br />

One of the simplest forms of dynamic analysis is time<br />

profiling, which helps developers understand how much time<br />

each part of the program takes to execute. There are many tools,<br />

both commercial products and open source, available for<br />

collecting raw timing data from executions, and for converting<br />

that data into a form that is convenient for consumption. Profiles<br />

may be gathered at function and statement granularity. For<br />

simplicity, the following section describes function-level<br />

profiling information only.<br />

Most profilers will gather data such as the following:<br />

calculator program named bc, with the Gnu profiler gprof.<br />

After the program is executed, gprof is invoked as follows:<br />

gprof -b -L -p --inline-file-names bc >gprof.txt<br />

This writes the profiling information to the file gprof.txt. The<br />

first few lines of an example run are shown below. Note that the<br />

functions and the files in which they are found are identified by<br />

name.<br />

Flat profile:<br />

Each sample counts as 0.01 seconds.<br />

% cumulative self self total<br />

time seconds seconds calls ms/call ms/call name<br />

18.75 0.09 0.09 3 30.01 160.04 execute (execute.c:67)<br />

18.75 0.18 0.09 1021353 0.00 0.00 bc_multiply (number.c:639)<br />

16.67 0.26 0.08 971853 0.00 0.00 bc_divide (number.c:742)<br />

It is relatively easy to write a simple program that can<br />

convert this information into a comma-separated-value format<br />

file. A plug-in for CodeSonar is available that can then read that<br />

file and create metrics for each procedure. Once these metrics<br />

are in the CodeSonar database, they can be viewed in several<br />

ways. The simplest way to see them is alongside some of the<br />

built-in metrics. They can also be shown in the visualization<br />

tool. Error! Reference source not found.Figure 2 shows a<br />

screen capture of a visualization of the call graph. In this<br />

particular instance, the size of the rectangles is proportional to<br />

the percentage time spent in each function, and the intensity of<br />

the red is proportional to the number of static analysis warnings<br />

found in the item. From this, the user can easily pick out the<br />

places in the code that are both consuming most time during<br />

www.embedded-world.eu<br />

502


Figure 3. A screenshot showing a warning generated from the test effectiveness metric.<br />

execution (the size of the box), as well as those that are<br />

potentially most risky (the intensity of the red). This will help<br />

the user focus on the parts of the program most likely to benefit<br />

from increased scrutiny. Selecting the box reveals a link that<br />

allows the user to see all of those static analysis warnings.<br />

IV.<br />

Code Coverage<br />

Code coverage tools measure how much of the code is<br />

exercised during execution. There are several forms of coverage,<br />

the most popular of which are statement coverage and condition<br />

coverage. Again, there are both open-source tools (e.g, gcov) and<br />

commercial tools such as CTC Testwell from VerifySoft. The<br />

examples described below are generated using Testwell<br />

(http://www.verifysoft.com/en_ctcpp.html).<br />

Coverage tools typically generate metrics on test<br />

effectiveness; a standard metric is Test Effectiveness Ratio<br />

(TER), defined as the ratio of elements exercised by the tests as<br />

a percentage of the whole. In Testwell, TER can be generated the<br />

different kinds of coverage it supports.<br />

Additionally, Testwell can show which parts of the code did<br />

not get exercised by the tests.<br />

Testwell will create a data file in a convenient format<br />

(JSON). Again, it is a simple matter to write a CodeSonar plugin<br />

that will read this file and create metrics corresponding to the<br />

TER values. Those metrics will then show up in the CodeSonar<br />

user interface in the same way as demonstrated for the profiler<br />

metrics described above.<br />

Those metrics can then be used to generate static analysis<br />

warnings. For example, to generate a warning for a procedure<br />

whose TER is less than 80% is as simple as adding the following<br />

lines to the CodeSonar config file for the project:<br />

METRIC_WARNING_CONDITION = TER[PROCEDURE] < 80<br />

METRIC_WARNING_CLASS_NAME = Low Test Effectiveness<br />

METRIC_WARNING_BASE_RANK = 5.0<br />

METRIC_WARNING_SIGNIFICANCE = RELIABILITY<br />

Figure 3 shows such a CodeSonar warning.<br />

The previous example demonstrated how to specify a new<br />

warning class in CodeSonar using configuration file parameters.<br />

CodeSonar also has an API in which it is possible to implement<br />

checkers in a more general fashion. Figure 4 shows a warning<br />

that was generated for a condition that was not exercised during<br />

the tests. This warning was generated by the same script that was<br />

used to import the test execution metrics.<br />

Plug-ins for CodeSonar can be written in several languages<br />

(Python, C++, C, Scheme, Java and C#); the API for accessing<br />

the program model and for generating metrics and warnings is<br />

available in all of those languages. The plug-ins written for this<br />

paper were written entirely in Python. A few snippets from that<br />

script are shown below.<br />

First a warning class is created:<br />

untested_condition = cs.analysis.create_warningclass(<br />

"Untested Condition",<br />

"", 2.0,<br />

cs.warningclass_flags.PADDING,<br />

cs.warning_significance.RELIABILITY)<br />

When the Testwell file is read, the script identifies the file<br />

and line number where the untested condition is found. The<br />

warning is then reported as follows:<br />

503


Figure 4. A Testwell untested condition imported as a CodeSonar warning<br />

untested_condition.report(<br />

sfile.arbitrary_instance(),<br />

probe['line'], proc, str(msg),<br />

cs.report_flags.ALREADY_XML_ENCODED)<br />

Writing checkers such as these in Python is usually fairly<br />

straightforward.<br />

V. Crash Analysis<br />

If a program crashes during execution, the operating system<br />

may arrange for a memory dump of the process to be written to<br />

a file. On Linux and other Unix systems, this is referred to as a<br />

core dump; debuggers such as gdb may be used to examine the<br />

state of the program at the point when it crashed. The most useful<br />

information is usually the stack trace. It is a fairly simple matter<br />

to import the stack trace into CodeSonar.<br />

The screenshot in Figure 6 shows an example of a crash<br />

dump that was imported into CodeSonar.<br />

The mechanism for doing this is straightforward: a simple<br />

script looks for core files and invokes gdb in batch mode as<br />

follows:<br />

gdb exe corefile --batch -q -ex bt<br />

The output of this is read and converted into a form that can<br />

be imported into CodeSonar with a simple plug-in that reads the<br />

data and creates warnings.<br />

VI.<br />

Memory Analysis<br />

Memory analysis tools find errors such as resource leaks, use<br />

of invalid address, and buffer overruns. Valgrind with the<br />

memcheck module is a popular option for developers on Linux<br />

systems (http://valgrind.org).<br />

Valgrind can be invoked in a manner that creates an XML<br />

file containing the report of errors; the following command runs<br />

the program named crash and writes the report to crash.vg.xml.<br />

valgrind --leak-check=yes --xml=yes \<br />

--xml-file=crash.vg.xml ./crash<br />

The screenshot in Figure 5 shows a warning generated from<br />

having imported the report into CodeSonar.<br />

In this instance the memory had already been freed. The<br />

report shows the location where the second free took place (line<br />

14); the stack trace at the point of the illegal free is represented<br />

by other events in the report. This report also shows two other<br />

stack traces; the first is the stack at the point where it was<br />

previously freed (line 13), and the stack at the point where the<br />

memory was allocated (line 12).<br />

In this case it was most convenient to convert the XML file<br />

into a SARIF file. SARIF (Static Analysis Interchange Format)<br />

is designed to facilitate integrating static analysis tools. A plugin<br />

for CodeSonar for importing these files is in development and<br />

is available upon request.<br />

In the case of Valgrind, persistent triage of results can be<br />

very useful. Normal practice calls for users of Valgrind to<br />

maintain “suppressions files”. These tell the tool do refrain from<br />

generating certain reports. This is useful because although many<br />

reports are technically true positives, they have been judged to<br />

be either acceptable or harmless. Managing this file for a team<br />

of programmers on a large project can be tedious. A reasonable<br />

alternative is to mark Valgrind warnings in CodeSonar as False<br />

Positive or Don’t Care, then automatically generate a<br />

suppressions file for use in future runs.<br />

www.embedded-world.eu<br />

504


Figure 6. A crash report imported as a CodeSonar warning.<br />

Figure 5. A report from Valgrind as a CodeSonar warning.<br />

VII. CONCLUSIONS<br />

Static analysis tools and dynamic analysis tools are powerful<br />

and complementary approaches to finding and eliminating<br />

programming errors. As demonstrated above, it is feasible to use<br />

the results of each style of analysis to help strengthen or augment<br />

the other. This is relatively easy to accomplish because modern<br />

tools are designed and built in a way that allows them to be<br />

integrated.<br />

VIII. REFERENCES<br />

CodeSonar: http://www.grammatech.com<br />

505


Finding Safety Defects and Security Vulnerabilities<br />

by Static Analysis<br />

Daniel Kästner, Laurent Mauborgne, Christian Ferdinand<br />

AbsInt GmbH<br />

66123 Saarbrücken, Germany<br />

Abstract—Static code analysis has evolved to be a standard<br />

technique in the development process of safety-critical software. It<br />

can be applied to show compliance to coding guidelines, and to<br />

demonstrate the absence of critical programming errors,<br />

including runtime errors and data races. In recent years, security<br />

concerns have become more and more relevant for safety-critical<br />

systems, not least due to the increasing importance of highlyautomated<br />

driving and pervasive connectivity. While in the past<br />

static analyzers have been primarily applied to demonstrate<br />

classical safe properties they are well suited also to address data<br />

safety, and to discover security vulnerabilities. This talk gives an<br />

overview and discusses practical experience.<br />

Keywords— static analyiss, abstract interpretation, runtime<br />

errors, security vulnerabilities, functional safety, cybersecurity<br />

I. INTRODUCTION<br />

Some years ago static analysis meant manual review of<br />

programs. Nowadays, automatic static analysis tools are gaining<br />

popularity in software development as they offer a tremendous<br />

increase in productivity by automatically checking the code<br />

under a wide range of criteria. Many software development<br />

projects are developed according to coding guidelines, such as<br />

MISRA C, CERT, or CWE, aiming at a programming style that<br />

improves clarity and reduces the risk of introducing bugs.<br />

Compliance checking by static analysis tools has become<br />

common practice.<br />

In safety-critical systems static analysis plays a particularly<br />

important role. A failure of a safety-critical system may cause<br />

high costs or even endanger human beings. With the growing<br />

size of software-implemented functionality preventing softwareinduced<br />

system failures becomes an increasingly important task.<br />

One particularly dangerous class of errors are runtime errors<br />

which include faulty pointer manipulations, numerical errors<br />

such as arithmetic overflows and division by zero, data races,<br />

and synchronization errors in concurrent software. Such errors<br />

can cause software crashes, invalidate separation mechanisms in<br />

mixed-criticality software, and are a frequent cause of errors in<br />

concurrent and multi-core applications. At the same time these<br />

defects are also at the root of many security vulnerabilities,<br />

including exploits based on buffer overflows, dangling pointers,<br />

or integer errors.<br />

This is recognized by the MISRA C norm by a particular rule<br />

which recommends deeper analysis: “Minimization of runtime<br />

failures shall be ensured by the use of at least one of (a) static<br />

analysis tools/techniques; (b) dynamic analysis<br />

tools/techniques; (c) explicit coding of checks to handle runtime<br />

faults.” ([23], rule 21.1).<br />

In safety-critical software projects obeying coding<br />

guidelines such as MISRA C is strongly recommended by all<br />

current safety standards, including DO-178B, DO-178C, IEC-<br />

61508, ISO-26262, and EN-50128. In addition, all of them<br />

consider demonstrating the absence of runtime errors explicitly<br />

as a verification goal. This is often formulated indirectly by<br />

addressing runtime errors (e.g., division by zero, invalid pointer<br />

accesses, arithmetic overflows) in general, and additionally<br />

considering corruption of content, synchronization mechanisms,<br />

and freedom of interference in concurrent execution [3].<br />

Semantics-based static analysis has become the predominant<br />

technology to detect runtime errors and data races.<br />

Abstract interpretation is a formal methodology for<br />

semantics-based static program analysis [8]. It supports formal<br />

soundness proofs (it can be proven that no error is missed) and<br />

scales to real-life industry applications. Abstract interpretationbased<br />

static analyzers provide full control and data coverage and<br />

allow conclusions to be drawn that are valid for all program runs<br />

with all inputs. Such conclusions may be that no timing or space<br />

constraints are violated, or that runtime errors or data races are<br />

absent: the absence of these errors can be guaranteed [16].<br />

Nowadays, abstract interpretation-based static analyzers that can<br />

detect stack overflows and violations of timing constraints [28]<br />

and that can prove the absence of runtime errors and data races<br />

[10, 17], are widely used for developing and verifying safetycritical<br />

software. From a methodological point of view, abstract<br />

interpretation-based static analyses can be seen as equivalent to<br />

testing with full data and control coverage. They do not require<br />

access to the physical target hardware, can be easily integrated<br />

in continuous verification frameworks and model-based<br />

development environments [18], and they allow developers to<br />

detect runtime errors as well as timing and space bugs in early<br />

product stages.<br />

In the past security properties have mostly been relevant for<br />

non-embedded and/or non-safety-critical programs. Recently<br />

due to increasing connectivity requirements (cloud-based<br />

services, car-to-car communication, over-the-air updates, etc.),<br />

more and more security issues are rising in safety-critical<br />

software as well. Security exploits like the Jeep Cherokee hacks<br />

[30] which affect the safety of the system are becoming more<br />

and more frequent. In consequence, safety-critical software<br />

www.embedded-world.eu<br />

506


development faces novel challenges which previously only have<br />

been addressed in other industry domains.<br />

On the other hand, as outlined above, safety-critical software is<br />

developed according to strict guidelines which effectively<br />

reduce the relevant subset of the programming language used<br />

and improve software verifiability. As an example, dynamic<br />

memory allocation and recursion often are forbidden or used in<br />

a very limited way. In consequence for safety-critical software<br />

much stronger code properties can be shown than for non-safetycritical<br />

software, so that also security vulnerabilities can be<br />

addressed in a more powerful way.<br />

The topic of this article is to show that some classes of defects<br />

can be proven to be absent in the software so that exploits based<br />

on such defects can be excluded. On the other hand, additional<br />

syntactic checks and semantical analyses become necessary to<br />

address security properties which are orthogonal to safety<br />

requirements. Throughout the article we will focus on software<br />

aspects only without addressing safety or security properties at<br />

the hardware level.<br />

II.<br />

SECURITY IN SAFETY-CRITICAL SYSTEMS<br />

Functional safety and security are aspects of dependability,<br />

in addition to reliability and availability. Functional safety is<br />

usually defined as the absence of unreasonable risk to life and<br />

property caused by malfunctioning behavior of the software.<br />

The main goals of information security or cybersecurity (for<br />

brevity denoted as ‘security’ in this article) traditionally are to<br />

preserve confidentiality (information must not be disclosed to<br />

unauthorized entities), integrity (data must not be modified in an<br />

unauthorized or undetected way), and availability (data must be<br />

accessible and usable upon demand).<br />

In safety-critical systems safety and security properties are<br />

intertwined. A violation of security properties can endanger the<br />

functional safety of the system: an information leak could<br />

provide the basis for a successful attack on the system, and a<br />

malicious data corruption or denial-of-service attack may cause<br />

the system to malfunction. Vice versa, a violation of safety goals<br />

can compromise security: buffer overflows belong to the class<br />

of critical runtime errors whose absence have to be demonstrated<br />

in safety-critical systems. At the same time an undetected buffer<br />

overflow is one of the main security vulnerabilities which can be<br />

exploited to read unauthorized information, to inject code, or to<br />

cause the system to crash [32]. To emphasize this, in a safetycritical<br />

system the definition of functional safety can be adapted<br />

to define cybersecurity as absence of unreasonable risk to life<br />

and property caused by malicious misusage of the software.<br />

The convergence of safety and security properties also<br />

becomes apparent in the increasing role of data in safety-critical<br />

systems. There are many well-documented incidents where<br />

harm was caused by erroneous data, corrupted data, or<br />

inappropriate use of data – examples include the Turkish<br />

Airlines A330 incident (2015), the Mars Climate Orbiter crash<br />

(1999), or the Cedars Sinai Medical Centre CT scanner<br />

radiation overdose (2009) [11]. The reliance on data in safetycritical<br />

systems has significantly grown in the past few years,<br />

cf. e.g., data used for decision-support systems, data used in<br />

sensor fusion for highly automatic driving, or data provided by<br />

car-to-car communication or downloaded from a cloud. As a<br />

consequence of this there are ongoing activities to provide<br />

specific guidance for handling data in safety-critical systems<br />

[11]. At the same time, these data also represent safety-relevant<br />

targets for security attacks.<br />

A. Coding Guidelines<br />

The MISRA C standard [23, 24] has originally been<br />

developed with a focus on automotive industry but is now<br />

widely recognized as a coding guideline for safety-critical<br />

systems in general. Its aim is to avoid programming errors and<br />

enforce a programming style that enables the safest possible use<br />

of C. A particular focus is on dealing with<br />

undefined/unspecified behavior of C and on preventing runtime<br />

errors. As a consequence, it is also directly applicable to<br />

security-relevant code.<br />

The most prominent coding guidelines targeting security<br />

aspects are the ISO/IEC TS 17961, the SEI CERT C Coding<br />

Standard, and the MITRE Common Weakness Enumeration<br />

CWE.<br />

The ISO/IEC TS 17961 C Secure Coding Rules [15]<br />

specifies rules for secure coding in C. It does not primarily<br />

address developers but rather aims at establishing requirements<br />

for compilers and static analyzers. MISRA C:2012 Addendum<br />

2 [25] compares the ISO/IEC TS 17961 rule set with MSIRA<br />

C:2012. Only 4 of the C Secure rules are not covered by the first<br />

edition of MISRA C:2012 [24]. MISRA C:2012 Amendment 1<br />

[26] contains 14 additional guidelines, one directive and 13<br />

rules with a focus on covering additional security concerns,<br />

which now also cover the previously not handled C Secure<br />

rules. This illustrates the strong overlap between the safety- and<br />

security-oriented coding guidelines.<br />

The SEI CERT C Coding Standard belongs to the CERT<br />

Secure Coding Standards (https://www.securecoding.cert.org).<br />

While emphasizing the security aspect CERT C [14] also<br />

targets safety-critical systems: it aims at “developing safe,<br />

reliable and secure systems”. CERT distinguishes between<br />

rules and recommendations where rules are meant to provide<br />

normative requirements and recommendations are meant to<br />

provide general guidance; the book version [14] describes the<br />

rules only. A particular focus is on eliminating undefined<br />

behaviors that can lead to exploitable vulnerabilities. In fact,<br />

almost half of the CERT rules (43 of 99 rules) are targeting<br />

undefined behaviors according to the C norm.<br />

The Common Weakness Enumeration CWE is a software<br />

community project (https://cwe.mitre.org) that aims at creating<br />

a catalog of software weaknesses and vulnerabilities. The goal<br />

of the project is to better understand flaws in software and to<br />

create automated tools that can be used to identify, fix, and<br />

prevent those flaws. There are several catalogues for different<br />

programming languages, including C. In the latter one, once<br />

again, many rules are associated with undefined or unspecified<br />

behaviors.<br />

B. Vulnerability Classification<br />

Many rules are shared between the different coding<br />

guidelines, but there is no common structuring of security<br />

vulnerabilities. The CERT Secure C roughly structures its rules<br />

507


according to language elements, whereas ISO/IEC TS 17961<br />

and CWE are structured as a flat list of vulnerabilities. In the<br />

following we list some of the most prominent vulnerabilities<br />

which are addressed in all coding guidelines and which belong<br />

to the most critical ones at the C programming level. The<br />

presentation follows the overview given in [32].<br />

1) Stack-based Buffer Overflows<br />

An array declared as local variable in C is stored on the<br />

runtime stack. A C program may write beyond the end of the<br />

array due to index values being too large or negative, or due<br />

to invalid increments of pointers pointing into the array. A<br />

runtime error then has occurred whose behavior is undefined<br />

according to the C semantics. As a consequence the program<br />

might crash with bus error or segmentation fault, but<br />

typically adjacent memory regions will be overwritten. An<br />

attacker can exploit this by manipulating the return address<br />

or the frame pointer both of which are stored on the stack, or<br />

by indirect pointer overwriting, and thereby gaining control<br />

over the execution flow of the program. In the first case the<br />

program will jump to code injected by the attacker into the<br />

overwritten buffer instead of executing an intended function<br />

return. In case of overflows on array read accesses<br />

confidential information stored on the stack (e.g. through<br />

temporary local variables) might be leaked. An example of<br />

such an exploit is the well-known W32.Blaster.Worm 1 .<br />

2) Heap-based Buffer Overflows<br />

Heap memory is dynamically allocated at runtime, e.g. by<br />

calling malloc() or calloc() implementations provided by<br />

dynamic memory allocation libraries. Just like stackallocated<br />

arrays, there may be read or write operations to<br />

dynamically allocated arrays that access beyond the array<br />

boundaries. In case of a read access information stored on<br />

the heap might be leaked – a prominent example is the<br />

Heartbleed bug in OpenSSL (cf. CERT vulnerability<br />

720951 2 ). Via write operations attackers may inject code and<br />

gain control over program execution, e.g., by overwriting<br />

management information of the dynamic memory allocator<br />

stored in the accessed memory chunk.<br />

3) General Invalid Pointer Accesses<br />

Buffer overflows are special cases of invalid pointer accesses<br />

which are listed here as separate points due to the large<br />

number of attacks based on them. However, any invalid<br />

pointer access in general is a security vulnerability – other<br />

examples are null pointer accesses or dangling pointer<br />

accesses. Accessing such a pointer is undefined behavior<br />

which can cause the program to crash, or behave erratically.<br />

A dangling pointer points to a memory location that has been<br />

deallocated either implicitly (e.g. data stored in the stack<br />

frame of a function after its return) or explicitly by the<br />

programmer. A concrete example of a dangling pointer<br />

access is the double free vulnerability where already freed<br />

memory is freed a second time. This can be exploited by an<br />

attacker to overwrite arbitrary memory locations and execute<br />

injected code [32].<br />

4) Uninitialized Memory Accesses<br />

Automatic variables and dynamically allocated memory<br />

have indeterminate values when not explicitly initialized.<br />

Accessing them is undefined behavior which can cause the<br />

program to behave erratically or in an unexpected way.<br />

Uninitialized variables can also be used for security attacks,<br />

e.g., in CVE-2009-1888 3 potentially uninitialized variables<br />

passed to a function were exploited to bypass the access<br />

control list and gain access to protected files [14].<br />

5) Integer Errors<br />

Integer errors are no exploitable vulnerabilities by<br />

themselves, but they can be the cause of critical<br />

vulnerabilities like stack- or heap-based buffer overflows.<br />

Examples of integer errors are arithmetic overflows, or<br />

invalid cast operations. If, e.g., a negative signed value is<br />

used as an argument to a memcpy() call, it will be interpreted<br />

as a large unsigned value, potentially resulting in a buffer<br />

overflow.<br />

6) Format String Vulnerabilities<br />

A format string is copied to the output stream with<br />

occurrences of %-commands representing arguments to be<br />

popped from the stack and expanded into the stream. A<br />

format string vulnerability occurs, if attackers can supply the<br />

format string because it enables them to manipulate the<br />

stack, once again making the program write to arbitrary<br />

memory locations.<br />

7) Concurrency Defects<br />

Concurrency errors may lead to concurrency attacks which<br />

allow attackers to violate confidentiality, integrity and<br />

availability of systems [31]. In a race condition the program<br />

behavior depends on the timing of thread executions. A<br />

special case is a write-write or read-write data race where the<br />

same shared variable is accessed by concurrent threads<br />

without proper synchronization. In a Time-of-Check-to-<br />

Time-of-Use (TOCTOU) race condition the checking of a<br />

condition and its use are not protected by a critical section.<br />

This can be exploited by an attacker, e.g., by changing the<br />

file handle between the accessibility check and the actual file<br />

access. In general, attacks can be run either by creating a data<br />

race due to missing lock-/unlock protections, or by<br />

exploiting existing data races, e.g., by triggering thread<br />

invocations.<br />

Most of the vulnerabilities described above are based on<br />

undefined behaviors, and among them buffer overflows seem<br />

to play the most prominent role for real-live attacks. Most of<br />

them can be used for denial-of-service attacks by crashing the<br />

program or causing erroneous behavior. They can also be<br />

exploited to inject code and cause the program to execute it, and<br />

to extract confidential data from the system. It is worth noticing<br />

that from the perspective of a static analyzer most exploits are<br />

based on potential runtime errors: when using an unchecked<br />

value as an index in an array the error will only occur if the<br />

attacker manages to provide an invalid index value. The<br />

obvious conclusion is that safely eliminating all runtime errors<br />

1<br />

https://en.wikipedia.org/wiki/Blaster_(computer_worm)<br />

2<br />

http://www.kb.cert.org/vuls/id/720951<br />

3<br />

CVE-2009-1888: SAMBA ACLs Uninitialized Memory Read.<br />

https://nvd.nist.gov/vuln/detail/CVE-2009-1888<br />

www.embedded-world.eu<br />

508


due to undefined behaviors in the program significantly reduces<br />

the risk for security vulnerabilities.<br />

C. Analysis Complexity<br />

While semantic-based static program analysis is getting<br />

more widespread for safety property, there is practically no such<br />

analyzer dedicated to security properties. This is mostly<br />

explained by the difference in complexity between safety and<br />

security properties. From a semantical point of view, a safety<br />

property can always be expressed as a trace property. This means<br />

that to find all safety issue, it is enough to look at each trace of<br />

execution in isolation. If it were not for the overhead and the fact<br />

that the detection would occur too late, it would be possible to<br />

solve all safety issues using monitors that would detect during<br />

the execution of the critical software any safety issue.<br />

This is not possible anymore for security properties. Most of<br />

them can only be expressed as set of traces properties, or<br />

hyperproperties [7]. A typical example is non-interference [27]:<br />

to express that the final value of a variable x can only be affected<br />

by the initial value of y, one must consider each pair of possible<br />

execution trace with the same initial value for y, and check that<br />

the final value of x is the same for both executions. It was proven<br />

in [4] that any other definition (tracking assignments, etc)<br />

considering only one execution trace at a time would miss some<br />

cases or add false dependencies. This additional level of sets has<br />

direct consequences on the difficulty to track security properties<br />

soundly.<br />

Other examples of hyperproperties are secure information<br />

flow policies, service level agreements (which describe<br />

acceptable availability of resources in term of mean response<br />

time or percentage uptime), observational determinism (whether<br />

a system appears deterministic to a low-level user), or<br />

quantitative information flow.<br />

Finding expressive and efficient abstractions for such<br />

properties is a young research field (see [30] for a promising<br />

approach), which is the reason why no sound analysis of such<br />

properties appear in industrial static analyzer. The best solution<br />

using the current state of the art consists in using dedicated<br />

safety properties as approximation of the security property, with<br />

non-standard semantics, such as the taint propagation described<br />

in Sec. IV.B.<br />

III.<br />

PROVING THE ABSENCE OF DEFECTS<br />

In safety-critical systems the use of dynamic memory<br />

allocation and recursions typically is forbidden or only used in<br />

limited ways. This simplifies the task of static analysis such that<br />

for safety-critical embedded systems it is possible to formally<br />

prove the absence of runtime errors, or report all potential<br />

runtime errors which still exist in the program. Such analyzers<br />

are based on the theory of abstract interpretation<br />

\cite{CousotCousot77-1}, a mathematically rigorous formalism<br />

providing a semantics-based methodology for static program<br />

analysis.<br />

A. Abstract Interpretation<br />

The semantics of a programming language is a formal<br />

description of the behavior of programs. The most precise<br />

semantics is the so-called concrete semantics, describing closely<br />

the actual execution of the program on all possible inputs. Yet in<br />

general, the concrete semantics is not computable. Even under<br />

the assumption that the program terminates, it is too detailed to<br />

allow for efficient computations. The solution is to introduce an<br />

abstract semantics that approximates the concrete semantics of<br />

the program and is efficiently computable. This abstract<br />

semantics can be chosen as the basis for a static analysis.<br />

Compared to an analysis of the concrete semantics, the analysis<br />

result may be less precise but the computation may be<br />

significantly faster.<br />

A static analyzer is called sound if the computed results hold<br />

for any possible program execution. Abstract interpretation<br />

supports formal correctness proofs: it can be proved that an<br />

analysis will terminate and that it is sound, i.e., that it computes<br />

an over-approximation of the concrete semantics. Imprecision<br />

can occur, but it can be shown that they will always occur on the<br />

safe side. In runtime error analysis, soundness means that the<br />

analyzer never omits to signal an error that can appear in some<br />

execution environment. If no potential error is signaled,<br />

definitely no runtime error can occur: there are no false<br />

negatives. If a potential error is reported, the analyzer cannot<br />

exclude that there is a concrete program execution triggering the<br />

error. If there is no such execution, this is a false alarm (false<br />

positive). This imprecision is on the safe side: it can never<br />

happen that there is a runtime error which is not reported.<br />

The difference between syntactical, unsound semantical and<br />

sound semantical analysis can be illustrated at the example of<br />

division by 0. In the expression x/0 the division by zero can be<br />

detected syntactically, but not in the expression a/b. When an<br />

unsound analyzer does not report a division by zero in a/b it<br />

might still happen in scenarios not taken into account by the<br />

analyzer. When a sound analyzer does not report a division by<br />

zero in a/b, this is a proof that b can never be 0.<br />

B. Astrée<br />

In the following we will concentrate on the sound static<br />

runtime error analyzer Astrée [5, 22]. It reports program defects<br />

caused by unspecified and undefined behaviors according to the<br />

C norm (ISO/IEC 9899:1999 (E)) [6], program defects caused<br />

by invalid concurrent behavior, violations of user-specified<br />

programming guidelines, and computes program properties<br />

relevant for functional safety. Users are notified about:<br />

• integer/floating-point division by zero<br />

• out-of-bounds array indexing<br />

• erroneous pointer manipulation and dereferencing<br />

(buffer overflows, null pointer dereferencing, dangling<br />

pointers, etc.)<br />

• data races<br />

• lock/unlock problems, deadlocks<br />

• integer and floating-point arithmetic overflows<br />

• read accesses to uninitialized variables<br />

• unreachable code<br />

• violations of optional user-defined assertions to prove<br />

additional runtime properties, e.g., to guarantee that<br />

output variables are within the expected value ranges<br />

• violations of coding rules (MISRA C:2004/2012 incl.<br />

Amendment 1, ISO/IEC TS 17961, CERT, CWE) and<br />

code metric thresholds. The supported code metrics<br />

509


include the statically computable HIS metrics (HIS<br />

2008), e.g., comment density, and cyclomatic<br />

complexity.<br />

• non-terminating loops.<br />

Astrée computes data and control flow reports containing a<br />

detailed listing of accesses to global and static variables sorted<br />

by functions, variables, and processes and containing a summary<br />

of caller/called relationships between functions. The analyzer<br />

can also report each effectively shared variable, the list of<br />

processes accessing it, and the types of the accesses (read, write,<br />

read/write).<br />

The C99 standard does not fully specify data type sizes,<br />

endianness nor alignment which can vary with different targets<br />

or compilers. Astrée is informed about these target ABI settings<br />

by a dedicated configuration file in XML format and takes the<br />

specified properties into account.<br />

The design of the analyser aims at reaching the zero false alarm<br />

objective, which was accomplished for the first time on large<br />

industrial applications at the end of November 2003. For<br />

keeping the initial number of false alarms low, a high analysis<br />

precision is mandatory. To achieve high precision Astrée<br />

provides a variety of predefined abstract domains, including the<br />

following ones:<br />

• The interval domain approximates variable values by<br />

intervals.<br />

• The octagon domain [19] covers relations of the form<br />

x ± y ≤ c for variables x and y and constants c.<br />

• Floating-point computations are precisely modelled<br />

while keeping track of possible rounding errors.<br />

• The memory domain empowers Astrée to exactly<br />

analyze pointer arithmetic and union manipulations. It<br />

also supports a type-safe analysis of absolute memory<br />

addresses.<br />

• The clock domain has been specifically developed for<br />

synchronous control programs and supports relating<br />

variable values to the system clock [9].<br />

• With the filter domain [12] digital filters can be<br />

precisely approximated.<br />

Fig. 1: Astrée GUI with alarm overview<br />

Any remaining alarm has to be manually checked by the<br />

developers – and this manual effort should be as low as possible.<br />

Astrée explicitly supports investigating alarms in order to<br />

understand the reasons for them to occur. Alarm contexts can be<br />

interactively explored, the computed value ranges of variables<br />

can be displayed for each different context, the call graph is<br />

visualized, and a program slicer is available to identify the<br />

program parts contributing to a selected defect. By fine-tuning<br />

the precision of the analyzer to the software under analysis the<br />

number of false alarms can be further reduced.<br />

To deal with concurrency defects Astrée has been extended<br />

by a sound low-level concurrent semantics [20] which provides<br />

a scalable sound abstraction covering all possible thread<br />

interleavings. The interleaving semantics enables Astrée, in<br />

addition to the classes of runtime errors found in sequential<br />

programs, to report data races, and lock/unlock problems, i.e.,<br />

inconsistent synchronization. The set of shared variables does<br />

not need to be specified by the user: Astrée assumes that every<br />

global variable can be shared, and discovers which ones are<br />

effectively shared, and on which ones there is a data race. After<br />

a data race, the analysis continues by considering the values<br />

stemming from all interleavings. Since Astrée is aware of all<br />

locks held for every program point in each concurrent thread,<br />

Astrée can also report all potential deadlocks.<br />

In some situations data races may be intended behavior. As<br />

an example a lock-free implementation where one process only<br />

writes to a variable and another process only reads from it may<br />

be correct, although there actually is a data race. However, a<br />

prerequisite is that all variable accesses involved are atomic.<br />

Astrée explicitly supports such lock-free implementations by<br />

providing means to specify the atomicity of basic data type<br />

accesses as a part of the target ABI specification. Data race<br />

alarms explicitly distinguish between atomic and non-atomic<br />

accesses.<br />

Thread priorities are exploited to reduce the amount of<br />

spurious interleavings considered in the abstraction and to<br />

achieve a more precise analysis. A dedicated task priority<br />

domain supports dynamic priorities, e.g., according to the<br />

Priority Ceiling Protocol used in OSEK systems. Astrée includes<br />

a built-in notion of mutual exclusion locks, on top of which<br />

actual synchronization mechanisms offered by operating<br />

systems can be modeled (such as POSIX mutexes or semaphores<br />

[13]); program-enforced mutual exclusion is also exploited by<br />

Astrée to reduce spurious interleavings. When these features are<br />

insufficient to match the concurrency semantics of the analyzed<br />

program, Astrée reverts to unrestricted preemption, which<br />

ensures a sound analysis coverage for all concurrency models,<br />

including execution on multi-core processors. In particular,<br />

Astrée is not limited to collaborative threads nor discrete sets of<br />

preemption points.<br />

Programs to be analyzed are seldom run in isolation; they<br />

interact with an environment. In order to soundly report all<br />

runtime errors, Astrée must take the effect of the environment<br />

into account. In the simplest case the software runs directly on<br />

the hardware, in which case the environment is limited to a set<br />

of volatile variables, i.e., program variables that can be modified<br />

by the environment concurrently, and for which a range can be<br />

provided to Astrée by formal directives. More often, the<br />

www.embedded-world.eu<br />

510


program is run on top of an operating system, which it can access<br />

through function calls to a system library. When analyzing a<br />

program using a library, one possible solution is to include the<br />

source code of the library with the program. This is not always<br />

convenient (if the library is complex), nor possible, if the library<br />

source is not available, or not fully written in C, or ultimately<br />

relies on kernel services (e.g., for system libraries). An<br />

alternative is to provide a stub implementation, i.e., to write, for<br />

each library function, a specification of its possible effect on the<br />

program. Astrée provides stub libraries for the ARINC 653<br />

standard, the OSEK/AUTOSAR standards [2, 1], and for POSIX<br />

threads.<br />

A particularity of OSEK is that system resources, including<br />

tasks, are not created dynamically at program startup; instead<br />

they are hardcoded in the system: a specific tool reads a<br />

configuration file in OIL format (OSEK Implementation<br />

Language) describing these resources and generates a dedicated<br />

version of the system to be linked against the application. To<br />

support this workflow Astrée provides its own OIL file reader<br />

and automatically creates the implementation code from the OIL<br />

file. Combining the C sources of the OSEK application, the fixed<br />

OSEK stub provided with Astrée, and the C file automatically<br />

generated from the OIL file, we get a stand-alone application,<br />

without any undefined symbol, that can be analyzed with Astrée<br />

and models faithfully the execution of the application in an<br />

OSEK environment. This workflow enables a high level of<br />

automation with minimal configuration when analyzing OSEK<br />

applications.<br />

Practical experience on avionics and automotive industry<br />

applications are given in [21, 17]. They show that industry-sized<br />

programs of millions of line of code can be analyzed in<br />

acceptable time with high precision for runtime errors and data<br />

races.<br />

IV.<br />

CONTROL AND DATA FLOW ANALYSIS<br />

Safety standards like the DO-178C and ISO-26262 require<br />

to perform control and data flow analysis as a part of software<br />

unit testing and in order to verify the software architectural<br />

design. Investigating control and data flow is also subject of the<br />

Data Safety guidance [11], and it is a prerequisite for analyzing<br />

confidentiality and integrity properties as a part of a security<br />

case. Technically, any semantics-based static analysis is able to<br />

provide information about data and control flow, since this is the<br />

basis of the actual program analysis. However, data and control<br />

flow analysis has many aspects, and for some of them, tailored<br />

analysis mechanisms are needed.<br />

Global data and control flow analysis gives a summary of<br />

variable accesses and function invocations throughout program<br />

execution. In its standard data and control flow reports Astrée<br />

computes the number of read/write accesses for every global or<br />

static variable and lists the location of each access along with the<br />

function from which the access is made and the thread in which<br />

the function is executed. The control flow is described by listing<br />

all callers and callees for every C function along with the threads<br />

in which they can be run. Indirect variable accesses via pointers<br />

as well as function pointer call targets are fully taken into<br />

account. Astrée also provides a call graph enhanced by data flow<br />

and concurrency information, which can be interactively<br />

explored.<br />

Fig. 2: Call Tree Visualization enhanced by Data Flow and<br />

Concurrency Information<br />

More sophisticated information can be provided by two<br />

dedicated analysis methods: program slicing and taint analysis.<br />

Program slicing [29] aims at identifying the part of the program<br />

that can influence a given set of variables at a given program<br />

point. Applied to a result value, e.g., it shows which functions,<br />

which statements, and which input variables contribute to its<br />

computation. Taint analysis tracks the propagation of specific<br />

data values through program execution. It can be used, e.g., to<br />

determine program parts affected by corrupted data from an<br />

insecure source. In the following we give a more detailed<br />

overview of both techniques.<br />

A. Program Slicing<br />

A slicing criterion of a program P is a tuple where s<br />

is a statement and V is a set of variables in V. Intuitively, a slice<br />

is a subprogram of P which has the same behavior than P with<br />

respect to the slicing criterion . Computing a statementminimal<br />

slice is an undecidable problem, but using static<br />

analysis approximative slices can be computed. As an example,<br />

Astrée provides a program slicer which can produce sound and<br />

compact slices by exploiting the invariants from Astrée’s core<br />

analysis including points-to information for variable and<br />

function pointers. A dynamic slice does not contain all<br />

statements potentially affecting the slicing criterion, but only<br />

those relevant for a specific subset of program executions, e.g.,<br />

only those in which an error value can result.<br />

Computing sound program slices is relevant for<br />

demonstrating safety and security properties. It can be used to<br />

show that certain parts of the code or certain input variables<br />

might influence or cannot influence a program section of<br />

interest.<br />

B. Taint Analysis<br />

In literature, taint analysis is often mentioned in combination<br />

with unsound static analyzers, since it allows to efficiently detect<br />

potential errors in the code, e.g., array-index-out-of-bounds<br />

accesses, or infeasible library function parameters parameters<br />

[14, 15]. Inside a sound runtime error analyzer this is not needed<br />

since typically more powerful abstract domains can track all<br />

undefined or unspecified behaviors. Inside a sound analyzer,<br />

taint analysis is primarily a technique for analyzing security<br />

properties. Its advantage is that users can flexibly specify taints,<br />

511


taint sources, and taint sinks, so that application-specific data<br />

and control flow requirements can be modeled.<br />

In order to be able to leverage this efficient family of<br />

analyses in sound analyzers, one must formally define the<br />

properties that may be checked using such techniques. Then it is<br />

possible to prove that a given implementation is sound with<br />

respect to that formal definition, leading to clean and well<br />

defined analysis results. Taint analysis consists of discovering<br />

data dependencies using the notion of taint propagation. Taint<br />

propagation can be formalized using a non-standard semantics<br />

of programs, where an imaginary taint is associated to some<br />

input values. Considering a standard semantics using a successor<br />

relation between program states, and considering that a program<br />

state is a map from memory locations (variables, program<br />

counter, etc.) to values in V, the tainted semantics relates tainted<br />

states which are maps from the same memory locations to<br />

V × {taint, notaint}, and such that if we project on V we get<br />

the same relation as with the standard semantics.<br />

To define what happens to the taint part of the tainted value,<br />

one must define a taint policy. The taint policy specifies:<br />

• Taint sources which are a subset of input values or<br />

variables such that in any state, the values associated with<br />

that input values or variables are always tainted.<br />

• Taint propagation describes how the tainting gets<br />

propagated. Typical propagation is through assignment,<br />

but more complex propagation can take more control<br />

flow into account, and may not propagate the taint<br />

through all arithmetic or pointer operations.<br />

• Taint cleaning is an alternative to taint propagation,<br />

describing all the operations that do not propagate the<br />

taint. In this case, all assignments not containing the taint<br />

cleaning will propagate the taint.<br />

• Taint sinks is an optional set of memory locations. This<br />

has no semantical effect, except to specify conditions<br />

when an alarm should be emitted when verifying a<br />

program (an alarm must be emitted if a taint sink may<br />

become tainted for a given execution of the program).<br />

A sound taint analyzer will compute an over-approximation<br />

of the memory locations that may be mapped to a tainted value<br />

during program execution. The soundness requirement ensures<br />

that no taint sink warning will be overlooked by the analyzer.<br />

The tainted semantics can easily be extended to a mix of<br />

different hues of tainting, corresponding to an extension of the<br />

taint set associated with values. Then propagation can get more<br />

complex, with tainting not just being propagated but also<br />

changing hue depending on the instruction. Such extensions lead<br />

to a rather flexible and powerful data dependency analysis, while<br />

remaining scalable.<br />

V. CONCLUSION<br />

In this article we have given an overview of code-level<br />

defects and vulnerabilities relevant for functional safety and<br />

security. We have shown that many security attacks can be<br />

traced back to behaviors undefined or unspecified according to<br />

the C semantics. By applying sound static runtime error<br />

analyzers a high degree of security can be achieved for safetycritical<br />

software since the absence of such defects can be proven.<br />

In addition, security hyperproperties require additional analyses<br />

to be performed which, by nature, have a high complexity. We<br />

have given two examples of scalable dedicated analyses,<br />

program slicing and taint analysis. Applied as extensions of<br />

sound static analyzers they allow to further increase confidence<br />

in the security of safety-critical embedded systems.<br />

ACKNOWLEDGMENT<br />

The work presented in this paper was funded within the project<br />

ARAMiS II by the German Federal Ministry for Education and<br />

Research with the funding ID 01|S16025. The responsibility for<br />

the content remains with the authors.<br />

REFERENCES<br />

[1] AUTOSAR (AUTomotive Open System ARchitecture). http://-<br />

www.autosar.org.<br />

[2] OSEK/VDX Operating System. Version 2.2.3, 2005.<br />

[3] AbsInt GmbH. Safety Manual for aiT, Astrée, StackAnalyzer, 2015.<br />

[4] Mounir Assaf, David A. Naumann, Julien Signoles, Eric Totel, and<br />

Frédéric Tronel. Hypercollecting semantics and its application to static<br />

analysis of information flow. CoRR, abs/1608.01654, 2016.<br />

[5] B. Blanchet, P. Cousot, R. Cousot, J. Feret, L. Mauborgne, A. Miné, D.<br />

Monniaux, and X. Rival. A Static Analyzer for Large Safety-Critical<br />

Software. In Proc. of PLDI’03, pages 196–207. ACM Press, June 7–14<br />

2003.<br />

[6] JTC1/SC22. Programming languages – C, 16 Dec. 1999.<br />

[7] Michael R. Clarkson and Fred B. Schneider. Hyperproperties. Journal of<br />

Computer Security, 18:1157–1210, 2010.<br />

[8] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model<br />

for static analysis of programs by construction or approximation of<br />

fixpoints. In Proc. of POPL’77, pages 238–252. ACM Press, 1977.<br />

[9] Patrick Cousot, Radhia Cousot, Jérôme Feret, Antoine Miné, Laurent<br />

Mauborgne, David Monniaux, and Xavier Rival. Varieties of Static<br />

Analyzers: A Comparison with ASTRÉE. In First Joint IEEE/IFIP<br />

Symposium on Theoretical Aspects of Software Engineering, TASE 2007,<br />

pages 3–20. IEEE Computer Society, 2007.<br />

[10] D. Delmas and J. Souyris. ASTRÉE: from Research to Industry. In Proc.<br />

14th International Static Analysis Symposium (SAS2007), number 4634<br />

in LNCS, pages 437–451, 2007.<br />

[11] SCSC Data Safety Initiative Working Group [DSIWG]. Data Safety<br />

(Version 2.0)[SCSC-127B]. Technical report, Safety-Critical Systems<br />

Club, Jan 2017.<br />

[12] Jérôme Feret. Static analysis of digital filters. In Proc. of ESOP’04,<br />

volume 2986 of LNCS, pages 33–48. Springer, 2004.<br />

[13] IEEE Computer Society and The Open Group. Portable operating system<br />

interface (POSIX) – Application program interface (API) amendment 2:<br />

Threads extension (C language). Technical report, ANSI/IEEE Std.<br />

1003.1c-1995, 1995.<br />

[14] CERT Software Engineering Inestitute. SEI CERT C Coding Standard –<br />

Rules for Developing Safe, Reliable, and Secure Systems. Carnegie<br />

Mellon University, 2016.<br />

[15] ISO/IEC. Information Technology – Programming Languages, Their<br />

Environments and System Software Interfaces – Secure Coding Rules<br />

(ISO/IEC TS 17961), Nov 2013.<br />

[16] D. Kästner. Applying Abstract Interpretation to Demonstrate Functional<br />

Safety. In J.-L. Boulanger, editor, Formal Methods Applied to Industrial<br />

Complex Systems. ISTE/Wiley, London, UK, 2014.<br />

[17] D. Kästner, A. Miné, L. Mauborgne, X. Rival, J. Feret, P. Cousot,<br />

A. Schmidt, H. Hille, S. Wilhelm, and C. Ferdinand. Finding All<br />

Potential Runtime Errors and Data Races in Automotive Software. In<br />

SAE World Congress 2017. SAE International, 2017.<br />

[18] D. Kästner, C. Rustemeier, U. Kiffmeier, D. Fleischer, S. Nenova,<br />

R. Heckmann, M. Schlickling, and C. Ferdinand. Model-Driven Code<br />

www.embedded-world.eu<br />

512


Generation and Analysis. In SAE World Congress 2014. SAE<br />

International, 2014.<br />

[19] A. Miné. The Octagon Abstract Domain. Higher-Order and Symbolic<br />

Computation, 19(1):31–100, 2006.<br />

[20] A. Miné. Static analysis of run-time errors in embedded real-time parallel<br />

C programs. Logical Methods in Computer Science (LMCS), 8(26):63,<br />

Mar. 2012.<br />

[21] A. Miné and D. Delmas. Towards an Industrial Use of Sound Static<br />

Analysis for the Verification of Concurrent Embedded Avionics<br />

Software. In Proc. of the 15th International Conference on Embedded<br />

Software (EMSOFT’15), pages 65–74. IEEE CS Press, Oct. 2015.<br />

[22] A. Miné, L. Mauborgne, X. Rival, J. Feret, P. Cousot, D. Kästner,<br />

S. Wilhelm, and C. Ferdinand. Taking Static Analysis to the Next Level:<br />

Proving the Absence of Run-Time Errors and Data Races with Astrée.<br />

Embedded Real Time Software and Systems Congress ERTS 2 .<br />

[23] MISRA Limited. MISRA-C:2004 Guidelines for the use of the C<br />

language in critical systems, Oct. 2004.<br />

[24] MISRA Limited. MISRA-C:2012 Guidelines for the use of the C<br />

language in critical systems, Mar. 2013.<br />

[25] MISRA Limited. MISRA-C:2012 – Addendum 2. Coverage of MISRA<br />

C:2012 against ISO/IEC TS 17961:2013 "C Secure", Apr. 2016.<br />

[26] MISRA Limited. MISRA-C:2012 Amendment 1 – Additional security<br />

guidelines for MISRA C:2012, Apr. 2016.<br />

[27] A Sabelfeld and A. C. Myers. Language-based information-flow<br />

security. IEEE Journal on Selected Areas in Communications, 21(1):5–<br />

19, 2003.<br />

[28] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Victor Jégu,<br />

Guillaume Borios, and Reinhold Heckmann. Computing the worst case<br />

execution time of an avionics program by abstract interpretation. In<br />

Proceedings of the 5th Intl Workshop on Worst-Case Execution Time<br />

(WCET) Analysis, pages 21–24, 2005.<br />

[29] Mark Weiser. Program slicing. In Proceedings of the 5th International<br />

Conference on Software Engineering, ICSE ’81, pages 439–449. IEEE<br />

Press, 1981.<br />

[30] Wired.com. The jeep hackers are back to prove car hacking can get much<br />

worse. https://www.wired.com/2016/08/jeep-hackers-return-high-speedsteering-acceleration-hacks/,<br />

2016.<br />

[31] Junfeng Yang, Ang Cui, John Gallagher, Sal Stolfo, and Simha<br />

Sethumadhavan. Concurrency attacks. In In the Fourth USENIX<br />

Workshop on Hot Topics in Parallelism (HOTPAR12, 2012.<br />

[32] Yves Younan, Wouter Joosen, and Frank Piessens. Code injection in c<br />

and c++ : A survey of vulnerabilities and countermeasures. Technical<br />

report, Departement Computerwetenschappen, Katholieke Universiteit<br />

Leuven, 2004.<br />

513


C++17: Analysis and risk mitigation of security<br />

vulnerabilities<br />

Walter Capitani<br />

Product manager, Klocwork<br />

Rogue Wave Software<br />

Ottawa, ON, Canada<br />

walter.capitani@roguewave.com<br />

With the recent approval of the C++17 language standard,<br />

along with new features introduced in C++11 and C++14,<br />

embedded developers are writing code in all sorts of exciting new<br />

ways. However, the introduction of these new features introduces<br />

new points of failure and attack vectors for hackers.<br />

This paper explores the impact of these new features on<br />

software quality and identifies new and expanded security<br />

vulnerabilities and attack vectors that can be exploited. Based on<br />

an analysis of the standards by language experts and actual<br />

running code, sample vulnerabilities and defects will be presented,<br />

and techniques and standards for reducing risk will be evaluated.<br />

C++, software security, software quality, best practices,<br />

embedded software development<br />

I. INTRODUCTION<br />

C++ continues to be one of the most popular programming<br />

languages in the world, even 33 years after its first release. The<br />

latest version of the C++ standard, C++17, was released in<br />

December 2017, introducing many new features to simplify the<br />

language, support large-scale systems, and improve<br />

concurrency. With every version, the community tries to<br />

improve code security by adapting to strategies employed by<br />

malicious entities yet there remain some ways in which the<br />

language can be exploited.<br />

TABLE I.<br />

C++ POPULARITY ACROSS DIFFERENT SITES<br />

Site Method Ranking<br />

GitHub Opened pull requests 6 a<br />

TIOBE Search query hits 3 b<br />

IEEE Spectrum 12 different metrics 4 c<br />

RedMonk<br />

Data from GitHub<br />

and StackOverflow<br />

a<br />

octoverse.github.com<br />

b<br />

tiobe.com/tiobe-index/<br />

c<br />

spectrum.ieee.org/static/interactive-the-top-programming-languages-2017<br />

6 d<br />

This paper examines the C++17 standard (ISO/IEC<br />

14882:2017) to identify potential security vulnerabilities and<br />

provide examples. It is hoped that this identification will assist<br />

compiler developers, application developers, and software<br />

testers in the remediation of the vulnerabilities.<br />

II. IMPROPER DEALLOCATION OF DYNAMICALLY-<br />

ALLOCATED RESOURCES<br />

Potential security issues can be introduced by the incorrect<br />

usage of smart pointers, a problem that was introduced in C++11<br />

and not addressed in C++17. The following code sample, from<br />

the CERT C++ Coding Standard (MEM51-CPP: Properly<br />

deallocate dynamically allocated resources 1 ), illustrates the<br />

issue:<br />

1 #include <br />

2<br />

3 struct S {};<br />

4<br />

5 void f() {<br />

6 std::unique_ptr s{new S[10]};<br />

7 }<br />

Here, a std::unique_ptr is declared to hold a pointer to<br />

an object but is initialized with an array of S objects. When<br />

std::unique_ptr is destroyed as it goes out of scope on line<br />

7, undefined behavior will result as, by default, delete will be<br />

called instead of delete[]. This could cause abnormal<br />

application termination, memory leaks, or other issues.<br />

To avoid this issue, std::unique_ptr should be declared<br />

to hold an array of S objects, to ensure the correct deleter is<br />

called upon destruction, and std::make_unique()should be<br />

called to initialize the smart pointer, which alerts the user if the<br />

resulting std::unique_ptr is not of the correct type. This<br />

solution is below.<br />

d<br />

redmonk.com/sogrady/2017/06/08/language-rankings-6-17<br />

1<br />

wiki.sei.cmu.edu/confluence/display/cplusplus/MEM51-<br />

CPP.+Properly+deallocate+dynamically+allocated+resources<br />

514


1 #include <br />

2<br />

3 struct S {};<br />

4<br />

5 void f() {<br />

6 std::unique_ptr s =<br />

std::make_unique(10);<br />

7 }<br />

III. LAMBDA OBJECTS<br />

Referencing EXP61-CPP from the CERT C++ Coding<br />

Standard 2 , undefined behavior may result when using a capture<br />

of the this pointer when creating a lambda object. As this<br />

standard applies to C++14, the following code sample illustrates<br />

this issue for C++17.<br />

1 #include <br />

2 #include <br />

3<br />

4 class C {<br />

5 public:<br />

6 C() : p(std::make_unique(10)) { }<br />

7 ~C() { p.release(); }<br />

8<br />

9 auto f() {<br />

10 return [this] { return p.get(); };<br />

11 }<br />

12<br />

13 private:<br />

14 std::unique_ptr p;<br />

15 };<br />

16<br />

17 int main()<br />

18 {<br />

19 auto myf = C().f();<br />

20 auto pp = myf();<br />

21 *pp += 2;<br />

22 return 0;<br />

23 }<br />

Here, the function f() returns a lambda, capturing a<br />

reference to the this pointer in an object of class C. On line 19,<br />

the temporary object C() is deleted but a reference to it is still<br />

used by the lambda function. On line 20, the captured this<br />

object does not exist anymore, which may cause unpredictable<br />

behavior when the freed memory is attempted to be used on line<br />

21. In this example, a null pointer deference was forced (which<br />

should cause a crash of the application) by using the code in line<br />

7. However, without line 7, the error is subtler and this would be<br />

a use of freed memory issue.<br />

The fix for this is to either extend the lifetime of the object<br />

of class C or copy the object of class C when creating the lambda<br />

object. This sample illustrates the former idea:<br />

1 #include <br />

2 #include <br />

3<br />

4 class C {<br />

5 public:<br />

6 C() : p(std::make_unique(10)) { }<br />

7 ~C() { p.release(); }<br />

8<br />

9 auto f() {<br />

10 return [this] { return p.get(); };<br />

11 }<br />

12<br />

13 private:<br />

14 std::unique_ptr p;<br />

15 };<br />

16<br />

17 int main()<br />

18 {<br />

19 C c;<br />

20 auto myf = c.f();<br />

21 auto pp = myf();<br />

22 *pp += 2;<br />

23 return 0;<br />

24 }<br />

IV. BREAKING OF BACKWARD COMPATIBILITY<br />

As a general goal, the creators of the C++ standards attempt<br />

to maintain backward compatibility between versions, however,<br />

there are cases where compatibility is broken to correct<br />

undesired behavior present in older versions. This can present<br />

potential issues when a user attempts to use newer behavior with<br />

a compiler enforcing an older standard.<br />

An example of this case is over-aligned types, which CERT<br />

has a rule for, MEM57-CPP: Avoid using default operator new<br />

for over-aligned types 3 . In this code sample from the CERT<br />

website, the new expression invokes the default operator new on<br />

line 6, which constructs an object of the user-defined type Vector<br />

with an alignment of 32 bytes (line 2), exceeding the typical<br />

alignment of 16 bytes for most implementations. This can cause<br />

unpredictable behavior if this object is passed into SIMD (single<br />

instruction, multiple data) vectorization instructions, which<br />

require specific aligned arguments.<br />

1 struct alignas(32) Vector {<br />

2 char elems[32];<br />

3 };<br />

4<br />

5 Vector *f() {<br />

6 Vector *pv = new Vector;<br />

7 return pv;<br />

8}<br />

This behavior was fixed in the C++17 standard 4 but can still<br />

occur if using an older compiler. This can cause a security issue<br />

such as abnormal termination of the application.<br />

To avoid this, the best practice would be for developers to<br />

know which standard their compiler enforces, which standard<br />

they are coding to, and to avoid programming practices that<br />

differ between the two unless the implications are clearly<br />

understood. The use of a static analysis tool would help automate<br />

the detection of these issues.<br />

2<br />

wiki.sei.cmu.edu/confluence/display/cplusplus/EXP61-<br />

CPP.+A+lambda+object+must+not+outlive+any+of+its+reference+captured+<br />

objects<br />

3<br />

wiki.sei.cmu.edu/confluence/display/cplusplus/MEM57-<br />

CPP.+Avoid+using+default+operator+new+for+over-aligned+types<br />

4<br />

open-std.org/jtc1/sc22/wg21/docs/papers/2016/p0035r4.html<br />

515


V. IMPRECISION WITH THE STANDARD<br />

With such a broad and complex standard as C++ (and the<br />

inclusion of many different opinions and practices), there is the<br />

possibility for imprecision in the guidelines that leave room for<br />

interpretation. The C++ committee recognizes this and<br />

maintains a list of defect reports for future improvement 5 but<br />

there are examples today that still occur.<br />

Defect report 2176 documents an issue with destructor<br />

throws that can still arise using the C++17-compliant Clang<br />

compiler. This code sample is from the defect report:<br />

1 #include <br />

2<br />

3 struct X {<br />

4 X() { puts("X()"); }<br />

5 X(const X&) { puts("X(const X&)"); }<br />

6 ~X() { puts("~X()"); }<br />

7 };<br />

8<br />

9 struct Y { ~Y() noexcept(false) { throw 0;<br />

} };<br />

10<br />

11 X f() {<br />

12 try {<br />

13 Y y;<br />

14 return {};<br />

15 } catch (...) {<br />

16 }<br />

17 return {};<br />

18 }<br />

19<br />

20 int main() {<br />

21 f();<br />

22 }<br />

The issue is that, upon destruction, the current compiler<br />

implementation prints X() twice but ~X() only once, which is<br />

incorrect. This can cause security issues such as memory leaks<br />

and incorrect resource handling.<br />

Note that there is a proposed resolution to correct this<br />

behavior 6 .<br />

VI. SUMMARY<br />

C++ continues to be a popular programming language and<br />

the release of the C++17 standard illustrates its longevity. With<br />

this release comes potential security vulnerabilities that<br />

developers should be aware of and take steps to prevent. Best<br />

practices for avoidance include following the general secure<br />

coding guidelines listed in the CERT C++ Coding Standard,<br />

ensuring dynamically allocated resources are declared and<br />

dereferenced correctly, and understanding any gaps or<br />

compatibility issues between the standard and compiler being<br />

used. A useful method for implementing these best practices to<br />

prevent potential security vulnerabilities is to use a static code<br />

analysis tool.<br />

5<br />

open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html<br />

6<br />

open-std.org/jtc1/sc22/wg21/docs/cwg_defects.html<br />

516


How can you sustain high performance with<br />

functional safety features?<br />

Jon Taylor<br />

Senior Technology Manager, Arm<br />

Cambridge, UK<br />

Abstract— Embedded systems markets are seeing a relentless<br />

push for higher performance, while at the same time markets such<br />

as drones and robotics are introducing requirements for<br />

functional safety. Software developers concerned about<br />

developing for safe systems may be using coding standards such as<br />

MISRA, or safety certified compilers and operating systems. The<br />

underlying hardware is also adding more features to support<br />

functionally safe systems, and it’s imperative that software<br />

engineers understand these features to get best performance from<br />

the design and meet their safety goals.<br />

This paper considers methods to achieve both high<br />

performance and high levels of functional safety, and will help<br />

software engineers understand features in both existing and new<br />

hardware platforms. It will discuss features including software test<br />

libraries, error correcting memories and detecting both software<br />

and hardware faults at runtime. It also considers at a higher level<br />

how hardware features such as additional privilege levels for<br />

virtualization allow new software development methodologies.<br />

Keywords— Functional safety, embedded software, real-time<br />

I. INTRODUCTION<br />

Functional safety is concerned with the mitigation of faults<br />

that might cause hazards in a system in operation. With respect<br />

to hardware, there are two main sources of fault to consider. The<br />

first is systematic faults – faults that might occur due to failures<br />

between requirements and implementation. These are addressed<br />

with rigorous design and verification processes, and further<br />

discussion of this is outside the scope of this paper. The other is<br />

runtime, random faults. These can occur for many reasons, but<br />

the more common occurrences are due to events such as<br />

radiation particle strikes, or the system itself failing due to age.<br />

To detect these faults, additional hardware, software, or a<br />

combination can be used. This paper considers some of these<br />

features, and the effects they may have on software performance.<br />

Consideration is given to areas that particularly affect<br />

performance. This paper does not cover techniques to develop<br />

software in a manner to meet functional safety goals, it is about<br />

understanding hardware features and the effects they can have<br />

on software performance.<br />

New markets, particularly involving autonomous systems,<br />

are now mixing the requirement for very high levels of compute<br />

with high levels of functional safety, and this provides a<br />

challenge for system designers. This can be made more<br />

challenging when these systems also add hard real-time response<br />

requirements too.<br />

II.<br />

RUN TIME TESTING<br />

In safety critical applications, there is a concern that latent faults<br />

can accumulate over time, which may eventually prevent safe<br />

operation of a device. Imagine a car air-bag. For the vast<br />

majority of its life, it will be quiescent, and the majority of its<br />

circuitry will not be used. But in an accident it will need to<br />

deploy, and previously unused circuitry will be used. If faults<br />

have accumulated in that over time, it may operate incorrectly.<br />

To protect against this occurring, diagnostic tests may be run<br />

during normal operation. These can include testing both<br />

memories and logic.<br />

Sometimes these tests can be run only at boot, but depending<br />

on application and safety analysis, there can also be a<br />

requirement for testing during normal operation at runtime.<br />

A. Memory Test<br />

While error correcting (ECC) memory can detect and correct<br />

for some faults, this only happens when any specific address is<br />

accessed. Memory Built-in Self Test (BIST) can operate over<br />

the whole memory array, in what is often called “scrubbing”.<br />

This prevents faults accumulating over time. When a memory<br />

is being accessed for testing, it is unavailable to the processor<br />

core. This can potentially cause performance (or availability)<br />

problems, particularly if the operation runs for an extended<br />

period of microseconds or milliseconds. In the case of a cache,<br />

the processor could clean the cache, then continue operating<br />

from backing memory while the test completes, albeit at lower<br />

performance. However companies such as Arm have<br />

introduced an alternative approach, called on-line MBIST. This<br />

allows one or two addresses to be tested in isolation, in a very<br />

short time period (a few tens of cycles). This can be done with<br />

minimal interference to normal operation, but over a longer<br />

time period can provide coverage of the whole memory. It can<br />

also be used to test whole memories efficiently during boot.<br />

For the developer, on-line MBIST is interesting as it allows fine<br />

grained control over what testing happens and how frequently.<br />

www.embedded-world.eu<br />

517


In the case where an error is detected, it could also run a test<br />

sequence to check whether the error is transient or permanent.<br />

B. Software test<br />

Software test is an elegant solution to the requirement for runtime<br />

testing of logic. In the same way that an RTOS can<br />

interleave different tasks, the software test routines can be<br />

considered as other tasks. They consist of carefully crafted test<br />

functions, written in assembler to ensure that instruction<br />

sequences and register use don’t change. By doing this,<br />

the processor designers can measure fault coverage of these<br />

sequences against the processor logic to ensure the desired<br />

coverage targets are achieved (this can only be done in<br />

simulation – it’s not something that can be done on silicon). The<br />

functions are then wrapped with C to provide a more<br />

programmer friendly interface.<br />

The benefit of having a software test library is that a system<br />

designer can choose how often to interleave code sequences<br />

with normal operation to achieve their coverage goals, and<br />

these can be run as small sections of code, relatively frequently,<br />

so they don’t affect availability in the way logic BIST would.<br />

Both software and memory BIST can be controlled by software<br />

running on the processor itself and are readily interleaved with<br />

normal operation. The third kind of coverage (logic BIST) is<br />

somewhat different.<br />

C. Logic test<br />

Logic BIST involves using manufacturing test logic (scan) to<br />

load patterns in and out of the processor logic, and is therefore<br />

destructive of processor state. These patterns are designed to<br />

provide test coverage that all the processor logic is operating<br />

correctly. Often logic BIST will be used during boot to ensure<br />

a device is functioning correctly prior to normal operation (for<br />

instance after turning a cars ignition on, the ECUs may be tested<br />

with LBIST).<br />

Running logic BIST will destroy any state in the processor, so<br />

it means taking a processor out of use while all the processor<br />

state is scanned out (and saved), the test run, then state restored.<br />

Unless you have multiple processors and some redundant<br />

capacity, this impact on availability is usually prohibitive of<br />

run-time logic BIST. It also requires additional capability<br />

outside the processor to manage the saving and restoring of<br />

processor state.<br />

In a many processor system, it might be part of the availability<br />

strategy that processors are rotated out of and back into use to<br />

allow high levels of fault coverage during normal operation.<br />

Logic BIST is mostly outside of scope for software developers;<br />

however if you are considering a scheme of taking cores in and<br />

out of use, then power sequences need to be considered, along<br />

with the ability to migrate tasks between processors. For system<br />

developers the consideration is around what level of availability<br />

must be maintained to achieve the total performance<br />

requirements needed, coupled with how often the testing needs<br />

to be performed.<br />

III.<br />

ERROR-CORRECTING MEMORY<br />

A common feature on many processors now is error-correcting<br />

memory. As process geometries have shrunk, memory has<br />

become more susceptible to bit flips caused by radiation, even<br />

on earth. One of the more common types of ECC protection is<br />

able to correct single bit errors, and detect double bit errors<br />

(SEC-DED). Usually the operation of this is transparent to the<br />

developer, however there are cases where it can affect<br />

performance, or even operation of the device.<br />

A single-bit error can usually be corrected in-line (the encoding<br />

used to detect the error contains enough information to correct<br />

it), i.e. the processor doesn’t have to perform any further<br />

memory accesses. However, the corrected value will need to be<br />

re-written to the memory to prevent an increase in the number<br />

of errors over time – i.e. single-bit correctable errors becoming<br />

double-bit uncorrectable errors. This will usually happen<br />

without affecting program operation – although if several errors<br />

occur in the middle of multiple back-to-back memory accesses,<br />

it may incur a slight performance penalty. A double-bit error<br />

may even be recoverable, if the access is a read from a clean<br />

cacheline, the data can be re-fetched from main memory –<br />

although this of course incurs a performance overhead.<br />

Where a performance impact is more likely, is when accessing<br />

small items of data. The ECC code adds an overhead in terms<br />

of memory storage. For example, SEC-DED ECC for a byte<br />

would require four additional data bits. The encoding of this<br />

code becomes more efficient as the data size increases – a 32-<br />

bit word can be protected with seven bits of ECC coding. The<br />

tradeoff is that ECC is calculated across the whole data – so if<br />

you write a byte to memory which has ECC chunk size of 32-<br />

bits, it will require a read-modify-write operation to calculate<br />

the new ECC value with the new byte of data. Again, this is<br />

always transparent to the software, but if many accesses are<br />

made close to each other in time, the processor may not be able<br />

to perform all of these operations without affecting<br />

performance. This is something that may need to be considered<br />

by the system designer too. For instance, an instruction cache<br />

may use 64-bit ECC encoding (as values are rarely written), but<br />

a data cache may use 32-bit ECC as this is a common size of<br />

access.<br />

Another benefit of using a smaller ECC chunk size is higher<br />

resilience to faults (fig 1). When a chunk size of 64-bits is used,<br />

a two-bit error within this word may result in an uncorrectable<br />

error. If 32-bit chunks had been used and one fault occurred in<br />

each half of the 64-bit data value, then this would be correctable<br />

(assuming a SEC-DED scheme). To know which to use comes<br />

back to the system designer to consider expected error rate and<br />

availability requirements, along with the most common size of<br />

memory access.<br />

ECC<br />

ECC<br />

ECC<br />

ECC<br />

32-bit word<br />

32-bit word<br />

64-bit double word<br />

64-bit double word<br />

ECC<br />

ECC<br />

32-bit word<br />

32-bit word<br />

Correct<br />

Detect<br />

Correct<br />

Correct<br />

ECC 32-bit word ECC 32-bit word Detect<br />

Figure 1 - comparison of ECC schemes<br />

518


A final consideration for ECC is what happens if the processor<br />

encounters a permanent fault. In this situation, an error is<br />

detected during a read operation. The processor tries to correct<br />

this by performing a write of the corrected data, before<br />

performing the read again. However in the case of a permanent<br />

fault, the error will recur. A hard error cache allows this location<br />

to be marked as bad, and instead the hard error cache is used to<br />

replace this location. This allows the processor to continue to<br />

make forward progress. Having this hard error cache requires<br />

additional memory, so it may be only a very small number of<br />

entries that can be supported. So long as the errors encountered<br />

are only single bit, forward progress can still be made, albeit<br />

with reduced performance.<br />

As well as considering the performance impact of ECC,<br />

software developers may also want to track the rate of errors<br />

occurring. ECC may operate transparently to the developer, or<br />

it may be recorded in special registers such that software can<br />

track what errors are occurring and measure how often they<br />

occur. This could be used to predict failure and indicate to a<br />

user that a hardware failure might be imminent.<br />

IV.<br />

MEMORY PROTECTION AND MANAGEMENT<br />

Many embedded processors these days include a memory<br />

protection unit or memory management unit. Effective use of<br />

these is critical to safe software development. These are usually<br />

supported by operating systems, but may not be enabled by<br />

default or if you run a bare metal environment. The advantages<br />

of being able to protect sections of memory, or peripherals from<br />

code that shouldn’t be able to access them should be obvious,<br />

but the performance aspect may need a little more<br />

consideration.<br />

A. Memory Protection Units (MPU)<br />

For hard real-time applications, a processor with a memory<br />

protection unit is a good solution. These typically have a fixed<br />

number of regions. The main advantage of an MPU is that<br />

lookups are deterministic, and usually place no additional<br />

overhead on a memory access. At a basic level, they can be used<br />

for stack protection (a stack overflow results in a trap, rather<br />

than data corruption), marking code as read-only, and data as<br />

execute-never. In a system with an operating system, the OS<br />

may need to reconfigure the MPU between different tasks.<br />

Depending on the frequency of the task switches and<br />

complexity of the memory map, this should still take a small<br />

amount of time, but is something the developer may need to<br />

consider.<br />

B. Memory management units (MMU)<br />

Operating systems such as Linux use an MMU to abstract<br />

applications from the underlying memory system. In processors<br />

with hardware virtualization support, a second stage of<br />

translation is added such that only the hypervisor knows the<br />

physical memory map; guest OSs have their accesses translated,<br />

and applications within those guests have a second layer of<br />

translation. Like MPUs, this also provides isolation and<br />

freedom from interference between guest operating systems and<br />

applications.<br />

Having memory address translation is a very flexible approach,<br />

and makes software more easily portable between platforms,<br />

but there can be a hidden cost. Unlike an MPU, with a fixed<br />

number of regions, typically an MMU does not have the same<br />

limit. A processor cannot store all these translations internally,<br />

so it uses a cache (called a Translation Lookaside Buffer, or<br />

TLB). The performance variation occurs depending on whether<br />

it hits in this translation cache, or goes to the main page tables<br />

stored in main memory (this is called a “page table walk” and,<br />

because it accesses main memory, may take many cycles). Fig<br />

2 shows the different approaches of MPU and MMU and how<br />

applications see the memory map of the device.<br />

V. VIRTUALIZATION<br />

Memory protection relies on the concept of task privilege.<br />

Privileged code is trusted and validated to perform system<br />

control, while application code typically runs unprivileged and<br />

is<br />

Shared<br />

Shared<br />

Address space<br />

Shared<br />

App1<br />

App2<br />

stage1<br />

MMU<br />

Shared<br />

App2<br />

App1<br />

Kernel<br />

Address space<br />

Task1<br />

Task2<br />

stage1<br />

MPU<br />

Task2<br />

Task1<br />

Kernel<br />

OS1<br />

Virtual address<br />

Physical address<br />

Figure 2 – MMU vs MPU Physical address =<br />

virtual address<br />

www.embedded-world.eu<br />

519


therefore unable to alter the system configuration in case of an<br />

error. Good practice is to minimize the amount of code required<br />

to operate in privileged mode (bug rates are often considered<br />

proportional to code size, so a smaller code base reduces the<br />

probability of bugs in the code and a smaller code base is<br />

simpler to validate fully).<br />

Virtualization has become common place in servers, where it is<br />

used to run multiple guest operating systems on the same<br />

physical hardware. This has now become possible in embedded<br />

processors, where the introduction of an additional privilege<br />

level to newer processors allows a hypervisor to be used to<br />

control the system. While virtualization is possible without this<br />

additional privilege, it requires the guest OSs to be<br />

paravirtualized (paravirtualization is a software-based<br />

virtualization solution which requires guest OSs to be ported to<br />

the virtualization API). The extra privilege level allows full<br />

virtualization and higher levels of performance. The<br />

introduction of these processors is particularly helpful in mixed<br />

criticality applications, where instead of having to integrate<br />

software from multiple vendors into a single partition, a<br />

hypervisor can be used to run the different partitions<br />

individually, isolating them in both memory and time.<br />

From a performance perspective, there are several factors that<br />

can affect operation when running under a hypervisor, some<br />

related to the processor hardware, others related to the<br />

hypervisor software itself. While hypervisors are often<br />

designed to be transparent to software developers, it is still<br />

worth considering different use models as this may affect<br />

whether use of a hypervisor is appropriate for your application.<br />

The biggest factor is whether multiple guests are running on a<br />

single core or not. If the OS is pinned to a specific core, and that<br />

core is not shared with other guests, there should be absolutely<br />

minimal overheads. Once shared though, the two main<br />

considerations are<br />

a) There will be times when your guest is not<br />

running, so longer interrupt latencies will be<br />

expected<br />

b) Access to peripherals may be slower, particularly<br />

if shared between guests<br />

Regarding peripherals, whether there are multiple guests<br />

sharing the same core or peripheral may affect which choice of<br />

Direct<br />

Shared<br />

access model is used, which in turn affects performance (see<br />

fig. 3, Peripheral access models). Direct access is the most<br />

performant, but least flexible, while virtualized is the most<br />

flexible, but has a higher overhead to all guests. A compromise<br />

is the shared model, where a single guest has direct access, with<br />

other guests accessing through the primary owner.<br />

A further performance consideration is for systems using<br />

MMUs. The cost of a page table walk has already been<br />

discussed, however with two stages of memory translation in a<br />

virtualized system, there is now the potential for two levels of<br />

page table walk. Understanding the cost of this can be<br />

particularly challenging when trying to work out worst-case<br />

execution times. It can be mitigated to some extent by managing<br />

the number of pages in use (if the number of pages used is<br />

smaller than the TLB size, then accesses should hit in the TLB).<br />

Some architectures may allow TLB entries to be locked,<br />

ensuring accesses to critical sections of code or data are<br />

consistent in timing, although this can reduce average<br />

performance as a fraction of the TLB is then not available for<br />

normal code.<br />

In complex systems, using hypervisors to isolate different<br />

components is often an easier approach than trying to integrate<br />

everything into a single OS.<br />

Microkernel hypervisors are typically very efficient and with a<br />

small code footprint that makes certification for safety or<br />

security more straightforward. Applications class processors<br />

have supported virtualization for a number of years, but now<br />

processors such as Arm Cortex-R52 mean that virtualization<br />

can be used even where the application has very hard real-time<br />

requirements as it uses a two stage MPU, rather than MMU.<br />

The hypervisor may be used to merge multiple guest operating<br />

systems onto the same processor, or in a multi-processor system<br />

can manage the system configuration and error handling, using<br />

core-pinning such that each guest is tied to a particular core.<br />

A full discussion of virtualization techniques and<br />

considerations is outside the scope of this paper.<br />

VI.<br />

REDUNDANCY<br />

For the most demanding safety applications, the methods<br />

described above still do not provide adequate hardware fault<br />

coverage. Sometimes it is necessary to execute a program<br />

Virtualized<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

App<br />

RTOS<br />

RTOS<br />

BME<br />

RTOS<br />

RTOS<br />

BME<br />

RTOS<br />

RTOS<br />

BME<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Driver<br />

Hypervisor<br />

Hypervisor<br />

Driver<br />

Hypervisor<br />

Cortex-R52<br />

Cortex-R52<br />

Cortex-R52<br />

Periph A<br />

Periph B<br />

Mem<br />

Periph A<br />

Periph B<br />

Mem<br />

Periph A<br />

Periph B<br />

Mem<br />

Figure 3 – Peripheral access models<br />

520


multiple times and ensure the results are consistent. Until now,<br />

systems requiring the highest levels of functional safety have<br />

generally had constrained processing requirements. With the<br />

rise of applications such as autonomous driving, there is now a<br />

need to consider how to achieve high levels of safety on high<br />

performance processors.<br />

A. Dual core lockstep (DCLS)<br />

As mentioned earlier, from a software developer’s perspective<br />

one of the simplest methods of improving fault coverage is to<br />

use dual-core lockstep processors. Two identical copies of the<br />

processor logic execute the same code, on the same data, and<br />

the results are continuously compared by the hardware. There<br />

is usually some temporal separation between these copies to<br />

avoid common mode failures (i.e. where a particle strike causes<br />

the same error in both processors and the outputs, although<br />

erroneous, still match).<br />

To software engineers, dual-core lockstep processors offer<br />

simplicity in that they are transparent during normal operation<br />

and software can run unmodified. The hardware will detect an<br />

error when the processors diverge (for example due to a<br />

hardware bit-flip caused by a radiation strike), and this can be<br />

handled by the system. However, there are still features within<br />

a system that software engineers need to consider that can affect<br />

performance. When a fault is detected, this has to be handled at<br />

the system level, and recovery would require the processors to<br />

be reset. There is no method to resynchronize the processors or<br />

detect which processor has the fault.<br />

While effective and simple, DCLS has some costs too. The first<br />

is the lack of flexibility – everything running on a DCLS core<br />

is executed twice, whether it needs to be or not. It also does not<br />

provide for any diversity in software. The other major cost is<br />

physical - it requires additional comparison logic, and a second<br />

copy of the execution logic (memories can be protected by ECC<br />

so don’t need duplicating).<br />

B. Software lockstep<br />

This is an alternative approach to DCLS, sometimes also called<br />

redundant execution. Both approaches take a single set of input<br />

data, operate on it twice, and check the results. However, while<br />

DCLS compares the processor outputs every cycle, in software<br />

lockstep the checking process is under software control. The<br />

main benefit of this approach is flexibility. In a DCLS system,<br />

everything running on the processor is checked, whether<br />

required from a safety perspective or not. Similarly, identical<br />

software is run on both copies of the logic; in the DCLS system,<br />

the checking process is transparent to software<br />

Software lockstep allows this to be selectively managed – not<br />

everything has to be run in duplicate, either creating more CPU<br />

time for additional processing or allowing a processor to be put<br />

to sleep to save energy. The redundant processors could even<br />

be separate SoCs.<br />

So far, we have assumed that the same software will run on both<br />

processors, but another possibility with software lockstep is to<br />

have diverse software implementations. In this case the<br />

comparison may be checking that both answers are within an<br />

expected range, rather than identical.<br />

While there are benefits of software lockstep, one of the biggest<br />

challenges of this approach is proving the fault coverage of the<br />

system, particularly if diverse software is used. Remember that<br />

the goal is ultimately to protect against errors in the system<br />

causing harm – if we can’t measure what fault coverage is<br />

achieved, we cannot meet the safety target.<br />

C. Heterogeneous platforms<br />

As application performance requirements continue to grow,<br />

developers are looking to heterogeneous platforms combining<br />

CPUs, GPUs and custom accelerators. At the same time,<br />

software running on these platforms (such as machine learning<br />

algorithms) may be hard to validate to the highest levels of<br />

functional safety.<br />

Use of decomposition techniques may be needed, splitting<br />

applications into different ASILs with appropriate hardware<br />

and software for each part of the system to achieve the safety<br />

goal at system level. This could include using either or both<br />

redundancy techniques already discussed.<br />

Heterogeneous solutions also allow tailoring of the compute to<br />

the workload – for instance using hard real-time processors<br />

mixed with applications class processors – seen in SoCs such<br />

as Renesas RCar-H3 and Xilinx Ultrascale+.<br />

From a performance perspective, unfortunately there is no<br />

single answer to what the best solution is. However, having a<br />

good understanding of the tradeoffs of the different solutions<br />

will help a system or software designer work out the best option<br />

for their use case.<br />

VII. CONCLUSIONS<br />

Safety certified software, tools and hardware have been in<br />

common use for some time. However, as the markets in which<br />

safety is important continue to grow, ever more developers will<br />

be required to think about these use cases. Through this paper<br />

we have discussed some of the most common hardware features<br />

used to achieve functional safety goals, and the impact they can<br />

have on software performance.<br />

Most important of all is for the developer to understand the<br />

safety requirement of their application use case – what level of<br />

fault coverage is required and therefore what combination of<br />

software and hardware features are used to achieve this.<br />

Once a developer knows what their safety goals are, they can<br />

decide which mix of features to use to achieve this.<br />

Many of the features are provided automatically by the<br />

hardware, with little requirement for the software developer to<br />

take action. However, it is important for the software developer<br />

to understand these mechanisms and how they can affect<br />

performance when they are active (e.g. ECC). Other features<br />

require active involvement from the software (such as memory<br />

management), but if done optimally, have minimal impact on<br />

performance, making it possible to sustain high performance<br />

with high levels of functional safety.<br />

www.embedded-world.eu<br />

521


Balancing functional safety with performance<br />

intensive systems<br />

Marcus Nissemark<br />

Field Applications Engineer<br />

Green Hills Software<br />

Sweden<br />

marcusn@ghs.com<br />

Abstract—When creating performance intensive systems that<br />

will be used in critical applications like autonomous driving,<br />

walking robots or semi-autonomous clinical operation machines<br />

there are many challenges. The performance requirements drive<br />

the need for fast multicore CPUs and the usage of GPUs for<br />

computation, creating heavy computing platforms running on<br />

general purpose operating systems. There is still a need for realtime<br />

behavior and meeting functional safety requirements in<br />

these scenarios, and such system challenges will be discussed in<br />

this paper.<br />

Keywords—hard real-time, functional safety, separation,<br />

virtualization<br />

I. REAL-TIME BEHAVIOR REQUIREMENTS<br />

A real time system is one that has well defined and fixed<br />

time constraints, making it highly deterministic. Processing<br />

must be done within the defined constraints or the system will<br />

fail. This need for predictability is one of the key factors that<br />

drives the need for real-time behavior of the high-performance<br />

system. The performance intensiveness of these systems<br />

relates to the ability to run computation algorithms for<br />

transforming and processing sensor data. The system can have<br />

direct coupled sensors, remote sensors, or even a combination<br />

of these. The input data of these sensors often need to be<br />

correlated in time as different sensors may detect the same<br />

object, but focusing on different properties. The correlation<br />

then requires determinism, possibly through time-stamping, to<br />

make sure that sensor sampling and correlation is done within<br />

a specific time window, to be used in the computation flow.<br />

For software execution, this means that we need to be able to<br />

schedule jobs to be executed when certain events occur, with<br />

minimal latency. This is typically something that a Real-Time<br />

Operating System (RTOS) can address and help sort out. In<br />

particular, the coordination of input acquisition, processing,<br />

and output to actuators or further analysis systems is one of<br />

the main drivers to stay away from general purpose OSes.<br />

The high-performance computing systems of today<br />

normally uses large 64-bit multicore System on Chips (SoCs),<br />

running at gigahertz speed using gigabytes of live memory,<br />

optionally controlling GPU, FPGA and/or DSPs for building<br />

an Artificial Intelligence (AI) framework. The complexity of<br />

these systems in terms of hardware configuration drives the<br />

need for an advanced operating system to leverage<br />

controllability of the system, from an application point of<br />

view.<br />

Embedded Linux is frequently chosen as the operating<br />

system, and used in the industry to control such computing<br />

platforms. However, despite many claims that Linux can meet<br />

real-time requirements [1], Linux was never designed to do so,<br />

and you need to patch and change the default kernel operation<br />

to achieve this [2], rendering a lot of the middleware on the<br />

platform unusable. This is in particular a separate discussion.<br />

The general assumption is that you will need a productiongrade<br />

RTOS to control your high-performance critical system,<br />

given the real-time constraints and hardware complexity.<br />

Several such operating systems are readily available, but we<br />

need to consider their ability to have or to be able to reach<br />

Functional Safety (FuSa) according to the relevant industrial<br />

standards, as later sections will show. The choice of operating<br />

system is beyond the scope of this paper, but worth noting is<br />

that there is only a handful of these RTOSes available which<br />

fulfill both hard real time and functional safety requirements.<br />

Typical examples, non-exhaustive, are INTEGRITY from<br />

Green Hills Software [3], QNX OS for Safety [5], and<br />

VxWorks from Wind River [4].<br />

II.<br />

SAFETY REQUIREMENTS<br />

The next challenge is the overall need for functional safety,<br />

which applies to both hardware and software of the highperformance<br />

system. In the automotive world the standard is<br />

called ISO 26262, which is derived from the industrial<br />

standard IEC 61508 [6]. In a simplified way, functional safety<br />

for software can be interpreted as the software needs to be<br />

proven to have a significantly low level of systematic faults to<br />

be able to be used in safety solutions. This level is the amount<br />

of risk reduction achieved by system safety processes and<br />

www.embedded-world.eu<br />

522


safety requirements. It is generally described as the safety<br />

integrity level, ASIL for automotive, or SIL for industrial, and<br />

can be illustrated as below in Fig. 1.<br />

Probability<br />

High<br />

Low<br />

Minor<br />

Fig. 1. Safety Integrity Level illustrated.<br />

In the context of a performance intensive system, possibly<br />

based on a general-purpose operating system solution like<br />

Linux, it means that all of the code running in the system<br />

needs to undergo extensive test and verification as well as deal<br />

with development process-related questions throughout the<br />

lifetime of the software product. The goal of such testing is to<br />

reduce the amount of systematic faults, aka bugs, which could<br />

cause the system to stop fulfilling its function adequately.<br />

Special care needs to be taken for code that is running in<br />

privileged mode (kernel), or that can affect the stability of the<br />

system (drivers, also in kernel on Linux). This of course<br />

includes the operating system code, which in many cases can<br />

be millions of lines of code. Therefore, achieving safety with<br />

such operating systems is not very likely [7].<br />

One of the prerequisites of the software running on a given<br />

hardware platform is that the hardware itself fulfills its<br />

function and executes the code correctly. However, once<br />

deployed, the system may encounter some disturbances,<br />

maybe due to the environment or aging of the system.<br />

Consequently, the system itself needs to additionally take care<br />

of the random hardware faults that will occur, which drives the<br />

need for a safety architecture. Typically, high-performance<br />

systems use hardware that have not been designed with<br />

hardware fault tolerance or self-diagnostics capabilities. This<br />

means that a typical 1oo1D system architecture could be<br />

deployed [8], which adds separate diagnostic coverage to<br />

detect faults as seen in Fig. 2.<br />

Environment<br />

Hazard<br />

Remaining<br />

Risk<br />

Diagnostics<br />

Input Logic Output<br />

Fig. 2. Typical 1oo1D system architecture<br />

Safety Integrity Level<br />

Major<br />

Severity<br />

Control<br />

confidence in the system and underlying hardware. In turn,<br />

this can increase the safety integrity level of the system. This<br />

is an important and complex effort, and understanding the<br />

safety context is non-trivial. It has to be considered at design<br />

time since safety needs to be designed from the beginning.<br />

Furthermore, in any safety application, safety has precedence<br />

over performance, which means that the system designer must<br />

consider some performance being dedicated to the safety<br />

functions of the system, especially those related to diagnostics.<br />

III.<br />

SEPARATION AS MITIGATION FOR SAFETY<br />

So, there is a need for mitigations to solve these robustness<br />

challenges. The functional safety standards allow separation of<br />

functionality into different elements, each element can then be<br />

treated at a different level of criticality as long as the<br />

separation method guarantees freedom from interference.<br />

Diagnostic channels can also be separated, to simplify the<br />

safety architecture. The most straightforward separation would<br />

be the division into different hardware components, like<br />

multiple CPUs, or the division of functionality between<br />

heterogeneous CPU cores in modern SoCs.<br />

The division of functionality between homogenous cores of<br />

a multicore SoC does not suffice because enough separation is<br />

not achieved. For instance, such installations are sharing<br />

caches, memory bus and other chip internals. These systems<br />

need to manage the separation and functionality division in<br />

software. Such separation can be done with techniques like a<br />

separation kernel or a hypervisor.<br />

A. Separate CPUs<br />

When using separate CPUs for performance intensive<br />

systems, they are often divided into high performance CPUs,<br />

possibly with a GPU, and separate lower speed MCUs that<br />

usually bear the safety compliance requirement. In other<br />

words, a performance domain and a safety domain are<br />

configured. These domains need to exchange data, feeding the<br />

calculations with input, and getting processed data back.<br />

Normally this transitions data to and from the safety domain<br />

through some physical connection, typically through simple<br />

buses like UART, SPI, or I2C, but cases using PCI, Ethernet<br />

or shared external memory uses can be seen.<br />

Although the physical separation is good from the safety<br />

standard perspective, you still need to consider the<br />

communication path. Complex buses like Ethernet and PCI<br />

may require that the device driver also goes through the<br />

formal process of functional safety certification if there is any<br />

point of interference in that path, and even when running on<br />

safety MCUs that use MPU for memory protection it is a nontrivial<br />

task. This is one of the reasons safety solutions tend to<br />

use the simpler buses, as they are easier to prove to be correct.<br />

An example architecture of hardware separation can be seen in<br />

Fig. 3.<br />

Diagnostics allow a detected dangerous event to be<br />

converted into a safe failure, which can be used to increase<br />

523


Safety domain<br />

ASIL A<br />

application<br />

Safety RTOS<br />

MCU<br />

ASIL C<br />

application<br />

Bus<br />

Communication<br />

Performance domain<br />

QM<br />

application<br />

Performance RTOS<br />

CPU<br />

QM<br />

application<br />

changes on one side shall allow the other side to behave<br />

differently. If the cores would be sharing configuration<br />

registers, memory busses or caches, these are apparent sources<br />

of interference which need extra protection. In these cases, the<br />

SoC vendor needs to assure us that the separation is proven for<br />

usage according to the safety standard. This is non-trivial, as<br />

these modern SoCs have vast functionality, and keeping<br />

everything under control is critical.<br />

Fig. 3. Example of hardware separation architecture<br />

However, those simpler buses may come with a<br />

performance penalty, compared to the more complex but faster<br />

buses. If your system needs to move a lot of data between<br />

different executing entities, i.e. between the safety and<br />

performance domains, the bus usage can become a bottleneck<br />

in both the performance and safety MCU, as not enough data<br />

can be transitioned between the domains. Additionally, this<br />

architecture is not flexible in algorithm scalability; if some<br />

performance intensive algorithm needs to run in the safety<br />

domain there is a limit to the execution capabilities the smaller<br />

MCUs can handle.<br />

The smaller MCU may very well be running an RTOS to<br />

leverage some of the constraints above, but the main drawback<br />

of running safety algorithms on a low performance MCU still<br />

exists. Furthermore, the performance CPU may also need to<br />

run a real-time operating system, not necessarily safety<br />

critical, but still capable of dealing with the real-time<br />

requirements of the system predictability requirements.<br />

B. Hardware consolidation / Heterogeneous core<br />

architectures<br />

The other side of the hardware separation is hardware<br />

consolidation. Better integrated System-on-Chip (SoC) start to<br />

propose a small core with a dedicated memory, clock and<br />

power management on the same die as the bigger systems, to<br />

allow higher-performance communication. SoC vendors try to<br />

bring these separate CPUs and MCUs into one SoC, we have<br />

seen examples like Xilinx Zynq Ultrascale MPSoC [9],<br />

Renesas R-Car H3 [10], or NXP i.MX 8 [11]. An example can<br />

be seen in Fig. 4.<br />

Safety domain<br />

ASIL A<br />

application<br />

Safety RTOS<br />

MCU Core<br />

ASIL C<br />

application<br />

SoC<br />

Fig. 4. Hardware consolidation separation architecture<br />

Performance domain<br />

QM<br />

application<br />

Performance RTOS<br />

CPU Core(s)<br />

QM<br />

application<br />

When doing so they need to consider, design and test that<br />

the separation between the different cores is free from<br />

interference, as the performance domain and the safety domain<br />

separation still must exist. This means that no dynamic<br />

Clearly this type of SoC removes some of the<br />

communication and data exchange bottlenecks, at least if they<br />

safely can use SoC internal shared memory or similar paths.<br />

But, there is still the issue of scalability of algorithms between<br />

safety domain and performance domain, as the safety side<br />

normally is locked down to the smaller MCU that now is built<br />

into the large SoC.<br />

What is needed is a way to run both safety algorithms and<br />

performance algorithms on the same cores in any SoC. In<br />

other words there is a need to have high level software that<br />

can do isolation and separation. A common conception for<br />

software separation is to think that virtualization using a<br />

hypervisor is the only way to it. However, the separation<br />

kernel architecture must also be considered.<br />

C. Software separation kernels<br />

The software separation kernel was originally presented by<br />

John Rushby in a 1981 paper [12]. He describes this as "the<br />

task of a separation kernel is to create an environment which is<br />

indistinguishable from that provided by a physically<br />

distributed system: it must appear as if each regime is a<br />

separate, isolated machine and that information can only flow<br />

from one machine to another along known external<br />

communication lines. One of the properties we must prove of<br />

a separation kernel, therefore, is that there are no channels for<br />

information flow between regimes other than those explicitly<br />

provided." Therefore, a proper separation kernel provides<br />

isolation equivalent to hardware isolation.<br />

Originally designed for security solutions, separation<br />

kernels can also be very useful in safety applications. The<br />

separation kernel solution allows for a division of applications<br />

into multiple levels of criticality, which significantly helps in<br />

the overall system architecture, i.e. creating a performance<br />

domain and a safety domain within the same RTOS. Green<br />

Hills Software’s INTEGRITY RTOS is an example of an<br />

RTOS separation kernel architecture, and Wind River’s<br />

VxWorks also provides a separation kernel profile. Both these<br />

operating systems, and a few others, have also undergone<br />

formal proof or functional safety pre-certifications, as<br />

previously mentioned. Therefore, these solutions already<br />

provide safety evidence that they have undergone the testing<br />

and scrutiny to claim that their solutions are adhering to the<br />

safety standard [13].<br />

On such a system, only a subset of the applications need to<br />

be assigned a safety integrity level, and the rest can be kept as<br />

regular quality code. This works in favor of scalable safety<br />

www.embedded-world.eu<br />

524


algorithms, as you can run such performance algorithms<br />

within the safety domain, but you do not have to run all of<br />

them there. Because code in applications is isolated by design<br />

from code in other applications, there is no longer a need to<br />

test and certify all the code at the highest level. This helps<br />

with process-related items when following ISO26262 or other<br />

standards, because the effort in certification is as a<br />

consequence greatly lowered. Thus, the safety domain and the<br />

performance domain will run on the same SoC, as long as the<br />

separation kernel supports the actual cores of the SoC. This<br />

can be exemplified with Fig. 5.<br />

Safety domain<br />

ASIL A<br />

application<br />

ASIL C<br />

application<br />

Fig. 5. Separation kernel architecture with a safety domain and a performance<br />

domain.<br />

The performance of these systems is also important, but it<br />

is assumed that most real-time operating systems actually<br />

work in favor of general performance. Consider that system<br />

performance itself is a very vast topic, which means there is<br />

no trivial formula for observation of performance. Operating<br />

systems can make heavy use of caching, with an expectation<br />

that some operations will be cache hits (probably fast) and<br />

some will be cache misses (probably slow). Interrupts often<br />

occur at unpredictable times, possibly resulting in altered<br />

performance of the code they break up. Different scheduling<br />

disciplines will insert varying delays into the performance of<br />

individual processes. Even if you have access to the operating<br />

system code, down at the hardware level there are caches and<br />

pipelines and other optimizations that you cannot see at all,<br />

except that they produce varying performance results. The end<br />

user application still needs to undergo optimizations in the<br />

context it will execute, whereby only then can the real<br />

performance be measured.<br />

Another useful benefit with separation kernels is besides<br />

the original design idea to support security, they also allow for<br />

software consolidation of multiple software functions on the<br />

same SoC, while still being logically separated; as long as the<br />

platform has sufficient performance capability, it is reasonable<br />

to integrate more and more independent functionalities on the<br />

platform without hardware redesign.<br />

D. Separation through virtualization<br />

QM<br />

application<br />

Separation kernel RTOS<br />

SoC<br />

Performance domain<br />

QM<br />

application<br />

The other method for doing software separation is generally<br />

called hardware virtualization, which is different from but also<br />

similar to the separation kernels. Hardware virtualization<br />

means that special execution paths are taken to allow multiple<br />

operating systems to share the CPU and memory of the SoC.<br />

The framework that allows this is called a hypervisor, or<br />

virtual machine monitor, which creates these virtual machine<br />

instances that can run different operating systems on a single<br />

physical hardware platform. Virtualization can be hardware<br />

accelerated through the usage of modern CPU features like<br />

ARM-VE and Intel VT-x, generally called hardwareaccelerated<br />

virtualization. This type of virtualization is<br />

different from operating system virtualization, where the<br />

instances, or so-called containers, share a single operating<br />

system kernel. The containers are not covered in this paper.<br />

There are several examples of hardware virtualization<br />

hypervisors, like the open-source Xen Project [17] and PikeOS<br />

from Sysgo [18]. Introducing hypervisors in the software<br />

architecture for safety is not without challenges because it<br />

adds complexity. The hypervisor needs to do memory<br />

separation and protection as well as scheduling of different<br />

workloads and management of the privilege levels of the<br />

virtualized guest operating systems. The hypervisor itself<br />

becomes the highest privileged software in the system, which<br />

means that in the context of safety applications, the hypervisor<br />

itself also needs to be considered safety relevant. In essence, a<br />

hypervisor is scheduling several operating systems like a<br />

separation kernel schedules applications.<br />

Hypervisors can also be categorized, typically into native<br />

or bare-metal, Type 1, and into hosted hypervisors, Type 2.<br />

Type 1 hypervisors run directly on the hardware, see Fig. 6,<br />

while hosted hypervisors run similarly to regular applications<br />

on the actual operating system. The distinction between the<br />

types is not clear as some configurations can be ambiguous. In<br />

some Type 1 implementations there is also an initial guest<br />

domain with higher privilege levels and dedicated peripheral<br />

access, called domain 0. The latter is typically seen in the Xen<br />

hypervisor solution. It is clear that such solutions also would<br />

require that the entire domain 0 guest is safety-relevant in a<br />

safety implementation.<br />

Safety domain<br />

ASIL A<br />

application<br />

Safety RTOS<br />

ASIL C<br />

application<br />

Hypervisor<br />

SoC<br />

Performance domain<br />

QM<br />

application<br />

Performance RTOS<br />

Fig. 6. Typical Type 1 hypervisor separation architecture.<br />

QM<br />

application<br />

Besides the general need for separation to achieve safety,<br />

the main benefit for using virtualization in the context of<br />

safety is the ability to re-use high performance algorithms<br />

designed for an operating system like Linux, while providing<br />

an efficient path to isolate such applications from safety<br />

525


applications running in the same system in a virtualized safety<br />

RTOS. Furthermore, the hypervisor should allow IPC (Inter<br />

Process Communication) mechanisms to support data<br />

transitions between a safety domain and a performance<br />

domain, as this type of data exchange is important.<br />

The drawback is that Type 1 virtualization of operating<br />

systems can add undesired latency to event handling, which<br />

causes issues in the real-time application scenarios and general<br />

determinism. Furthermore, scheduling of workloads across the<br />

cores on the SoC becomes non-trivial and can affect<br />

performance negatively.<br />

Type 2 hypervisors are normally seen as lacking<br />

performance, and are limited to the host operating systems<br />

safety and security solutions. This is due to the fact that the<br />

hypervisor either does not take advantage of hardware<br />

acceleration, does not allow direct device assignment (passthrough),<br />

and/or runs on a non-real-time operating system.<br />

Using a separation architecture microkernel can overcome<br />

those drawbacks, creating an interesting hypervisor solution.<br />

Furthermore, native applications in such a scenario can meet<br />

both safety and security requirements, as well as real-time low<br />

latency requirements. An example of this hypervisor<br />

architecture is Green Hills INTEGRITY Multivisor [16]. This<br />

alternative hypervisor architecture is illustrated below in Fig.<br />

7.<br />

Safety domain<br />

ASIL A<br />

application<br />

Fig. 7. Separation kernel based Type 2 hypervisor architecture.<br />

IV.<br />

ASIL C<br />

application<br />

QM<br />

application<br />

Separation kernel RTOS<br />

SoC<br />

Performance domain<br />

Performance RTOS<br />

VMM/Hypervisor<br />

QM<br />

application<br />

AI AND GPU PROCESSING IN A SAFETY CONTEXT<br />

Many recent high-performance computing systems are not<br />

only using the CPU for computation, it is also common to use<br />

GPU solutions to accelerate algorithms that can execute in<br />

multiple parallel implementations. This is normally done<br />

through programming extensions like OpenCL or the<br />

proprietary framework CUDA [15]. In fact, implementations<br />

of deep learning algorithms or other neural network<br />

techniques can make use of the massive parallelism that GPU<br />

acceleration provides, which basically is the driving force for<br />

the development of Artificial Intelligence (AI) solutions today<br />

[14].<br />

Normally, the GPU is under the control of a generalpurpose<br />

operating system and its device drivers, which feeds<br />

the GPU with executable workloads in a massive parallelism.<br />

This is basically how frameworks like OpenCL works. In the<br />

context of a safety system, the control functions for these<br />

workloads, and the resulting outcome may be the safety<br />

critical information that needs the extra protection that is<br />

involved for safety. Therefore, it is a fair assumption to make<br />

the controlling system also a system capable of meeting safety<br />

requirements. Then workload control and workload results can<br />

be made safety critical, while the actual computation is not.<br />

This further separates the safety and performance domains<br />

into separate processing entities, but can make use of fast<br />

CPUs for controlling the even faster GPU calculations in a<br />

safe approach.<br />

Both the separation kernel architecture and the<br />

virtualization solution can support this control path, although<br />

using virtualization adds an extra data transition path for the<br />

safety critical data required. Such data needs to transition to<br />

and from the safety domain via the virtualized guest operating<br />

system from and to the GPU, creating a usable but complex<br />

path.<br />

V. CONCLUSION<br />

Balancing functional safety with performance intensive<br />

systems requires separation through hardware or software. The<br />

hardware separation is limited in scalability and data<br />

transportation, but provides an easily proven safety versus<br />

performance domain separation. Software separation on the<br />

other hand provides scalable solutions for performance versus<br />

safety, but sometimes requires a complex software solution<br />

through virtualization. The middle path of using a separation<br />

kernel architecture provides the least complex path for<br />

software separation, allowing for scalable safety applications<br />

running on high performance CPU cores. When it comes to<br />

controlling GPU computations for safety, the controlling<br />

system most likely also needs to be capable of meeting safety<br />

requirements, or at least involve a hypervisor based control<br />

solution which incorporates safety.<br />

VI.<br />

REFERENCES<br />

[1] Intro to Real-Time Linux for Embedded Developers<br />

https://www.linuxfoundation.org/blog/intro-to-real-timelinux-for-embedded-developers<br />

[2] HOWTO setup Linux with PREEMPT_RT properly<br />

https://wiki.linuxfoundation.org/realtime/documentation/<br />

howto/applications/preemptrt_setup<br />

[3] INTEGRITY Real-Time Operating System<br />

https://www.ghs.com/products/rtos/integrity.html<br />

[4] VxWorks Safety Profile Product Overview<br />

https://www.windriver.com/products/product-<br />

overviews/Safety-Profile-for-VxWorks_Product-<br />

Overview/<br />

[5] QNX OS for Safety<br />

http://blackberry.qnx.com/en/products/certified_os/safekernel<br />

[6] ISO26262 Wikipedia<br />

https://en.wikipedia.org/wiki/ISO_26262<br />

www.embedded-world.eu<br />

526


[7] R H Pierce, “Preliminary assessment of Linux for safety<br />

related systems”, 2002. ISBN 0 7176 2538 9<br />

[8] 1oo1D Architecture,<br />

http://www.globalspec.com/reference/76370/203279/1oo<br />

1d-architecture<br />

[9] Zynq UltraScale+ MPSoC product site<br />

https://www.xilinx.com/products/silicondevices/soc/zynq-ultrascale-mpsoc.html<br />

[10] Renesas Generation 3 Automotive Computing Platform<br />

https://www.renesas.com/enus/solutions/automotive/products/rcar-h3.html<br />

[11] i.MX 8 Series Application Processors<br />

https://www.nxp.com/products/processors-andmicrocontrollers/applications-processors/i.mxapplications-processors/i.mx-8-processors:IMX8-<br />

SERIES<br />

[12] John Rushby, "The Design and Verification of Secure<br />

Systems," 8th ACM Symposium on Operating System<br />

Principles, pp. 12-21, Asilomar, CA, December 1981.<br />

(ACM Operating Systems Review, Vol. 15, No. 5).<br />

[13] Safety Automation Equipment List by Exida<br />

http://www.exida.com/SAEL/Green-Hills-Software-<br />

INTEGRITY-RTOS<br />

[14] Jensen Huang, “Accelerating AI with GPUs: A New<br />

Computing Model”, 2016<br />

https://blogs.nvidia.com/blog/2016/01/12/accelerating-aiartificial-intelligence-gpus/<br />

[15] General purpose computing on graphics processing units<br />

https://en.wikipedia.org/wiki/Generalpurpose_computing_on_graphics_processing_units<br />

[16] INTEGRITY Multivisor, Virtualization Architecture for<br />

Secure Systems<br />

https://www.ghs.com/products/rtos/integrity_virtualizatio<br />

n.html<br />

[17] The open source standard for hardware virtualization<br />

https://xenproject.org/users/virtualization.html<br />

[18] PikeOS Hypervisor<br />

https://www.sysgo.com/products/pikeos-hypervisor/<br />

527


Obtaining Worst-Case Execution Time Bounds on<br />

Modern Microprocessors<br />

Daniel Kästner, Markus Pister, Simon Wegener, Christian Ferdinand<br />

AbsInt GmbH<br />

D-66123 Saarbrücken, Germany<br />

info@absint.com<br />

Abstract—Many embedded control applications have realtime<br />

requirements. If the application is safety-relevant, worstcase<br />

execution time bounds have to be determined in order to<br />

demonstrate deadline adherence. If the microprocessor is timingpredictable,<br />

worst-case execution time guarantees can be<br />

computed by static WCET analysis. For high-performance multicore<br />

architectures with degraded timing predictability, WCET<br />

bounds can be computed by hybrid WCET analysis which<br />

combines static analysis with timing measurements. This article<br />

summarizes the relevant criteria for assessing timing predictability,<br />

gives a brief overview of static WCET analysis and<br />

focuses on a novel hybrid WCET analysis based on non-intrusive<br />

real-time instruction-level tracing.<br />

Keywords— worst-case execution time, static analysis, real-time<br />

tracing, timing predictability, path analysis, functional safety<br />

I. INTRODUCTION<br />

In real-time systems the overall correctness depends on the<br />

correct timing behavior: each real-time tasks has to finish<br />

before its deadline. All current safety standards require reliable<br />

bounds of the worst-case execution time (WCET) of real-time<br />

tasks to be determined.<br />

With end-to-end timing measurements timing information<br />

is only determined for one concrete input. Due to caches and<br />

pipelines the timing behavior of an instruction depends on the<br />

program path executed before. Therefore, usually no full test<br />

coverage can be achieved and there is no safe test end criterion.<br />

Techniques based on code instrumentation modify the code<br />

which can significantly change the cache and pipeline behavior<br />

(probe effect): the times measured for the instrumented<br />

software do not necessarily correspond to the timing behavior<br />

of the original software.<br />

One safe method for timing analysis is static analysis by<br />

Abstract Interpretation which provides guaranteed upper<br />

bounds for WCET of tasks. Static WCET analyzers are<br />

available for complex processors with caches and complex<br />

pipelines, and, in general, support single-core processors and<br />

multi-core processors. A prerequisite is that good models of the<br />

processor/System on-Chip (SoC) architecture can be<br />

determined. However, there are modern high performance<br />

SoCs which contain unpredictable and/or undocumented<br />

components that influence the timing behavior. Analytical<br />

results for such processors are unrealistically pessimistic.<br />

A hybrid WCET analysis combines static value and path<br />

analysis with measurements to capture the timing behavior of<br />

tasks. Compared to end-to-end measurements the advantage of<br />

hybrid approaches is that measurements of short code snippets<br />

can be taken which cover the complete program under analysis.<br />

Based on these measurements a worst-case path can be<br />

computed. The hybrid WCET analyzer TimeWeaver avoids the<br />

probe effect by leveraging the embedded trace unit (ETU) of<br />

modern processors, like Nexus 5001 [16], which allows a<br />

fine-grained observation of a core’s program flow.<br />

TimeWeaver reads the executable binary, reconstructs the<br />

control-flow graph and computes ranges for the values of<br />

registers and memory cells by static analysis. This information<br />

is used to derive loop bounds and prune infeasible paths. Then<br />

the trace files are processed and the path of longest execution<br />

time is computed. The computed time estimate provide<br />

valuable feedback for assessing system safety and for<br />

optimizing worst-case performance. TimeWeaver also provides<br />

feedback for optimizing the trace coverage: paths for which<br />

infeasibility has been proven need no measurements; loops for<br />

which the analyzed worst-case iteration count has not been<br />

measured are reported.<br />

In this article we give an overview of timing predictability<br />

in general and provide criteria for selecting suitable WCET<br />

analysis methods. We will outline the methodology of hybrid<br />

WCET analysis and report on practical experience with the tool<br />

TimeWeaver.<br />

II.<br />

TIMING PREDICTABILITY<br />

In general, a system is predictable if it is possible to predict<br />

its future behavior from the information about its current state.<br />

We consider predictability under the assumption that the<br />

hardware works without unexpected errors. Hardware faults<br />

like soft errors or transient faults have to be addressed by<br />

specific error handling mechanisms to ensure overall system<br />

safety.<br />

In [4] the program input and the hardware state in which<br />

execution begins are identified as the primary sources of<br />

uncertainty in execution time. Hardware-related timing<br />

predictability can be expressed as the maximal variance in<br />

execution time due to different hardware states for an arbitrary<br />

but fixed input. Analogously, software-related timing<br />

predictability corresponds to the maximal variance in<br />

execution time due to different inputs for an arbitrary but fixed<br />

www.embedded-world.eu<br />

528


hardware state. A basic assumption is uninterrupted program<br />

execution without interferences. In a concurrent system,<br />

interferences due to concurrent execution additionally have to<br />

be taken into account.<br />

To ensure the correct timing behavior it is necessary to<br />

demonstrate the deadline adherence of each task. To this end,<br />

the worst-case execution time of each task has to be<br />

determined, i.e. the concept of software-related predictability<br />

as defined above can be reduced to the predictability of the<br />

worst-case execution path.<br />

This leads to the following two main criteria for execution<br />

time predictability:<br />

• It must be possible to determine an upper bound of the<br />

maximal execution time which is guaranteed to hold.<br />

• To enable precise bounds on the maximal execution time to<br />

be determined the behavioral variance, i.e. the maximal<br />

variance in execution time due to different hardware states,<br />

has to be as low as possible. In general, the larger the<br />

behavioral variance is<br />

o<br />

o<br />

o<br />

the more the execution time depends on the execution<br />

history,<br />

the less meaningful is one particular execution time<br />

measurement in a specific execution context, and<br />

the larger can be the gap between the largest measured<br />

execution time and the true worst-case execution time.<br />

Even in single-core processors timing predictability is<br />

compromised by performance-enhancing hardware<br />

mechanisms like caches, pipelines, out-of-order execution,<br />

branch prediction and other mechanisms for speculative<br />

execution, which can cause significant variations in timing<br />

depending on the hardware state. Interestingly hardware<br />

speculation has recently been discovered to constitute a critical<br />

security vulnerability [21, 19].<br />

For multi-core processors all challenges to timing<br />

predictability are relevant that apply to single-core processors.<br />

In addition, there are new challenges imposed by the multi-core<br />

design. In the following we will first discuss timing<br />

predictability on single-core processors and then address<br />

specific challenges for multi-core processors.<br />

A. Single-Core Processors<br />

For simple non-pipelined architectures adding up the<br />

execution times of individual instructions is enough to obtain a<br />

bound on the execution time of a basic block. However,<br />

modern embedded processors try to maximize the instructionlevel<br />

parallelism by sophisticated performance-enhancing<br />

features, like caches, pipelines, or speculative execution.<br />

Pipelines increase performance by overlapping the executions<br />

of consecutive instructions. For timing measurements this<br />

means that there may be big variations between the execution<br />

times measured with different starting states of the hardware.<br />

Furthermore there may be a significant gap between the largest<br />

measured execution time and the true worst-case execution<br />

time. For a timing analyzer it means that it is not feasible to<br />

consider individual instructions in isolation. Instead, they have<br />

to be analyzed collectively—together with their mutual<br />

interactions—to obtain tight timing bounds. In the following<br />

we will give an overview of timing-relevant hardware features<br />

and discuss their effect on timing measurements and on static<br />

analysis methods.<br />

In general, the challenges for timing analysis of single-core<br />

architectures originate from the complexity of the particular<br />

execution pipeline and the connected hardware devices.<br />

Commonly used performance-enhancing features are caches,<br />

pipelines, out-of-order execution, speculative execution<br />

mechanisms like static/dynamic branch prediction and branch<br />

history tables, or branch target instruction caches. Many of<br />

these hardware features can cause timing anomalies [29] which<br />

render WCET analysis more difficult. Intuitively, a timing<br />

anomaly is a situation where the local worst-case does not<br />

contribute to the global worst-case. For instance, a cache miss<br />

—the local worst-case—may result in a globally shorter<br />

execution time than a cache hit because of hardware scheduling<br />

effects. In consequence, it is not safe to assume that the<br />

memory access causes a cache miss; instead both machine<br />

states have to be taken into account. An especially difficult<br />

timing anomaly are domino effects [22]: A system exhibits a<br />

domino effect if there are two hardware states s, t such that the<br />

difference in execution time (of the same program starting in s,<br />

t respectively) may be arbitrarily high. E.g., given a program<br />

loop, the executions never converge to the same hardware state<br />

and the difference in execution time increases in each iteration.<br />

In consequence, loops have to be analyzed very precisely and<br />

the number of machine states to track can grow high. For<br />

timing measurements this means that the difference between<br />

measured and true worst-case execution time caused by an<br />

incomplete hardware state coverage can grow arbitrarily high.<br />

The article [37] categorizes the timing compositionality of<br />

computing architectures according to the presence of timing<br />

anomalies. Fully compositional architectures, such as the<br />

ARM7, contain no timing anomalies; individual components,<br />

e.g., basic blocks, can be considered separately and their worstcase<br />

information can be combined. Compositional architectures<br />

only contain bounded timing effects, i.e., additional delays<br />

(e.g., due to an access to a shared resource or due to a<br />

preemption or interrupt) can be bounded by a constant and<br />

added to the local worst-case figures (e.g. TriCore 1797). Noncompositional<br />

architectures contain domino effects, i.e.,<br />

unbounded anomalies (e.g. PowerPC 755). Depending on the<br />

state of the pipeline and the predictors, the occupancy of<br />

functional units, and the contents of the caches—i.e., the<br />

execution history—an instruction needs only a few or several<br />

hundred cycles to complete its execution [8]. A rigorous<br />

definition of compositionality is given in [14].<br />

As the runtime of embedded control software often is<br />

dominated by load/store operations, memory subsystems<br />

nowadays introduce queues before the caches to buffer them<br />

and overcome early stall conditions like cache misses. Often<br />

this is complemented by fast data forwarding for consecutive<br />

accesses into cache lines that have already been requested by<br />

previous pending instructions, where the requested data might<br />

already be present in the core. This helps to reduce the number<br />

of transactions over the slow system bus. In the abstract model<br />

of the timing analysis, the representation of these hardware<br />

features has to be close to the concrete hardware to achieve<br />

529


satisfactory analysis precision. Due to their size, especially the<br />

dynamic branch prediction and the branch history tables<br />

consume a significant number of bits in the abstract state<br />

representation which increases the memory consumption of the<br />

analysis. Unknown or not precisely known effective addresses<br />

of memory requests further increase the timing analysis search<br />

space due to the number of possible scenarios (cache hit/miss,<br />

fast data forward or not, …). Concerning processor caches,<br />

both precision and efficiency depend on the predictability of<br />

the employed replacement policy [28, 8]. The Least-Recently-<br />

Used (LRU) replacement policy has the best predictability<br />

properties. Employing other policies, like Pseudo-LRU<br />

(PLRU), or First-In-First-Out (FIFO), or Random, yield less<br />

precise WCET bounds because fewer memory accesses can be<br />

precisely classified. Furthermore, the efficiency degrades<br />

because the analysis has to explore more possibilities. Another<br />

deciding factor is the write policy. Typically, there are two<br />

main options: write-through where a store is directly written to<br />

the next level in the memory hierarchy, and write-back where<br />

the data is written into the next hierarchy level if the concrete<br />

memory cell is evicted from the cache. The write-back policy<br />

induces timing uncertainty because the precise point in time<br />

when the write-back occurs is hard to predict; for example, it<br />

might happen after a task switch and slow down a different<br />

(and possibly higher-priority) task than the one that issued the<br />

store operation in the first place. Another timing analysis<br />

challenge is to model processor external devices which are<br />

typically connected with the caches over the system bus. Such<br />

devices are memory controllers for static (SRAM, Flash) or<br />

dynamic memory (DRAM, DDR or QDR) or controllers for<br />

system communication (CAN, FlexRay, AFDX). The<br />

corresponding bus protocol and memory chip timing have to be<br />

modeled precisely.<br />

Individually, each of the above features can be modeled<br />

without complexity problems. Only their combination can<br />

actually result in a large number of possible system states<br />

during the abstract simulation of a basic block. Smart system<br />

configurations as described in [18] can decrease both the<br />

execution time variability and the analysis complexity. In<br />

consequence, the complexity of timing analysis decreases such<br />

that highly complex processors like the Freescale PowerPC<br />

7448 can be handled. At the same time the accuracy of timing<br />

measurements will be improved.<br />

Some events in modern architectures are either<br />

asynchronous to program execution (e.g., interrupts, DMA) or<br />

not predictable in the model (e.g., ECC errors in RAM or some<br />

hardware exceptions). Their effect on the execution time has to<br />

be incorporated externally, i.e., by adding penalties based on<br />

the worst-case occurrence of the events to the computed<br />

WCET, or by statistical means.<br />

B. Multi-Core Processors<br />

Whereas timing analysis of single-core architectures<br />

already is quite challenging, the timing behavior of multi-core<br />

architectures is even more complex. A multi-core processor is a<br />

single computing component with two or more independent<br />

cores; it is called homogeneous if it includes only identical<br />

cores, otherwise it is called heterogeneous. Thus, all<br />

characteristic challenges from single-cores are still present in<br />

the multi-core design, but the multiple cores can independently<br />

run multiple instructions at the same time. Some multi-core<br />

processors can be run in lockstep mode where all cores execute<br />

the same instruction stream in parallel. This typically<br />

eliminates interferences between the cores, so from a timing<br />

perspective the processor behaves like a single-core.<br />

When the processor is not run in lockstep mode, the intercore<br />

parallelism becomes relevant. To interconnect the several<br />

cores, buses, meshes, crossbars, and also dynamically routed<br />

communication structures are used. In that case, the<br />

interference delays due to conflicting, simultaneous accesses to<br />

shared resources (e.g. main memory) are the main cause of<br />

imprecision. On a single-core system, the latency of a memory<br />

access mostly depends on the accessed memory region (e.g.<br />

slow flash memory vs. fast static RAM) and whether the<br />

accessed memory cell has been cached or not. On a multicore<br />

system, the latency also depends on the memory accesses of<br />

the other cores, because multiple simultaneous accesses might<br />

lead to a resource conflict, where only one of the accesses can<br />

be served directly, and the other accesses have to wait. The<br />

shared physical address space requires additional effort in order<br />

to guarantee a coherent system state: Data resident in the<br />

private cache of one core may be invalid, since modified data<br />

may already exist in the private cache of another core, or data<br />

might have already been changed in the main memory. Thus,<br />

additional communication between different cores is required<br />

and the execution time needed for this has to be taken into<br />

account. Multi-core processors which can be configured in a<br />

timing-predictable way to avoid or bound inter-core<br />

interferences are amenable to static WCET analysis [18, 36].<br />

Examples are the Infineon AURIX TC275 [17], or the<br />

Freescale MPC 5777.<br />

The Freescale P4080 [13] is one example of a multicore<br />

platform where the interference delays have a huge impact on<br />

the memory access latencies and cannot be satisfactorily<br />

predicted by purely static techniques. It consists of eight<br />

PowerPC e500mc cores which communicate with each other<br />

and the main memory over a shared interconnect, the CoreNet<br />

Coherency Fabric. The main problem for static analysis<br />

approaches is that the publically available documentation about<br />

the CoreNet is not enough to statically predict its behavior.<br />

Nowotsch et al. [24] measured maximal write latencies of 39<br />

cycles when only one core was active, and maximal write<br />

latencies of 1007 cycles when all eight cores were running.<br />

This is more than 25 times longer than the observed best case.<br />

A sound WCET analysis must take the interference delays into<br />

account that are caused by resource conflicts. Unless<br />

interference is avoided by means of the overall software<br />

architecture, ignoring these delays might result in<br />

underestimation of the real WCET whereas assuming full<br />

interferences at all times might result in huge overestimation.<br />

To improve predictability of avionics systems the<br />

Certification Authorities Software Team (CAST) [5] advocates<br />

to either deactivate or control existing interference channels. If<br />

deactivation is not possible the software architecture has to be<br />

able to prevent or bound the interferences. One hardware<br />

element where such mechanisms are required is the<br />

interconnect, i.e., the Network-on-Chip (NoC) or shared bus<br />

connecting main memory to the individual cores. Several<br />

www.embedded-world.eu<br />

530


approaches to address interference on shared memory accesses<br />

have been discussed in literature, most of them in the context<br />

of Integrated Modular Avionics (IMA). They typically rely on<br />

a time-triggered static scheduling scheme, e.g., corresponding<br />

to the avionics standard ARINC 653. As an example, with the<br />

approaches of [30] or [24] precise static WCET bounds can be<br />

computed, albeit at the cost of high computational complexity.<br />

For systems which do not implement such rigorous software<br />

architectures or where the information needed to develop a<br />

static timing model is not available, hybrid WCET approaches<br />

are the only solution.<br />

III.<br />

WCET GUARANTEES ON PREDICTABLE PROCESSORS<br />

The most successful formal method for WCET computation<br />

is Abstract Interpretation-based static program analysis. Static<br />

program analyzers compute information about the software<br />

under analysis without actually executing it. Semantics-based<br />

static analyzers use an explicit (or implicit) program semantics<br />

that is a formal (or informal) model of the program executions<br />

in all possible or a set of possible execution environments.<br />

Most interesting program properties—including the WCET—<br />

are undecidable in the concrete semantics. The theory of<br />

abstract interpretation [6] provides a formal methodology for<br />

semantics-based static analysis of dynamic program properties<br />

where the concrete semantics is mapped to a simpler abstract<br />

model, the so-called abstract semantics. The static analysis is<br />

computed with respect to that abstract semantics, enabling a<br />

trade-off between efficiency and precision. A static analyzer is<br />

called sound if the computed results hold for any possible<br />

program execution. Applied to WCET analysis, soundness<br />

means that the WCET bounds will never be exceeded by any<br />

possible program execution. Abstract interpretation supports<br />

formal soundness proofs for the specified program analysis.<br />

Like model checking and theorem proving, it is recognized as a<br />

formal method by the DO-178C and other safety standards (cf.<br />

Formal Methods Supplement [26] to DO-178C [27]). It is<br />

based on a mathematically rigorous concept and provides the<br />

highest possible confidence in the correctness of the results (cf.<br />

IEC-61508, Ed. 2.0 [15], Table C.18).<br />

In addition to soundness, further essential requirements for<br />

static WCET analyzers are efficiency and precision. The<br />

analysis time has to be acceptable for industrial practice, and<br />

the overestimation must be small enough to be able to prove<br />

the timing requirements to be met.<br />

Over the last few years, a more or less standard architecture<br />

for timing analysis tools has emerged [9, 11]. It neither requires<br />

code instrumentation nor debug information and is composed<br />

of three major building blocks:<br />

• control-flow reconstruction and static analyses for control<br />

and data flow,<br />

• micro-architectural analysis, computing upper bounds on<br />

execution times of basic blocks,<br />

• path analysis, computing the longest execution paths<br />

through the whole program.<br />

The data flow analysis of the first block also detects<br />

infeasible paths, i.e., program points that cannot occur in any<br />

real execution. This reduces the complexity of the following<br />

micro-architectural analysis. Basic block timings are<br />

determined using an abstract processor model (timing model) to<br />

analyze how instructions pass through the pipeline taking<br />

cache-hit/ cache-miss information into account. This model<br />

defines a cycle-level abstract semantics for each instruction's<br />

execution yielding in a certain set of final system states. After<br />

the analysis of one instruction has been finished, these states<br />

are used as start states in the analysis of the successor<br />

instruction(s). Here, the timing model introduces nondeterminism<br />

that leads to multiple possible execution paths in<br />

the analyzed program. The pipeline analysis has to examine all<br />

of these paths.<br />

In the following sections we will focus on the commercially<br />

available tool aiT [1] which implements the architecture<br />

described above. It is centered around a precise model of the<br />

microarchitecture of the target processor and is available for<br />

various 16-bit and 32-bit single-core and multi-core<br />

microcontrollers. aiT determines the WCET of a program task<br />

in several phases corresponding to the reference architecture<br />

described above, which makes it possible to use different<br />

methods tailored to each subtask [34]. In the following we will<br />

give an overview of each analysis stage.<br />

• In the decoding phase the instruction decoder reads and<br />

disassembles the input executable(s) into its individual<br />

instructions. Architecture specific patterns decide whether<br />

an instruction is a call, branch, return or just an ordinary<br />

instruction. This knowledge is used to reconstruct the basic<br />

blocks of the control flow graph (CFG) [33]. Then, the<br />

control flow between the basic blocks is reconstructed. In<br />

most cases, this is done completely automatically.<br />

However, if a target of a call or branch cannot be statically<br />

resolved, then the user can write some annotations to guide<br />

the control flow reconstruction.<br />

• The combined loop and value analysis determines safe<br />

approximations of the values of processor registers and<br />

memory cells for every program point and execution<br />

context. These approximations are used to determine<br />

bounds on the iteration number of loops and information<br />

about the addresses of memory accesses. Contents of<br />

registers or memory cells, loop bounds, and address ranges<br />

for memory accesses may also be provided by annotations<br />

if they cannot be determined automatically. Value analysis<br />

information is also used to identify conditions that are<br />

always true or always false. Such knowledge is used to<br />

infer that certain program parts are never executed and<br />

therefore do not contribute to the worst-case execution<br />

time or the stack usage.<br />

• In the micro-architectural analysis phase cache and pipeline<br />

analysis has to be combined because the pipeline analysis<br />

models the flow of instructions through the processor<br />

pipeline and therefore computes the precise instant of time<br />

when the cache is queried and its state is updated. The<br />

combined cache and pipeline analysis represents an<br />

abstract interpretation of the program's execution on the<br />

underlying system architecture. The execution of a<br />

program is simulated by feeding instruction sequences<br />

from a control-flow graph to the timing model which<br />

computes the system state changes at cycle granularity and<br />

531


keeps track of the elapsing clock cycles. The correctness<br />

proofs according to the theory of abstract interpretation<br />

have been conducted by Thesing [35]. The cache analysis<br />

presented by [10] is incorporated into the pipeline analysis.<br />

At each point where the actual hardware would query and<br />

update the contents of the cache(s), the abstract cache<br />

analysis is called, simulating a safe approximation of the<br />

cache effects. The result of the cache/pipeline analysis<br />

either is a worst-case execution time for every basic block,<br />

or a prediction graph that represents the evolution of the<br />

abstract system states at processor core clock granularity<br />

[7].<br />

• The path analysis phase uses the results of the combined<br />

cache/pipeline analysis to compute the worst-case path of<br />

the analyzed code with respect to the execution timing.<br />

The execution time of the computed worst-case path is the<br />

worst-case execution time for the program. Within the aiT<br />

framework, different methods for computing this worstcase<br />

path are available.<br />

aiT has been successfully employed in the avionics [12, 11, 31]<br />

and automotive [23] industries to determine precise bounds on<br />

execution times of safety-critical software. It is available for a<br />

variety of microprocessors ranging from simple processors like<br />

ARM7 to complex superscalar processors with timing<br />

anomalies and domino effects like Freescale MPC755, or<br />

MPC7448, and multicore processors like Infineon AURIX<br />

TC27x.<br />

IV.<br />

HYBRID WCET ANALYSIS<br />

Techniques to compute worst-case execution time<br />

information from measurements are either based on end-to-end<br />

measurements of tasks, or they construct a worst-case path<br />

from timing information obtained for a set of smaller code<br />

snippets in which the executable code of the task has been<br />

partitioned. With end-to-end timing measurements, timing<br />

information is only determined for one concrete input. As<br />

described above, due to caches and pipelines the timing<br />

behavior of an instruction depends on the program path<br />

executed before. Therefore, usually no full test coverage can be<br />

achieved and there is no safe test end criterion. Approaches that<br />

instrument the code to obtain timing information about the<br />

code snippets of a task modify the code which can significantly<br />

change the cache and pipeline behavior (probe effect): the<br />

times measured for the instrumented software do not<br />

necessarily correspond to the timing behavior of the original<br />

software.<br />

The solution which is implemented in the hybrid WCET<br />

analysis tool TimeWeaver [2] combines static context-sensitive<br />

path analysis with non-intrusive real-time instruction-level<br />

tracing to provide worst-case execution time estimates. By its<br />

nature, an analysis using measurements to derive timing<br />

information is aware of timing interference due to concurrent<br />

execution and multicore resource conflicts, because the effects<br />

of asynchronous events (e.g. activity of other running cores or<br />

DRAM refreshes) are directly visible in the measurements. The<br />

probe effect is completely avoided since no code<br />

instrumentation is needed. The computed estimates are safe<br />

upper bounds with respect to the given input traces, i.e.,<br />

TimeWeaver derives an overall upper timing bound from the<br />

execution time observed in the given traces. Thus, the coverage<br />

of the input traces on the analyzed code is an important metric<br />

that influences the quality of the computed WCET estimates.<br />

The trace information needed for running TimeWeaver is<br />

provided out-of-the-box by embedded trace units of modern<br />

processors, like NEXUS IEEE-ISTO 5001 [16] or ARM<br />

CoreSight [3]. They allow the fine-grained observation of a<br />

program execution on single-core and multicore systems.<br />

Examples for processors supporting the NEXUS trace interface<br />

are the NXP QorIQ P- and T-series processors (using either an<br />

e500mc or an e5500/e6500 core).<br />

A. NEXUS Traces<br />

On the PowerPC architecture TimeWeaver relies on<br />

NEXUS program flow trace messages. Such traces consist of<br />

trace segments separated by trace events. TimeWeaver maps<br />

the events to points in the control-flow graph (trace points) and<br />

the segments to program paths between these points. This is<br />

done for those parts of the trace that reach from the call of the<br />

routine used as analysis entry till the end of that routine or any<br />

other feasible end of execution. Such parts are called trace<br />

snippets. A single trace may contain several trace snippets.<br />

TimeWeaver can operate on one or more traces given as trace<br />

files, each containing one or more trace snippets.<br />

A NEXUS trace event encodes its type, a time stamp<br />

containing the elapsed CPU cycles since the last trace event<br />

and the contents of the branch history buffer, which can be<br />

used to reconstruct execution path decisions and allows to map<br />

trace segments to the control-flow graph of the corresponding<br />

executable.<br />

Microprocessor debugging solutions like the Lauterbach<br />

PowerDebug Pro [20] allow to record NEXUS trace events as<br />

they are emitted during program execution and to export them<br />

in various formats. TimeWeaver can process those exports for<br />

its timing analysis as described below.<br />

Here is a sample NEXUS trace excerpt (with some<br />

information removed) in ASCII format:<br />

+056 TCODE=1D PT-IBHSM F-ADDR=F1F4 HIST=2 TS=8847<br />

+064 TCODE=21 PT-PTCM EVCODE=A TS=88F1<br />

+072 TCODE=1C PT-IBHM U-ADDR=03DC HIST=1 TS=8D62<br />

+080 TCODE=21 PT-PTCM EVCODE=A TS=8E2F<br />

+088 TCODE=21 PT-PTCM EVCODE=A TS=8FBA<br />

+096 TCODE=21 PT-PTCM EVCODE=A TS=9105<br />

+104 TCODE=1C PT-IBHM U-ADDR=02CC HIST=1 TS=9275<br />

+112 TCODE=1C PT-IBHM U-ADDR=01F0 HIST=1 TS=93BF<br />

+120 TCODE=21 PT-PTCM EVCODE=A TS=997B<br />

+128 TCODE=1C PT-IBHM U-ADDR=0044 HIST=1 TS=9B02<br />

+136 TCODE=21 PT-PTCM EVCODE=A TS=9F21<br />

This output has been generated using the following<br />

command in the Lauterbach Trace32 tool:<br />

Trace.export.ascii nexus /showRecord<br />

Each line corresponds to a trace event. The number at the<br />

beginning of the line is the trace record number. The second<br />

and third column represent the particular trace event type<br />

followed by type-specific information like branch history and<br />

program address information associated with the event. The TS<br />

number at the end is a time stamp.<br />

www.embedded-world.eu<br />

532


Debugging solutions differ in the format in which they<br />

export trace data. Some debuggers allow the user to configure<br />

the output. TimeWeaver can currently import traces which<br />

have been exported by Lauterbach, PLS or iSYSTEM<br />

debuggers. Whenever the format is configurable, we have<br />

identified a minimal set of information needed to perform the<br />

TimeWeaver analysis. Additionally, TimeWeaver can be easily<br />

extended to support other trace formats.<br />

B. TimeWeaver Toolchain<br />

The main inputs for TimeWeaver are the fully linked<br />

executable(s), timed traces and the location of the analyzed<br />

code in the memory (entry point, which usually is the name of<br />

a task or function). Optionally, users can specify further<br />

semantical information to the analysis, like targets of computed<br />

calls, loop bounds, values of registers and memory cells. This<br />

information is used to fine-tune the analysis. The analysis<br />

proceeds in several stages: decoding, loop/value analysis, trace<br />

analysis, and path analysis. Most steps in this tool chain are<br />

shared with aiT, leveraging its powerful static analysis<br />

framework.<br />

The decoding phase of TimeWeaver is mostly identical to<br />

the decoding phase of aiT. One important difference is that<br />

when encountering call targets which cannot be statically<br />

resolved, TimeWeaver can be instructed to extract the targets<br />

of unresolved branches or calls from the input traces. To this<br />

end there is a feedback loop between the CFG reconstruction<br />

and the trace analysis step (cf. Fig. 1). As an alternative, the<br />

same user annotations can be used as in the aiT tool chain.<br />

In the next phase, several microarchitectural analyses are<br />

performed on the reconstructed CFG starting with the<br />

combined loop and value analysis, again equal to the aiT tool<br />

chain. It determines possible values of registers and memory<br />

cells, addresses of memory accesses, as well as loop and<br />

recursion bounds. Based on this, statically infeasible paths are<br />

computed, i.e., parts of the program that cannot be reached by<br />

any execution under the given configuration. This is important<br />

because each detected infeasible path increases the trace<br />

coverage. Such paths are pruned from further analysis. If the<br />

value analysis cannot compute a loop bound or if the computed<br />

bound is not precise enough, users can specify custom bounds<br />

by means of annotations which are used by the analysis. The<br />

loop transformation allows loops in the CFG to be handled as<br />

self-recursive routines to improves analysis precision [32].<br />

After value analysis, the analyzer has annotated each<br />

instruction in the control-flow graph with context-sensitive<br />

analysis results. This context-sensitivity is important because<br />

the precision of an analysis can be improved significantly if the<br />

execution environment is considered [32]. For example, if a<br />

routine is called with different register values from two<br />

different program points, the execution time in both situations<br />

might be different. Depending on the context settings, this is<br />

taken into account leading to higher precision in the analysis<br />

result.<br />

Fig. 1. TimeWeaver tool chain structure<br />

In the trace analysis step the given traces are analyzed such<br />

that each trace event is mapped to a program point in the<br />

control-flow graph. This mapping defines the trace points and<br />

segments mentioned above and is not only necessary for the<br />

whole analysis but also ensures that the input trace matches the<br />

analyzed binary. In case a preemptive system has been traced,<br />

interrupts are detected and reported. The extracted timing<br />

information, i.e., the clock cycles which have been elapsed<br />

between two consecutive trace points are annotated to the CFG<br />

in a context-sensitive manner.<br />

After the trace conversion, a CFG which combines the<br />

results of value analysis and traced execution timings (both<br />

context-sensitive) is available. This graph is the input for the<br />

next step, the path analysis phase. Here, the trace segment<br />

times alongside the control-flow graph are used to generate an<br />

integer linear program (ILP) formulation to compute the worstcase<br />

execution path with respect to the traced timings. At this<br />

point, the recorded times for each pair of trace segment and<br />

533


analysis context, get maximized. The ILP formulation is<br />

structurally the same as in the path analysis of aiT [33] with the<br />

exception that the involved execution times are not computed<br />

by a micro-architectural pipeline analysis but are extracted<br />

from the input traces. The generated ILP is fed to a solver<br />

whose solution is the worst-case execution path alongside its<br />

costs, i.e., the WCET estimate of the analyzed task. This<br />

solution is annotated to the CFG for the final step, namely<br />

reporting and visualization.<br />

As mentioned above, the input traces might contain<br />

asynchronous events like DRAM refreshes which can lead to<br />

exceptionally high trace segment times. TimeWeaver allows to<br />

address these with a filter for trace segment times based on<br />

their cumulative frequency (CF), i.e. their occurrence<br />

percentage. The threshold refers to a percentage of occurrences<br />

ordered by execution times (as in the survival graph, see<br />

below). A threshold of 0% is passed by all occurrences. A<br />

threshold of 5% is passed by all but the 4 most expensive ones<br />

(in terms of execution time) if there are 100 occurrences, by all<br />

but the 9 most expensive ones if there are 200 occurrences, etc.<br />

Trace segment times that do not pass the specified threshold<br />

are ignored in the ILP generation. The filter function is applied<br />

for each trace segment separately. TimeWeaver also allows to<br />

simulate the effect of the CF filter in its statistics view to<br />

experiment with different filter values.<br />

C. TimeWeaver Result Reporting and Visualization<br />

Besides the global WCET estimate and the execution path<br />

triggering it, TimeWeaver offers a variety of reporting<br />

facilities:<br />

• WCET estimate per routine (including cumulative<br />

information of called sub routines),<br />

• Context-specific WCET estimate per routine (including<br />

cumulative information of called sub routines),<br />

• Determined loop bounds (distinguishes between traced,<br />

analyzed, and effective bounds) including loop scaling<br />

conflicts,<br />

• Variance of trace segment times (context-sensitive),<br />

• Trace coverage with respect to the number of basic<br />

blocks and instructions in the analyzed code, and<br />

• Memory access information along the computed worstcase<br />

path.<br />

In addition to the above described statistics, TimeWeaver<br />

provides the following visualizations:<br />

• Analysis result graph to interactively explore the results,<br />

• Per trace segment distribution graph for the recorded<br />

segment times (cf. Fig. 2), and a<br />

• Per trace segment survival graph to show the<br />

cumulative frequency of the recorded segment times<br />

(cf. Fig. 3).<br />

Fig. 2. Sample distribution graph of a trace segment<br />

Fig. 3 Sample survival graph of a trace segment<br />

D. WCET Estimate Extrapolation<br />

As mentioned above, TimeWeaver computes the global<br />

WCET estimate based on the observed execution times of trace<br />

segments. The times are maximized per trace segment and the<br />

maximized times are composed to identify the worst-case path<br />

with respect to those figures.<br />

Where in general, one would need to measure all possible<br />

execution paths (which is impractical on real-world<br />

applications) of the analyzed program for coverage reasons,<br />

TimeWeaver allows to compute an upper bound on the global<br />

execution time of the analyzed program based on the trace<br />

segment times extracted from the input traces.<br />

This way, it is only necessary to trace all possible execution<br />

paths between two consecutive trace points. By inserting<br />

custom trace points, the user can further decrease the required<br />

number of measurements. Fig. 4 illustrates this by showing<br />

three consecutive trace points (TP1, TP2, and TP3) and the<br />

possible execution paths between each of them. TimeWeaver<br />

composes the WCET estimate for the time between TP1 and<br />

TP3 by the sum over the maximized trace segment time<br />

between TP1→TP2 and the maximized trace segment time<br />

between TP2→TP3. Thus, the measurements need to cover the<br />

four execution paths between TP1→TP2 as well as between<br />

three execution paths between TP2→TP3. Without that time<br />

composition, all 12 execution paths between TP1→TP3 need<br />

to be measured.<br />

E. Loop Scaling<br />

For loops, there might be a gap between the maximum of<br />

the observed iteration counts in the input traces (traced bound)<br />

and the statically possible maximum iteration count (analyzed<br />

bound) which is computed by the value analysis. The bound<br />

www.embedded-world.eu<br />

534


actually used for the ILP generation is the so-called effective<br />

bound which is the analyzed bound if it is finite and applicable<br />

(cf. scaling conflicts below) and otherwise the traced bound.<br />

Per user request, the intersection of analyzed and traced bound<br />

is used.<br />

If the effective bound is higher than the traced bound, the<br />

maximum observed execution time (context-sensitively) for<br />

one loop iteration is scaled up to the effective bound. This<br />

overcomes the necessity to trace each loop in the analyzed task<br />

with its worst-case iteration count, which might be hard to<br />

achieve because loop conditions often are data-dependent and<br />

thus can be complex to trigger.<br />

However, loop scaling as described above is not always<br />

directly applicable. It requires each trace to pass a trace point<br />

inside the loop body. If there is at least one traced execution<br />

path through the loop body without a trace point, scaling<br />

cannot be done and only the traced bounds are used for this<br />

loop. Such a situation is called an event loop scaling conflict.<br />

The solution is to either trace the worst-case loop iteration<br />

count or to ensure that each traced path through the loop body<br />

passes a trace point (by inserting custom trace points).<br />

There is another situation which triggers a loop scaling<br />

conflict: if due to the context settings of the analysis a loop is<br />

virtually unrolled more times than the corresponding loop body<br />

has been executed in the trace, scaling cannot be applied, as<br />

well. The reason is that the scaling is applied in the last loop<br />

context, i.e., in that context which represents the last loop<br />

iteration(s). In that case, there is no traced loop body time in<br />

the trace mapped to this context which prevents scaling. Such a<br />

conflict is called a unroll loop scaling conflict. To solve this<br />

conflict, one can either trace the worst-case iteration count of<br />

the corresponding loop or the (virtual) loop unroll during<br />

analysis of this particular loop can be decreased to the traced<br />

bound.<br />

Fig. 4 Execution paths between trace points<br />

V. EXPERIMENTAL RESULTS ON TIMEWEAVER<br />

To evaluate TimeWeaver for PowerPC, we recorded<br />

program executions on an NXP T1040 [25] evaluation board<br />

using a Lauterbach PowerDebug Pro JTAG debugger.<br />

A. Loop scaling<br />

Execution times for loops can be scaled up from the<br />

maximum observed execution time of the loop body. This can<br />

be seen in the analysis of the following program:<br />

1 volatile int sensor;<br />

2<br />

3 int helper (int x)<br />

4 {<br />

5 int result = x;<br />

6 result += sensor;<br />

7 return result + 3;<br />

8 }<br />

9<br />

10<br />

11 int main (void)<br />

12 {<br />

13 int result = 0;<br />

14<br />

15 result += helper(256);<br />

16<br />

17 int i;<br />

18 int loop_bound = (sensor-0xDEADBEEF)+5;<br />

19<br />

20 /* Loop with statically unknown bound */<br />

21 for (i=0 ; i


Application Trace [cycles] Estimate [cycles] Diff [%]<br />

crc 809068 829039 2.47<br />

edn 4788025 4791420 0.07<br />

eratosthenes sieve 368345 369803 0.40<br />

dhrystone 168093 177314 5.49<br />

md5 127857 131718 3.02<br />

nestedDepLoops 2747357 2747359 0.00<br />

sha 23426161 23815350 1.66<br />

Avionics Task 420677 498028 18.38<br />

Automotive Task 1 65058 71964 10.62<br />

Automotive Task 2 27215 28967 6.44<br />

Automotive Task 3 17386 18595 6.95<br />

Automotive Task 4 101749 109302 7.42<br />

Tab. 1. TimeWeaver Result Comparison<br />

For each application, the maximum observed end-to-end<br />

time has been extracted from the traces and compared with the<br />

WCET estimate computed by TimeWeaver. The difference<br />

represents the overestimation of TimeWeaver resulting from<br />

the composition of trace segment times to a global estimate. In<br />

average, the TimeWeaver results from the table above are<br />

5.24% above the maximum observed end-to-times from the<br />

traces.<br />

VIII.<br />

CONCLUSION<br />

In this article we have given a definition of timing<br />

predictability and discussed hardware features which increase<br />

the difficulty of obtaining safe and precise worst-case<br />

execution time bounds, both on single-core and multicore<br />

processors. We have described the methodology of static<br />

worst-case execution time analysis which can provide<br />

guaranteed WCET bounds on complex processors, if the<br />

timing behavior of the processor is well specified, and<br />

asynchronous interferences can be controlled or bounded.<br />

Hybrid worst-case execution time analysis allows to obtain<br />

worst-case execution time bounds even for systems where<br />

these conditions are not met. We have given an overview of<br />

the hybrid WCET analyzer TimeWeaver which combines<br />

static value and path analysis with timing measurements based<br />

on non-intrusive instruction-level real-time traces. The trace<br />

information covers interference effects, e.g., by accesses to<br />

shared resources from different cores, without being distorted<br />

by probe effects since no instrumentation code is needed. The<br />

analysis results include the computed WCET bound with the<br />

time-critical path, and information about the trace coverage<br />

obtained. They provide valuable feedback for optimizing trace<br />

coverage, for assessing system safety, and for optimizing<br />

worst-case performance. Experimental results show that with<br />

good trace coverage safe and precise WCET bounds can be<br />

efficiently computed.<br />

ACKNOWLEDGMENT<br />

This work was funded within the project ARAMiS II by the<br />

German Federal Ministry for Education and Research (BMBF)<br />

with the funding ID 01IS16025B, and within the BMBF<br />

project EMPHASE with the funding ID 16EMO0183. The<br />

responsibility for the content remains with the authors.<br />

REFERENCES<br />

[1] AbsInt GmbH. aiT Worst-Case Execution Time Analyzer Website.<br />

http://www.AbsInt.com/ait.<br />

[2] AbsInt GmbH. TimeWeaver Website. http://www.AbsInt.com/-<br />

timeweaver.<br />

[3] ARM Ltd. Coresight TM Program Flow Trace TM PFTv1.0 and PFTv1.1<br />

architecture specification, 2011. ARM IHI 0035B.<br />

[4] Philip Axer, Rolf Ernst, Heiko Falk, Alain Girault, Daniel Grund, Nan<br />

Guan, Bengt Jonsson, Peter Marwedel, Jan Reineke, Christine<br />

Rochange, Maurice Sebastian, Reinhard von Hanxleden, Reinhard<br />

Wilhelm, and Wang Yi. Building timing predictable embedded systems.<br />

ACM Transactions on Embedded Computing Systems, 13(4):82:1–<br />

82:37, 2014.<br />

[5] Certification Authorities Software Team (CAST). Position Paper<br />

CAST-32A Multi-core Processors, November 2016.<br />

[6] P. Cousot and R. Cousot. Abstract interpretation: a unified lattice model<br />

for static analysis of programs by construction or approximation of<br />

fixpoints. In 4 th POPL, pages 238–252, Los Angeles, CA, 1977. ACM<br />

Press.<br />

[7] Christoph Cullmann. Cache persistence analysis for embedded realtime<br />

systems. PhD thesis, Universitaet des Saarlandes, Postfach 151141,<br />

66041 Saarbruecken, 2013.<br />

[8] Christoph Cullmann, Christian Ferdinand, Gernot Gebhard, Daniel<br />

Grund, Claire Maiza, Jan Reineke, Benoît Triquet, and Reinhard<br />

Wilhelm. Predictability considerations in the design of multi-core<br />

embedded systems. In Proceedings of Embedded Real Time Software<br />

and Systems, pages 36–42, May 2010.<br />

[9] Andreas Ermedahl. A Modular Tool Architecture for Worst-Case<br />

Execution Time Analysis. PhD thesis, Uppsala University, 2003.<br />

[10] Christian Ferdinand. Cache Behavior Prediction for Real-Time Systems.<br />

PhD thesis, Saarland University, 1997.<br />

[11] Christian Ferdinand, Reinhold Heckmann, Marc Langenbach, Florian<br />

Martin, Michael Schmidt, Henrik Theiling, Stephan Thesing, and<br />

Reinhard Wilhelm. Reliable and precise WCET determination for a<br />

real-life processor. In Proceedings of EMSOFT 2001, First Workshop<br />

on Embedded Software, volume 2211 of Lecture Notes in Computer<br />

Science, pages 469–485. Springer-Verlag, 2001.<br />

[12] Christian Ferdinand and Reinhard Wilhelm. Fast and Efficient Cache<br />

Behavior Prediction for Real-Time Systems. Real-Time Systems, 17(2-<br />

3):131–181, 1999.<br />

[13] Freescale Inc. QorIQTM P4080 Communications Processor Product<br />

Brief, September 2008. Rev. 1.<br />

[14] Sebastian Hahn, Jan Reineke, and Reinhard Wilhelm. Towards<br />

compositionality in execution time analysis: Definition and challenges.<br />

SIGBED Rev., 12(1):28–36, March 2015.<br />

[15] IEC 61508. Functional safety of electrical/electronic/programmable<br />

electronic safety-related systems, 2010.<br />

[16] IEEE-ISTO. IEEE-ISTO 5001 TM -2012, The Nexus 5001 TM Forum<br />

Standard for a Global Embedded Processor Debug Interface, 2012.<br />

[17] Infineon Technologies AG. AURIXTM TC27x D-Step User’s Manual,<br />

2014.<br />

[18] D. Kästner, M. Schlickling, M. Pister, C. Cullmann, G. Gebhard,<br />

R. Heckmann, and C. Ferdinand. Meeting Real-Time Requirements<br />

with Multi-Core Processors. Safecomp 2012 Workshop: Next<br />

Generation of System Assurance Approaches for Safety-Critical<br />

Systems (SASSUR), September 2012.<br />

[19] Paul Kocher, Daniel Genkin, Daniel Gruss, Werner Haas, Mike<br />

Hamburg, Moritz Lipp, Stefan Mangard, Thomas Prescher, Michael<br />

Schwarz, and Yuval Yarom. Spectre attacks: Exploiting speculative<br />

execution. ArXiv e-prints, January 2018.<br />

[20] Lauterbach GmbH. Lauterbach Website. http://www.lauterbach.com.<br />

[21] Moritz Lipp, Michael Schwarz, Daniel Gruss, Thomas Prescher,<br />

Werner Haas, Stefan Mangard, Paul Kocher, Daniel Genkin, Yuval<br />

Yarom, and Mike Hamburg. Meltdown. ArXiv e-prints, January 2018.<br />

[22] Thomas Lundqvist and Per Stenström. Timing anomalies in<br />

dynamically scheduled microprocessors. In Real-Time Systems<br />

Symposium (RTSS), December 1999.<br />

www.embedded-world.eu<br />

536


[23] NASA Engineering and Safety Center. Technical Support to the<br />

National Highway Traffic Safety Administration (NHTSA) on the<br />

Reported Toyota Motor Corporation (TMC) Unintended Acceleration<br />

(UA) Investigation, 2011.<br />

[24] J. Nowotsch, M. Paulitsch, D. Bühler, H. Theiling, S. Wegener, and<br />

M. Schmidt. Multi-core Interference-Sensitive WCET Analysis<br />

Leveraging Runtime Resource Capacity Enforcement. In ECRTS’14:<br />

Proceedings of the 26th Euromicro Conference on Real-Time Systems,<br />

July 2014.<br />

[25] NXP Semiconductors. QorIQTM T1040 Reference Manual, 2015.<br />

[26] Radio Technical Commission for Aeronautics. Formal Methods<br />

Supplement to DO-178C and DO-278A, 2011.<br />

[27] Radio Technical Commission for Aeronautics. RTCA DO-178C.<br />

Software Considerations in Airborne Systems and Equipment<br />

Certification, 2011.<br />

[28] Jan Reineke, Daniel Grund, Christoph Berg, and Reinhard Wilhelm.<br />

Timing predictability of cache replacement policies. Real-Time Systems,<br />

37(2):99–122, 2007.<br />

[29] Jan Reineke, Björn Wachter, Stephan Thesing, Reinhard Wilhelm, Ilia<br />

Polian, Jochen Eisinger, and Bernd Becker. A Definition and<br />

Classification of Timing Anomalies. In Frank Mueller, editor,<br />

International Workshop on Worst-Case Execution Time Analysis<br />

(WCET), July 2006.<br />

[30] Andreas Schranzhofer, Jian-Jia Chen, and Lothar Thiele. Timing<br />

predictability on multi-processor systems with shared resources. In<br />

Workshop on Reconciling Performance with Predictability (RePP),<br />

2010, October 2009.<br />

[31] Jean Souyris, Erwan Le Pavec, Guillaume Himbert, Victor Jégu,<br />

Guillaume Borios, and Reinhold Heckmann. Computing the worst case<br />

execution time of an avionics program by abstract interpretation. In<br />

Proceedings of the 5th Intl Workshop on Worst-Case Execution Time<br />

(WCET) Analysis, pages 21–24, 2005.<br />

[32] Stefan Stattelmann and Florian Martin. On the Use of Context<br />

Information for Precise Measurement-Based Execution Time<br />

Estimation. In B. Lisper, editor, 10th International Workshop on Worst-<br />

Case Execution Time Analysis (WCET 2010), volume 15 of OpenAccess<br />

Series in Informatics (OASIcs), pages 64–76. Schloss Dagstuhl–<br />

Leibniz-Zentrum fuer Informatik, 2010.<br />

[33] Henrik Theiling. Control Flow Graphs for Real-Time System Analysis.<br />

Reconstruction from Binary Executables and Usage in ILP-Based Path<br />

Analysis. PhD thesis, Saarland University, 2003.<br />

[34] Henrik Theiling and Christian Ferdinand. Combining abstract<br />

interpretation and ILP for microarchitecture modelling and program<br />

path analysis. In Proceedings of the 19th IEEE Real-Time Systems<br />

Symposium, pages 144–153, Madrid, Spain, December 1998.<br />

[35] Stephan Thesing. Safe and Precise Worst-Case Execution Time<br />

Prediction by Abstract Interpretation of Pipeline Models. PhD thesis,<br />

Saarland University, 2004.<br />

[36] Simon Wegener. Towards Multicore WCET Analysis. In Jan Reineke,<br />

editor, 17th International Workshop on Worst-Case Execution Time<br />

Analysis (WCET 2017), volume 57 of OpenAccess Series in Informatics<br />

(OASIcs), pages 1–12, Dagstuhl, Germany, 2017. Schloss Dagstuhl–<br />

Leibniz-Zentrum fuer Informatik.<br />

[37] Reinhard Wilhelm, Daniel Grund, Jan Reineke, Markus Pister, Marc<br />

Schlickling, and Christian Ferdinand. Memory hierarchies, pipelines,<br />

and buses for future time-critical embedded architectures. IEEE TCAD,<br />

28(7):966–978, July 2009.<br />

537


Missing Relationship between Software FTAs and<br />

System FTA on Multi-Core Platforms<br />

– Identification and Resolving<br />

Hossam H. Abolfotuh<br />

Functional Safety Department<br />

eJad L.L.C<br />

Cairo, Egypt<br />

Hossam.Abolfotuh@ejad.com.eg<br />

Esam Mamdouh<br />

Functional Safety Department<br />

eJad L.L.C<br />

Cairo, Egypt<br />

Esam.Mamdouh@ejad.com.eg<br />

Abstract—Functional safety is a key player in the<br />

development of Advanced Driver Assistance Systems (ADAS).<br />

The primary objective of applying safety analysis on software<br />

architectural design is to anticipate potential scenarios of failure.<br />

This kind of analysis aims to identify how failures originate at the<br />

low-levels of the design and how combinations or sequences of<br />

such low-level failures propagate to higher levels leading to a<br />

safety goal violation. Such described analysis can be realized by<br />

applying software Fault Tree Analysis (FTA) method. Applying<br />

the software FTA on ADAS architectures is a challenge, where<br />

the ADAS software architecture is mainly developed based on<br />

multi-core platforms. This paper will discuss how the software<br />

FTA will be performed on multi-core platform taking into<br />

consideration the dependencies between the cores; it also will<br />

discuss the linking of these software FTAs with system FTA to<br />

reach consistent analysis.<br />

Keywords—Software; Automotive; Multi-Core; Functional<br />

Safety; ISO 26262; FTA; Fault Tree Analysis;<br />

I. INTRODUCTION<br />

Functional safety highly impacts the automotive industry<br />

especially when the autonomous driving has been adopted. In<br />

critical systems, such as radar and camera applications that<br />

participates in autonomous driving, functional safety will be a<br />

must. In fact, there are many systematic failures that can lead to<br />

a violation to the safety goals which in turn may put<br />

passengers’ lives at risk. This arises the need for performing a<br />

systematic safety analysis on the system, software and<br />

hardware level. The primary objective of applying safety<br />

analysis is to anticipate potential scenarios of failure. This kind<br />

of analysis aims to identify how failures originate at the lowlevels<br />

of the design and how combinations or sequences of<br />

such low-level failures propagate to higher levels leading to a<br />

safety goal violation. Such described analysis can be realized<br />

by applying software Fault Tree Analysis (FTA) method<br />

according to ISO-26262 [1]. FTA is a top-down approach<br />

which is more appropriate for software applications than<br />

bottom-up approaches. This paper will discuss how the<br />

software FTA will be performed on multi-core platform taking<br />

into consideration the dependencies between the cores; it will<br />

also discuss the linkage of these software FTAs with system<br />

FTA to reach a consistent safety analysis.<br />

II.<br />

FAULT TREE ANALYSIS (FTA)<br />

A. What is Fault Tree Analysis?<br />

In general, the FTA is a deductive top-down analysis<br />

approach, in which an undesired state of a system is analyzed<br />

using Boolean logic to combine a series of lower-level events.<br />

The undesired states of the system are defined as a set of Top<br />

Level Events (TLEs) that represent the failure events which<br />

lead to this state and affect the critical system outputs. Then it<br />

traces these events till their root causes which are known as<br />

Basic Events (BEs). After defining these BEs, a list of safety<br />

mechanisms is provided to tolerate them.<br />

B. How is Fault Tree Analysis performed?<br />

The FTA is performed for each one of the set of TLEs<br />

separately. Each TLE will be the starting point of Fault Tree as<br />

shown in Fig. 1. For instance, consider (TLE1) as a “Failure in<br />

Object Detection” in a radar system. Working backward from<br />

this top event it might be determined that it is caused by one of<br />

two events (E); the first one is a failure in transmission of<br />

messages containing list of detected objects (E1), while the<br />

second one is a failure in the object detection algorithm (E2).<br />

This condition is represented in the fault tree diagram as a<br />

logical OR between these possible causes as shown in Fig. 1.<br />

Following (E2), perhaps it might be determined it is caused by<br />

one of two events; the first one is a failure in radar signal<br />

processing algorithm to create the list of detected objects (E3),<br />

while the second one is a memory corruption in the list of<br />

detected objects (E4). This is another logical OR. A design<br />

improvement using “Memory Protection Unit” can be<br />

implemented to protect the critical data – such as list of<br />

www.embedded-world.eu<br />

538


Fig. 1. Fault Tree diagram example<br />

detected objects – from corruption. This is a safety mechanism<br />

(SM1) added in the form of a logical AND with the memory<br />

corruption as a (BE1).<br />

III. CHALLENGES FACING FTA<br />

The first challenge during performing FTA is that the TLEs<br />

are defined on the system level where the safety goals are<br />

defined. So after performing the FTA on system level there<br />

must be some events that needs deeper analysis in either<br />

software or hardware level. This requires clear identification of<br />

the relations between different applied analyses (e.g. system<br />

FTA and software FTA) to have a consistently integrated FTA<br />

at the end. Problems in this step will lead to difficulties in the<br />

integration phase of the different FTAs performed on different<br />

development levels.<br />

Another challenge facing FTAs on multi-core platform, is<br />

that FTA is usually performed separately on each core without<br />

considering the inter-dependencies between them during the<br />

safety analysis phase. Which leads to missing possible failures<br />

resulting from these inter-dependencies and consequently<br />

missing additional safety mechanisms.<br />

A. Complications during linking Software to System FTA<br />

Starting the FTA at the software level apart from the system<br />

FTA is a big mistake. The software may have its own TLEs but<br />

it’s never independent from the system. Any system FTA will<br />

end up with some BEs that need to be deeply analyzed in the<br />

software architecture. If these events are not taken into<br />

considerations when performing software FTA, a gap will arise<br />

in the integration phase of the System with Software FTAs. In<br />

section IV, a proposal is illustrated to see how this gap could be<br />

covered.<br />

B. Missing dependencies between FTAs on multi-cores<br />

Considering a multi-core platform, the relations between<br />

the cores creates dependencies between the events. These<br />

dependencies must be taken in consideration during the FTA of<br />

each core. As a result, FTA can't be performed on each core in<br />

complete separation from the other cores. But, if the separation<br />

is needed in case that responsibility of each core is with a<br />

different team. So, separation is allowed but after considering<br />

the dependencies and translating them into new events.<br />

For example, the data transferred through Inter Process<br />

Communication (IPC) is a critical source of dependencies that<br />

should be considered during FTA on multi-core platform.<br />

IV. THE PROPOSED METHODLOGY<br />

In this paper, a systematic methodology is proposed to face<br />

the previously mentioned challenges. This methodology is<br />

based on the separation of the work as flexible as possible<br />

without dropping any dependency. System FTA can be<br />

performed by a separate team from the software. Even the<br />

software FTAs themselves can be separated among different<br />

cores. But at the end, the dependencies must be handled in a<br />

systematic well-organized way.<br />

A. Integrating System FTA with Software FTA<br />

Starting with the challenge of linking system FTA to<br />

software FTAs, the dependency is clear since the system FTA<br />

generates more TLEs for the software. In order to make sure no<br />

links are missed between the system and the software analyses,<br />

the set of software TLEs must be stated clearly including two<br />

categories. First category includes the created software TLEs<br />

corresponding to the BEs of the system FTA which need to be<br />

deeply analyzed in the software. For example the generation of<br />

correct critical output based on software algorithms. The<br />

second one includes the TLEs arising from the software<br />

architecture itself and they don’t have clear mapping to system<br />

BEs. New BEs shall be created in the system and linked to<br />

these software TLEs. For example, the sequence and timing<br />

constraints of executing a specific software algorithm.<br />

B. Performing Software FTA on multi-core<br />

The second challenge is performing the analysis on multicore.<br />

The dependencies between the cores shall be studied and<br />

defined. In order to make sure no links are missed between the<br />

cores, a set of TLEs must be stated clearly for each core;<br />

including two categories of TLEs. First category includes the<br />

share of this core from the entire software set of TLEs, based<br />

on the core functionality. For example, the generation of list of<br />

objects from the main core. The second one shall cover all the<br />

dependencies in which this core is a source of dependency. It<br />

adds new TLE in the source side and expects a corresponding<br />

BE in the receiver side of the dependency. For example, the<br />

transmission of car information such as speed and yaw rate<br />

from main core (core-0) to other cores (core-1 and core-2)<br />

which use this information for algorithms as shown in Fig. 2.<br />

Fig. 2. IPC on a Tri-core platform must be taken in consideration during<br />

FTA on each core.<br />

www.embedded-world.eu<br />

539


V. CASE STUDY<br />

In the case study, these solutions are applied on a Medium<br />

Range Radar system for Emergency Braking Assistance,<br />

developed on a tri-core platform "MPC5774-RaceRunner" [2].<br />

Core-0 is the main core which communicates with the<br />

centralized ECU via CAN bus, it receives car information such<br />

as speed, steering angel; and sends the detections list. Core-1 is<br />

responsible for radar signal processing. Core-2 is responsible<br />

for the algorithms execution to detect the objects. This system<br />

has a main TLE defined as "Existing object not reported by<br />

Radar ECU". The system and software FTAs were performed<br />

using Medini Analyze.<br />

On the system level, a high-level failure – identified by<br />

system FTA – is the transmission of a wrong list of detected<br />

objects. On the software level, this can be easily mapped to a<br />

TLE – (TLE2) in this example – and deeply analyzed in<br />

software. This link is highlighted in green color. On the other<br />

hand, based on the software nature there is another failure<br />

arises from the transmission of correct but obsolete list of<br />

detected objects. This failure can occur due to some timing<br />

violation. It can be tolerated using timing protection safety<br />

mechanism such as flow control monitoring. Accordingly, a<br />

BE shall be added in the system FTA – highlighted in purple –<br />

as shown in Fig. 3 in order to be linked to (TLE4) that is<br />

created in the software FTA.<br />

Later in the project, it is required to apply software FTA on<br />

the three cores. During a software FTA on core-2, a TLE is<br />

analyzed related to the generation of the objects list. This list is<br />

sent later to core-0. So a link between core-0 and core-2 –<br />

highlighted in blue color – shall appear in different software<br />

FTAs. A BE in the software FTA of the receiving core (core-0)<br />

as shown in Fig. 4 which is linked to (TLE5) in the source core<br />

(core-2) as shown in Fig. 5.<br />

Afterwards, by analyzing the dependencies between the<br />

two cores, it was found that the critical vehicle information –<br />

such as speed and yaw rate – is transferred from (core-0) to<br />

(core-2) was missing in the FTAs. A BE appeared in the<br />

software FTA of the receiving core (core-2) as shown in Fig. 5<br />

which was not linked to any TLE. Accordingly an additional<br />

(TLE3) was created in the source core (core-0) as shown in<br />

Fig. 6, then linked together as highlighted in orange color.<br />

Fig. 3. System FTA diagram<br />

Fig. 5. Software FTA diagram on Core-2-TLE5<br />

Fig. 4. Software FTA diagram on Core 0-TLE2<br />

Fig. 6. Software FTA diagram on Core-0-TLE3<br />

www.embedded-world.eu<br />

540


VI.<br />

CONCLUSION<br />

In this paper, a systematic methodology was introduced to<br />

identify missing relationships during performing FTA on<br />

multi-core platforms. In order to have a consistently integrated<br />

System-Software FTAs, an additional set of TLEs must be<br />

defined for the System FTA to make sure they will link with<br />

the Software FTAs.<br />

On the other hand; when performing the software FTAs<br />

among different cores, the dependencies between the cores<br />

shall be considered because they add more TLEs to each core<br />

in addition to its original share of Software TLEs.<br />

By covering these two challenges, it will ensure the<br />

integrity of performing different FTAs on the System and the<br />

Software levels.<br />

VII. REFERENCES<br />

[1] International standard, “Road Vehicles – Functional Safety”, ISO<br />

Standard 26262, first edition, Nov. 2011.<br />

[2] NXP Datasheet, “MPC5775K Reference Manual”, Document Number:<br />

MPC5775KRM, Rev. 2, 2/2014.<br />

www.embedded-world.eu<br />

541


Stopping Buffer Overruns<br />

Connecting static and dynamic code analysis<br />

Mark Hermeling<br />

GrammaTech, Inc.<br />

Ithaca, NY, USA<br />

mhermeling@grammatech.com<br />

Abstract—Buffer overruns are abundant in many deployed<br />

software systems, open source or commercial, enterprise and<br />

embedded. They are causing an embarrassing number of<br />

software security issues. A system is only as secure as its weakest<br />

link and a buffer overrun may provide the attacker a foothold<br />

into the system. Static analysis has been used for decades to<br />

detect buffer overruns, but they still occur as static analysis is not<br />

perfect and prone to both false positives and false negatives. This<br />

paper will explain why buffer overruns are hard to detect and<br />

propose how we can combine static and dynamic analysis to help<br />

detect and resolve them.<br />

Keywords—buffer overrun; static analysis; dynamic analysis,<br />

functional testing; security; quality;<br />

I. INTRODUCTION<br />

Static analysis has been a key technology in software<br />

developer’s toolboxes for decades. Static analysis is the<br />

technique of performing detailed analysis on source code<br />

without actually executing the software. This means that the<br />

technique can be applied very early in the software<br />

development lifecycle, before the team is even able to perform<br />

significant unit, integration, or system testing dynamically.<br />

Static analysis finds serious programming mistakes, such as<br />

buffer overruns, which can lead to security vulnerabilities. It<br />

can find these mistakes with very little effort so the use of<br />

static analysis is highly recommended by many practitioners,<br />

organizations as well as standards bodies that concern<br />

themselves with high quality software products in practically<br />

all verticals such as automotive, transportation, industrial,<br />

medical networking, consumer electronics, aerospace and<br />

defense.<br />

Static analysis has evolved significantly from the flexible<br />

linters of the early days that scanned for simple code patterns to<br />

today’s advanced, whole program, static analysis tools that<br />

explore paths through the code and perform abstract analysis of<br />

that code. Even these advanced tools are not perfect on<br />

significantly sized, real-world code bases. Static analysis tools<br />

suffer from false positives, warnings that the tool emits on code<br />

that is not defective, as well as false negatives, warnings that<br />

the tool does not emit on code that is actually defective.<br />

In this paper, we will explore buffer overruns and explore<br />

what makes some buffer overruns difficult to detect by static<br />

analysis tools. We will then explore techniques that can be<br />

employed to detect these buffer overruns dynamically during<br />

run-time.<br />

II. SIMPLE BUFFER OVERRUNS<br />

A buffer overrun in its simplest form is a read, or write of<br />

data after the end of an object in memory. This can happen in<br />

many different programming languages, but this paper will<br />

focus on C and C++. Objects can of course be allocated either<br />

statically or dynamically. A very simple buffer overrun can be<br />

of the form in Figure 1.<br />

char buf[10];<br />

…<br />

buf[10] = 'a' ;<br />

Figure 1 -- Basic buffer overrun<br />

In this case memory for an array of 10 characters is<br />

allocated and later the 11 th index in the array is accessed. The<br />

result of this access is non-defined, in most implementations<br />

the character ‘a’ will be written to whatever memory area was<br />

next to this buffer, overwriting what was there before.<br />

These types of faults are easy to detect by static analysis<br />

tools, the index value is hard-coded and static analysis tools<br />

will easily flag this as a buffer overrun. In the real world<br />

though, indexes are not hard-coded and can come from a<br />

variety of sources: device input, user input, network input, file<br />

reads, random variables and the like. There are a lot of patterns<br />

that static analysis tools have a hard time calculating through,<br />

like, for example the code in Figure 2.<br />

int i;<br />

char * s;<br />

s = (char *) malloc(100);<br />

...<br />

i=0;<br />

while (s[i] != '\0')<br />

i++;<br />

Figure 2 -- More complex buffer overrun<br />

www.embedded-world.eu<br />

542


In this example, it is harder for a static analysis tool to<br />

detect a buffer overrun. Whether or not there is an overrun here<br />

will depend on the value of the string pointed to by the variable<br />

and especially whether it is null-terminated. This variable can<br />

be populated anywhere in the program, can come from user,<br />

network or other input that the tool may not be able to track.<br />

Advanced static analysis tools will catch this in some cases, but<br />

not all.<br />

III. DATA TAINT<br />

Instead, what a static analysis tool may do, is flag a<br />

warning to the user indicating that a particular variable has<br />

been read from a suspicious source and that it has not been<br />

sufficiently assessed for erroneous, or suspicious values. A<br />

suspicious source could, for example be user input, network<br />

input or file input. Take the example in Figure 3.<br />

int c;<br />

char buf[10];<br />

c = getchar();<br />

buf[c] = 'a';<br />

Figure 3 -- Data taint<br />

The last line could lead to a buffer overrun, depending on<br />

the user-input on the third line. Static analysis tools cannot<br />

predict user input, but they can flag line 4 as a data taint. This<br />

code will lead to problems in fielded systems. Data taint is<br />

further explained in the white paper ‘Protecting against Tainted<br />

Data in Embedded Apps with Static Analysis’ 2 .<br />

IV. CONTROL FLOW<br />

In the previous, very simple examples, file name and line<br />

number would be sufficient information to understand the<br />

problem. In real-world code though, it is not sufficient to just<br />

provide file-name and line-number. As mentioned before,<br />

advanced static analysis tools perform whole-program analysis,<br />

for a specific problem they will indicate why a particular<br />

statement is considered a problem, but the tool will also<br />

provide the path of execution that it analyzed. This path is<br />

important for the software developer to understand the<br />

reasoning and come up with a fix to the problem.<br />

The control flow of a particular problem may include<br />

multiple different function calls across different compilation<br />

units, if statements, switch statements, for loops and the like.<br />

The control flow can also contain pointer dereferences,<br />

including function pointers. All these constructs can make<br />

analysis quite complex.<br />

V. RECALL AND PRECISION<br />

Static analysis tool vendors work hard at creating tools that<br />

can do deep analysis and find as many problems as they can.<br />

The goal is always to have a high recall, where recall is defined<br />

as the percentage of real-world problems the tool is able to<br />

identify. A problem that the tool is not able to identify is<br />

referred to as a false negative. However, the tool also needs to<br />

have high precision, which is defined as the proportion of<br />

results that are true positives. A false positive is where the tool<br />

reports a warning where no problem exists.<br />

For safety and security-critical code recall is typically more<br />

important than precision. A false negative that is lurking in a<br />

fielded product can have disastrous impacts. Still, a tool needs<br />

to have sufficient precision, or developers loose trust in it. This<br />

means that a static analysis tool cannot flag every construct that<br />

it thinks may be problematic, it needs to have sufficient<br />

evidence that there is a case where it can be a true problem.<br />

Take for example the code in Figure 2. Should the tool<br />

issue a warning here or not if it is unable to trace the origin of<br />

the content of the string?<br />

Static analysis is focused on preventing programming<br />

mistakes. Buffer overruns are one type of these, but there are<br />

many others such as null-pointer dereferences, dead code,<br />

wrong type casts and the like. There are typically four<br />

categories of problems that static analysis tools can catch:<br />

1. Behavior that is undefined by the language. This is<br />

the category that buffer overruns fall in.<br />

2. API misuse. An example would be to do a send<br />

without opening a socket.<br />

3. Suspicious behavior. Dead code for example<br />

4. Coding standard violations.<br />

Using static analysis to catch these programming mistakes<br />

significantly improves the quality of the code in the source<br />

code repository. While many senior programmers sometimes<br />

complain that they do not need a tool to watch over their<br />

shoulder, the reality is that a) everybody makes mistakes and b)<br />

not everybody is a senior programmer. Static analysis helps<br />

everybody write better code. Static analysis does not verify<br />

functional correctness, though. That is what functional testing<br />

is supposed to address.<br />

VI. FUNCTIONAL TESTING<br />

Once code is sufficiently fleshed out, it can be tested.<br />

Testing typically happens at different levels, from unit testing,<br />

where a single function or set of functions are tested, to<br />

integration testing where multiple components come together<br />

to system testing where the system is tested in its entirety.<br />

Testing mostly focusses on functional correctness, which<br />

means verifying whether an input has the desired effect. The<br />

effect could be an output, or a change in system state. Testing<br />

typically starts at the unit-test level and is driven through<br />

testing harnesses, either hand-written, or built through<br />

automation tools from vendors like VectorCast, QA Systems,<br />

VerifySoft, or the like. These tools not only make creating test<br />

harnesses easier, they also facilitate execution of the test cases<br />

on desktop, host or embedded targets and collecting and<br />

reporting of the results.<br />

2 https://resources.grammatech.com/whitepapers/protecting-against-tainteddata-in-embedded-apps-with-static-analysis<br />

543


The challenge with functional testing is that it can easily<br />

overlook the state corruption caused by buffer overruns. There<br />

are two reasons for this:<br />

1) Problems may only occur in corner cases;<br />

2) State corruption is not detected<br />

The first problem can be dealt with by exhaustive testing.<br />

Functional testing needs to test not just the ‘happy path’, where<br />

all input is correct and expected and we are making sure the<br />

algorithm works. Testing also needs to try and break the<br />

algorithm by providing malformed input, or going outside of<br />

data ranges. One of the famous examples of this is the<br />

Heartbleed bug in OpenSSL, caused by a simple programming<br />

error, where malformed input could trick a server into a buffer<br />

overrun and share too much sensitive information.<br />

Testing tools help with this, as do techniques such as fuzz<br />

testing (fuzzing) 3 , which generates input values in a way that<br />

tries to steer an algorithm into corner cases. We will not delve<br />

into this deeper in this paper.<br />

The second problem is due to the fact that functional testing<br />

tools are do not directly look for state corruption. They<br />

generally look for the right output that corresponds to a given<br />

input. They may not detect even the simplest buffer overrun<br />

examples presented earlier unless the overwrites cause<br />

incorrect output or abnormal termination. This is often not the<br />

case with buffer overruns.<br />

VII. CATCHING BUFFER OVERRUNS DYNAMICALLY<br />

Let’s assume that we have proper unit testing that tests the<br />

happy path, but that also tests corner cases. As is often the case<br />

in code that is security and safety critical. Projects that build<br />

these types of products have the focus and are given the time<br />

and resources to make sure that their software is exhaustively<br />

tested. Projects drive this by making sure that they have<br />

complete code coverage, meaning that they have executed (and<br />

hence tested) every statement or condition outcome in the code<br />

at least once. While this is good, this is not sufficient to prove<br />

that there are no buffer overflows in the program, for that you<br />

would have to make sure to test all paths through the source<br />

code. 100% statement or condition coverage is no guarantee<br />

that you also have 100% path coverage.<br />

Still, we have to detect when the program writes or reads<br />

outside of a buffer and corrupts the state of a program. This<br />

typically involves special treatment of memory allocations, the<br />

addition of canaries around memory areas and inspection of<br />

memory accesses into these accesses. There are a number of<br />

different existing tools for this, each with their own benefits<br />

and disadvantages. These tools monitor memory accesses<br />

during execution and when they see a suspicious access they<br />

provide some amount of feedback in a log file, or on standard<br />

output. The output is generally a memory region where the<br />

problem happened and a short stack trace.<br />

Valgrind 4 is an instrumentation framework for dynamic<br />

analysis tools. It is popular and extremely flexible and people<br />

3 https://en.wikipedia.org/wiki/Fuzzing<br />

4 http://valgrind.org/<br />

have built a number of different tools on top of it, including a<br />

memory error detector, which would suit our needs. However,<br />

the execution-time overhead that Valgrind requires is<br />

significant, which makes it not always feasible for use.<br />

AddressSanitizer (ASan) 5 is another popular solution that is<br />

faster than Valgrind and available with Clang and GCC<br />

compilers.<br />

Both Valgrind and ASan are considered debugging tools.<br />

Tools that developers use when they hit problems and try to<br />

figure out how to resolve them. Both report on memory<br />

problems by giving addresses that can then be resolved through<br />

the debugger to point to a location in the source code.<br />

GrammaTech has also recently announced a product to<br />

detect these state corruptions, CodeSonar/X 6 , an addition to<br />

GrammaTech’s static analysis tool CodeSonar. This solution is<br />

different from Valgrind and ASan as it is more performant in<br />

both time and space dimensions compared to both Valgrind<br />

and ASan and can be used during the development cycle, and<br />

can be left into deployed systems as well. It supports different<br />

operating systems (including embedded operating systems like<br />

VxWorks) and can be made to support additional compilers.<br />

The technology behind GrammaTech’s CodeSonar/X is<br />

derived from its participation in DARPA’s Cyber Grand<br />

Challenge 7 .<br />

VIII. PUTTING IT ALL TOGETHER<br />

So far this paper has argued that static analysis is required<br />

and that dynamic analysis is required. Combining static and<br />

dynamic analysis is the next step to assist projects to build<br />

better software faster.<br />

Static analysis can be done early in the development<br />

lifecycle and it catches programming mistakes early in the<br />

process and improves the source code that ends up in the<br />

source control repository.<br />

Functional testing is always required and should exercise<br />

the code as much as possible, touching as many code paths as<br />

is realistic, as early in the software development lifecycle as<br />

possible. Different projects will have different requirements for<br />

the depth of analysis here, an internet-connected fridge will<br />

spend less time on low level functional testing compared to the<br />

auto-pilot function of an airplane, or algorithms for self-driving<br />

cars.<br />

Dynamic state corruption detection is a great asset and can<br />

be integrated into all layers of the testing cycle. Combined with<br />

proper test coverage, this provides an additional layer of fault<br />

detection.<br />

Detecting the problems is one part of the puzzle, the second<br />

part is to help developers understand the problems. To do this,<br />

GrammaTech CodeSonar can combine the output of state<br />

corruption tools (Valgrind, ASan and of course CodeSonar/X)<br />

with its static analysis results.<br />

5 https://github.com/google/sanitizers/wiki/AddressSanitizer<br />

6 https://www.grammatech.com/products/codesonar<br />

7 https://www.darpa.mil/program/cyber-grand-challenge<br />

www.embedded-world.eu<br />

544


Any state corruptions are reported in the static analysis<br />

tool’s user interface and combined with the static analysis<br />

warnings. Delivers two main benefits:<br />

• Confirmation of existing warnings<br />

• Detection of false negatives<br />

The confirmation of existing warnings as true positives<br />

happens when a dynamically found warning happens on the<br />

same line as a similar warning that was found statically. This is<br />

an immediate sign to the software engineer that the problem is<br />

a serious one and should be high priority to fix.<br />

The detection of false negatives happens when a state<br />

corruption occurs where static analysis had not previously<br />

reported a problem. This will also result in a high priority<br />

warning report and provides not just filename and line-number,<br />

but reports on the execution trace as well.<br />

IX. EXAMPLES<br />

A couple of examples to demonstrate combining static and<br />

dynamic analysis. Figure 4 shows a traditional static buffer<br />

overflow warning on line 30. Intermixed in the output are two<br />

dynamically detected warnings on lines 21 and 30. The ‘Invalid<br />

Write’ warnings were detected by CodeSonar/X during runtime.<br />

The warning on line 30 shows the power of combining<br />

static and dynamic analysis. The static warning would have<br />

been found first in the software development lifecycle. Once<br />

the developer checks in the code, this warning would have<br />

been flagged immediately, even before the code is executed.<br />

With the dynamic tests now though, this proofs that this<br />

problem has been hit during testing, which should increase its<br />

priority.<br />

The warning on line 21 shows that dynamic tests can find<br />

things that were missed statically.<br />

Figure 4 -- Static and dynamic warnings intermixed<br />

As a second example, we can combine static analysis on<br />

source code with the dynamic results from Valgrind and get<br />

something similar, see Figure 5. In this case not a buffer<br />

overrun, but an ‘abort()’ call hit during execution.<br />

Figure 5 -- Crash observed through Valgrind, reported in CodeSonar<br />

X. SUMMARY<br />

Buffer overruns can lead to exploitable vulnerabilities and<br />

these can be costly, cyber vulnerabilities cost a company<br />

545


approximately $15.4 million per instance according to Forbes 9 .<br />

Any reasonable effort that we can make to reduce the amount<br />

of buffer overruns that make it into fielded products seems<br />

justified.<br />

Static analysis is not new, functional testing is not new,<br />

state corruption detection is not new, but the combination of<br />

the three together provides exciting new capabilities to the<br />

software development teams. Applying these three<br />

technologies requires proper investing in testing infrastructure<br />

and investment in proper test cases and is in no way free.<br />

However, combining these technologies promises to find<br />

difficult-to-find problems earlier and hence reduce the amount<br />

of fielded buffer overruns, which handsomely justifies the<br />

investment.<br />

9 https://www.forbes.com/sites/moneybuilder/2015/10/17/an-average-cybercrime-costs-a-u-s-company-15-4-million/#2bdf663032cb<br />

www.embedded-world.eu<br />

546


X-Ray Your Software Supply Chain<br />

Creating Automated Security Gates<br />

Ralf Huuck<br />

Software Integrity Group<br />

Synopsys<br />

Sydney, Australia<br />

ralf.huuck@synopsys.com<br />

Abstract— Software security has become a key challenge for<br />

embedded systems. This is particularly true for connected<br />

products such as those that can be found in the IoT space or the<br />

autonomous driving market. One of the big unknowns are thirdparty<br />

and open source software. In this work we present the results<br />

of the analysis of over 120,000 software artifacts. For each we<br />

identified the open source components and compared them with<br />

the known software vulnerabilities. The results are striking.<br />

Moreover, we advise on how to integrate such a security scanning<br />

activity in the (SDLC) and how to manage the supplier<br />

relationship.<br />

Keywords—software compostion; security; automated securiy<br />

gates; CVE; CVSS; open source security study<br />

I. INTRODUCTION<br />

As seen with the IoT-based MIRAI botnet, security<br />

vulnerabilities can have their root cause several layers down in<br />

the supply chain. This is particularly threating for complex<br />

and deep supply chains as prevalent in domains such as<br />

automotive and industrial control systems.<br />

In this work, we present our results from security scanning<br />

over 120,000 embedded software packages across a wide<br />

range of application domains. We automatically decomposed<br />

each software package into its component and cross-matched<br />

each component with its known security vulnerabilities as<br />

recorded in the National Vulnerability Database (NVD). We<br />

explain the purpose of the NVD, how to use it and how to<br />

make sense of the recorded Common Vulnerability Exposure<br />

(CVE) entries.<br />

We detail our findings by listing the components that are<br />

most commonly used, those that have most commonly a<br />

vulnerability as well as their age and the likelihood of existing<br />

patches that would remedy the situation. Moreover, we give an<br />

overview of the most critical vulnerabilities and the<br />

prevalence of “celebrity” bugs still active in embedded<br />

software.<br />

To remedy the situation, we explain how an automated,<br />

trustworthy supply chain process could look like that is built<br />

on various scanning and security gates across embedded<br />

suppliers, integrators and vendors. In particular, we take into<br />

account that many embedded vendors are not security experts.<br />

II.<br />

THE STATE OF OPEN SOURCE COMPONENTS<br />

A. Background<br />

Synopsys regularly publishes research into vulnerabilities<br />

in open source components. It summarizes the results of<br />

software uploaded to the Protecode platform [2]. Protecode is<br />

an analysis software that examines binary files for existing<br />

open source components, determines the version numbers of<br />

the open source components, and compares these components<br />

with existing databases of vulnerabilities. The primary<br />

database of vulnerabilities is the National Vulnerability<br />

Database (NVD) as maintained by NIST [3]. This database<br />

contains some 90,000 entries that document known<br />

vulnerabilities (CVEs), their causes and their implications.<br />

Moreover, each CVE is assigned a vulnerability score in the<br />

Common Vulnerability Scoring System (CVSS) [4]. The<br />

CVSSS ranges between 0 and 10, the higher the score the<br />

more critical the vulnerability.<br />

B. Open source component secutiy study 2016/2017<br />

The 2016/2017 Software Composition Study included about<br />

130,000 uploads to the Protecode platform. Over 16,000<br />

different components and versions were automatically<br />

identified. Figure 1 shows a breakdown of the most frequently<br />

identified components by task and application area. About twothirds<br />

of all components are utilities for Windows and Linux<br />

tools, network protocols such as SLL and http, and media<br />

libraries for jpg, png or XML.<br />

While this is not unexpected, it is interesting to note that<br />

common utilities are implicitly trusted. In fact, such basic<br />

utilities are often not even considered as 3rd part software that<br />

would be subjected to a rigorous security analysis. At the same<br />

time these utility functions are part of standard deployments to<br />

establish network connections, parse data formats and read from<br />

files or databases. Any weakness in these components can easily<br />

be imagined having larger security implications.<br />

www.embedded-world.eu<br />

547


In addition, the security vulnerabilities found are not<br />

generally not new as Figure 2 shows: About 50% of all<br />

vulnerabilities are four years old or older. In most cases, there is<br />

a newer and safer version for the component in question<br />

available. It is, however, not used. It is worth noting, that<br />

security vulnerabilities are typically discovered over time. This<br />

means, a component that could be considered perfectly fine<br />

today might not be secure tomorrow as new research and<br />

insights are obtained. This is particularly difficult for<br />

manufacturers to control and correct.<br />

A typical example of outdated components is the Heartbleed<br />

vulnerability. First publicized widely in 2014 it gained a lot of<br />

press due to its SSL vulnerability affecting a large number of the<br />

world’s web servers. In our study about 3 years later we find that<br />

Heartbleed is still in the top 50% of all found CVEs. This means<br />

it is still widely prevalent. Other celebrity bugs such as<br />

Stagefright or Ghost, however, only occur sporadically and can<br />

be assumed to be generally addressed.<br />

Fig. 1. Overview of Top 20 components detected.<br />

In our study we were able to identify around 9,000 security<br />

vulnerabilities with corresponding CVEs in the overall 16,000<br />

different components. This means, a large number of<br />

components identified cannot considered to be secure. We note,<br />

however, having a security vulnerability is not the same as being<br />

exploitable. It only means, there exists an exploit possibility for<br />

that component under the right circumstances. If these<br />

circumstances are present cannot necessarily be verified<br />

automatically was not part of this study.<br />

Fig. 2. Age of CVEs by initial detection year.<br />

For the software supply chain these findings have serious<br />

implications: It cannot be assumed that third-party party<br />

software is generally secure. In fact, the opposite is more likely.<br />

Furthermore, given the delay between the introduction of a<br />

security vulnerability and its detection through researchers there<br />

is a likelihood that even in the best case secure products that are<br />

out today might need some updates in the future. As a result, a<br />

strong and lasting supplier-manufacturer relationship is<br />

advisable.<br />

In the following, we make some suggestions how to<br />

structure some automated security scanning process.<br />

III. AUTOMATION & INTEGRATION INTO THE SDLC<br />

It is unrealistic to dispense with components from open<br />

source sources and third-party providers. These are useful<br />

ingredients to deliver products less costly and relatively quickly.<br />

Moreover, it is by no means proven that open source<br />

components are worse than off-the-self software. In fact, the<br />

reverse is often the case. Out study shows, however, that the<br />

security of third-party components needs to be vetted.<br />

The vetting of third-party software used to be a complex and<br />

specialized domain only. However, new software solutions such<br />

as Protecode, Sonatype, or Black Duck are available in the<br />

marketplace to perform these evaluations automatically [5].<br />

Moreover, these software solutions can be integrated<br />

automatically into the development process. This means that, for<br />

example, a DevOps Jenkins process can be started, which runs<br />

an automated analysis with every product creation, which<br />

discovers open source components and compares them with<br />

their known security vulnerabilities. These results can then be<br />

made available promptly to the software development teams and<br />

quality teams.<br />

In order to achieve this automated vetting pipeline, the right<br />

understanding and processes must be in place. This means, the<br />

organization must be set up to have this vetting process of part<br />

of their release or development plan. There needs to be an owner<br />

to define the acceptable policies and actions that need to be taken<br />

should a component fail the security qualification. Moreover, it<br />

is advisable to integrate this vetting process early and<br />

548


continuously into the SDLC as a once-off just before release<br />

checking process often does not allow to apply appropriate fixes.<br />

IV.<br />

SUMMARY<br />

In this work we presented our insights of scanning software<br />

products for known third-party components and their<br />

vulnerabilities. We showed that a large number of security<br />

vulnerabilities can be identified in common products.<br />

Moreover, we showed that software components are often<br />

used that are outdated and for which newer and patched<br />

versions exists.<br />

We indicated how an automated SDLC approach might look<br />

like to vet software components against known vulnerabilities.<br />

Finally, we believe it is advisable to communicate this security<br />

vetting process with the suppliers to increase the awareness also<br />

on the supplier side, establish contractual terms for mitigation<br />

and patches and encourage the supplier to proactively initiate<br />

their own scanning to avoid passing down low-security<br />

components. As a result, higher quality and more secure<br />

software can be produced without much overhead and enables<br />

any market player to stand out as a premium vendor.<br />

REFERENCES<br />

[1] Constantinos Kolias, Georgios Kambourakis, Angelos Stavrou and<br />

Jeffrey Voas. DDoS in the IoT: Mirai and Other Botnets. IEEE Computer,<br />

Volume 50/7, 2017.<br />

[2] Synopsys Software Integrity Group. The State of Software Composition<br />

2017. https://www.synopsys.com/software-integrity/resources/analystreports/state-of-software-composition-2017.html<br />

[3] Harold Booth, Doug Rike, Gregory A. Witte. The National Vulnerability<br />

Database (NVD): Overview. ITL Bulletin, December 2013.<br />

[4] Peter Mell, Karen Scarfone, and Sasha Romanosky. 2006. Common<br />

Vulnerability Scoring System. IEEE Security and Privacy 4, 6<br />

(November 2006), 85-89.<br />

[5] Millar S. Vulnerability Detection in Open Source Software: The Cure<br />

and the Cause. Queen's University Belfast, 2017.<br />

www.embedded-world.eu<br />

549


My Processor is Inside of an FPGA<br />

What Do I Do Now?<br />

Glenn Steiner (Author)<br />

Xilinx, Inc.<br />

San Jose, CA USA<br />

Abstract—With the drive to increase integration, reduce<br />

system costs, accelerate performance, and enhance reliability,<br />

software developers are discovering the processor they are<br />

targeting may be embedded inside of an FPGA. This paper will<br />

help you, the system architect or software developer, understand<br />

how you can architect and develop software, and even accelerate<br />

code via FPGA accelerators. (Abstract)<br />

Keywords— FPGA, Programmable Logic, SoC, System on a<br />

Chip, Extensible Processing Platforms, Multicore, Reconfigurable<br />

Architectures, Reconfigurable Systems, Programmable Systems (key<br />

words)<br />

I. INTRODUCTION<br />

As a software developer, you may have just been told that<br />

your next software project will be targeting a processor inside<br />

of an FPGA. How will this impact your development process<br />

and what benefits might you gain with this tight integration<br />

of processor and FPGA? Starting from the basics of what<br />

FPGAs are (in terms of software programming), this paper<br />

provides a simple to understand primer of what modern<br />

FPGAs with embedded processors can do. Next we will<br />

describe how one develops and debugs embedded processor<br />

applications. Finally, will wrap up with examples of how<br />

high level synthesis tools can move software to<br />

programmable logic hardware enabling dramatic software<br />

acceleration.<br />

As product designs increase in complexity, there is a need<br />

to use integrated components, such as Application Specific<br />

Standard Products (ASSPs), to address design requirements.<br />

Years ago, engineers chose individual components for<br />

processor, memory, and peripherals, and then pieced these<br />

elements together with discrete logic. More recently,<br />

engineers search through catalogs of ASSP processing<br />

systems attempting to find the nearest match to meet system<br />

requirements. When additional logic or peripherals are<br />

required, an FPGA is frequently mated with an ASSP to<br />

complete the solution. Over the last few years, FPGA sizes<br />

have increased, providing sufficient space to accommodate<br />

complete processor and logic systems within a single device.<br />

Software engineers are now faced with developing and<br />

debugging code targeting a processor inside of an FPGA and<br />

in some cases fear doing so. In this paper we will describe<br />

FPGAs and the process of creating and debugging code for<br />

FPGA embedded processors.<br />

II.<br />

WHAT IS AN FPGA?<br />

A Field Programmable Gate Array (FPGA) is an integrated<br />

circuit containing logic that may be configured and connected<br />

after manufacturing or “in the field”. Where in the past<br />

engineers purchased a variety of logic devices and then<br />

assembled them into a system design via connections on a<br />

printed circuit board, today hardware designers can implement<br />

complete system designs within a single device. In their<br />

simplest form FPGAs contain:<br />

• Configurable Logic Blocks<br />

• AND, OR, Invert & many other logic functions<br />

• Configurable interconnect enabling Logic Blocks to be<br />

connected together<br />

• I/O Interfaces<br />

With these elements an arbitrary logic design may be created.<br />

Note: With the transition of embedded processors integrated<br />

with FPGA’s, and the concept of both programmable processors<br />

and programmable FPGA’s, the idea that FPGA’s are now<br />

Programmable Logic aligns with the terminology of<br />

Programmable Processors. Thus, in this paper we will use<br />

Programmable Logic to describe the FPGA logic inside of an<br />

All Programmable device.<br />

Hardware engineers usually write code in HDL (typically<br />

either Verilog or VHDL) and then “compile” the design into an<br />

“object file” which is loaded into the device for execution. On<br />

the surface the HDL programs can look very much like High<br />

Level Languages such as C.<br />

The following is an implementation of an 8 bit counter<br />

written in Verilog courtesy of www.asic-world.com. One can<br />

see many constructs taken from today’s high level languages:<br />

© Copyright 2017 Xilinx<br />

www.embedded-world.eu<br />

550


----------------------------<br />

// Design Name : up_counter<br />

// File Name : up_counter.v<br />

// Function : Up counter<br />

// Coder : Deepak<br />

//----------------------------<br />

module up_counter (<br />

out ,// Output of the counter<br />

enable ,// enable for counter<br />

clk ,// clock Input<br />

reset // reset Input<br />

);<br />

//---Output Ports--------------<br />

output [7:0] out;<br />

//----Input Ports--------------<br />

input enable, clk, reset;<br />

//---Internal Variables--------<br />

reg [7:0] out;<br />

//------Code Starts Here-------<br />

always @(posedge clk)<br />

if (reset) begin<br />

out


IP power-gating options. The fourth power domain is the<br />

programmable logic (PL).<br />

o<br />

o<br />

o<br />

2 tightly coupled memories (TCM): Connected to<br />

the Cortex-R5s, each with 4 individually powergated<br />

banks<br />

On-Chip Memory (OCM): 4 individually powergated<br />

banks<br />

2 USBs: Each individually power-gated<br />

Figure 1: Zynq UltraScale+ MPSoC Power Domains<br />

1) Battery Power Domain<br />

The battery power domain, which can be powered by an<br />

external battery, contains battery-backed RAM (BBRAM) for<br />

an encryption key, and a real-time clock with external crystal<br />

oscillator to maintain time even when the device is off.<br />

2) Full-Power Domain<br />

The full-power domain consists of the Application Processor<br />

Unit, with the ARM® Cortex-A53 processors, the Graphics<br />

Processing Unit, the DDR memory controller, and the highperformance<br />

peripherals including PCI Express®, USB 3.0,<br />

DisplayPort, and SATA.<br />

3) Low-Power Domain<br />

The low-power domain consists of a Real-time Processor<br />

Unit (RPU) with the ARM Cortex-R5 processors, static On-Chip<br />

Memory (OCM), the Platform Management Unit (PMU), the<br />

Configuration and Security Unit (CSU), and the low-speed<br />

peripherals.<br />

4) Programmable Logic<br />

The Programmable Logic power domain consists of logic<br />

cells, block RAMs, DSP blocks, XADC, I/Os, and high-speed<br />

serial interfaces. Some devices include the video codec, PCIe<br />

Gen-4, UltraRAM, CMAC, and Interlaken.<br />

C. Power Islands for Fine-Grain Power Management<br />

Within the full- and low-power domains, there are multiple<br />

power islands. Each island is capable of being power-gated<br />

locally within the device. The following islands can be powergated:<br />

<br />

<br />

Full-Power Domain<br />

o<br />

o<br />

o<br />

4 ARM Cortex-A53 applications processors; each<br />

can be individually power-gated<br />

L2 cache servicing the Cortex-A53 processors<br />

2 pixel processors in Graphics Processing Unit:<br />

Each can be individually power-gated<br />

Low-Power Domain<br />

o<br />

2 ARM Cortex-R5 processors: Power-gated as a<br />

pair<br />

VI.<br />

HOW THE HARDWARE ENGINEER IMPLEMENTS A<br />

PROCESSING SYSTEM DESIGN<br />

Tools allow the rapid assembly of a processors systems via<br />

wizards. Using drop-down lists or check boxes, one simply<br />

specifies the targeted part, the desired processor, and<br />

peripherals. The processor and data processing systems in the<br />

device can be connected by graphically connecting bus<br />

interfaces.<br />

VII. HOW THE SOFTWARE ENGINEER CREATES AND DEBUGS<br />

CODE<br />

The software development process follows the following steps:<br />

1. Create a software development workspace and import<br />

the hardware platform.<br />

2. Create the software project and Board Support<br />

Package<br />

3. Create the software<br />

4. Run and debug the software project<br />

5. Optional: Profile the software project<br />

Steps 3, 4 and 5 are familiar to most developers. Steps 1 and<br />

2 may be new to some developers but are straight forward. We<br />

will use the Eclipse development environment as an example for<br />

the above steps.<br />

1. Creating a Software Development Workspace and<br />

Importing the Hardware Platform:<br />

After starting Eclipse the user is prompted for a workspace<br />

to use. A workspace is simply a directory path where project<br />

files are to be stored. Next the user specifies the hardware<br />

platform (design). This file is automatically generated by the<br />

hardware development tools and describes the processor system<br />

including memory interfaces and peripherals including memory<br />

maps. The file is output from the hardware development tools<br />

and the hardware engineer will typically supply this file to the<br />

software developer. Once specified, the hardware platform is<br />

imported and this step is complete.<br />

2. Creating the Software Project and Board Support<br />

Package (BSP)<br />

The Board Support Package (BSP) contains the libraries and<br />

drivers that software applications can utilize when using the<br />

provided Application Program Interfaces (APIs). A software<br />

project is the software application source and settings.<br />

For Xilinx C projects, Eclipse automatically creates<br />

Makefiles that will compile the source files into object files and<br />

link the object files into an executable.<br />

© Copyright 2017 Xilinx<br />

www.embedded-world.eu<br />

552


Next the system generates the BSP and automatically loads<br />

the applicable drivers based upon the defined hardware platform<br />

and operating system. These drivers are then compiled.<br />

3. Creating the Software<br />

At this point one may either import a software example or<br />

create code from scratch. As one saves code Eclipse<br />

automatically compiles and links the code reporting out any<br />

compiler or linker errors.<br />

4. Running and debugging the software project<br />

With FPGAs there is one step that must be completed prior<br />

to executing code; the FPGA must be programmed. In Eclipse<br />

the user simply selects Tools Program FPGA. This step takes<br />

the hardware design created by the hardware engineer and<br />

downloads it to the FPGA. Once completed the user may select<br />

the type of software to be built:<br />

Debug – Turns off code optimization and inserts<br />

debugging symbols<br />

Release – Turns on code optimization<br />

Note: For profiling one uses the –pg compile option.<br />

Finally the user may run the code by selecting Run and<br />

defining the type of run configuration and compiler options. If<br />

Release has been selected the processor will immediately begin<br />

code execution. Otherwise, the processor will execute a few<br />

boot instructions and will stop at the 1st line of source code and<br />

the Debug perspective will appear in Eclipse.<br />

From the Debug perspective the user may view the source or<br />

object code, registers, memory and variables. They may single<br />

step code at either the source or object level and may set<br />

breakpoints for code execution.<br />

5. Profile the software project<br />

Should the user desire they may profile code and view the<br />

number of function calls as well as see the percentage of time<br />

spent in any given function.<br />

VIII. SOFTWARE ACCELERATION VIA PROGRAMMABLE LOGIC<br />

With All Programmable Devices, one has the unique<br />

opportunity of turning software code into hardware accelerators.<br />

In the past one had to do such via tedious manual steps of<br />

creating a hardware engine that performed the desired software<br />

function; attach DMA engines to move data between the<br />

accelerator and memory; and create software interfaces between<br />

the replaced function(s) and the hardware accelerators and<br />

associated memory. Today there are modern C to HDL tools<br />

such as the Xilinx SDSoC environment that automates this<br />

process. When developing accelerators with such a tool the user<br />

performs the following steps:<br />

a. Profile and identify time critical functions<br />

b. Use the C to HDL tool to automatically create:<br />

i. The hardware representation and HDL code<br />

of the function to be accelerated<br />

ii. Attached DMA engines to move data to and<br />

from the accelerator<br />

iii. Replacement hardware functions for the<br />

original software functions<br />

c. Tune the design using provided performance data<br />

including logic utilization, estimated clock cycles and<br />

latency<br />

Design tuning allows the user to optimize the design for<br />

performance via increasing pipelining which allows more<br />

computations to be done in parallel per clock cycle, or via<br />

parallelizing computations by having multiple computation<br />

pipes running at the same time.<br />

Dramatic acceleration of software functions can be obtained<br />

using this methodology. A few examples include:<br />

Algorithm<br />

Hardware<br />

Acceleration vs<br />

Software<br />

MRI Back Projection Algorithm<br />

8x<br />

16k Fast Fourier Transform (FFT) 10x<br />

Optical Flow<br />

25x<br />

Stereo Local Block Matching<br />

25x<br />

2D Video Optical Filter<br />

30x<br />

Binary Neural Network 9,000x<br />

IX.<br />

CONCLUSION<br />

For cost, power, size and overall system efficiency embedded<br />

processors with programmable logic are becoming primary<br />

design choices. Software engineers do not need to consider an<br />

FPGA embedded processor as a mystery or any more difficult to<br />

program than an external processor. Industry standard<br />

development environments such as Eclipse are now being<br />

provided by FPGA vendors at competitive costs and are<br />

customized for FPGA embedded processing. Within these<br />

environments users can create, compile, link and download<br />

code, and as necessary debug their designs in the same manner<br />

as they have done in the past with external processors. FPGA<br />

embedded processors have extensive IP libraries, drivers and OS<br />

support. Finally, modern C to HDL tools enable software<br />

engineers to automatically build hardware accelerators for<br />

software functions yielding orders of magnitude improvement in<br />

software performance.<br />

REFERENCES<br />

[1] Xilinx, Inc., “Zynq UltraScale+ Device Technical Reference Manual,”<br />

Decemnber, 2017.<br />

[2] Xilinx, Inc., “Zynq UltraScale+ MPSoC Software Developer Guide,”<br />

November, 2017.<br />

[3] Xilinx, Inc., “Xilinx Software Development Kit (SDK),”<br />

[4] Xilinx, Inc., “SDSoC Environment User Guide,” December, 2017.<br />

(references)<br />

© Copyright 2017 Xilinx<br />

www.embedded-world.eu<br />

553


Yocto Project Linux as a Platform for<br />

Embedded Systems Design<br />

Alex González García<br />

Software Engineering Manager<br />

Digi International Inc.<br />

Logroño, Spain<br />

Abstract—Given the wide variety and individuality of<br />

embedded devices, choosing an operating system is not simple.<br />

This paper examines the process of selecting an embedded device<br />

operating system by highlighting the most important decision<br />

factors and weighing the available options. It discusses the benefits<br />

of using the Yocto Project to build a custom Linux-based<br />

embedded operating system. Keywords—yocto; debian;<br />

embedded; buildroot; operating system<br />

I. INTRODUCTION<br />

The choice of an operating system (OS) is one of the most<br />

critical decisions in embedded product design. Embedded<br />

systems, as opposed to general-purpose computers, are not a<br />

homogeneous group of devices and cannot be treated as a single<br />

entity. Every embedded device is unique. For instance, a single<br />

widely used architecture such as ARM encompasses embedded<br />

devices ranging from 32bit microcontrollers to 64bit multicore<br />

CPUs.<br />

While they range in features and complexity, embedded<br />

devices also share common OS considerations:<br />

<br />

<br />

<br />

<br />

<br />

<br />

Power consumption<br />

Security, particularly with always-connected devices<br />

and the internet of things<br />

Quick start up time<br />

Networking stacks<br />

Some amount of real time (RT) determinism and low<br />

latency<br />

User interface, possible graphical<br />

The choice of an embedded OS is also influenced by cost and<br />

time to market, both of which are directly proportional to system<br />

complexity.<br />

1<br />

https://www.freertos.org/<br />

2<br />

https://www.mbed.com/<br />

II. MICROCONTROLLER- OR MICROPROCESSOR- BASED<br />

SYSTEMS<br />

Embedded devices exist on a spectrum of complexity. At the<br />

low end, which generally equates to lower cost and faster time<br />

to market, are embedded devices with a microcontroller<br />

(MCU). These devices typically lack memory management<br />

(MMU-less). Embedded developers working with MCUs are<br />

intimate with the hardware and microprocessor architecture and<br />

make extensive use of JTAG debuggers. The application is<br />

usually bundled with the OS on a single flat memory model.<br />

MCUs usually run in-house developed, bare-metal OSs. There<br />

has also been a recent uptick in the use of open source<br />

alternatives such as FreeRTOS 1 , mbed 2 , or Zephyr 3 .<br />

Microprocessor (CPU)-based systems are higher up the<br />

complexity spectrum. For example, in an ARM architecture,<br />

System-On-Chips extend the available MCU interfaces with<br />

more complex cores like HDMI and USB. They may also<br />

provide graphical, video, or cryptographic acceleration.<br />

III. REAL TIME OR GENERAL PURPOSE OPERATING SYSTEM<br />

The most important consideration when choosing an embedded<br />

OS is determining the optimal amount of real-time capabilities.<br />

Determinism and low latency requirements will prescribe either<br />

a real time OS (RTOS) or a general purpose OS (GPOS). An<br />

RTOS is the most complex; hence, cost and time to market are<br />

also the highest. Software development for RTOS and GPOS<br />

also requires substantially different skill sets.<br />

Hybrid approaches that use the real-time responsiveness of<br />

MCUs with the high processing and graphical capabilities of a<br />

GPOS on a CPU are also possible.<br />

On the GPOS side, embedded Linux and its multiple facets has<br />

become the standard choice. This is also true for soft RT<br />

3<br />

https://www.zephyrproject.org/<br />

www.embedded-world.eu<br />

554


capabilities (hard RT Linux is possible but significantly<br />

increases the complexity). In a 2017 embedded market survey<br />

[1], 62% of projects were using embedded Linux in one form<br />

or another, and 82% were thinking of using embedded Linux in<br />

2018.<br />

IV. EMBEDDED LINUX<br />

The choice of embedded Linux already implies a jump in<br />

software complexity, and the software skills needed on<br />

embedded Linux teams are broader than those of traditional<br />

embedded developers. Disregarding the steep learning curve of<br />

embedded Linux for a RTOS embedded developer is a common<br />

mistake that can significantly increase the time to market of<br />

embedded Linux projects.<br />

As system complexity increases, the embedded developer role<br />

changes. Embedded Linux teams are bigger and more<br />

heterogeneous than traditional embedded teams and typically<br />

consist of three distinct roles:<br />

<br />

<br />

<br />

BSP developer<br />

Application developer<br />

System developer<br />

BSP developers work on bootloaders and the Linux kernel. The<br />

bootloader developer is very close to the traditional embedded<br />

developer—intimate with the hardware and running on a flat<br />

memory system. However, the development work to be done in<br />

a bootloader is limited to hardware bring-up and the execution<br />

of the Linux kernel. Embedded Linux kernel developers, on the<br />

other hand, have more in common with desktop PC kernel<br />

developers than with traditional embedded developers. JTAG<br />

devices are of little use beyond bring-up, and the CPU board<br />

support package (BSP) is usually provided by the manufacturer.<br />

Even drivers are usually provided either by the community or<br />

the device manufacturer, so an embedded Linux kernel<br />

developer does device tree customization and maybe some<br />

driver debugging and development.<br />

Application developers work at a high level, abstracted by the<br />

Linux kernel. Application development work is very similar to<br />

desktop application development; in many cases, applications<br />

can initially be developed on a PC before cross-compiling them<br />

in the embedded hardware. User interfaces typically use the QT<br />

framework 4 or are actually a web-based interface.<br />

Programming languages are no longer limited to C and C++.<br />

Applications are now developed in Python, Node.js, and even<br />

Java.<br />

An embedded Linux application is not a traditional monolithic<br />

entity. Rather, it is usually comprised of a collection of<br />

applications that communicate and cooperate. Typical<br />

embedded Linux application services include:<br />

Process monitoring and watchdog, which are in charge<br />

of monitoring the rest of the system and restarting<br />

other applications or the whole system when a failure<br />

occurs<br />

Messaging services like d-bus or similar<br />

Network managers, including Wi-Fi, cellular, and<br />

other technologies<br />

Configuration managers, which translate user<br />

interface commands into configuration changes<br />

User interfaces, and possibly CLIs or web interfaces<br />

Logging service<br />

A main application<br />

All these services collaborate to provide the experience of a<br />

single embedded application.<br />

System developers are the people in charge of system<br />

integration and the build system, including root filesystem<br />

customization and software development kit (SDK) generation.<br />

While a traditional embedded application would be bundled<br />

with the OS, Linux requires a user space or root filesystem<br />

which contains the runtime set of applications and libraries.<br />

Building this root filesystem has always been complex and is<br />

one of the reasons why embedded Linux adoption has been<br />

slow.<br />

The SDK is used as an interface between the development roles,<br />

allowing teams to scale in size and specialize to a certain<br />

degree. For example, application and BSP developers use and<br />

update an SDK but do not usually need to be concerned with<br />

root filesystem customization.<br />

A. Choosing an embedded Linux distribution<br />

A Linux distribution is an operating system based on the Linux<br />

kernel and GNU 5 Linux software, most importantly the GNU<br />

toolchains, libraries, and development tools.<br />

A distribution sets the policies for the system and includes<br />

components such as:<br />

<br />

<br />

<br />

<br />

<br />

The selection of supported packages<br />

The initialization system to use<br />

The graphical backend<br />

System-wide choices, like the Bluetooth stack<br />

Graphical environments<br />

An embedded Linux distribution provides:<br />

<br />

<br />

<br />

<br />

The bootloader<br />

The Linux kernel<br />

The user space or root filesystem<br />

A software development kit<br />

4<br />

https://www.qt.io/<br />

5<br />

https://www.gnu.org/<br />

555


A distribution is generated in one of two ways:<br />

1. Customize an existing binary Linux distribution such<br />

as Debian 6<br />

2. Build a Linux distribution from source<br />

1) Binary Linux distributions<br />

Binary distributions contain pre-built binary packages that are<br />

added (downloaded from the cloud) or removed from a system<br />

using a package manager. Systems are usually bootstrapped on<br />

target, and on-target compilation is easy and common.<br />

Binary-based distributions have the lowest complexity and<br />

quickest time to market if the hardware is already supported by<br />

the distribution, so they initially appear to be an easy solution.<br />

However, they have several drawbacks that make them<br />

inadequate for embedded products.<br />

Because package maintenance is taken care of by the<br />

distribution provider, binary distributions offer very limited<br />

package configuration. Packages are also generic, so they are<br />

usually heavily patched to cover a wide range of use cases<br />

instead of focusing on embedded application needs.<br />

The policies and architectural choices are pre-defined and offer<br />

little customization options. Embedded products are unique, so<br />

customization of the binary distribution is often necessary. This<br />

leads to manual non-standard builds that are difficult to<br />

reproduce and trace. Even if very little customization is needed,<br />

package maintenance becomes a problem once the distribution<br />

maintenance period ends and manual non-standard builds are<br />

required.<br />

Performing package updates via package managers is also<br />

unsuited for embedded devices. After several updates, there is<br />

no way to guarantee that the deployed system is the same as the<br />

tested system. Also, losing power while updating could leave<br />

the system in inconsistent states.<br />

Binary distributions produce bigger systems, so images are<br />

larger and slower to boot. They are also more complex systems<br />

that require more resources to run and are more difficult to<br />

secure.<br />

Finally, binary distributions are not easily portable, so they do<br />

not scale well to run on multiple platforms.<br />

In summary, binary distributions have high maintainability and<br />

low reproducibility, which are disadvantages for embedded<br />

systems.<br />

2) Build from source<br />

Building from source is more complex and has traditionally<br />

meant longer development cycles. However, it allows<br />

maximum flexibility to architect the system with full package<br />

configuration and no pre-determined choices. It also allows for<br />

package maintenance as long as necessary.<br />

A system built specifically for an embedded system provides a<br />

more compact system with smaller images that are faster to<br />

boot. Also, reduced system complexity allows it to run on fewer<br />

resources and makes it easier to secure.<br />

It is also highly portable with good scalability to multiple<br />

hardware platforms.<br />

In summary, a system built from source offers good<br />

maintainability and reproducibility.<br />

The longer development cycle of the do-it-yourself approach,<br />

even with the help of projects like cross Linux from scratch 7<br />

and crosstool-ng 8 , placed embedded Linux projects at a<br />

disadvantage. However, standard tools have emerged that<br />

greatly simplify the building of custom Linux systems. The two<br />

most prominent tools are Buildroot 9 and Yocto Project 10 .<br />

a) Buildroot<br />

Buildroot is an easy to learn tool for small projects and small<br />

teams. It uses the kbuild system (as the Linux kernel) as<br />

configuration tool, which means the configuration is only kept<br />

in one place. It can be thought of as an image generator more<br />

than a distribution builder. It does not generate binary packages,<br />

and does not support package managers or native on-target<br />

compilation.<br />

It also has a good selection of well-maintained packages, and<br />

custom external packages can be added. Buildroot only<br />

performs full system updates; this is good for production<br />

systems but not for development, as the images must always be<br />

updated as a whole.<br />

It also has no concept of a build cache and often needs to<br />

perform full system rebuilds instead of incremental builds.<br />

Buildroot has a three-month release cadence and a long term<br />

support (LTS) release every year.<br />

Although at this point it has diverted considerably, openWRT 11<br />

is an example of a distribution that originated in Buildroot. It is<br />

focused on networking devices, particularly routers.<br />

Buildroot is a good choice for small projects and teams, as it<br />

keeps the complexity low while significantly reducing time to<br />

6<br />

https://www.debian.org/ports/<br />

7<br />

http://trac.clfs.org/<br />

8<br />

http://crosstool-ng.github.io/<br />

9<br />

https://buildroot.org/<br />

10<br />

https://www.yoctoproject.org/<br />

11<br />

https://openwrt.org/<br />

www.embedded-world.eu<br />

556


market. However, it does not scale as well as Yocto for multiple<br />

platforms or bigger teams.<br />

b) The Yocto Project<br />

The Yocto Project is a distribution builder that provides a<br />

reference distribution called Poky. Its OpenEmbedded 12 build<br />

system is based on Bitbake, a task scheduler written in Python,<br />

with package recipes that are structured in layers. It supports a<br />

large number of packages, and the layers facilitate software<br />

reuse. But since the layer ownership is distributed, maintenance<br />

can be problematic.<br />

Configuration is scattered in distro, machine, image, and local<br />

configuration files. Bitbake parses all configuration and recipes,<br />

resolves dependencies, and prepares and executes a list of tasks.<br />

The built output is binary packages, which are then installed<br />

into a root filesystem image. Package managers running on the<br />

target are supported and especially useful for development. The<br />

Yocto Project can be used to create binary-based distributions,<br />

but it is mostly used as an image generator.<br />

The Yocto Project has a six-month release cadence, and<br />

maintains both the current and previous software releases.<br />

It has a steeper learning curve than Buildroot. However, once<br />

proficiency is achieved it also reduces the system complexity<br />

and the time to market while scaling well to multiple platforms<br />

and bigger teams<br />

An example of a Yocto Project-based distribution is<br />

Ångström 13 .<br />

V. CONCLUSION<br />

Although there is no one-size-fits-all embedded operating<br />

system, embedded Linux covers the majority of use cases for<br />

microprocessor-based solutions. Even so, real-time<br />

considerations must be taken into account.<br />

However, the increased software complexity and different<br />

software skillset required for embedded Linux development is<br />

also important to consider.<br />

The choice between using a binary distribution and taking on<br />

the complexity of building a Linux distribution from source is<br />

made easier with the use of system builders like Buildroot and<br />

the Yocto Project. Even though Buildroot is a great tool for<br />

smaller projects, the scalability of platforms and development<br />

workflows with the Yocto Project makes it the de facto standard<br />

for embedded Linux systems.<br />

A more detailed discussion of the differences between the<br />

Yocto Project and Buildroot can be found at [2], while a<br />

detailed comparison with Debian can be found at [3].<br />

REFERENCES<br />

[1] EETimes/Embedded 2017 Embedded Market Study by AspenCore<br />

https://m.eet.com/media/1246048/2017-embedded-market-study.pdf<br />

[2] Alexandre Belloni and Thomas Petazzoni, "Buildroot vs.<br />

OpenEmbedded/Yocto Project: A Four Hands Discussion". Embedded<br />

Linux Conference 2016.<br />

https://elinux.org/images/7/7a/Bellonipetazzoni.pdf<br />

[3] Mads Doré Hansen, "Yocto/Debian Comparison White Paper".<br />

https://www.prevas.dk/download/18.58aaa49815ce6321a327da/1506087<br />

244328/Yocto_Debian_Whitepaper.pd<br />

12<br />

https://www.openembedded.org/<br />

13<br />

http://www.angstrom-distribution.org/<br />

557


Boot Time<br />

Benefits & Drawbacks of Linux Sleep and Hibernate<br />

Thom Denholm<br />

Technical Product Manager<br />

Datalight, Inc.<br />

Bothell, WA<br />

Thom.Denholm@datalight.com<br />

Abstract— Embedded designs need to start up quickly, and<br />

those based on Linux and Android must overcome challenges of<br />

initializing many peripherals and complex applications. Many<br />

designs today rely on a sleep or hibernate solution. What are the<br />

risks of these options, and is there a better alternative?<br />

This session will examine strategies to optimize boot time<br />

including a detailed discussion of trade-offs to consider when<br />

working to perfect your users' experience. Bill of materials costs,<br />

power budgets, and required development team expertise will be<br />

examined.<br />

Keywords—Linux kernel; boot time; hibernate; sleep; Android<br />

I. INTRODUCTION<br />

Consumers desire an instant-on experience, so embedded<br />

designs need to start up quickly. Systems that once had no<br />

processor or a simple startup ROM set that standard for the<br />

consumer, and they expect similar start times from embedded<br />

devices which are considerably more complex. The Linux<br />

Kernel adds significantly to this overhead.<br />

Android, running on top of Linux, further adds to the<br />

burden. Google is aware of the problem, and even targeted<br />

faster boot speed with their latest release, Oreo. [1] Above both<br />

the Linux Kernel and Android environment is the application,<br />

which may need to initialize graphic environments and/or load<br />

databases to start up.<br />

The goal of this paper is to survey available options to<br />

improve overall device startup time, and also shed some light<br />

on the risks and benefits of the various approaches.<br />

II.<br />

DEFINING THE PROBLEM<br />

For an embedded device to be ready for input, it must start<br />

the hardware, drivers and Kernel, then any application<br />

environment and finally the application. When people focus on<br />

startup time, it is usually the hardware and Kernel that get the<br />

most attention.<br />

Boot tracer and other utilities can be used to measure the<br />

boot process. When that data is routed through the<br />

bootgraph.pl script, a colored chart results to break down<br />

those results [2]. Some can be removed or postponed, and<br />

others may be reduced through code manipulation. Another<br />

technique used is to parallel initialization, allowing two or<br />

more operations to start together.<br />

NAND media startup can be particularly pernicious.<br />

Drivers must usually scan the entire media to remap the wear<br />

leveling, and that is before whatever scans and checks are<br />

required by the file system for reliability validation.<br />

Work done to reduce Linux startup time can minimize the<br />

time spent in cold boot, but it can have unexpected costs for<br />

longer term projects. Modifications will usually apply only to<br />

the current Linux kernel used on the project; any changes will<br />

require much of this work be performed again. This knowledge<br />

can also be something of a special skill, meaning staffing<br />

changes could significantly impact your project.<br />

III.<br />

APPLICATION BEYOND THE KERNEL<br />

Just starting the Kernel isn’t enough for most embedded<br />

designs; any required applications need to start as well. Any of<br />

these could involve loading data files and databases, each of<br />

which takes time to read from the media. Other major startup<br />

hurdles include graphic libraries and external communication.<br />

It is very difficult to optimize the startup time for a complex<br />

application to any great degree.<br />

The Android environment is really a specialized user<br />

interface application, and it must startup also. Like Windows<br />

and iOS, this environment loads applications and initializes<br />

information. The system may not be fully ready for tens of<br />

seconds (or even minutes). It contains its own task manager<br />

which can be used to disable some app startup to improve that<br />

speed somewhat, but this knowledge is also subject to change<br />

with new versions.<br />

www.embedded-world.eu<br />

558


One interesting option is to store a suspended version of the<br />

application (sometimes called a snapshot). This is popular with<br />

virtual machines, which have their state suspended instead of<br />

starting from scratch each time. For the entire design, this<br />

would be known as hibernation, more on that later.<br />

IV.<br />

OVERALL DESIGN<br />

While the entire design is going through various stages of<br />

startup, it is drawing power. This is being used to read from the<br />

media and allow the processor to work, but most designs also<br />

use the display screen as well. Whether your customers see a<br />

splash screen or the status of the boot cycle, this additional<br />

display time draws current for the device. Reducing the entire<br />

system startup time has a secondary benefit of reducing power<br />

usage and improving device lifetime. This is especially useful<br />

if the device must initialize many times on one battery charge.<br />

V. SLEEP MODE<br />

One solution is to use sleep mode. Here, the processor<br />

suspends nearly all activity (exceptions include DRAM<br />

refresh). To the end user, the device appears to be off, though it<br />

is in fact consuming a small amount of power to maintain<br />

operation. This can eventually drain a battery. It is less risky<br />

when the device is plugged in, but power can (and will) be<br />

interrupted.<br />

Many devices will add a small indication that the device is<br />

not “off” but merely sleeping – a slowly blinking light, for<br />

example – to make sure the customer is aware of this risk, and<br />

will not arbitrarily unplug a device. This is often referred to as<br />

a heartbeat, which is appropriate – when the power is lost, the<br />

device is dead. When power is lost, the status of everything in<br />

memory will also be lost – loaded applications, open files and<br />

non-committed program operations.<br />

In addition to adding a visible heartbeat, there are other<br />

changes required to support sleep mode. Of the options<br />

surveyed here, this is the most complex from a hardware<br />

design standpoint.<br />

VI.<br />

HIBERNATION<br />

The alternative to sleep is hibernation. In this case, rather<br />

than suspending operations in memory, the machine state<br />

contents are committed to the media. Committing and restoring<br />

that state can take some time, and should be done in a powerfail<br />

safe manner. If the commit is not complete, the device will<br />

have to cold boot – the same as if power was lost in sleep<br />

mode.<br />

The major advantage of both sleep and hibernate is that<br />

both allow the device to avoid the time needed to reload the<br />

applications and their associated files, in contrast to improving<br />

the cold boot by optimizing system loading which does very<br />

little to improve the startup time of the application. Also,<br />

neither of these methods require significant update when the<br />

Kernel or Android version changes, and no special Linux<br />

development expertise is required.<br />

Hibernate requires support in the BSP and drivers. One<br />

disadvantage of Hibernate is the amount of data that needs to<br />

be committed and restored, which grows as the system DRAM<br />

footprint grows. The additional time required for this I/O also<br />

draws power, and could require a status screen to reassure the<br />

user, which requires even more. There are a few techniques<br />

which could be used to improve this operation. Each will<br />

require changes to the code to perform both the hibernate and<br />

the restore.<br />

VII. SKIP UNUSED PORTIONS<br />

Large portions of memory in a system are blank – either<br />

unused or allocated as part of a program’s stack or heap. Not<br />

only is there no point in committing those unused portions to<br />

the media, but those writes (and subsequent reads) are costly in<br />

terms of time and flash media life. The solution is to compress<br />

the memory in use by those amounts.<br />

Performing this operation requires knowledge of the<br />

individual drivers and the application, but in this case a little<br />

knowledge can yield a significant savings. One source for this<br />

knowledge is the Linux kernel source code, which is freely<br />

available. Here the hibernate code can simply skip the unused<br />

portions; the restore code should initialize that memory to the<br />

values expected by the various drivers and applications.<br />

VIII. FURTHER COMPRESSION<br />

While storing blocks to the media, why not take advantage<br />

of compression algorithms to further reduce the footprint? This<br />

operation would require changes to both the hibernation and<br />

restoration code, but would be well worth it. Allocated<br />

program space in memory is far more compressible than media<br />

and images, and the time required to read back the data from<br />

the disk is even further reduced.<br />

IX.<br />

ANOTHER ALTERNATIVE - REMOVE THE TIME REQUIRED<br />

TO CREATE THE HIBERNATION IMAGE<br />

For devices which present the same interface to the<br />

customer each time they start up, another alternative would be<br />

to create the hibernation image just one time, then restore that<br />

image each time. This allows for the fastest possible device<br />

ready state – Linux Kernel and application running, files preloaded.<br />

This "factory configuration" is the same one used each<br />

time, with no per-user or site customization. Going further, an<br />

application could be modified to use a small configuration file<br />

for this data, reading it again when the system detects it has<br />

been restored.<br />

559


X. BILL OF MATERIALS<br />

All the hibernation techniques discussed are not without a<br />

cost. Some additional storage media is required to save these<br />

images, with the amount required dependent on how much<br />

overall DRAM is in the system, and whether compressed<br />

images are an option. Fortunately, media storage has come<br />

down in price, and media vendors are pushing larger size parts<br />

while using the same board footprint.<br />

XI.<br />

CONCLUSION<br />

Devices with Linux are often far more complex than (and<br />

slower to start up) than those designed with an RTOS. Many<br />

techniques are available to accelerate the startup on Linux.<br />

Only hibernation keeps the application loaded AND protects<br />

against power failure. Some clever modifications to hibernate<br />

are available to speed the restore time and reduce the storage<br />

time, even removing it completely. This solution delivers<br />

improved overall device start time with the least additional<br />

Linux kernel modifications.<br />

REFERENCES<br />

[1] Sameer Samat, Aug 21 2017,<br />

https://blog.google/products/android/android-oreo-superpowers-comingdevice-near-you/<br />

[2] Chung-Yeh Wang, Dec 31 2012,<br />

http://linuxonarm.blogspot.com/2012/12/boottime-patchubuntunexus7.html<br />

www.embedded-world.eu<br />

560


Connecting Sub-1 GHz Low-Power IoT Nodes to the<br />

Internet Using 802.15.4<br />

Nick Lethaby<br />

Connected Microcontrollers<br />

Texas Instruments<br />

Goleta, USA<br />

nlethaby@ti.com<br />

Abstract—The Sub-1 GHz spectrum provides publically<br />

available bands that allow long-range low-power communication,<br />

rendering it ideal for many IoT applications. Unfortunately, the<br />

Sub-1 GHz band supports many different PHYs and lacks a<br />

standard networking solution. This has limited its use in the IoT<br />

market because developers must have implementation expertise<br />

in low-level RF communications. IEEE extended support to the<br />

Sub-1 GHz band in the ‘g’ amendment of the IEEE 802.15.4<br />

specification, creating an opportunity for standards-based<br />

protocol implementations. We will overview an 802.15.4g-based<br />

protocol stack implementation that enables sensors and actuators<br />

to connect to the cloud using Sub-1 GHz radios, including which<br />

802.15.4 standards are useful and which additional proprietary<br />

implementation is needed for a full stack. This includes a Linuxbased<br />

stack and gateway implementation to bridge the wireless<br />

network to the internet using a serial port abstraction for the<br />

Sub-1 GHz radio device. We will conclude with benchmark data<br />

demonstrating network reliability and potential battery life for<br />

application usage scenarios with an ARM Cortex M-based Sub-1<br />

GHz wireless microcontroller.<br />

Keywords—IoT; Sub-1 GHz; 802.15.4;<br />

I. INTRODUCTION<br />

With Internet of Things (IoT) applications triggering large<br />

scale deployment of wirelessly connected sensor and actuator<br />

nodes, cost-effective implementation will depend on keeping<br />

deployment costs low. The choice of wireless technology will<br />

significantly affect these costs as it will determine how many<br />

gateways or intermediate routers are required, whether mains<br />

power is required, and the hardware costs, such as processing<br />

power and memory, of the end node.<br />

In many IoT applications, mains power may not be<br />

conveniently available, adding to the deployment costs unless a<br />

node can function for long periods on a battery or similar<br />

power source. In addition, in applications such as agriculture,<br />

warehouse asset tracking, or industrial plants, nodes will need<br />

to be placed over a wide area and in locations where metal or<br />

concrete may attenuate or obstruct the signal, creating<br />

challenges for wireless technologies such as Wi-Fi or BLE that<br />

have somewhat limited range and don’t easily adapt to radio<br />

environments that must turn corners.<br />

The Sub-1 GHz spectrum provides publically available<br />

bands that allow long-range communication and good<br />

penetration. This band has been extensively proven in IoT<br />

application segments such as smart metering and home alarms,<br />

which are based on proprietary network implementations. It is<br />

also the band used by emerging Low Power Wide Area<br />

Network technologies like Sigfox and LoRa. Unlike<br />

connectivity technologies such as Ethernet, BLE, or Wi-Fi,<br />

which have a very limited number of PHYs, Sub-1 GHz allows<br />

a very wide range of different PHYs. As a result there is no<br />

standard networking solution. This has limited its use in the<br />

broader IoT market because developers needed implementation<br />

expertise in low-level RF communications and the associated<br />

protocol stacks.<br />

In 2011, IEEE extended support to the Sub-1 GHz band<br />

with the ‘g’ amendment of the IEEE 802.15.4 specification,<br />

creating an opportunity for more standards-based protocol<br />

implementations. Since 802.15.4 is purely a MAC-layer<br />

standard, it will always require additional custom stack layers<br />

to be developed for a fully functional implementation.<br />

However, using 802.15.4 as the starting point enables the many<br />

engineers who are knowledgeable with this standard to have a<br />

more familiar stack for Sub-1 GHz applications, compared to a<br />

fully custom implementation.<br />

Since connecting low power wireless networks to the<br />

internet requires a gateway, the 802.15.4 stack implementation<br />

for Sub-1 GHz will be discussed for both low power sensor<br />

nodes and a Linux-based gateway that provides internet<br />

connectivity, including which components of 802.15.4 were<br />

used and which custom stack elements needed to be<br />

implemented.<br />

However, we will begin with a summary of the network<br />

requirements since these strongly influenced many of the<br />

implementation choices.<br />

II.<br />

NETWORK REQUIREMENTS SPECIFICATION<br />

Any wireless network design must first determine the<br />

optimal blend of cost, range, data rate, and power consumption<br />

for the targeted applications as these factors heavily influence<br />

the implementation path. These are summarized below. In the<br />

www.embedded-world.eu<br />

561


specific context of Sub-1 GHz, a key requirement was very low<br />

power performance, as this involves a trade-off with<br />

transmission ranges:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Robustness: For IoT applications such as smoke<br />

alarms or predictive maintenance, sensor data<br />

must be delivered reliably to enable a response<br />

Scalability: IoT applications will often have<br />

dozens to hundreds of sensors on an individual<br />

wireless network<br />

Latency: An IoT application like a home alarm<br />

must deliver data in a timely manner<br />

Security: It is very important to minimize the<br />

opportunity for unauthorized access or<br />

eavesdropping<br />

Regulatory compliance: Since the Sub-1 GHz<br />

band is subject to national or regional<br />

telecommunication rules, the network<br />

implementation must confirm to these in areas<br />

such as FCC channel occupancy requirements<br />

1 km range: The ability to deploy IoT sensors<br />

over a whole building or factory in a simple star<br />

network topology avoids the cost complexity of<br />

mesh networking with intermediate routers or<br />

range-extenders. It is important to understand that<br />

the relationship between range and data rate in<br />

wireless transmission. As the range increases, the<br />

possible data rates decrease. As a result, for an<br />

equivalent amount of data, devices must stay<br />

active longer (and thereby consume more power)<br />

when transmitting at longer range. To achieve<br />

very long battery life, the range requirement was<br />

therefore limited to enable a data rate of 50 kbps.<br />

Low power nodes: Since many IoT sensors and<br />

actuators lack access to mains or solar power and<br />

it is not cost-effective to frequently replace<br />

batteries, multiple years of operation on a coin cell<br />

battery was required. To achieve this, the network<br />

must allow devices to sleep for long periods<br />

without being woken up purely for network<br />

synchronization purposes.<br />

Two-way communication: Some nodes will be<br />

actuators that wait for commands or related input<br />

from the network. The network must be able to<br />

support sending out commands as well as simply<br />

receiving sensor data.<br />

Low cost nodes: To achieve low cost, it was<br />

required that the network implementation worked<br />

an embedded MCU with


enables the receiver to verify that the packet<br />

contents were not altered (countering man-in-themiddle<br />

attacks).<br />

Asynchronous mode: 802.15.4 is designed for<br />

low power operation and, unlike many network<br />

implementations, avoids the need for a device to<br />

regularly synchronize with the network. This<br />

allows a sensor node to sleep until it has a reason<br />

to connect to the network, such as for transmitting<br />

data.<br />

Broadcast mode: While sensors may be able to<br />

sleep until they must transfer data, an actuator,<br />

such as a LED lighting controller, needs to be able<br />

to respond quickly to a user command. The<br />

802.15.4 broadcast mode allows actuators to act as<br />

beacons waiting for input from the network.<br />

While 802.15.4 provided standards-based solutions that<br />

addressed many of the network implementation requirements,<br />

there were several areas where additional standards or<br />

proprietary techniques were utilized in the implementation:<br />

<br />

<br />

<br />

Frequency hopping: Although 802.15.4 includes<br />

frequency hopping standards, we chose the Wi-<br />

SUN (Wireless Smart Ubiquitous Networks)<br />

frequency hopping as the basis for our<br />

implementation as this is simpler and directly<br />

designed for 802.15.4g. We enhanced the Wi-<br />

SUN implementation to add support for sleepy<br />

devices and broadcast (beacons) mode.<br />

Logical Link Controller: 802.15.4g is purely a<br />

MAC layer standard and does not address<br />

functions addressed by the Logical Link<br />

Controller as well as higher layers in the network<br />

stack, such as network formation and<br />

management. It was therefore necessary to<br />

implement a proprietary application to provide the<br />

additional networking function needed to bridge<br />

sensor data from 802.15.4g to the internet.<br />

Security: 802.15.4 does not define standards for<br />

device authentication or secure key exchange.<br />

These are generally regarded as essential in<br />

modern IoT network implementations.<br />

We will discuss the details of the Logical Link Controller<br />

and enhanced security implementation to 802.15.4 in greater<br />

detail in subsequent sections.<br />

V. LOGICAL LINK LAYER<br />

As discussed earlier, 802.15.4g is purely a MAC layer<br />

standard. A significant amount of additional functionality must<br />

be added to have a viable wireless network. Although we titled<br />

this software module as the Logical Link Controller in our<br />

implementation, it is important to understand that this module<br />

provides much more than that what is typically thought of as<br />

Logical Link Layer functions. The key functions implemented<br />

were:<br />

<br />

<br />

<br />

Network Management: This starts and closes the<br />

network and ensures that the network connections<br />

are functional. This functionality is entirely<br />

implemented on the gateway.<br />

Device Management: The gateway adds devices<br />

that wish to join and removes devices that wish to<br />

leave or are not responding, and maintains device<br />

connections. On the device side, a device must<br />

verify if the connection is still open. If note, it<br />

should look for a new network.<br />

Service Discovery: This identifies which types of<br />

devices are connected to the network, such as<br />

temperature or soil moisture sensors, so that that<br />

these can be configured by applications for<br />

appropriate reporting.<br />

VI.<br />

INTERNET CONNECTIVITY<br />

Connecting the 802.15.4 wireless network to the internet<br />

requires a gateway. A software stack that performed the<br />

gateway function was implemented on an embedded Linux<br />

board, which connected to the Sub-1 GHz radio using a USB<br />

cable (see figure 1).<br />

The Sub-1 GHz radio was operated as network processor<br />

and treated as a Linux character device. The 802.15.4g MAC-<br />

www.embedded-world.eu<br />

563


layer functions continue to run on the MCU in the network<br />

processor, similar to a node device. In addition the network<br />

processor required additional software to enable it to<br />

communicate to the Linux gateway stack through a serial<br />

device driver. On the Linux side, the gateway stack contains a<br />

network processor interface layer that serializes data and<br />

commands between the NPI and the gateway application.<br />

The Linux gateway application implements the Logical<br />

Link Controller functions described earlier such as network<br />

formation and management and service discovery. It is also<br />

responsible for data transmission from the higher level<br />

applications to the radio. It provides a simple API layer that<br />

offers APIs such transmit, receive, network open and close, and<br />

device join and remove. The gateway also encapsulates data<br />

received from the NPI into JSON objects, which may be easily<br />

consumed by the cloud. For sensor objects, we chose the<br />

Internet Protocol Smart Objects (IPSO) definitions, as these<br />

provide a standard for formatting sensor data. The end user can<br />

create an application that posts the JSON data into an MQTT<br />

queue or whatever other mechanism is required to pass the data<br />

into the cloud for storage and processing.<br />

VII. NETWORK SECURITY<br />

As described earlier, AES-CCM addresses several aspects<br />

of network security. However, to offer the security expected in<br />

a modern IoT network, these capabilities must be enhanced<br />

further. Since AES is a symmetric cipher, it suffers from the<br />

drawback of all symmetric encryption: namely how to securely<br />

exchange the key required for subsequent secure transmission.<br />

A second weakness of AES-CCM is device authentication.<br />

While AES-CCM can authenticate the message contents, it has<br />

no mechanism for authenticating that a device joining the<br />

network is legitimate and not a potential bad actor.<br />

To authenticate devices, a password mechanism was<br />

chosen. The device manufacturer must embed a unique IEEE<br />

address and an 8- or 16-characters alphanumeric passcode into<br />

each device. When a user wishes to commission the node into<br />

the network they take that information, which will typically be<br />

encapsulated in a Quick Response (QR) code, and input them<br />

into the gateway application. The gateway application<br />

computes the unique token ID by performing a hash operation<br />

of the concatenated string of the passcode and IEEE address<br />

and then opens the network for joining. The node computes the<br />

unique token ID from its embedded IEEE address and passcode<br />

information, and then attempts to join using a TLS-like<br />

exchange over 802.15.4 that is secured over the token-derived<br />

link. If the device-generated token matches the token generated<br />

on the gateway, then the connection is allowed. An alternative<br />

to a password-based approach would be to use certificates.<br />

These would offer even stronger security for device<br />

authentication as certificates contain more information such as<br />

the manufacturer’s identity or manufacturing location identify,<br />

for example. A certificate based approach is also more efficient<br />

when it is necessary to join large number of devices at the same<br />

time as it eliminates the need to manually enter the device<br />

credentials.<br />

As intimated above, a TLS-like sequence is used to pass<br />

the AES key between the node and gateway. ECC-HD was the<br />

chosen method to generate and securely exchange the AES<br />

key. The same AES key is used to encrypt all future<br />

communication between the device and the node, unless the<br />

node is completely power cycled, in which case it will generate<br />

and exchange a new key.<br />

VIII. POWER CONSUMPTION RESULTS<br />

Table 1 shows that the current consumption and projected<br />

battery life for a node based on testing of the stack in specific<br />

application scenarios. These indicate that a node can potentially<br />

operate for multiple years on a coin cell battery. It should be<br />

stressed that battery life is completely dependent on the<br />

application profile and how frequently and how long a node is<br />

awake for. However the profiles illustrated are quite reasonable<br />

for a sensor.<br />

Node<br />

Application<br />

Profile<br />

Sends data every<br />

3 mins<br />

Sends data every<br />

3 mins and polls<br />

every minute<br />

Without Frequency<br />

Hopping<br />

Average<br />

Current<br />

Predicted<br />

Battery Life<br />

With Frequency<br />

Hopping<br />

Average<br />

Current<br />

Predicted<br />

Battery Life<br />

1.9 uA 13.9 years 2.2 uA 11.9 years<br />

4 uA 6.6 years 6.7 uA 4 years<br />

Table 1: Predicted coin cell battery life for two different application profiles<br />

using the Sub-1 GHz 802.15.4g stack<br />

IX.<br />

SUMMARY<br />

The Sub-1 GHz band offers many attributes that are highly<br />

suitable for IoT sensor network applications, including longrange,<br />

robustness, and very low power operation. The long<br />

range and low power can significantly reduce deployment and<br />

operation costs by allowing simple star-network configurations<br />

and eliminating the need to makes mains power available or<br />

frequently install new batteries.<br />

The Sub-1 GHz band has been used for some years in select<br />

IoT segments based on proprietary networks, such as smart<br />

meters and home alarms. A major barrier to wider adoption has<br />

been the need for developers to have the requisite experience to<br />

implement low-level RF protocols. 802.15.4g provides a basis<br />

for a robust, reliable standards-based wireless networking stack<br />

for Sub-1 GHz with low-power operation. Although 802.15.4g<br />

usage limits Sub-1 GHz range to around 1 km, this is still<br />

significantly more than competing low-power wireless<br />

technologies.<br />

Since 802.15.4 focuses on the MAC layer only, additional<br />

custom implementation effort is required to produce a stack<br />

that can make IoT data easily available for transmission to the<br />

cloud. We described the encapsulation of the radio into a Linux<br />

device driver, a network formation and management<br />

application, and the use of the IPSO data formats to provide<br />

sensor data in an easily consumable format for IoT platform<br />

agents.<br />

ACKNOWLEDGMENT<br />

I would to thank Roberto Sandre, of the Connected MCU<br />

organization at Texas Instruments, for providing technical<br />

insight into the 802.15.4-based stack for Sub-1 GHz.<br />

564


emb::6: An Open-Source IoT stack for Multiple<br />

IPv6 Communication Protocols<br />

Nidhal Mars, Lukas Zimmermann, Manuel Schappacher and Axel Sikora<br />

Institute of Reliable Embedded Systems and Communication Electronics,<br />

University of Applied Sciences Offenburg, D77652 Offenburg, Germany<br />

{nidhal.mars, lukas.zimmermann, manuel.schappacher, axel.sikora}@hs-offenburg.de<br />

Abstract—6LoWPAN (IPv6 over Low Power Wireless<br />

Personal Area Networks) is gaining more and more attraction for<br />

the seamless connectivity of embedded devices for the Internet of<br />

Things. It can be observed that most of the available solutions are<br />

following an open source approach, which significantly leads to a<br />

fast development of technologies and of markets. Although the<br />

currently available implementations are in a pretty good shape,<br />

all of them come with some significant drawbacks. It was<br />

therefore decided to start the development of an own<br />

implementation, which takes the advantages from the existing<br />

solutions, but tries to avoid the drawbacks and support multiple<br />

communication protocols. This paper describes the emb::6-<br />

implementation and its characteristics. It also covers the<br />

extension to support Thread protocol and 6Tisch (IPv6 over the<br />

Time-Slotted Channel Hopping mode of IEEE 802.15.4e)<br />

networks. The presented implementation is available as opensource<br />

project under [1].<br />

Keywords—6LoWPAN; ContikiOS; IEEE802.15.4; Thread<br />

Network; 6Tisch<br />

I. INTRODUCTION<br />

The Internet Protocol is the building block not only of the<br />

legacy Internet, but also of the upcoming Internet of Things<br />

(IoT), which connects small, reasonably powerful, energy and<br />

cost efficient embedded device to communicate not only with<br />

each other but also connects the embedded devices seamlessly<br />

with the existing Internet. On one side, these devices have<br />

interfaces to the physical world, whereas, on the other side,<br />

they connect to the virtual world of databases and servers in the<br />

Internet and thus are a cornerstone of a Cyber-Physical System<br />

(CPS).<br />

The lower three layers of the protocol stacks are already<br />

well defined with IEEE802.15.4 [2] and 6LoWPAN [3] [4] [5].<br />

6LoWPAN is developed to enable IPv6 connectivity for<br />

constrained embedded devices that use 802.15.4 low-power<br />

wireless communication. Although open issues here remain<br />

here with regard to the selection of physical layer, to unified<br />

commissioning procedures, to routing functions and<br />

parameters, and to security, a reasonable level of<br />

interoperability has already been achieved that is comparable<br />

with other more homogeneous protocol stack.<br />

It can be observed that most of the available solutions are<br />

following an open source approach, which significantly leads<br />

to a fast development of technologies and of markets. As<br />

example we could mention BLIP from TinyOS, RIOS OS [6],<br />

openWSN [7] and µIPv6 from Contiki. Although the currently<br />

available implementations are in a pretty good shape, all of<br />

them come with some significant drawbacks.<br />

After a thorough analysis of existent solutions for<br />

6LoWPAN, regarding their maturity and maintenance, the<br />

authors of this article decided to develop a new network stack<br />

which shall fulfill industry grade requirements for applications<br />

and has to provide comprehensive parametrization and<br />

commissioning capabilities.<br />

II.<br />

PROPOSED SOLUTION<br />

A. Design Principles<br />

The initial development of the emb::6 network stack started<br />

as a fork of Contiki OS including µIPv6, however to reach the<br />

requirements of an industry grade network stack several<br />

Contiki related core parts were removed or reworked. The most<br />

important aspects are:<br />

Architecture part:<br />

event-driven paradigm for network stack management<br />

improved modularity for the use of different Data Link<br />

Layer implementation, i.e. to also support developed in<br />

Wake-On-Radio-enabled IEEE802.15.4 stack [10] or<br />

optional security enhancements, like (D)TLS [11].<br />

clear separation between functional parts.<br />

seamless integration in other software environments.<br />

Implementation part:<br />

reduced usage of macros<br />

flexible modular and clear building system – SCons<br />

improved possibilities for parameterization thanks to<br />

extended APIs<br />

improved portability due to extended abstraction at the<br />

HAL. Any possible combinations of transceivers, MCU,<br />

sensors and periphery is possible, even simulation on PC.<br />

Conserve µIPv6 core in a suitable manner for regular bug<br />

fixing.<br />

www.embedded-world.eu<br />

565


B. Architecture<br />

Figure 1 shows the basic architecture of the emb::6 network<br />

stack with its networking core in the middle of the block<br />

diagram. The Networking core handles the network related<br />

tasks, mainly the communication part, whereas the different<br />

tasks have been split up into several layers. Beginning on top at<br />

the Application Layer (APL), usually serving as interface to the<br />

device application, requests will be forwarded layer by layer<br />

down to the physical layer (PHY) which is responsible for the<br />

implementation of the RF-module drivers.<br />

Figure 1: Protocol Stack of emb::6<br />

Brief description of layers follows:<br />

Application layer. The application layers (APLs) are the<br />

highest layers of the emb::6 Networking Stack and are<br />

located above the transport layer (TPL). The APL is an<br />

optional part of the emb::6 Networking Stack. Depending<br />

on the application, different APLs may be used whereas the<br />

following are currently included:<br />

o COAP. The COAP protocol is HTTP like protocol<br />

adapted and optimized for the Internet of Things (IoT).<br />

It is based on RESTful services [12].<br />

o ETSI M2M. According to the Global Standards<br />

Collaboration Machine-to-Machine Task Force, more<br />

than 140 organizations around the world are involved in<br />

M2M standardization. A considerable effort is done by<br />

ETSI to decrease M2M market fragmentation by<br />

defining a horizontal service platform for M2M<br />

interoperability. The proposed solution provides a<br />

RESTful Service Capability Layer (SCL) [13]<br />

accessible via open interfaces to enable developing<br />

services and applications independently of the<br />

underlying network<br />

o LWM2M. LightweightM2M is a device management<br />

protocol standardized by Open Mobile Alliance,<br />

designed to meet the requirements of applications.<br />

LightweightM2M is not restricted to device<br />

management, it also able to transfer service / application<br />

data.<br />

Transport layer. The transport layer is based on the µIPv6<br />

embedded TCP/IP Stack. By default only UDP is<br />

supported, however TCP can be enabled by request.<br />

Network layer. The network layer contains two sublayers,<br />

the upper IPv6 layer and the lower 6LoWPAN adaption<br />

layer. The IPv6 layer includes the routing protocol (RPL),<br />

ICMPv6 and the neighbor discovery protocol (NDP). The<br />

6LoWPAN adaption layer provides IPv6 and UDP header<br />

compression and fragmentation to transport IPv6 packets<br />

with a maximum transmission (MTU) of 1280 bytes over<br />

IEEE 802.15.4 with a MTU of 127 byte.<br />

MAC layer. Reduced in functionality implementation of<br />

IEEE 802.15.4.<br />

PHY layer. The physical layer is represented by the radiointerface<br />

driver and supports hardware depended<br />

functionality of the transceiver, e.g. CSMA and auto<br />

retransmission.<br />

Besides the networking core, a separate so called Utility<br />

Module implements all common functionalities such as timer<br />

and event handling, which are used by all other layers and<br />

modules.<br />

C. Implementation and Parametrization<br />

To support different hardware platforms including different<br />

microcontrollers, RF modules and target boards all hardware<br />

dependent parts of the emb::6 networking stack are<br />

encapsulated in a separate so-called Board Support Package<br />

(BSP) which is accessing a hardware dependent hardware<br />

abstraction layer (HAL). This allows easy porting of the emb::6<br />

Networking Stack across different hardware platforms.<br />

The implementation of the emb::6 networking stack makes<br />

use of so called structure based interfaces to build up the<br />

complete stack. Therefore each of the stack, HAL and utility<br />

parts have their well-defined interface descriptions that can be<br />

connected during initialization (cf. Figure 3) depending on the<br />

required configuration using e.g. C structures as shown in<br />

Figure 2 for the network stack. This makes it possible to<br />

dynamically change, add or remove functionality (e.g. change<br />

compression algorithm) by providing different modules<br />

conforming to the given interface.<br />

The emb::6 networking stack was designed to become<br />

highly scalable and configurable. Therefore configurations can<br />

be made during compile time as well as during runtime<br />

whereas compile time parameters mainly decrease the<br />

functionality of the stack in exchange for smaller requirements<br />

regarding memory and performance requirements. This makes<br />

it possible to use the stack also on more constrained devices.<br />

566


typedef struct netstack {<br />

const struct netstack_headerCompression* hc;<br />

const struct netstack_highMac*<br />

hmac;<br />

const struct netstack_lowMac*<br />

lmac;<br />

const struct netstack_framer*<br />

frame;<br />

const struct netstack_interface* inif;<br />

}s_ns_t;<br />

Figure 2: emb::6 network stack interface structure<br />

Figure 3: Example emb::6 initialization sequence.<br />

Runtime parameters have been implemented in layer and<br />

utility based configuration structures. Parameters are seat<br />

during the stack initialization and can be changed during<br />

runtime. Common use cases for such runtime parameters are<br />

e.g. transceiver output power or device addresses.<br />

In order to configure emb::6 stack during compilation time<br />

and manage overall complexity of different software and<br />

hardware configurations a module based approach was<br />

designed handled by the SCons building system [14].<br />

D. Code Size<br />

Since the emb::6 network stack was mainly developed for<br />

the usage with resource constrained embedded devices,<br />

benchmarks especially regarding to memory consumption in<br />

FLASH and RAM are key points of the stack implementation.<br />

As the emb::6 network stack can be configured in many ways<br />

and all changes within a configuration affect the resulting<br />

memory usage, it is nearly impossible to provide a common<br />

number here. However, Figure 4 gives a basic overview of the<br />

memory consumption of a full function device (FFD) in<br />

comparison to a reduced function device (RFD). The different<br />

configurations are based on a sample implementation for<br />

different targets with a gnu-gcc compiler and code optimization<br />

activated.<br />

Stack<br />

Configuration<br />

Setup initial networking<br />

stack parameters<br />

loc_initialConfig()<br />

Setup application dependent<br />

layer types<br />

loc_demoAppsConf()<br />

Initialize stack layers<br />

emb6_init()<br />

Initialize application<br />

loc_demoAppsInit()<br />

stk3600<br />

Flash/RAM<br />

45.7 / 4.3kB<br />

Stack parameters (SConsTargets):<br />

- MAC-Address<br />

- TX Power<br />

- RX Sensivity<br />

- Modulation<br />

Application Configuration (SConsTargets):<br />

- COAP (client/server)<br />

- UDP-Alive<br />

Stack layers:<br />

- all emb6 layers<br />

- BSP with radio driver<br />

Application Initialization (SConsTargets):<br />

- COAP (client/server)<br />

- UDP-Alive<br />

xpro_212b<br />

Flash/RAM<br />

46.9 / 4.3kB<br />

atany900<br />

Flash/RAM<br />

46.6 / 2.8kB<br />

atany900_rfd<br />

Flash/RAM<br />

26.4 / 1.0kB<br />

COAP: 11.6 / 2.2kB 11.9 / 2.2kB 12.5 / 2kB<br />

RPL: 13.2 / 0.3kB 14.4 / 0.3kB 13.3 / 0.2kB 10.1 / 0.1kB<br />

IPV6: 16.4 / 1.5kB 15.5 / 1.5kB 15.3 / 1.3kB 10.5 / 0.7kB<br />

6LOWPAN: 4.5 / 0.3kB 5.1 / 0.3kB 5.8 / 0.3kB 5.8 / 0.2kB<br />

Figure 4: Memory overview estimation for the emb::6 networking stack<br />

E. Demo Applications<br />

To provide usability and an easy entry into the emb::6<br />

networking stack, it comes with a lot of different demo<br />

applications providing basic functionalities e.g. to establish a<br />

network. Here simple UDP based socket applications are<br />

included as well as demos using application protocols such as<br />

COAP.<br />

III.<br />

EXTENSION TO SUPPORT THREAD PROTOCOL<br />

A. Overview<br />

To enrich our emb::6 stack we choose to support a recent<br />

development based on 6LoWPAN named Thread Network. It<br />

came with extensions regarding a more media independent<br />

approach, which – additionally – also promises true<br />

interoperability.<br />

Our extension covers mainly the layer 2 and layer 3<br />

requirements from the Thread Specification [15]. The<br />

implementation covers Mesh Link Establishment (MLE) and<br />

network layer functionality as well as 6LoWPAN mesh under<br />

routing mechanism based on MAC short addresses. The<br />

development has been verified on a virtualization platform and<br />

allows dynamical establishment of network topologies based<br />

on Thread’s partitioning algorithm. Note that all parts related to<br />

commissioning and security are not supported yet.<br />

B. Thread protocol<br />

The Thread protocol is an open standard for reliable, costeffective,<br />

low power, wireless device-to-device<br />

communication. It is designed specifically for connected home<br />

applications where IP-based networking is desired and a<br />

variety of application layers can be used on the stack. The<br />

Thread standard is based on IEEE 802.15.4 (2006) MAC and<br />

physical layer operating at 250 kb/s in the 2.4 GHz band.<br />

Figure 5 illustrates a general overview of our Thread stack<br />

implementation architecture [16]. This work mainly<br />

concentrates on MLE and network layer as described in<br />

chapter 4 and 5 in the Thread specification.<br />

Figure 5 Thread network stack<br />

www.embedded-world.eu<br />

567


C. Thread device types<br />

The Thread network uses different types of devices as<br />

illustrated in Figure 6.<br />

Border Router: A specific type of router that supports<br />

multiple interfaces besides IEEE 802.15.4 in order to<br />

connect with other networks, e.g. Wi-Fi, Ethernet, etc.<br />

Router: Used to provide routing services to the network<br />

and handle joining and security services for devices trying<br />

to join the network. Routers are not allowed to operate as<br />

sleepy end devices, but may downgrade their functionality<br />

and become REEDs (Router-eligible End Devices).<br />

Leader: The device that makes decisions within the Thread<br />

network and manages router ID assignments. The Leader<br />

is the first active router on the network and can be elected<br />

in case of losing connectivity.<br />

Router-eligible End Devices: REEDs have the capability<br />

to become routers without user interaction, if necessary.<br />

End Devices: End devices communicate only through their<br />

parent router and cannot forward messages to other<br />

devices. To save energy they can sleep for a time period<br />

and poll their associated router for data once they are<br />

awake.<br />

by allowing a node to send periodical link-local multicast<br />

messages containing an estimated link quality for all links. In<br />

addition, MLE exchanges link costs between nodes by sending<br />

MLE advertisement messages.<br />

MLE advertisement messages are used to exchange<br />

bidirectional link quality between neighboring routers. All<br />

routers exchange periodically single-hop MLE advertisement<br />

packets containing link cost information. Those periodic<br />

advertisements allow routers to quickly detect changes in the<br />

set of neighboring routers. For instance, if a new router joins<br />

the network, an existing router has been downgraded to REED<br />

or if a router lost connection to the Thread network.<br />

Regarding the architecture, MLE cannot be placed in OSI<br />

Model clearly. Instead, it operates alongside the stack using<br />

UDP (User Datagram Protocol) as transport protocol. This<br />

architecture is also given for other systems that make use of<br />

MLE, such as ARM mbed OS [19]. Figure 7 shows the<br />

different protocol modules used by ARM mbed OS and the<br />

interaction of the MLE protocol with the existing layers.<br />

Figure 6 Thread device types [17]<br />

D. Mesh Link Establishment<br />

The existence of many asymmetric radio links within the<br />

IEEE 802.15.4 network represents one of the main issues while<br />

establishing links between nodes. Thread is using the so-called<br />

Mesh Link Establishment protocol (MLE) [18] to resolve such<br />

kind of problems in addition to other capabilities.<br />

In this section, we give an overview about the capabilities<br />

of the MLE layer and its architecture. Furthermore, we will<br />

highlight the main processes of MLE that have been<br />

implemented.<br />

MLE capabilities and architecture<br />

MLE is a protocol that is used to configure and secure radio<br />

links dynamically as the topology and physical environment<br />

change. This is done by exchanging IEEE 802.15.4 radio<br />

parameters between nodes such as addresses, node capabilities<br />

and frame counters.<br />

MLE allows all nodes to synchronize periodically and share<br />

radio link parameters to adapt to any change that might happen<br />

on the topology such as joining of new devices. Furthermore,<br />

MLE can detect unreliable links before any effort is spent for<br />

configuring them. For example, a link between two devices<br />

that is strong in one direction may be unusable due to weak<br />

signal strength towards the other direction. MLE resolves this<br />

Figure 7 ARM 6LoWPAN stack alongside OSI model [19]<br />

MLE processes and test cases<br />

In Thread networks, all devices join to the network either as an<br />

end device or as a REED. Joiner devices always try to attach to<br />

an active Thread router from which they allocate a 16-bit short<br />

address. In case such a join process fails, a second request is<br />

sent to both routers and REEDs.<br />

Figure 8 briefly shows such a mesh link establishment<br />

scenario. We use four nodes that have 0xAA, 0xB0, 0xC0 and<br />

0xB1 as last two bytes of their MAC address, respectively.<br />

Possible radio links are defined statically by the environment<br />

and are delineated by the gray line.<br />

Figure 8 Testing scenario for MLE joining process.<br />

568


Figure 9 shows the trace output of the nodes for this<br />

scenario. Window A corresponds to node 1 (0xAA) and<br />

window B to node 2 (0xB0).<br />

- Node 1 (0xAA) is the first active node in the network. The<br />

joining process fails since no other routers are available<br />

(line A.12). Consequently, the node creates a new partition<br />

and starts operating as a parent (lines A.13 and A.14).<br />

- Node 2 (0xB0) attaches to node 1 after exchanging four<br />

handshake messages. Node 2 operates as a child after<br />

receiving the CHILD ID RESPONSE (lines B.18 and<br />

B.20).<br />

- Node 3 (0xC0) sends a multicast PARENT REQUEST.<br />

Node 1 and node 2 receive the message (line A.22 and<br />

B.13), but only node 1 replies due to a scan mask TLV<br />

(the first request should be replied only by the active<br />

router). In case that more than one parent responds, the<br />

joining device compares them and selects the best device<br />

to be its parent using the connectivity TLV received in the<br />

parent response and the calculated two-way link quality<br />

(calculated using the received link margin TLV in the<br />

parent response and the RSSI of the response itself as<br />

explained in Table 1 ). Then the handshake process will<br />

continue normally between node 3 and node 1. Finally,<br />

node 3 (0xC0) will start operating as a child.<br />

- Node 4 (0xB1) has one possible radio link with node 2.<br />

Although, node 2 (0xB0) only replied for the second<br />

PARENT REQUEST (line B.24). This is explained by the<br />

fact that on the first request only the active router should<br />

reply (node 2 is operating as a child at that moment). Once<br />

node 2 receives a CHILD ID REQUEST, it sends a request<br />

to the leader to become a router (line B.26). Finally,<br />

node 2 switches its mode to active router (line B.27) and<br />

node 4 starts operating as a child.<br />

1<br />

2<br />

3<br />

4<br />

5<br />

6<br />

7<br />

8<br />

9<br />

10<br />

11<br />

12<br />

13<br />

14<br />

15<br />

16<br />

17<br />

18<br />

19<br />

20<br />

21<br />

22<br />

23<br />

24<br />

25<br />

26<br />

27<br />

28<br />

29<br />

30<br />

MLE UDP initialized :<br />

lport --> 19788<br />

rport --> 19788<br />

MLE protocol initialized.<br />

[+] JP Send mcast parent request to active router<br />

==> MLE PARENT REQUEST sent to : ff02::2<br />

[+] JP Waiting for incoming response from active router<br />

[+] JP Send mcast parent request to active Router and REED<br />

==> MLE PARENT REQUEST sent to : ff02::2<br />

[+] JP Waiting for incoming response from active Router and REED<br />

Joining process failed.<br />

Starting new partition.<br />

MLE : Node operating as Parent.<br />

[+] SNY process: Send Link Request to neighbor router.<br />

==> MLE LINK REQUEST sent to : ff02::2<br />

[+] SNY process: Synchronization process finished.<br />

MLE PARENT RESPONSE sent to : fe80::250:c2ff:fea8:b0<br />

MLE PARENT RESPONSE sent to : fe80::250:c2ff:fea8:c0<br />

MLE CHILD ID RESPONSE sent to : fe80::250:c2ff:fea8:b0<br />

MLE CHILD ID RESPONSE sent to : fe80::250:c2ff:fea8:c0<br />

Child linked with id : 1 and timeout is : 10<br />

Child linked with id : 2 and timeout is : 10<br />

A<br />

1 MLE UDP initialized :<br />

B<br />

2 lport --> 19788<br />

3 rport --> 19788<br />

4 MLE protocol initialized.<br />

5 [+] JP Send mcast parent request to active router<br />

6 ==> MLE PARENT REQUEST sent to : ff02::2<br />

7 [+] JP Waiting for incoming response from active router


EID is a stable IPv6 address that uniquely identifies a Thread<br />

interface within a Thread partition. EIDs are not directly<br />

routable, because the Thread routing protocol only exchanges<br />

route information for RLOCs. To deliver an IPv6 datagram<br />

with an EID as the IPv6 destination address, a Thread device<br />

must perform EID-to-RLOC lookup. When attaching to a<br />

partition, a node must retrieve an RLOC IPv6 address from a<br />

router. The RLOC’s 16 least significant bits are called<br />

RLOC16 and map router ID and child ID of the node. Routers<br />

assign child ID 0. Figure 10 shows the RLOC16 structure.<br />

Figure 10: RLOC16 structure<br />

A router retrieves his router ID from the partition leader by<br />

sending a CoAP address query message. RLOC addresses are<br />

only used for communicating control traffic and delivering<br />

IPv6 datagrams to their destinations. Since no RLOC address is<br />

available when initially sending an address query message, the<br />

EID is used. Intermediate nodes must perform EID-to-RLOC<br />

lookup in order to forward the packet to the partition leader and<br />

vice versa. The child ID is allocated by the parent node and<br />

communicated through MLE attachment process.<br />

Routing algorithm<br />

A Thread network has up to 32 active routers that use next-hop<br />

routing for messages based on its routing database. This<br />

database includes path cost calculation that is performed by<br />

applying distributed Bellman-Ford algorithm (cf. RIPng) [20].<br />

The routing database is a set of neighbor router table (Link<br />

Set), routing table (Route Set) and all valid router IDs (Router<br />

ID Set). All routers advertise their routing table periodically.<br />

The rate at which routing advertisements are sent is determined<br />

by an instance of the Trickle algorithm. For routers, in order to<br />

keep track of the validity of shared data in the network, an<br />

incremental ID sequence number is attached to the routing<br />

data. After looking up the shortest path for a route, a router<br />

generates the IPv6 RLOC address of the destination router<br />

using its router ID.<br />

All tables that are part of the routing database have been<br />

implemented by using linked lists. When looking for a routing<br />

entry, linked list structures can be used as a mask when<br />

iterating trough all list entries. The benefit of this approach is<br />

that predefined fields can be accessed easily. Since embedded<br />

devices usually underlie memory constraints, we implemented<br />

least recently used (LRU) replacement policy for Link Set and<br />

Route Set. The last recently accessed item is inserted at the<br />

head of the linked list by modifying appropriate pointers. As a<br />

result, when transmitting fragmented packets the lookup<br />

iteration for subsequent fragments will terminate after the first<br />

list element. When inserting a new list element, the last one of<br />

the linked list is removed if the number of elements would<br />

exceed a defined maximum.<br />

Link cost determination<br />

The link set stores information about neighboring routers<br />

including the measured link margin (RSSI) in dB. Furthermore,<br />

the link margin plays a leading role in parent selection during<br />

attachment process. The measured one-way link margin may<br />

change during runtime due to noise floor or altered<br />

environmental conditions. To smooth out short-term volatility,<br />

Thread devices must perform exponentially weighted moving<br />

average (EWMA) method of the link margins for each<br />

neighbor. Equation (1) shows the EWMA calculation where<br />

M t −1 is the currently stored link margin for a specific<br />

neighbor, Y t is the last recently measured link margin and M t<br />

will be the newly calculated link margin for that neighbor.<br />

M t = α ∙ Y t + (1 − α) ∙ M t−1 (1)<br />

To avoid costly floating point computations on the microcontroller,<br />

equation (1) has been rewritten to equation (2).<br />

M t = Y t+ ( 1 α −1)∙M t−1<br />

1<br />

(2)<br />

α<br />

The exponential smoothing fraction α (equation (3)) is used as<br />

weighting and defined as either 1 or 1<br />

[21].<br />

8 16<br />

α = {α ∈ R | 0 ≤ α ≤ 1} (3)<br />

The preceding transformation results in a bit shift to the left of<br />

the numerator around the reciprocal value of α. This allows to<br />

exploit the benefits of integer calculations without glaring<br />

rounding errors.<br />

Routing advertisement<br />

Distributed routing algorithms reduce node-sided<br />

computational costs in terms of shared route data. When using<br />

non-distributed algorithms, each node has to expand a graph by<br />

incrementally improving path costs. In Thread, MLE<br />

advertisements are used from nodes acting as routers for<br />

advertising their routing table to neighboring routers. A<br />

practicable approach of determining the rate at which<br />

advertisements are sent is to deploy a dependency on the<br />

change of routing data. Thread uses the Trickle algorithm to<br />

generate dynamic and random transmission windows. If the<br />

routing entries are stable, the rate is reduced to a minimum.<br />

The flowchart in Figure 11 shows our implementation of the<br />

Trickle algorithm. We use a timer that recalculates its<br />

expiration time after a timeout. The limits of the time slots can<br />

be defined via C macros. After initializing the Trickle timer, it<br />

runs independently from other processes.<br />

Figure 11: Flowchart of Trickle timer implementation<br />

570


Unicast packet forwarding<br />

Routing inside a Thread network is performed using RLOC<br />

IPv6 addresses. Unicast packets are forwarded by applying a<br />

mesh under strategy on the 6LoWPAN layer. Router packets<br />

include the 6LoWPAN mesh header carrying the originator and<br />

final RLOC16 addresses. When receiving a packet including a<br />

6LoWPAN mesh header, a routing table lookup is performed<br />

without uncompressing the packet. Therefore, we extended the<br />

emb::6 implementation in order to support mesh under routing<br />

for unicast packets. During MLE joining process, the 16-bit<br />

short MAC address is set to the RLOC16 assigned to the<br />

router. Usually, IPv6 packets from higher layers, e.g.<br />

application layer, are using the EID of the destination device.<br />

Then, routers must perform EID-to-RLOC lookup to retrieve<br />

the router ID of the destination router. The EID-to-RLOC<br />

lookup mechanism consists of CoAP messages targeting CoAP<br />

resources provided by routers [22]. The router that is<br />

responsible for the given EID is sending a response message<br />

including its router ID. Each router maintains a EID-to-RLOC<br />

map cache to hold a list of recently used lookups. This prevents<br />

from frequently sending lookup messages when transmitting<br />

fragmented packets. For instance, an end device does not have<br />

routing capability and therefore must forward packets to its<br />

parent router. In this case the packet is sent without adding<br />

6LoWPAN mesh header.<br />

Slot Number (ASN). The pairwise assignment of a directed<br />

communication between devices in a given timeslot on a given<br />

channel offset is a link<br />

During a timeslot, one node typically sends a frame, and<br />

another sends back an acknowledgement if it successfully<br />

receives that frame. If an acknowledgement is not received<br />

within the timeout period, retransmission of the frame waits<br />

until the next assigned transmit timeslot (in any active<br />

slotframe) to that address occurs. Figure 13 shows the structure<br />

of transmit and receive timeslots in TSCH mode Note that the<br />

CCA is a configurable option before transmit in timeslots.<br />

IV.<br />

INTEGRATION OF 6TISCH PROTOCOL.<br />

A. 6Tisch Overview<br />

As it becomes required and it<br />

represents an elegant solution for<br />

many industrial applications to have<br />

slotted network were the quality of<br />

service is guaranteed. We decided to<br />

go for integrating 6Tisch protocol.<br />

Figure 12 shows the operation layer<br />

for this protocol among the stack.<br />

The 6top layer is a logical link<br />

control sitting between the IP layer<br />

and the TSCH MAC layer, which<br />

provides the link abstraction that is<br />

require for IP operations. The 6top<br />

operations are specified in [23]. The<br />

6top sublayer hides the complexity of<br />

the schedule to the upper layers.<br />

Time Slotted Channel Hopping<br />

(TSCH) is a MAC layer of the IEEE<br />

802.15.4e-2012 amendment [24].<br />

Figure 12: 6Tisch<br />

architecture<br />

Figure 13 the structure of transmit and receive timeslots in IEEE 802.15.4e<br />

TSCH mode [25].<br />

C. Channel Hopping<br />

One of the advantage behind using channel is to Mitigate<br />

Channel Impairments by allowing frequency diversity to<br />

reduce the effects of interference and multipath fading<br />

It also Increase Network Capacity so that one timeslot can be<br />

used by multiple links at the same time.<br />

Figure 14 shows an example of how to calculate the current<br />

channel of each link at a given time slot.<br />

In this each link rotates through 6 available channels over 6<br />

cycles. The channe is calculated using the following equation:<br />

Ch = CH_Table [(ASN+ChannelOffset)% Number_of_Channels ]<br />

B. Time slotted<br />

All nodes in the network are synchronized on a slotted time<br />

base. A slotframe is a collection of timeslots repeating in time.<br />

The number of timeslots in a given slotframe determines how<br />

often each timeslot repeats. The total number of timeslots that<br />

has elapsed since the start of the network is called the Absolute<br />

Figure 14: Frequency calculation<br />

www.embedded-world.eu<br />

571


D. Synchronization<br />

Device-to-device synchronization is necessary to maintain<br />

connection with neighbors in a slotframe-based network. There<br />

are two methods for a device to synchronize to the network:<br />

<br />

Acknowledgment-based synchronization involves the<br />

receiver calculating the delta between the expected time<br />

(explained in Figure 15) of frame arrival and its actual<br />

arrival, and providing that information to the sender node<br />

in its acknowledgment. This allows a sender node to<br />

synchronize to the clock of the receiver.<br />

1. Transmitter node sends a packet, timing at the start<br />

symbol.<br />

2. Receiver timestamps the actual timing of the reception<br />

of start symbol<br />

3. Receiver calculates:<br />

TimeAdj = Expected Time – Actual measured Time<br />

4. Receiver informs the sender TimeAdj<br />

5. Transmitter adjusts its clock by TimeAdj<br />

<br />

<br />

Configuration [I-D.ietf-6tisch-minimal] specification. And<br />

does not preclude other scheduling operations to co-exist<br />

on a same 6TiSCH network.<br />

Neighbor-to-Neighbor Scheduling refers to the dynamic<br />

adaptation of the bandwidth of the Links that are used for<br />

IPv6 traffic between adjacent routers.<br />

Remote Monitoring and Schedule Management refers<br />

to the central computation of a schedule and the capability<br />

to forward a frame based on the cell of arrival<br />

Hop-by-hop Scheduling refers to the possibility to<br />

reserve cells along a path for a particular flow using a<br />

distributed mechanism.<br />

As Figure 16 shows with RPL we can use either Static or<br />

Neighbor-to-Neighbor scheduling. However the current<br />

implementation support only static scheduling.<br />

Frame-based synchronization involves the receiver<br />

calculating the delta between the expected time of frame<br />

arrival and its actual arrival, and adjusting its own clock by<br />

the difference. This allows a receiver node to synchronize<br />

to the clock of the sender.<br />

1. Receiver timestamps the actual timing of the reception<br />

of start symbol<br />

2. Receiver calculates<br />

TimeAdj = Expected Time – Actual timing<br />

3. Receiver adjusts its own clock by TimeAdj<br />

Figure 16: Routing, Forwarding and scheduling [26]<br />

V. SUMMARY AND OUTLOOK<br />

With emb::6 a flexible and modular 6LoWPAN stack with<br />

multiple protocols has been developed and its basic<br />

functionality and performance has been tested using an<br />

automated testbed. Next steps of development will include<br />

commission and security missing parts of Thread protocol and<br />

support other scheduling mechanism for 6TiSCH protocol. The<br />

given implementation has already been released and is<br />

available as an open-source project on GitHub [1].<br />

REFERENCES<br />

Figure 15: Time Adjustment calculation [25]<br />

A node will only synchronize to its time-parent, where the tree<br />

formed by the time parents is rooted at the gateway. This forms<br />

a synchronization tree, and ensures that all the nodes in the<br />

network has a common sense of time.<br />

E. Scheduling<br />

The 6TiSCH architecture identifies four ways a schedule can<br />

be managed and CDU cells can be allocated:<br />

<br />

Static Scheduling refers to the minimal 6TiSCH operation<br />

whereby a static schedule is configured for the whole<br />

network for use in a slotted-aloha fashion. The static<br />

schedule is distributed through the native methods in the<br />

TSCH MAC layer. It is specified in the Minimal 6TiSCH<br />

[1] emb::6, https://github.com/hso-esk/emb6.<br />

[2] IEEE 802. 15. 4-2011, "Part 15. 4: Low-Rate Wireless<br />

Personal Arae Networks (LR-WPANs), " September<br />

2011..<br />

[3] https://tools.ietf.org/html/rfc4944.<br />

[4] https://tools.ietf.org/html/rfc6282.<br />

[5] https://tools.ietf.org/html/rfc6775.<br />

[6] http://www.riot-os.org/.<br />

[7] https://openwsn.atlassian.net/wiki/.<br />

[8] https://openwsn.atlassian.net/wiki/ display/OW/uRES.<br />

[9] http://dunkels.com/adam/pt/.<br />

[10] N. M. Phuong, M. Schappacher, A. Sikora, Z. Ahmad,<br />

A. Muhammad, “Real-Time Water Level Monitoring<br />

using Low-Power Wireless Sensor Network”,<br />

Embedded World Conference, Feb 2015, Nuremberg..<br />

[11] A. Yushev, A. Walz, A. Sikora, "Securing Embedded<br />

572


Communication with TLS1.2", embedded world<br />

conference 2015, Nuremberg, 24.-26. Feb. 2015..<br />

[12] RFC7252, The Constrained Application Protocol<br />

(CoAP), June 2014.<br />

[13] ETSI TS 102.690 v2.1.1. Machine-to-Machine<br />

communications (M2M); Functional architecture.<br />

October 2013.<br />

[14] Knight, Steven. "Building software with SCons."<br />

Computing in Science and Engineering 7.1 (2005): 79-<br />

88..<br />

[15] Thread Group, Inc., Thread Specification, Revision<br />

1.1.0 (July 2016).<br />

[16] A Sikora, Funknetzwerke für das Internet der Dinge:<br />

6LoWPAN OpenSource-Projekt: emb6, Elektronik<br />

Wireless 2016, (2016).<br />

[17] B Curtis, S Ashon. Thread Open House. Thread Group,<br />

(May 2016).<br />

[18] R K Kelsey, Mesh Link Establishment, (Oct. 2013).<br />

[19] ARM mbed 6LoWPAN Stack Overview,<br />

https://docs.mbed.com/docs/arm-ipv66lowpanstack/en/latest/02_N_arch/,<br />

(04.02.2017).<br />

[20] Malkin, G and R Minnear, RIPng for IPv6, RFC 2080,<br />

(Jan. 1997).<br />

[21] T Agami Reddy, Applied data analysis and modeling for<br />

energy engineers and scientists, in. Boston, MA:<br />

Springer US, pp. 253–288 (2011).<br />

[22] Z Shelby, K Hartke, C Bormann, The Constrained<br />

Application Protocol (CoAP), RFC 7252, (2014).<br />

[23] https://tools.ietf.org/html/draft-ietf-6tisch-6top-protocol-<br />

09.<br />

[24] http://ieeexplore.ieee.org/stamp/stamp.jsp?arnumber=61<br />

85525.<br />

[25] Kris Pister, Chol Su Kang, Kuor Hsin Chang, Rick<br />

Enns, Clint Powell, José A. Gutierrez, Ludwig Winkel,<br />

Time Slotted, Channel Hopping MAC, 1 Sep, 2008.<br />

[26] https://tools.ietf.org/html/draft-ietf-6tisch-architecture-<br />

13.<br />

www.embedded-world.eu<br />

573


Unraveling Mesh Networking Options<br />

Benchmarking Zigbee, Thread and Bluetooth Low Energy Mesh Protocol Stacks<br />

Tom Pannell<br />

Senior Director of IoT Marketing<br />

Silicon Labs<br />

Austin, Texas<br />

tom.pannell@silabs.com<br />

Skip Ashton<br />

Vice President of Software<br />

Silicon Labs<br />

Boston, Massachusetts USA<br />

skip.ashton@silabs.com<br />

Abstract—Developers of home and building automation<br />

products have many wireless protocol choices. Zigbee and<br />

proprietary wireless controls dominate these markets today.<br />

Thread and Bluetooth mesh are new entrants to this market.<br />

Bluetooth and Wi-Fi are mature protocols that are also popular.<br />

The deployed networks, regardless of the underlying protocol,<br />

must be robust. The robustness of a network is quantified by<br />

measuring throughput, latency and reliability. These<br />

measurements depend on installation size and other system-level<br />

requirements. “One size does not fit all” when it comes to mesh<br />

networking protocol choices. Each protocol presents unique<br />

characteristics and advantages, depending on the use case and end<br />

application. Understanding the inner workings of mesh technology<br />

goes beyond a list of key features. More importantly, developers<br />

need to understand how these network protocols perform in the<br />

key areas of power consumption, throughput, latency, scalability,<br />

security and Internet Protocol (IP) connectivity. Zigbee, Thread<br />

and Bluetooth mesh are all designed differently from the ground<br />

up, and how the mesh is implemented can have an impact on<br />

performance and robustness.<br />

Keywords— mesh network; Bluetooth, Bluetooth mesh,<br />

embedded IP; low power, IEEE 802.15.4, Zigbee, Thread, BLE<br />

I. HUMANITY’S SEARCH FOR COMFORT AND SAFETY<br />

The desire to control our environment is a central aspect of<br />

human history and behavior, which has led to the establishment<br />

of permanent houses, farms, transportation and communications<br />

infrastructure, and cities. Life as we know it is a result of humans<br />

searching for comfort, convenience and a connection to the<br />

world around them. It is fundamental to the human condition to<br />

want more, to make things easier and create more comfort.<br />

Wireless technologies have developed in our modern era to<br />

enable humans to communicate over long distances and control<br />

aspects of their life to enhance comfort, convenience and<br />

security.<br />

Wireless communication has become part of the fabric of our<br />

daily lives. We have Bluetooth in our phones, Zigbee to control<br />

the buildings where we work and live, proprietary wireless is in<br />

our factories and Z-wave in the security systems that protect our<br />

homes. These wireless technologies exist to make our lives<br />

easier and more efficient. This trend of wireless connectivity has<br />

no end in sight as common objects are becoming increasingly<br />

more connected.<br />

II.<br />

WIRELESS CONNECTIVITY<br />

A. Wireless SoCs<br />

Wireless SoCs have become cost effective enough to be<br />

added to the “things” that provide us convenience, safety and<br />

comfort daily. A “thing” becomes an “IoT” device when<br />

wireless connectivity is added. Many of today’s IoT devices<br />

were previously things that didn’t have wireless connectivity to<br />

the Internet. Changing regulations and consumer expectations<br />

are forcing product manufacturers to add wireless connectivity<br />

to a myriad of products and systems to meet regulatory<br />

requirements, stay competitive or create the potential for new<br />

revenue streams.<br />

When developers choose to build IoT devices, they must<br />

consider how the end product is used and the ecosystem in which<br />

these products will operate.<br />

B. Types of Wireless Networks<br />

There are many competing wireless technologies in the IoT.<br />

Two basic topologies exist: mesh and star (Figure 1). Mesh is<br />

often preferred over star networks in home and building<br />

automation due to mesh’s ability to scale to many more nodes<br />

and cover long distances. Star networks rely on a point-to-point<br />

connection between an end-node and a central device. If the<br />

environment changes after the network is installed, a star<br />

network can fail. Mesh, on the other hand, is distributed and selfhealing.<br />

If the environment changes or a node fails after the<br />

network is deployed, the mesh network can heal itself.<br />

www.embedded-world.eu<br />

574


C. Which Network is best for Home and Building Automation<br />

Zigbee is commonly used in building and home automation.<br />

More recently Thread and Bluetooth mesh are being considered<br />

for these applications. Z-wave is a mesh technology that is<br />

popular in home security and home automation. However, this<br />

paper does not cover Z-wave due to the lack of access to a<br />

comparable test network where results can be verified.<br />

Home and building automation includes a combination of<br />

energy harvesting devices, battery-powered devices and linepowered<br />

devices. Lighting and thermostats are typically line<br />

powered because they are part of the infrastructure, but that<br />

doesn’t mean power consumption can be ignored. Devices that<br />

are part of the infrastructure and are AC-powered must be<br />

managed carefully due to new government regulations limiting<br />

“vampire power.” Batteries usually power remote sensors and<br />

control elements. That means the mesh must comprehend two<br />

fundamentally different use cases from a power perspective.<br />

III.<br />

USE CASES<br />

There are many possible use cases in home and building<br />

automation. A few are discussed below.<br />

A. Comfort Use Case<br />

Consider, for example, lighting and environmental control in<br />

a theater or museum. These installations usually have hundreds<br />

to thousands of nodes. The lights, motors for curtains and blinds<br />

need to be controlled in a precise and choreographed way. All<br />

the lights need to dim simultaneously, and the motors controlling<br />

the curtains should all work in concert. Slight differences are<br />

noticeable and would detract from the experience of the<br />

audience. The home has similar requirements. If you are creating<br />

a scene with lights and window shades, the user expects a<br />

seamless and choreographed experience where all lights dim<br />

simultaneously and all window shades move in unison.<br />

B. Safety use case<br />

An environment like a warehouse may have different<br />

lighting needs than a theater. Often the lights are turned on in a<br />

section simultaneously. However, it doesn’t really matter if<br />

those lights turn on together or if it takes a few seconds for all to<br />

illuminate. The user experience and expectation are different.<br />

On the other hand, if certain lights need to turn on quickly due<br />

to a power outage, suddenly time does matter.<br />

C. Convenience Use Case<br />

A developer may want to add additional services to the<br />

wirelessly controlled lights in the warehouse described above. It<br />

may not matter if every light turns on in unison in the<br />

installation. However, it could matter how robust the network is<br />

if the developer wants to add additional services. A service that<br />

is gaining popularity in mesh installations is asset tracking. In<br />

this instance, the designer relies on the control network to also<br />

transmit data about the assets being tracked by the installed<br />

infrastructure. In this example, throughput and latency matter in<br />

terms of how quickly the asset information will propagate<br />

through the network.<br />

D. Which Mesh Protocol Is Best?<br />

The answer is not so simple. There are fundamental<br />

architectural differences between Zigbee, Thread and Bluetooth<br />

mesh. Zigbee and Thread can use flooding when required but<br />

generally use a routing mesh to minimize network overhead that<br />

can interfere with messaging. Bluetooth mesh uses a flooding<br />

mesh but allows configuration of the devices to act as routers to<br />

reduce the impact of the flooding. The Bluetooth Special Interest<br />

Group (SIG) calls this “managed flooding.” 1<br />

Zigbee and Thread networks include routing nodes and end<br />

nodes. The routing nodes are usually line powered and serve as<br />

the backbone to the mesh. The end nodes are normally battery<br />

powered, operating on the periphery of the mesh, and use routers<br />

to relay messages for them. The routing table is established when<br />

the mesh is created. The routing table is a directory of sorts that<br />

tells each device how to communicate to other devices in the<br />

mesh. In this manner, one node can efficiently communicate to<br />

another node by sending messages in a precise route through the<br />

mesh. This has a positive effect on throughput of the mesh and<br />

can reduce latency as the mesh grows.<br />

Routing mesh is historically preferred to a flooding mesh<br />

because it provides more efficient communications and<br />

predictable performance. On the other hand, it is more difficult<br />

to implement for the developers of the stack.<br />

IV.<br />

PACKET STRUCTURE<br />

A. Zigbee and Thread Packet Structure<br />

Both Zigbee and Thread use IEEE 802.15.4 with 127 byte<br />

packets and an underlying data rate of 250 kbps. While the PHY<br />

headers are the same, the packet structure is different, resulting<br />

in slightly different payload sizes. Zigbee packet format is<br />

shown in Figure 2 and results in a 68 byte payload. For payloads<br />

above 68 bytes, Zigbee fragments into multiple packets. Thread<br />

packet format is shown in Figure 3 and results in a 63 byte<br />

payload. For payloads above 63 bytes, the Thread stack<br />

fragments using 6LoWPAN. Silicon Labs’ mesh performance<br />

data is based on payload size as this is the design parameter of<br />

concern when building an application.<br />

As noted above, each of these networks fragments larger<br />

messages into smaller ones. For Zigbee, fragmentation occurs at<br />

the application layer and is performed end to end from the source<br />

to the destination. For Thread, the fragmentation is done at the<br />

6LoWPAN layer, as well as from source to destination.<br />

575


For unicast forwarding within these networks, the message<br />

is forwarded as soon as the device is ready to send. For multicast<br />

forwarding, there generally are networking requirements for<br />

how messages are forwarded. These include:<br />

a. For Zigbee devices, a multicast message is forwarded<br />

by a device only after jitter of up to 64 milliseconds. However,<br />

the initiating device has a gap of 500 milliseconds before<br />

retransmitting the initial message.<br />

b. For Thread devices, RFC 7731 MPL forwarding is<br />

used. The trickle timer is set to 64 milliseconds so the devices<br />

back off a random amount up to this time before retransmitting.<br />

B. Bluetooth LE Packet Structure<br />

Bluetooth low energy has the following packet structure to<br />

minimize time on air and energy consumption.<br />

Bluetooth mesh further refined this packet structure to add<br />

the mesh and security capabilities.<br />

IVI<br />

Network<br />

ID<br />

1 1 3 2 2 12 or 16 4 or 8<br />

Sequence Source Dest<br />

TTL Number Address Address Packet Payload NWK MIC<br />

CTL<br />

Figure 4 – Bluetooth Mesh Packet Format<br />

This means Bluetooth mesh has only 12 or 16 bytes available<br />

for payload, and beyond this the packets are segmented into<br />

individual packets and reassembled at the destination. This<br />

segmented packet carries a header identifying the segment and<br />

12 bytes of application payload except for the last segment<br />

which can be shorter. However, there are additional backoff<br />

requirements in the Bluetooth mesh specification that space out<br />

these segmented packets, increasing latency and decreasing<br />

throughput. As all of our throughput and latency analysis is<br />

based on application payload, we can see that Bluetooth mesh<br />

will require more packets than Zigbee or Thread because of this<br />

lower packet payload size.<br />

V. ROUTING VS FLOODING MESH<br />

As stated previously, Zigbee, Thread and Bluetooth mesh<br />

were designed for home and building automation. Zigbee<br />

supports several routing techniques including flooding of the<br />

mesh for route discovery or group messages, next hop routing<br />

for controlled messages in the mesh, and many-to-one routing to<br />

a gateway, which then uses source routing out to devices. It is<br />

normal for a Zigbee network to use all of these methods<br />

simultaneously.<br />

Thread also supports next hop routing as well as flooding.<br />

However, thread networks maintain next hop routes to all routers<br />

as part of normal network maintenance instead of a device<br />

performing route discovery. Thread also minimizes the number<br />

of active routers to address scalability to large networks.<br />

Previously, this has been viewed as a limitation for embedded<br />

802.15.4 networks because the network flooding in the presence<br />

of a large number of routers limited the frequency and reliability<br />

of multicast traffic. Note that the thread network manages the<br />

number and spacing of active routers, and user intervention or<br />

management is not required.<br />

Bluetooth mesh supports managed flooding. This is a slight<br />

spin on flooding mesh in that the user can designate which<br />

powered devices participate in the flooding. This will reduce the<br />

impact of flooding but requires the user to determine the<br />

appropriate density and topology for routers in their network and<br />

this can be difficult. As network conditions change over time,<br />

which devices participate in the flood may also need to change<br />

and this would require user intervention. Bluetooth also has end<br />

devices similar to Zigbee or Thread, and these are called<br />

“friendship” devices. A friendship device is coupled with an<br />

adjacent powered node, and packets for the friend are stored by<br />

the line-powered node. The friend will wake periodically to ask<br />

the neighbor if there are any packets. The powered node only<br />

saves the packet for a defined period of time so the “friend”<br />

needs to check in with its paired relay node<br />

Figure 5 – Bluetooth Mesh Example 2<br />

Our study of mesh topologies analyzes both small and large<br />

networks. These networks can behave very differently, and the<br />

routing and management techniques often need to change when<br />

considering a 10-node network or a 200-node network.<br />

Typically, in a small network, devices are within 1 or 2 hops<br />

and very simple routing or flooding can be suitable. As the<br />

network grows in size, it adds complexity such as more hops<br />

between devices, density of devices which may interfere with<br />

each other when sending messages, and more concern over<br />

latency and reliability. If a flood type message is used to turn on<br />

100 lights, it is normally not acceptable for 98 or 99 of the 100<br />

lights to turn on or off. This type of problem is rare in a 10-node<br />

network and may become common in a 100-node network.<br />

VI.<br />

FIGURES OF MERIT<br />

In the previously cited use cases, the designer desires a<br />

robust network for the application. The figures of merit to be<br />

measured in assessing the robustness of a network are<br />

throughput, latency and reliability. These three measurements<br />

can accurately predict the robustness of a network for a given<br />

installation.<br />

Throughput defines the scalability of the network (how<br />

many devices can be sending normal traffic) and also the<br />

behavior for higher data operations such as pushing a firmware<br />

update to devices.<br />

Latency describes how long it takes for an action to happen.<br />

It is a critical parameter for any interaction involving end users<br />

(as opposed to machine-to-machine communications) as most<br />

people can detect operations that take longer than 100<br />

milliseconds. For processes where simultaneous operation is<br />

www.embedded-world.eu<br />

576


desired, such as turning on multiple lights, the timing must be<br />

lower than 100 ms so that end users do not complain of a<br />

“popcorn” effect as lights turn in succession.<br />

Reliability is taken for granted, but when interacting with<br />

everyday devices such as lights and switches, users expect<br />

nearly 100 percent reliability. As a matter of practice, Silicon<br />

Labs tests to 99.999 percent reliability.<br />

These are the most critical aspects of the mesh network to<br />

measure and strongly relate to the design goals for devices and<br />

wireless systems no matter what technology or underlying<br />

wireless is used.<br />

VII. TEST SETUP<br />

To minimize the variability of device testing, the test can be<br />

performed in fixed topologies where the RF paths are wired<br />

together through splitters and attenuators to ensure the topology<br />

does not change over time and testing. This is used for the 7 hop<br />

testing to ensure the network topology. MAC filtering can also<br />

be used to achieve the network topology.<br />

Large network testing is best conducted in an open-air<br />

environment where device behavior is based on existing and<br />

varying RF conditions. The Silicon Labs lab in Boston,<br />

Massachusetts (USA) is used for this open-air testing process.<br />

The wireless conditions in the open-air testing environment<br />

have typical Wi-Fi and Zigbee traffic present as noise. This is<br />

not part of the test network and is used as a typical building<br />

control system independent of any tests being performed.<br />

Latency (milliseconds)<br />

200<br />

150<br />

100<br />

50<br />

0<br />

Latency vs Hops - Mesh Comparison<br />

1 2 3 4 5 6 7<br />

Hops<br />

10 byte Thread 20 byte Thread 8 byte BT Unseg<br />

11 byte BT Unseg 8 byte BT Seg 16 byte BT Seg<br />

Figure 6 – Latency vs Hops<br />

This chart (Figure 6) shows the average latency per hop for<br />

a Thread network versus Bluetooth mesh unsegmented and<br />

segmented packets. Zigbee data is not included as it is similar to<br />

Thread. In this example, we can see for these smaller payloads<br />

the Bluetooth unsegmented and Thread latency is very similar<br />

out to 6 hops. As we add the Bluetooth segmented packet and<br />

increase the payload to 16 bytes, the latency increases<br />

substantially due to the additional packets being transmitted.<br />

Latency (milliseconds)<br />

1200<br />

1000<br />

800<br />

600<br />

400<br />

200<br />

0<br />

Thread vs Bluetooth Latency - 4 hop<br />

10<br />

30<br />

50<br />

70<br />

90<br />

110<br />

130<br />

150<br />

170<br />

190<br />

210<br />

230<br />

250<br />

270<br />

290<br />

Figure 7 – Thread vs Bluetooth Mesh Latency<br />

Looking at 4 hop data with increasing payload (Figure 7),<br />

Bluetooth mesh has higher latency as it has to use segmented<br />

messages. This shows the importance of Bluetooth mesh devices<br />

trying to keep payloads within one packet to avoid this increase<br />

latency in applications where it is an important factor.<br />

Silicon Labs has published additional details about mesh<br />

network performance testing in the application note, AN1132.<br />

VIII. TEST RESULTS<br />

Comprehensive test results are available in AN1132<br />

published by Silicon Labs.<br />

IX.<br />

Payload Size<br />

4 hop Thread 4 Hop BT Mesh<br />

CONCLUSION<br />

The choice of mesh network depends on the end application<br />

or ecosystem. There are many established ecosystems such as<br />

Philips Hue, Amazon Echo Plus, Comcast Xfinity and countless<br />

others. If a device manufacturer wants to interoperate with these<br />

ecosystems, Zigbee is an optimal choice.<br />

If the ecosystem has not been specified for the application,<br />

then many protocol choices are available. Thread and Bluetooth<br />

mesh are both viable options and the most commonly considered<br />

aside from Zigbee. Development tools provided by the IC<br />

vendor matter greatly in terms of how quickly a mesh can be<br />

developed. Tools such as packet trace and a multi-node energy<br />

profiler can ensure whichever mesh is chosen will be robustly<br />

designed. Ultimately, the network size, the required latency,<br />

desired throughput and overall reliability will drive the choice of<br />

mesh protocols.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank Matteo Paris, Dave Fiore,<br />

Alex Showers, Hannu Mallat and Petri Pitkanen, all from Silicon<br />

Labs, for their tireless work to collect and analyze mesh network<br />

performance.<br />

577


REFERENCES<br />

[1] Woolley, M., (2017, August 01) “An Intro to Bluetooth Mesh Part 2,”<br />

https://blog.bluetooth.com/an-intro-to-bluetooth-mesh-part2<br />

[2] Di Marco, P. Skillermark, P., Larmo, A. & Arvidson, P., (2017, July 22)<br />

“Bluetooth mesh networking”.<br />

https://www.ericsson.com/en/publications/white-papers/bluetooth-meshnetworking<br />

www.embedded-world.eu<br />

578


Meeting the Challenge of Coexistence<br />

in the Connected Home<br />

Brian G. Bedrosian<br />

VP of Marketing, IoT Business Unit<br />

Cypress<br />

San Jose, California<br />

Abstract—The increasing number of IoT-based devices in the<br />

home has led to a densely populated network. Without coexistence<br />

measures at the hardware, software, and system levels, the<br />

performance, reliability, and quality of the user experience will be<br />

negatively affected. This paper details coexistence requirements,<br />

considers the impact of mesh networks, and explores technologies<br />

like Real Simultaneous Dual Band (RSDB) that enable<br />

collaborative coexistence.<br />

Keywords—IoT; Connected Home; Coexistence; RSDB;<br />

Bluetooth Mesh; Collaborative Coexistence; Managed Arbitration;<br />

Global Coexistence Interface; GCI; Connected Car; Connected<br />

Auto;<br />

I. INTRODUCTION<br />

As could be seen across nearly the entire show floor at the<br />

Consumer Electronics Show this year, the connected smart<br />

home is a reality. The proliferation of connected devices and<br />

the variety of Internet of Things (IoT) applications is<br />

staggering. According to Cisco, we can expect there to be as<br />

many as 50 billion “things” and more than 50 million homes<br />

connected to the Internet by 2020. Everything seems to be<br />

getting connected, from our alarm clocks to our lights to our<br />

kitchens. Our bodies are becoming connected in a multitude of<br />

manners through a wide range of sensors. Even our pets and<br />

livestock have been affected by the IoT.<br />

The IoT was initially driven by smartphones. Smartphones<br />

provided a simple way for users to interface with connected<br />

systems. In addition, the availability of standardized and<br />

TCP/IP networked wireless technologies has led to the<br />

adoption of many protocols in the 2.4 GHz band, including Wi-<br />

Fi, Bluetooth, Bluetooth Low Energy (BLE), and 802.15.4<br />

applied as ZigBee and Thread.<br />

Early smart home applications focused on adjusting the<br />

temperature, controlling lights, and streaming media. With low<br />

cost, low power wireless technology, it has become possible to<br />

add intelligence to the home in a way that goes far beyond<br />

simple sensors or digital entertainment. More advanced<br />

applications driving greater innovation today include home<br />

security, water quality monitoring, pollution detection, smart<br />

appliances, and many others.<br />

II.<br />

COMPETITION WITHIN THE CONNECTED HOME<br />

Increasing the number of connected devices, however, is<br />

making the home a densely populated network (see Figure 1).<br />

Further complicating the issue is the potential of other nearby<br />

densely populated networks. In an apartment building, for<br />

example, the connected home could be surrounded on all sides<br />

by other networks attempting to utilize the same spectrum<br />

simultaneously.<br />

Fig. 1. The increasing number of connected devices has turned the home into<br />

a densely populated network.<br />

Additionally, the proliferation of voice services like Alexa<br />

require greater quality of service (QoS) amidst increasing<br />

multimedia streaming to maintain their value. The need to<br />

support advanced capabilities like voice services is on the<br />

uptake. According to Consumer Intelligence Research Partners,<br />

Amazon has sold 20 million Echo units since 2015, with 15<br />

million of these sold in the last 12 months. Similarly, Google<br />

Home units have seen tremendous growth, with sales estimated<br />

at 7 million since their debut last year [1]. The connected<br />

home, one that can listen and respond to us, continues to<br />

evolve, as does its wireless needs.<br />

With so many wireless technologies sharing spectrum in the<br />

connected home, there is a critical need for robust wireless<br />

www.embedded-world.eu<br />

579


coexistence measures to ensure throughput performance,<br />

reliability, and quality across multiple radios and use cases.<br />

Consider the increasing importance of reliability and fidelity in<br />

the connected home. Audio distribution across several<br />

simultaneous channels is becoming the norm, and any<br />

significant interruption or interference will negatively impact<br />

the user experience.<br />

Without some form of coexistence measures in place, the<br />

connected home will not be able to provide the level of<br />

responsiveness, reliability, or fidelity consumers are<br />

demanding. There is often too much competition for<br />

bandwidth. There are a great many protocols in use, and as<br />

each uses different methods for securing bandwidth, this<br />

creates additional contention. Aggravating this problem is that<br />

many devices are designed as if they are the only connected<br />

device in the room. They don’t take into account how crowded<br />

the home is getting.<br />

For example, some devices “talk” too much, consuming<br />

more than a reasonable share of the available bandwidth. They<br />

might also interrupt other devices in their communications<br />

because their data is of a “higher” priority. Some devices<br />

broadcast with more power than they need, effectively shouting<br />

over quieter, more cooperative devices. This creates<br />

undesirable contention, triggering retransmissions, reducing<br />

effective range, and increasing the difficulty of finding a clear<br />

slot in which to broadcast. This in turn lowers impacts the<br />

quality of real-time streaming, potentially resulting in glitches a<br />

user can hear or see.<br />

The bottom line is that when devices don’t coexist with<br />

each other, bandwidth is wasted, reliability drops, and quality<br />

suffers. In truth, as the number of connected devices continues<br />

to rise in our homes, coexistence may well determine the<br />

success and rate of adoption of IoT technologies in the home.<br />

III.<br />

THE NEED FOR COEXISTENCE<br />

Coexistence refers to well-defined measures that manage<br />

medium access and connection when radios in the same<br />

location are operating simultaneously in adjacent or<br />

overlapping radio frequency spectrums using different<br />

protocols.<br />

The three most common coexistence issues are, from most<br />

to least severe:<br />

Overlapping spectrum use, such as happens in the 2.4<br />

GHz between Wi-Fi, Bluetooth, and 802.15.4<br />

Adjacent frequency spectrum<br />

Harmonics and intermodulation distortion<br />

To provide effective coexistence that addresses these<br />

issues, the following requirements typical of area-constrained<br />

environments like the connected home must be met:<br />

Multi-use: Users perceive that need they simultaneous<br />

operation across several devices, such as viewing<br />

photos while listening to high-fidelity audio and issuing<br />

voice commands to adjust the lights.<br />

Frequency Domain: It may be the case that the<br />

requirements for applications coexisting together are<br />

higher than is required for standalone operation.<br />

Devices that cannot provide enough performance to<br />

meet these additional requirements may end up bringing<br />

down the performance of every device with which<br />

they are sharing bandwidth.<br />

Time Domain Arbitration: Allocation of bandwidth<br />

must take into account the varying QoS requirements of<br />

each device, application, and potentially user.<br />

To be able to meet these requirements, coexistence must be<br />

considered at the beginning of the design process, otherwise<br />

devices may not operate properly when they are deployed in<br />

densely populated networks. In addition, developers must also<br />

consider usage scenarios at the onset of design so these<br />

requirements can be clearly outlined, understood, and<br />

addressed. If usage scenarios are not considered, developers<br />

may find that they are locked out of certain systems due to<br />

incompatible design choices. Finally, sufficient bandwidth<br />

must be allocated for each operation/application with<br />

appropriate quality of service (QoS) capabilities to ensure a<br />

quality user experience.<br />

A traditional approach to coexistence is to use frequency<br />

domain methods that prevent radios from using the same<br />

spectrum. Overlapping bands between protocols, however,<br />

prevents this approach from being the sole effective approach<br />

in the connected home.<br />

IV.<br />

COLLABORATIVE COEXISTENCE<br />

Devices targeted for the connected home will need to be<br />

able to operate under a wide range of challenging conditions.<br />

Failure to do so may prevent devices from achieving sufficient<br />

coexistence. Such devices may have poor market acceptance<br />

from consumer reviews once it is discovered that they offer<br />

substandard performance compared to devices that have been<br />

designed with coexistence as an early design priority. And, if<br />

the industry cannot work together as a whole to ensure better<br />

coexistence, this will delay overall adoption of the connected<br />

home.<br />

For devices to work together to share bandwidth, they need<br />

a mechanism to manage arbitration across devices, protocols,<br />

and applications. Collaborative coexistence provides a<br />

methodology by which Wi-Fi, Bluetooth, and 802.15.4 can be<br />

collocated. To be successful, managed arbitration must be<br />

applied at all levels of design:<br />

Hardware: Apply spatial separation and filters<br />

Software: Time domain multiplexing of radios with<br />

time synchronization and radio frames<br />

System: Combined intelligence between devices<br />

through mechanisms like Packet Traffic Arbitration and<br />

the Global Coexistence Interface<br />

Collaboration at the hardware level takes place between<br />

radios collocated in the same device. Leading Wi-Fi /<br />

Bluetooth combination chips implement highly sophisticated<br />

hardware mechanisms and algorithms that provide enhanced<br />

collaborative coexistence between subsystems. These<br />

580


mechanisms enable Wi-Fi and Bluetooth to operate<br />

simultaneously while ensuring maximum access time<br />

utilization and high throughput. This is essential to guarantee<br />

QoS for applications such as wireless audio.<br />

Integrating collaborative coexistence capabilities in<br />

connected devices can provide optimal performance and a<br />

superior user experience. Coexistence begins with the interface<br />

between the Wi-Fi / Bluetooth and 802.15.4 subsystems.<br />

Collaboration between Wi-Fi and Bluetooth can be<br />

implemented according to IEEE 802.15.2 Packet Traffic<br />

Arbitration and using the Global Coexistence Interface (GCI).<br />

This interface must be simple to facilitate effective arbitration.<br />

Instead of using a system bus where coexistence signals would<br />

have to complete with other system messages, a dedicated 3-<br />

wire coexistence interface enables optimal signaling between<br />

subsystems.<br />

Figure 2 shows the GCI in action. Only three signals are<br />

used for handshaking. First, the Bluetooth subsystem sets<br />

RF_ACTIVE to request to use the medium. It uses STATUS to<br />

indicate both the priority and Tx/Rx slots. TX_CONF allows or<br />

denies the request. If allowed, the Bluetooth subsystem asserts<br />

RF_ACTIVE to request antenna access before transmitting. If<br />

antenna access is granted, TX_CONF is asserted; if not<br />

granted, TX_CONF is not asserted. When the Bluetooth<br />

subsystem completes its transaction, it de-asserts RF_ACTIVE.<br />

Fig. 3. The Global Coexistence Interface (GCI) is a dedicated 3-wire<br />

interface that works with an external ZigBee radio to maximize Wi-Fi<br />

throughput, voice quality, and link performance.<br />

Managed arbitration is essential to the success of the<br />

connected home. A unified connectivity coexistence<br />

framework like collaborative coexistence is needed to enable a<br />

superior IoT experience across multiple radio technologies.<br />

This widely supported industry effort encompasses best<br />

practices like APIs, programmability to accommodate different<br />

use cases and environments, and support for multiple OSes and<br />

platforms. Collaborative coexistence makes optimal use of the<br />

spectrum to provide the best user experience. It also facilitates<br />

an effective software ecosystem with interoperability and<br />

openness to inspire innovation and new business models.<br />

V. REALIZING THE PROMISE OF BLE MESH<br />

New Bluetooth Mesh technology promises to accelerate the<br />

adoption of connected home technology by simplifying<br />

provisioning and management of nodes in applications such as<br />

smart lighting control and home medical applications. BLE<br />

meshes are being used to deploy a great variety of wireless<br />

sensors and controllers, both line-powered and batteryoperated.<br />

Users will be able to easily set up secure networks<br />

and directly control devices through their mobile devices (see<br />

Figure 4).<br />

Lighting<br />

Gateway<br />

Home Acess<br />

Gateway<br />

Fig. 2. The Global Coexistence Interface (GCI) uses only three signals for<br />

handshaking.<br />

The decision-making part of this exchange is done by the<br />

Packet Traffic Arbitrator (PTA). The PTA allows or denies the<br />

Bluetooth subsystem’s initial request, as well as its request for<br />

access to the antenna. Note that the PTA has a real-time<br />

information exchange with the Bluetooth subsystem to<br />

determine its priority. In general, high-priority requests are<br />

granted.<br />

Note that the GCI can be applied by external radios such as<br />

Thread and ZigBee to access the PTA. Figure 3 shows how the<br />

dedicated 3-wire coexistence interface works with an external<br />

ZigBee radio. Three GPIO signals are used for the interface:<br />

WLAN Request, WLAN Priority, and WLAN Grant. Using the<br />

prioritization approach between data types and applications,<br />

optimal performance can be achieved, resulting in maximum<br />

Wi-Fi throughput, voice quality, and link performance.<br />

Fig. 4. BLE Mesh technology simplifies provisioning and management of<br />

nodes in applications such as smart lighting control and home medical<br />

applications. However, meshes without coexistence measures can create<br />

bursts of interference that can disrupt real-time functions within the connected<br />

home like voice control and audio streaming.<br />

The increasing presence of BLE Mesh in the home,<br />

however, is only going to aggravate coexistence issues and<br />

introduces a unique set of challenges for developers. Consider<br />

a mesh-based home lighting system. Rather than connecting<br />

through the power line, each light has its own wireless radio.<br />

To keep cost and energy consumption down, this radio uses<br />

BLE and communicates to the home controller through other<br />

BLE-enabled lights, thus creating a mesh. To turn on a specific<br />

light, the controller sends out a message to the nearest<br />

connected light, which then relays the message to another light<br />

until the desired light is reached.<br />

Without some point of intelligent control, this mesh can<br />

wreck havoc on the connected home environment. A mobilebased<br />

application might broadcast commands continuously to<br />

ensure the message gets through in a timely fashion. Each node<br />

in the mesh will do the same. As a result, such commands may<br />

www.embedded-world.eu<br />

581


create a burst of interference that visibly disrupts real-time<br />

communications such as voice control or audio playback.<br />

Mesh coexistence is ideally implemented outside of any<br />

application requirements. Thus, the reliability of the home<br />

environment is not at risk to the failure of an app developer to<br />

apply Bluetooth Mesh coexistence measures. Rather,<br />

coexistence technology is built into a dedicated mesh<br />

controller.<br />

Figure 5 shows the command flow for an Android device<br />

controlling a BLE mesh-based lighting system. The Host<br />

Controller Interface (HCI) enables the Android device to<br />

communicate with the dedicated mesh controller via a UART<br />

connection, thus completely abstracting all wireless transmit,<br />

receive, and coexistence functionality. The dedicated mesh<br />

controller is then able to schedule mesh communications<br />

collaboratively with the rest of the Wi-Fi / Bluetooth connected<br />

home environment to implement coexistence measures<br />

effectively and ensure reliable communications.<br />

Fig. 5. Coexistence can be transparently implemented in BLE meshes with a<br />

dedicated mesh controller using a Wi-Fi / Bluetooth enabled platform like<br />

Cypress’ Wireless Internet Connectivity for Embedded Devices (WICED) as<br />

shown here. Rather than require every application to apply coexistence<br />

correctly, the dedicated mesh controller abstracts coexistence functionality<br />

and collaborates with the rest of the connected home network to ensure<br />

reliable communications.<br />

VI.<br />

REAL SIMULTANEOUSLY DUAL BAND (RSDB)<br />

Work continues to be done to improve the efficiency of<br />

wireless technology through coexistence. One of the recent<br />

innovations available to developers is WLAN Real<br />

Simultaneous Dual Band (RSDB) technology.<br />

RSDB brings the capabilities of a high-end router to IoT<br />

applications. By collocating 2.4 GHz Wi-Fi, 5 GHz Wi-Fi, and<br />

Bluetooth, a single WLAN RSDB controller can implement<br />

optimal coexistence measures to ensure optimal use of<br />

bandwidth, real-time fidelity, and overall network reliability.<br />

Collocating wireless radios in this way greatly simplifies<br />

many elements of the connected home network. Users have a<br />

more satisfying experience because they can seamlessly use<br />

any part of the bandwidth all of the time. Because wireless<br />

traffic aggregates in the RSDB controller, capacity can be<br />

efficiently allocated across all available spectrum to multiple<br />

users and for all use cases. A centralized controller also makes<br />

it possible to support multiple independent streaming channels<br />

in a manner that eliminates contention to maximize QoS and<br />

quality of content.<br />

Bandwidth utilization is optimized through fully concurrent<br />

operation in 2.4 GHz and 5 GHz bands between Wi-Fi and<br />

Bluetooth across all active applications. Because the controller<br />

manages coexistence for all traffic it carries, the losses due to<br />

interference from contention can be substantially reduced. As a<br />

result, RSDB is capable of providing full Bluetooth throughput<br />

(>2 Mbps), 802.11n throughput (> 50 Mbps), and 802.11ac<br />

throughput (> 300 Mbps), all at the same time without<br />

degradation.<br />

System design can also be simplified through the use of a<br />

dongle-based architecture. This refers to the ability of the<br />

controller to offload certain tasks from the host processor and<br />

simplify system integration. For example, 802.11 processing of<br />

Ethernet packets exchanged between the controller and host<br />

can be handled on-chip. Additional offloading includes<br />

Preferred Network Offload (PNO) and Address Resolution<br />

Protocol (ARP) processing.<br />

To further ease design and integration, RSDB technology<br />

can be implemented in a processor- and operating systemagnostic<br />

manner. This makes it much easier to introduce RSDB<br />

into environments like the connected home.<br />

VII. THE CONNECTED CAR<br />

One of the key markets for RSDB beyond the connected<br />

home is the connected car. Rising use of Wi-Fi in vehicles,<br />

added to existing Bluetooth usage, is only going to increase<br />

capacity demands. In addition, the densely populated nature of<br />

the car presents a highly challenging environment for<br />

coexistence, and RSDB is a key technology for supporting true<br />

use-case concurrency.<br />

For example, a family of four will typically have two to<br />

four cell phones in addition to a tablet or two. One cell may be<br />

delivering navigation information, several streaming music,<br />

while the tablets stream video. The car could have active voice<br />

controls, currently be tethered to a phone, and display sharing.<br />

Any of these devices could also be accessing the Internet or<br />

providing hotspot capabilities for another device. This is a<br />

582


tremendous number of radios and real-time data to have to<br />

accommodate simultaneously is such a confined space.<br />

To provide reliably connectivity that can stream quality<br />

video and high-fidelity audio, the connected car needs<br />

technology like RSDB. Its dual-band capabilities can provide<br />

the needed throughput and reliability. For example, the 2.4<br />

GHz band could be used for real-time audio streaming and data<br />

delivery while the 5 GHz band is used to carry streaming<br />

video. Aggregating data in this way makes it easier to<br />

interweave data without negatively impacting quality (see<br />

Figure 6).<br />

Fig. 6. The dual-band capabilities of RSDB enable the 5 GHz band to carry<br />

video and the 2.4 GHz band to interweave real-time audio streaming and data<br />

delivery without negatively impacting quality.<br />

The IoT is growing quickly, and our homes and cars are<br />

only going to get more crowded. Wireless technologies like<br />

BLE, 802.11ac, and RSDB are essential for the IoT to move<br />

forward. By implementing collaborative coexistence measures<br />

in hardware, software, and at the system level, developers can<br />

ensure the performance, reliability, and fidelity of the<br />

connected home.<br />

REFERENCES<br />

[1] CIRP, 2017<br />

[2] Cypress WICED ® IoT Developer Community:<br />

www.cypress.com/wicedcommunity; 2017<br />

www.embedded-world.eu<br />

583


Building IoT Solution Effectively<br />

Simon Chudoba<br />

IQRF Alliance z.s., CEO<br />

Jicin, Czech Republic<br />

simon.chudoba@iqrf.org<br />

Internet of Things is a young but very promising<br />

market segment which is catching attention of many<br />

companies all around the world. All technicians as well as<br />

businessmen want to realize a simple, fast and cost<br />

effective Proof of Concept project to evaluate both<br />

technical and business aspects of a specific use case. This is<br />

not a simple task since IoT is a very complex area with<br />

hundreds of elements that must fit one to each other. The<br />

goal of the members of the IQRF Alliance is to provide<br />

these elements from end devices, through gateway<br />

hardware and software up to clouds and mobile apps so<br />

building up an IoT project is matter of couple of days.<br />

How far we get, what is ready and what are the challenges<br />

ahead are the key questions answered in this paper.<br />

Internet of Things, IQRF Technology, IQRF Alliance,<br />

IQRF Ecosystem, Wireless Mesh Network, fog/edge<br />

computing<br />

I. INTRODUCTION<br />

IoT seems to be, and at the end of the day must be, very<br />

simple. For user it should be just a matter of using his smart<br />

phone or tablet to monitor, manage and control his home,<br />

business, city or any other “thing”. On the other hand if you<br />

take a closer look at the IoT ecosystem you realize it's a large<br />

puzzle of dozens or rather hundreds of pieces that must fit one<br />

to each other.<br />

To build up a well working solution, and do it easily, fast<br />

and cost effective, is a big challenge even for a very<br />

experienced team. There is no company world-wide which can<br />

realize an IoT project from A to Z: manufacturing all<br />

components, write all SWs, run own clouds, provide own<br />

mobile apps, market the solutions, deploy it, maintain it and<br />

support it.<br />

Fig. 1. Internet of Things puzzle<br />

Due to this you need 1) an open community providing 2) an<br />

ecosystem of ready elements for building an IoT solution<br />

quickly and effectively.<br />

With this challenge in mind and proven wireless mesh<br />

technology called IQRF [1] in hands, couple years ago we<br />

started to build up the IQRF Alliance [2] so you can find all<br />

necessary IoT elements on one place and make your IoT pilot<br />

project up and running within couple of days.<br />

II.<br />

ALLIANCE – BUILDING IOT COMMUNITY<br />

Although we are talking about the Internet of Things here,<br />

first of all you need to get together people who will analyze<br />

customer needs, develop and manufacture appropriate devices,<br />

put together reasonable solutions and provide valuable services<br />

to end customers. We believe that the best way how to do this<br />

is to build up a community of cooperating commercial and<br />

non-profit entities having the same goal and values.<br />

IQRF Alliance is an open international community of IoT<br />

professionals (developers, manufacturers, cloud providers,<br />

telco operators, system integrators, research and innovation<br />

centers, technical high schools and universities) providing<br />

wireless solutions for IoT and M2M communication based on<br />

the IQRF platform.<br />

The IQRF Alliance focuses on these 3 areas: community,<br />

interoperability and promotion.<br />

COMMUNITY<br />

In the community area we focus on the real and effective<br />

cooperation of the members so system integrators share their<br />

1<br />

584


needs according to their opportunities with manufacturers and<br />

SW and cloud providers so they develop what is really needed<br />

by the end customer. IQRF Alliance also supports joint pilot<br />

projects since we see it as the most effective way how to build<br />

and sell different IoT solutions with significant added value to<br />

the end-customer. Two examples of joint IoT projects could be<br />

found in the chapter 5 of this document and more at [3].<br />

IQRF Alliance has currently (October 2017) around 80<br />

members from 17 countries [4] and the number is steadily<br />

growing. The portfolio of members is really wide from global<br />

corporations, through successful SMEs up to small start-ups.<br />

Fig. 2. Members of the IQRF Alliance, October 2017<br />

INTEROPERABILITY<br />

The IQRF platform, specifically the IQRF DPA framework<br />

[5], provides built-in wireless compatibility so devices from<br />

different manufacturers can communicate in one wireless mesh<br />

network. The trouble was that every device was usually<br />

controlled with different commands and provided data in a<br />

little bit different structure (based on the manufacturer<br />

preference). This made integration of devices from different<br />

manufacturers more complex and disabled using the key IQRF<br />

functions such as Fast Respond Commands [6].<br />

Due to this the IQRF Alliance members agreed on<br />

standardization of the most used commands and sensor/meter<br />

quantities. In October 2017 the IQRF Alliance released the first<br />

version of the IQRF Interoperability Standard and published it<br />

on its website [7].<br />

The standardization enables control of devices without<br />

integration special commands and reading sensor/meter data<br />

without special data parsing algorithms.<br />

Every certified device gets a unique HWPID (Hardware<br />

Profile ID) so gateway or cloud can recognize what type of<br />

device is connected. Currently (October 2017) the IQRF<br />

Alliance is testing the IQRF Repository which contains all<br />

relevant information about certified products so gateway or<br />

cloud can automatically download them. In the second stage<br />

the Repository will include drivers of IQRF-certified devices,<br />

so gateway can start controlling these devices automatically.<br />

Fig. 3. IQRF Ecosystem<br />

PROMOTION<br />

The third key area covered by the IQRF Alliance is<br />

promotion of the products and solutions based on the IQRF<br />

Technology. IQRF Alliance uses different channels to<br />

communicate the benefits of the IQRF Ecosystem to the IoT<br />

professionals such as website, social media, participation on<br />

conferences and exhibitions, organization of IQRF Summit and<br />

local meet-ups and much more.<br />

III.<br />

ECOSYTEM – BUILDING IOT PORTFOLIO<br />

In order to build up your IoT solution effectively you need<br />

ready components so you don’t have to waste your time on<br />

development of everything from A to Z. This would not be<br />

only very time and money consuming process but also skills<br />

and know-how requiring operation.<br />

With this having in mind IQRF Alliance is supporting its<br />

members to prepare ready devices, software, clouds, services,<br />

mobile apps, etc. so putting together an IoT solution is really<br />

a job for just couple of days.<br />

Fig. 4. IQRF Ecosystem [8]<br />

In the following text we will describe the key attributes of all<br />

levels of an IoT solution. Having said that, as we are the<br />

Alliance focused on wireless connectivity we will not go too<br />

much into a detail on the cloud level in this document.<br />

A. Wireless connectivity<br />

One of the first challenges of any IoT solution is the lastmile-communication.<br />

The well-known and massively used<br />

wireless technologies such as GSM/LTE, WiFi or Bluetooth<br />

don’t fit well most of the IoT use cases’ needs: low-power,<br />

high number and high density of connected devices, low data<br />

rate, reliability, security,...<br />

Thus, there is a big boom of young technologies for IoT<br />

especially in the area of Wireless Wide Area Networks<br />

(WWAN) such as LoRa or Sigfox. These technologies are<br />

designed mainly for collecting data from remote sensors. On<br />

the other side there is a high number of IoT use cases where<br />

features and parameters of WWAN technologies don’t fit well,<br />

either. Those are typically real time (local) control applications<br />

(lights, heating, air-condition, motors) and deep indoor<br />

applications (large buildings, underground, tunnels, industrial<br />

operations, etc.).<br />

For these types of applications wireless mesh networking<br />

technologies are much better fit.<br />

5852


TABLE II.<br />

IQRF BEST-FIT TYPE OF PROJECTS<br />

Fig. 5. Positioning of IQRF Technology<br />

IQRF<br />

IQRF [1] is a mature technology connecting devices to IoT<br />

via wireless mesh networks. IQRF provides simple integration,<br />

standards-based-security, interoperability of end-devices,<br />

robust and reliable mesh networking, low-power operation and<br />

full bidirectional communication.<br />

Project needs<br />

Data acquisition<br />

Control<br />

Gateway<br />

Number of nodes per GW<br />

Ready infrastructure and<br />

signal coverage<br />

Cost of wireless operation<br />

Density of nodes<br />

Environment<br />

Power<br />

OTA upgrades<br />

Robustness and reliability<br />

Cloud<br />

IQRF<br />

sensor / operation data – tens of bytes<br />

actuators (ON/OFF, dimming, rotation,..)<br />

local control and data processing<br />

(fog/edge computing)<br />

tens / hundreds<br />

not needed<br />

free of charge<br />

< 200m from each other to ensure robust<br />

(redundant) mesh networking<br />

outdoor / indoor / deep indoor / RF harsh<br />

ultra-low-power – 5+ years on battery a<br />

yes, all levels (OS, plug-ins, custom app.)<br />

very high due to mesh networking<br />

any cloud, standard protocols (MQTT,<br />

https)<br />

a. Depends on use case, type of battery, etc<br />

TABLE I.<br />

SW:<br />

Band:<br />

BASIC IQRF PARAMETERS<br />

Parameter<br />

Network topology:<br />

Range (device-to-device):<br />

Range (device-to- gateway):<br />

Native multi-hop:<br />

Routing algorithm:<br />

Security:<br />

Directionality:<br />

End devices OTA management:<br />

Main benefit:<br />

Low power:<br />

BEST-FIT TYPE OF PROJECTS<br />

IQRF<br />

OS + DPA + Appl. + SDK<br />

433 / 868 / 916 MHz<br />

mesh<br />

500+ meters<br />

tens of kilometers<br />

240 hops per packet<br />

oriented flooding<br />

multilayer, AES-128, dynamic keys<br />

bidirectional<br />

for all operations needed<br />

easy adoption/reliability<br />

several years on a battery<br />

There is no technology fitting every use case. Here is a list<br />

of typical parameters of projects where IQRF fit the best:<br />

As the consequence of typical project parameters here are<br />

the typical use cases of the IQRF technology:<br />

TABLE III.<br />

Area<br />

Smart City<br />

Smart Building<br />

Industry 4.0<br />

IQRF USE CASES<br />

IQRF Typical Use Cases<br />

street lighting, street parking, traffic monitoring<br />

and control, environment sensors, waste<br />

management,…<br />

indoor/ emergency / design lighting, HVAC<br />

control, environment monitoring, metering,<br />

operation monitoring, …<br />

machine and tool monitoring, employee and<br />

forklifts tracking, infrastructure monitoring,…<br />

B. End devices<br />

In order to be flexible when putting together your IoT<br />

project you need a wide range of interoperable sensors and<br />

actuators.<br />

Interoperable means that the devices not only communicate<br />

in one network but that the actuators are controlled with the<br />

same commands and sensors provide data in the same<br />

structure. Interoperability thus significantly simplifies<br />

integration of different devices from more manufacturers in<br />

one network.<br />

Overview of available IQRF end-devices could be found at<br />

[8].<br />

C. Gateways<br />

In IQRF Ecosystem gateways are the key component of the<br />

whole design. Gateways don’t provide only a link from IQRF<br />

network to the internet but they are the control unit of the<br />

complete IQRF network. It means that they collect data from<br />

sensors, analyze them and based on the results control actuators<br />

in the network. Apparently they also report data up to a<br />

connected cloud or receive commands from the cloud or users.<br />

5863


This “fog/edge computing” approach enables much bigger<br />

flexibility and reliability than standard cloud-controlledinstallations<br />

and is the future of the real-time IoT.<br />

When talking about IQRF gateways we don’t mean only<br />

hardware but also included software and remote management.<br />

HARDWARE<br />

Regarding gateway hardware the goal is to be as much as<br />

possible independent on a specific hardware and to let the<br />

integrator choose the hardware according to his priorities.<br />

Nowadays actually any Linux computer can operate as an<br />

IQRF Gateway. In general two things are needed:<br />

IQRF Transceiver connected to the gateway through<br />

SPI or USB connector/protocol<br />

IQRF Daemon – universal software which can control<br />

IQRF network and communicate to a cloud or a mobile<br />

app<br />

IQRF DAEMON<br />

simultaneously, as well. You can check IQRF-ready remote<br />

gateway management systems from RehiveTech at [10].<br />

D. Remote visualization and control<br />

Another important layer of any IoT solution is the data<br />

storage, analysis, visualization and user control interface.<br />

Currently these tasks are usually covered by cloud solutions,<br />

mobile applications and integration platforms.<br />

The IQRF Ecosystem is fully open to any cloud solution<br />

which communicates on standard protocols such as MQTT or<br />

https. Thanks to this you as an system integrator or a customer<br />

have a full flexibility to use any cloud or platform.<br />

IQRF Alliance cooperates with providers and integrators of<br />

the key cloud services such as Microsoft Azure or IBM<br />

Bluemix as well as with small cloud service providers such as<br />

Inteliments, CIS or CTI software.<br />

Part of the IQRF Ecosystem is also a universal mobile app<br />

by Master Internet that enables you to build and control an<br />

IQRF network directly from your cell phone.<br />

IV.<br />

HOW TO BUILD YOUR IOT PILOT PROJECT<br />

In the previous paragraphs we described what you need as a<br />

base line to start your IoT pilot project and realize it<br />

effectively. In this chapter we will focus on the step by step<br />

guide how you can build up your simple IoT solution and how<br />

you can extend it into a real IoT installation.<br />

A. Start with the IoT Starter Kit<br />

Members of the IQRF Alliance joined their efforts and put<br />

together a starter kit [11] where you should find all you need to<br />

start your IoT project.<br />

Fig. 7. IoT Starter Kit by IQRF Alliance members<br />

Fig. 6. IQRF Daemon [9]<br />

IQRF Daemon is the second key building block of an IQRF<br />

gateway. It provides all necessary services for control of an<br />

IQRF network, gateway configuration, remote access through<br />

UDP protocol, link to local user application through MQ<br />

messaging and communication to remote cloud or app through<br />

MQTT messaging.<br />

REMOTE GATEWAY MANAGEMENT<br />

Very important service, or rather must-to-have service, is a<br />

remote management of the gateways. If you need to do any<br />

upgrade or change a configuration you must be able to do it not<br />

only remotely but with dozens or rather hundreds of gateways<br />

There are two IQRF wireless kits – a sensor kit providing<br />

temperature, illumination and potentiometer inputs and a relay<br />

kit. These kits are enough for you to learn how to collect data<br />

from sensors and how to control actuators. You can make these<br />

kits up and running and connected in a wireless mesh network<br />

using IQRF IDE according to on-line video tutorials [12].<br />

UP board is a computer for makers and professional makers<br />

bridging the gap between hobby and industrial computers [13].<br />

It usually serves as a gateway controlling the IQRF wireless<br />

mesh network and connecting it to the Internet through<br />

Ethernet, WiFi, GSM or LTE.<br />

STEP-BY-STEP GUIDE<br />

To make the UP board working as an IQRF gateway you<br />

need to do the following steps:<br />

1. Install and configure Linux<br />

5874


2. Install and configure IQRF Daemon that will handle<br />

the control of your IQRF network<br />

3. Install Node-RED for basic control of your network.<br />

4. Install MQTT broker so you can get connected to one<br />

of the supported cloud services such as Microsoft<br />

Azure, IBM Bluemix, etc.<br />

Everything you need to realize these five steps is available<br />

on the IoT Starter Kit Github [14].<br />

B. Add more end-devices<br />

There is a growing portfolio of IQRF interoperable devices<br />

– both sensors and actuators. You can see the complete<br />

portfolio of IQRF related products, solutions and services on<br />

the IQRF Marketplace [8] and purchase end-device samples at<br />

IQRF Alliance e-shop [15]. You can select devices that you<br />

need, purchase them on one single e-shop and bond them to<br />

your wireless network.<br />

C. Test different Software from Github<br />

As you can extend your solution with different end-devices<br />

you can also test different software for your gateway. Go to<br />

IQRF GihHub extensions [16] where you can download<br />

software and/or demo access to different services free of<br />

charge.<br />

D. Test different clouds and mobile apps<br />

There is number of cloud and mobile apps providers and<br />

integrators in the IQRF Alliance providing access to Microsoft<br />

Azure, IBM Bluemix, Inteliglue, Master App etc. Based on the<br />

documentation available on the IQRF Github you can test<br />

different products and find you the one most fitting goals of<br />

your IoT project [16].<br />

E. Work with a system integrator<br />

Potential cooperation with a system integrator depends on<br />

the scale of your project, experience of your team and<br />

timeframe you have for realization of your project. You can do<br />

everything yourself or you can cooperate with system<br />

integrators or consultants who can help you to make your<br />

project up and running much faster and more effective.<br />

V. CASE STUDIES<br />

This paper would be just a poor theory without mentioning<br />

real case studies where the approach described above was<br />

taken. More case studies can be found at [3].<br />

A. Air quality monitoring in a Prague school<br />

IDEA<br />

Because of the assumption that there is a bad air in schools,<br />

and therefore students have concentration problems, Protronix<br />

and his partners (O2 IT Services, IQRF Alliance,<br />

MICRORISC, Camea,...) decided to make a 4-month-long<br />

measurement. The CO₂, temperature and relative humidity<br />

values were monitored. Data were continuously analyzed<br />

followed by recommendations for ventilation and other<br />

corrective actions.<br />

SOLUTION<br />

This solution consists of<br />

10 combined sensors of CO 2 , temperature and relative<br />

humidity<br />

IQRF wireless mesh network for data transfer<br />

UP board based gateway enabling data transfer from the<br />

IQRF network to TCP/IP network<br />

O2 data storage and a web application with<br />

visualization of measured data.<br />

RESULTS<br />

Fig. 8. CO2 concentration graph of a monitored classroom<br />

As a result, it was found that minimum recommended<br />

values of relative air humidity had not been reached for most of<br />

the school time and maximum allowed CO₂ values had been<br />

exceeded for almost half of the time. These variables and their<br />

values are directly linked to the concentration and health of<br />

students.<br />

CONCLUSION<br />

Thanks to using ready end-devices, gateway and remote<br />

management system it was very easy and cost-effective to<br />

realize this project.<br />

B. Water metering<br />

IDEA<br />

CETIN as the key Czech telecommunication infrastructure<br />

provider wanted to evaluate IQRF technology and compare it<br />

to W-MBUS when reading data from water meters. The goal<br />

was to do it cost and time effectively.<br />

SOLUTION<br />

CETIN involved five members of the IQRF Alliance into<br />

this project:<br />

Mainstream technologies providing integration services,<br />

data analysis in MS Azure and visualization in<br />

PowerBI;<br />

AAEON providing UP board as a gateway;<br />

IQRF Tech with IQRF Daemon and customization<br />

services;<br />

RehiveTech with their remote management system of<br />

gateways<br />

Bitspecta providing W-MBUS / IQRF protocol bridge<br />

RESULTS<br />

As a result, using services and products of other IQRF<br />

Alliance members CETIN was able to evaluate benefits of<br />

mesh networking in the area of water metering in a very<br />

limited budget and time frame.<br />

There are many other running projects where cooperation<br />

of members and using of ready IQRF ecosystem elements is<br />

the key to successful and effective pilot projects.<br />

5885


VI.<br />

FURTHER DEVELOPMENT<br />

As expected, we definitely don’t stop where we are. There<br />

is plenty of work ahead to make your life, as a user of the<br />

IQRF Ecosystem, even easier.<br />

A. Community<br />

As mentioned, community is the base of the whole<br />

ecosystem. We will involve more partners with more skills in<br />

to the IQRF community, so the overall flexibility of the<br />

Alliance steadily grows.<br />

B. Ecosystem<br />

From the IQRF Ecosystem perspective we see the weakest<br />

point in the limited portfolio of ready end-devices and<br />

gateways. This is the challenge and opportunity for<br />

manufacturers to develop and manufacture sensors and<br />

actuators that are needed on the market.<br />

C. Standard<br />

The IQRF Alliance will not only extend the current<br />

standard but will develop an on-line repository of IQRF<br />

Certified devices so building up a wireless network will be<br />

literary in the “plug-and-play” mode.<br />

[16] IoT Starter Kit SW and cloud extensions<br />

https://github.com/iqrfsdk/iot-starter-kit/tree/master/extensions<br />

VII. CONCLUSION<br />

The Internet of Things is a really very complex ecosystem<br />

and enabling pilot projects to be done quickly and effectively is<br />

not the simple task. The IQRF Alliance did number of steps to<br />

prepare ready and open ecosystem including not only enddevices<br />

and gateways but gateways software, clouds, mobile<br />

apps, services and development tools as well.<br />

These days (October 2017) using the IoT Starter Kit you<br />

can build up your device-to-screen IoT solution in matter of<br />

hours and within couple of days extend it, using different enddevices<br />

and software, into pilot-project-solutions in a matter of<br />

days. There is a limited number of end-devices you can use for<br />

your project at the moment, but the portfolio is growing pretty<br />

fast following to the market needs.<br />

You can join us in our effort to make the IoT really useful<br />

and cost effective for the end user. You are always welcome to<br />

step on the board of the IQRF Alliance.<br />

VIII. REFERENCES<br />

[1] IQRF Technology, www.iqrf.org<br />

[2] IQRF Alliance, www.iqrfalliance.org<br />

[3] IQRF Alliance, Case studies http://iqrfalliance.org/case-studies/<br />

[October, 2017]<br />

[4] IQRF Alliance members http://iqrfalliance.org/alliance<br />

[5] IQRF DPA Framework http://iqrf.org/technology/dpa<br />

[6] Fast Response Command, Youtube tutorial<br />

https://www.youtube.com/watch?v=kK48A9MMfQU<br />

[7] IQRF Interoperability Standard http://www.iqrfalliance.org/techDocs/<br />

[8] IQRF Ecosystem http://iqrfalliance.org/products<br />

[9] IQRF Daemon https://github.com/iqrfsdk/iqrf-daemon<br />

[10] RehiveTech Management System<br />

http://iqrfalliance.org/case-studies/remote-iqrf-network-managementsystem-by-rehivetech<br />

[11] IoT Starter Kit, http://iqrfalliance.org/product/iot-starter-kit<br />

[12] IQRF Tutorials http://iqrf.org/support/video-tutorial-set<br />

[13] Up Board http://www.up-board.org/up<br />

[14] IoT Starter Kit GitHub https://github.com/iqrfsdk/iot-starter-kit<br />

[15] IQRF Alliance eshop https://iqrf.shop/<br />

5896


Supporting multiple protocols (BLE/IEEE 802.15.4)<br />

concurrently in a single chip<br />

Steve Urbanski<br />

Secure Connected MCU<br />

NXP Semiconductors<br />

Chicago, IL<br />

Steve.Urbanski@nxp.com<br />

Abstract—The Internet of Things (IoT) landscape is<br />

expanding requiring the need for devices to support multiple<br />

protocols in a single chip. Bluetooth Low Energy (BLE) is readily<br />

available in many personal devices today which makes it a good<br />

technology to use for command and control of IoT networks.<br />

IEEE 802.15.4 is a mature technology that enables low power<br />

mesh networks such as Zigbee and Thread which makes it a good<br />

technology to use for an IoT network.<br />

This paper addresses ways to enable multi-protocols running<br />

BLE and IEEE 802.15.4 concurrently on a single chip. It reviews<br />

techniques that need to be considered in both hardware and<br />

software and the limitations encountered when supporting<br />

networks running in both technologies. This paper will also<br />

review practical uses cases that require the use of these<br />

techniques and the need for supporting multiple protocols<br />

concurrently in a single chip.<br />

Keywords—Multiprotocol; Bluetooth Low Energy; BLE; IEEE<br />

802.15.4; IoT; Thread; Zigbee<br />

I. INTRODUCTION<br />

One of the challenges of supporting multiple protocols<br />

concurrently in a single chip is that the radio resource needs to<br />

be shared. Therefore the utilization of the radio for each<br />

technology needs to be carefully considered, more specifically,<br />

how much time each technology needs the radio resource for<br />

normal operation.<br />

This paper will review the communication fundamentals for<br />

normal BLE and 802.15.4 operation, focusing on the amount of<br />

time the radio resource is needed to complete fundamental<br />

tasks. It will then discuss concurrent operation of BLE and<br />

802.15.4 and techniques to use for supporting concurrent<br />

operation of these technologies in a single chip.<br />

Lastly, this paper will analyze a practical use case using an<br />

NXP KW41Z device running BLE and 802.15.4 concurrently<br />

while utilizing some of the techniques described in this paper.<br />

It will review the packet error rates (PER) of different<br />

experiments and discuss what network parameters make a<br />

difference and how to configure an error free network.<br />

II. BLE COMMUNICATION FUNDAMENTALS [1]<br />

The BLE Link Layer has five defined states. Four of them<br />

are non-connected states – Standby, Advertising, Scanning and<br />

Initiating – and one is defined as the Connection State. This<br />

paper will focus on operations in the Connection State and the<br />

timing fundamentals associated with this state as it relates to<br />

sharing the radio resource on a single chip.<br />

A. Connection State<br />

The Connection State is entered when an Advertiser and an<br />

Initiator successfully exchange connection Protocol Data Units<br />

(PDUs). When the two devices enter the Connection State, one<br />

takes on the Master Role, the other takes on the Slave Role.<br />

The master is in control of the timing of the Connection Event.<br />

Each Connection Event starts with the master sending a Data<br />

PDU to the slave. The slave may respond depending on the<br />

timing. This paper will review three timing parameters of the<br />

connection state – the Connection Interval (CI), the Slave<br />

Latency (SL) and the Supervision Timeout (STO).<br />

The Connection Interval is the time between connection<br />

events. Its value is a multiple of 1.25 ms in the range of 7.5 ms<br />

to 4.0 s. In many systems, a value around 50 ms is commonly<br />

observed.<br />

The Slave Latency is the number of consecutive connection<br />

events a slave can ignore before it needs to respond to the<br />

master. This helps a slave device save power by allowing it to<br />

sleep for longer periods of time. Its value is an integer in the<br />

range of 0 to ((STO / (CI * 2)) – 1) but can be no larger than<br />

500.<br />

If the connection gets lost for any reason, the Supervision<br />

Timeout is used as a fallback mechanism to prevent a device<br />

from getting stuck in the connection state. If no communication<br />

is received within the supervision timeout period, the device<br />

will exit the connection state and transition to the standby state.<br />

Its value is a multiple of 10 ms in the range of 100ms to 32.0 s<br />

and needs to be larger than (1 + SL) * 2 * CI.<br />

Fig. 1 shows an example of the connection state with a<br />

slave latency of 2.<br />

www.embedded-world.eu<br />

590


Connection State<br />

Supervision Timeout<br />

M S<br />

Connection<br />

Interval<br />

S M<br />

SL =1 SL =2 SL =1 SL =2<br />

Slave Latency (SL) = 2<br />

Supervision Timeout = 6 * Connection Interval<br />

Connection Events<br />

Figure 1: 1: An An example of of the the connection state state with with a slave slave latency of of 2. 2<br />

III.<br />

IEEE 802.15.4 COMMUNICATION FUNDAMENTALS<br />

The IEEE 802.15.4 communication protocol [2] is used as<br />

the lower Medium Access Control (MAC) and Physical (PHY)<br />

software layers of Zigbee [3] and Thread networks [4]. They<br />

both use a contention based access mode to access the shared<br />

channel which utilizes a Carrier Sense Multiple Access with<br />

Collision Avoidance (CSMA-CA) backoff algorithm.<br />

There are four frame types defined<br />

Beacon: Used for synchronization and broadcast of data<br />

Data: Used for data transmission. Maximum payload<br />

size is 127 octets<br />

MAC command: Used to carry MAC management<br />

commands<br />

Acknowledgement: Used to acknowledge data and<br />

command frames. Frame size is 8 octets (256 us)<br />

This paper will focus on the Data Frame type since it is the<br />

largest frame type and predominately used in an active<br />

network. The maximum data frame size consists of 127 bytes<br />

for the MAC PDU and 6 bytes for the PHY header. This equals<br />

a max frame size of 133 bytes (4.256 ms).<br />

The transmission of Data and MAC Command frames<br />

utilize the unslotted CSMA-CA backoff algorithm whereas the<br />

Beacon and Acknowledgement frames use no checking<br />

mechanism for transmission.<br />

The unslotted CSMA-CA algorithm maintains two<br />

variables for each transmission attempt<br />

NB: The number of times the CSMA-CA algorithm<br />

was required to backoff while attempting the current<br />

transmission. This value shall be initialized to zero<br />

before each new transmission attempt, where.<br />

o NB = NB + 1 for every unsuccessful transmission<br />

attempt<br />

o If NB > macMaxCSMABackoffs, the transmission is<br />

considered a failure, where<br />

• macMaxCSMABackoffs is an integer value<br />

between 0 and 5<br />

BE: The Backoff Exponent defines how many backoff<br />

periods a device shall wait before attempting to assess a<br />

channel. BE is initialized to the value of macMinBE,<br />

where<br />

o BE = min(BE + 1, macMaxBE) for every<br />

unsuccessful transmission attempt<br />

o macMinBE is an integer value between 0 and<br />

macMaxBE<br />

o macMaxBE is an integer value between 3 and 8<br />

o one backoff period is equal to aUnitBackoffPeriod<br />

symbols which is defined to be 20. This translates to<br />

320 us.<br />

Note that if macMinBE is set to zero, collision<br />

avoidance will be disabled during the first iteration of<br />

this algorithm.<br />

For each transmission attempt, the transmitter waits for a<br />

random number of backoff units between 0 and (2 BE – 1). After<br />

the delay, a Clear Channel Assessment (CCA) is performed. If<br />

the channel is clear, the transmitter proceeds. If not, the NB<br />

and BE variables are updated as follows<br />

BE = min(BE + 1, macMaxBE)<br />

NB = NB + 1<br />

If NB > macMaxCSMABackoffs, the transmission is<br />

considered a failure, otherwise the transmitter waits<br />

again using a new backoff delay value as described<br />

above.<br />

Fig. 2 shows an example of the NXP stack transmitting a<br />

data frame preceded by a maximum number of backoffs<br />

allowed by the system. Here macMinBE = 3, macMaxBE = 5<br />

and macMaxCSMABackoffs = 4.<br />

CCA<br />

128 us<br />

FAIL<br />

Wait<br />

Wait<br />

0 to 2.2 ms 0 to 4.8 ms<br />

BE = 3<br />

NB = 0<br />

BE = 4<br />

NB = 1<br />

Data Frame TX w/ Max Backoffs<br />

CCA<br />

128 us<br />

FAIL<br />

BE = 5<br />

NB = 2<br />

Wait<br />

CCA<br />

128 us<br />

FAIL<br />

Wait<br />

0 to 9.92 ms 0 to 9.92 ms<br />

BE = 5<br />

NB = 3<br />

CCA<br />

128 us<br />

FAIL<br />

BE = 5<br />

NB = 4<br />

Wait<br />

0 to 9.92 ms<br />

CCA<br />

128 us<br />

PASS<br />

Data<br />

Frame<br />

TX<br />

4.256 ms<br />

Figure 2: NXP stack transmitting a data frame preceded by a maximum<br />

number of backoffs<br />

591


Any data or MAC command frame can be sent with an<br />

acknowledgement request. This requires the recipient to send<br />

an acknowledgement frame back to the sender when the<br />

message has been properly received. Without this feedback<br />

mechanism, the transmission of the frame is just assumed to be<br />

successful.<br />

The acknowledgement frame needs to be sent within<br />

aTurnaroundTime = 12 symbols (192 us) after the reception of<br />

the last symbol of the data or MAC command frame. However,<br />

the originator will wait up to macAckWaitDuration = 54<br />

symbols (864 us) for the acknowledge frame to be received<br />

before the transmission attempt is considered failed.<br />

If the transmission attempt is considered failed, the<br />

originator will repeat the process of transmitting the frame and<br />

waiting for the acknowledgement up to a maximum of<br />

macMaxFrameRetries times, where<br />

macMaxFrameRetries = integer value between 0 and 7<br />

If the maximum frame retries has been reached, a<br />

transmission failure notice is sent to the next higher layer of the<br />

software stack.<br />

Fig. 3 shows an example of the NXP stack transmitting a<br />

data frame with acknowledgement preceded by a maximum<br />

number of acknowledgement failures. Here<br />

macMaxFrameRetries = 3 and it is assumed that all CCAs pass<br />

with a zero delay wait time prior to transmitting.<br />

Data Frame TX w/ Ack and Max Ack Failures<br />

Data<br />

Frame<br />

4.256 ms<br />

ACK<br />

Wait<br />

864 us<br />

Data<br />

Frame<br />

4.256 ms<br />

ACK<br />

Wait<br />

864 us<br />

Data<br />

Frame<br />

4.256 ms<br />

ACK<br />

Wait<br />

864 us<br />

Data<br />

Frame<br />

4.256 ms<br />

Retries 0 1 2 3<br />

macMaxFrameRetries = 3<br />

ACK<br />

RX<br />

448 us<br />

Figure 3: NXP stack transmitting a data frame with acknowledgement<br />

preceded by a maximum number of acknowledgement failures<br />

IV. CONCURRENT OPERATION OF BLE AND 802.15.4<br />

When using a single chip for concurrent operation of BLE<br />

and 802.15.4, the best use case is when the BLE side is<br />

configured as a slave and the 802.15.4 side is configured as an<br />

end device (a device without children and it communicates<br />

only to a single parent). This configuration is the least timing<br />

restrictive for both technologies. On the BLE side, the slave<br />

can take advantage of the Supervision Timeout period (see Fig.<br />

1). On the 802.15.4 side, the end device is in control of when<br />

data is transferred. See Fig. 4.<br />

Parent<br />

Data<br />

Acknowledgement<br />

(if requested)<br />

Child<br />

Data Transmission from a Child<br />

Parent<br />

Data Request<br />

Acknowledgement<br />

Data<br />

Acknowledgement<br />

Child<br />

Data Transmission from a Parent<br />

Figure 4: How data is transmitted in a nonbeacon-enabled 802.15.4 PAN<br />

As Fig. 4 illustrates, every 802.15.4 data transfer is initiated<br />

by the child device and can be scheduled when it’s convenient.<br />

This means that the end device can schedule all of the 802.15.4<br />

communication around the BLE communication, which has a<br />

stricter timing regimen. Recall that in the Connection State, the<br />

BLE master device must transmit at every connection interval<br />

and the slave must respond within the supervision timeout<br />

period. See Fig. 1.<br />

Since not all situations can take advantage of this best case<br />

configuration, another prominent use case to consider is when<br />

the 802.15.4 device is configured as a coordinator. Here the<br />

control of when the 802.15.4 side communicates is forfeited<br />

and the controller must respond to the end device(s) attached to<br />

it, all while maintaining the BLE connection.<br />

For this concurrent operation, one strategy that can be used<br />

is to give the BLE operation higher priority over the 802.15.4<br />

operation – essentially letting BLE run as needed while filling<br />

the time gaps with 802.15.4 operations. Like in the previous<br />

example, this strategy is effective because the 802.15.4<br />

protocol is less restrictive in its timing requirements.<br />

This strategy has been used in the NXP KW41Z hybrid<br />

device. This device supports BLE and 802.15.4 protocols<br />

concurrently in a single chip. The software has a Mobile<br />

Wireless System (MWS) Coexistence block that arbitrates the<br />

use of the radio hardware resource. It is essentially a set of<br />

APIs that allow higher layers of the software to request access<br />

to the radio resource. The MWS natively gives priority to BLE<br />

allowing it to abort ongoing 802.15.4 transactions even if they<br />

have already been started. If this happens, the 802.15.4<br />

transaction will be restarted once the BLE transaction has been<br />

completed.<br />

While this strategy is bullet proof for any BLE<br />

communication and any 802.15.4 transmission, it is vulnerable<br />

to 802.15.4 receptions. As described above, in this<br />

configuration, the device is not in control of the 802.15.4<br />

transactions and must listen for end node (child) devices. If a<br />

child tries to communicate while the parent is in the middle of<br />

a BLE transaction, the 802.15.4 packet could be lost.<br />

To measure the effects of this, a test was conducted using<br />

the NXP KW41Z hybrid device [5]. In this experiment, there<br />

were 3 devices – the KW41Z hybrid device, a Smartphone and<br />

another KW41Z device configured as an 802.15.4 end device.<br />

See Fig. 5.<br />

www.embedded-world.eu<br />

592


The results of this experiment demonstrate that the 802.15.4<br />

PER falls to around 1% as long as acknowledgement is used<br />

and the CI is kept around 50ms or higher. The other variables<br />

do not play as big of a role in the outcome.<br />

To further reduce this 1% PER, there are a number of<br />

techniques that can be used to minimize any packet loss. One<br />

technique is to relax the 802.15.4 parameters:<br />

macMaxFrameRetries – This is the number of retries if<br />

no ACK is received. The default value is 3 but can go as<br />

high as 7. Having more retries improves the chance of<br />

getting the 802.15.4 data packet through.<br />

CCA Backoff time – This is the amount of time the<br />

transmitter waits before it performs a CCA. This<br />

essentially spreads out the transmission events which<br />

give it more time to clear a BLE event if it’s in progress.<br />

Other techniques can be used to eliminate or reduce the<br />

PER. This involves adding retry mechanisms at higher layers<br />

of the software, such as the network and/or application layers.<br />

This is a common technique used in Zigbee and Thread<br />

systems.<br />

Figure 5: Packet Error Rate (PER) Experiment Setup<br />

For this experiment, an 802.15.4 Packet Error Rate (PER)<br />

test was performed to determine the impact to this protocol<br />

when BLE is running in the same device. The Smartphone was<br />

configured as the BLE Master and established a connection<br />

with the hybrid device using a selected connection interval.<br />

Then the Hybrid device created an 802.15.4 network. The end<br />

device connected to the Hybrid device on the 802.15.4<br />

network. With both networks running, end device sent 802.15.4<br />

data packets to the hybrid device with a selected time interval<br />

between packets. The Hybrid device measured the PER.<br />

This experiment was run 36 different ways, varying the<br />

following parameters<br />

802.15.4 Acknowledgement Enabled (Yes, No)<br />

802.15.4 Payload Size (0, 100 bytes)<br />

BLE Connection Interval (7.5, 50, 360ms)<br />

802.15.4 Message Interval Rate -- MIMS (10, 50,<br />

100ms)<br />

The results are shown in Fig. 6. Observe the clear<br />

difference between the Message Acknowledge Enabled versus<br />

Disabled results. This is because in the enabled case, the<br />

message is retransmitted if the acknowledgement is not<br />

received, which significantly lowers the PER.<br />

The next biggest noticeable difference is in the BLE<br />

Connection Interval (CI). As Fig. 6 shows, when the CI is at<br />

the lowest allowed value of 7.5 ms, the PER is dramatically<br />

higher than when a more typical rate such as 50 ms is used.<br />

V. CONCLUSION<br />

This paper reviewed the communication fundamentals of<br />

BLE and IEEE 802.15.4, focusing on the amount of time the<br />

radio resource is needed to complete fundamental tasks. It then<br />

reviewed techniques to support multiple protocols concurrently<br />

in a single chip. Lastly, it showed a practical use case applying<br />

these strategies and the effects it had on packet error rate of the<br />

802.15.4 network.<br />

ACKNOWLEDGEMENT<br />

The author would like to thank the team members of the<br />

NXP Microcontroller Systems Engineering team for their help<br />

in running the experiments and providing their valuable<br />

feedback.<br />

REFERENCES<br />

[1] Bluetooth SIG, Inc, “Bluetooth Core Specification,” v5.0, December<br />

2016,<br />

https://www.bluetooth.com/specifications/bluetooth-core-specification<br />

[2] IEEE Standards Association, “Wireless Medium Access Control (MAC)<br />

and Physical Layer (PHY) Specifications for Low-Rate Wireless<br />

Personal Area Networks (WPANs),” IEEE802.15.4-2006,<br />

http://standards.ieee.org/findstds/standard/802.15.4-2006.html<br />

[3] Zigbee Alliance, zigbee Specification, Revision 22 1.0, zigbee<br />

Document 05-3474-22, April 19, 2017, http://www.zigbee.org/zigbeefor-developers/network-specifications/zigbeepro/<br />

[4] Thread Group, Inc, “Thread 1.1.1 Specification,”<br />

https://www.threadgroup.org/ThreadSpec<br />

[5] S. Lopez and J.C. Pacheco, “Thread + Bluetooth Low Energy<br />

Coexistence,” unpublished<br />

593


Message Acknowledge Disabled<br />

Payload Size 0 Bytes<br />

Payload Size 100 Bytes<br />

Packet Error Rate<br />

Packet Error Rate<br />

100.00%<br />

90.00%<br />

80.00%<br />

70.00%<br />

60.00%<br />

50.00%<br />

40.00%<br />

30.00%<br />

20.00%<br />

10.00%<br />

0.00%<br />

MIMS<br />

6.00% 5.33% 5.33% 4.00% 2.33% 3.33%<br />

10 50 500* 10 50 500* 10 50 500*<br />

100.00%<br />

90.00%<br />

80.00%<br />

70.00%<br />

60.00%<br />

50.00%<br />

40.00%<br />

30.00%<br />

20.00%<br />

10.00%<br />

0.00%<br />

MIMS<br />

29.33% 28.33% 29.67%<br />

10.00% 8.33% 8.67% 10.00% 6.67% 3.67%<br />

10 50 500* 10 50 500* 10 50 500*<br />

Conn Int<br />

7.5 50 360<br />

Conn Int<br />

7.5 50 360<br />

Figure 6: Experiment Results<br />

www.embedded-world.eu<br />

594


Automatic Tracking of Li-Fi Links for Wireless<br />

Industrial Ethernet<br />

René Kirrbach<br />

Fraunhofer Institute for Photonic Microsystems IPMS<br />

Dresden, Germany<br />

Rene.kirrbach@ipms.fraunhofer.de<br />

Michael Faulwaßer, Tobias Schneider, Robert<br />

Ostermann, Dr. Alexander Noack<br />

Fraunhofer Institute for Photonic Microsystems IPMS<br />

Dresden, Germany<br />

{michel.faulwaßer, robert.ostermann, tobias.schneider,<br />

alexander.noack}@ipms.fraunhofer.de<br />

Abstract— The ongoing digitalization of our environment<br />

leads to continuously increasing data traffic. Especially in<br />

industrial environments, automation is an omnipresent<br />

trend. Autonomous systems incorporate a rising amount of<br />

sensors as well as continuous machine-to-machine (M2M)<br />

communication.<br />

Wireless communications can simplify the data<br />

transmission and enable connectivity to dynamic parts like<br />

moving, vibrating or rotating components. Due to the open<br />

nature of the communication channel, engineers have to<br />

face a number of challenges, e.g. security issues,<br />

interferences and regulation of irradiated power.<br />

Radio frequency (RF) technologies are used in manifold<br />

applications, but in certain scenarios they are still<br />

cumbersome, because of signal interference and hard<br />

real-time requirements.<br />

The so-called Li-Fi technology is ideal for autonomous<br />

systems in Industry 4.0 since optical communications offer<br />

reliable and high data rate communication links with lowlatency<br />

characteristics. However, the engineer typically<br />

has to face a trade-off between the link’s range, coverage<br />

and data rate. This contradiction can be overcome by<br />

forming a small, steerable spot.<br />

In this paper we present a compact Li-Fi tracking system<br />

based on a steerable optical wireless link, which enables<br />

real-time full-duplex bi-directional data communication<br />

with a data rate of 1.289 Gbit/s. This approach shows the<br />

feasibility and handling of an energy efficient wireless link,<br />

thanks to its 12-bit-precise beam alignment by using micro<br />

mirrors. We describe the optical setup and introduce a<br />

tracking algorithm which enables fully autonomous link<br />

establishment and thus simple installation. Data rate<br />

measurements underline the high performance of the<br />

wireless link whereas the system’s mobility is<br />

characterized by measurements of the settle time of the<br />

steered beam.<br />

Keywords— Li-Fi; optical wireless; infrared; real-time; light<br />

communication; 1 Gbps; IrDA; mobile; IoT; Industry 4.0; M2M<br />

I. INTRODUCTION<br />

The ongoing digitalization of our environment leads to<br />

continuously increasing data traffic. Especially in industrial<br />

environments, automation is an omnipresent trend.<br />

Autonomous systems incorporate a rising amount of sensors as<br />

well as continuous machine-to-machine (M2M)<br />

communication. The resulting enormous data volumes are<br />

often transmitted using wired interconnections.<br />

Wireless communications can simplify data transmission<br />

and enable the connectivity to dynamic parts like moving,<br />

vibrating or rotating components. Due to the open nature of the<br />

communication channel, engineers have to face a number of<br />

challenges, e.g. security issues, interferences and regulation of<br />

irradiated power.<br />

Radio frequency (RF) technologies are used in manifold<br />

applications, but in certain scenarios they are still cumbersome.<br />

For instance, wireless hard real-time operation require a careful<br />

analysis of the channel and its environment to avoid<br />

interferences. Easier handling by using exactly defined<br />

communication spots is possible by utilizing light.<br />

The so-called Li-Fi technology is ideal for autonomous<br />

systems in Industry 4.0 since optical communications offers<br />

reliable and high data rate communication links with<br />

low-latency characteristics. However, there is a trade-off<br />

between communication range and coverage. The latter is<br />

determined by the system’s field-of-view (FOV) which<br />

describes the defined light spot where the link is established.<br />

Unfortunately, a large FOV reduces the received power and the<br />

related maximum communication range. This contradiction can<br />

be overcome by forming a small, steerable spot as shown in<br />

Fig. 1.<br />

595


magnification can be achieved with additional lenses at the<br />

transmitters exit.<br />

Our Li-Fi tracking system is equipped with an<br />

1 Gbit/s-Ethernet adapter, which enables easy integration.<br />

Fig. 1: Optical wireless tracking scenario. If the position of one or both<br />

transceivers change, then the beam is deflected accordingly and<br />

interruption-free data transmission is possible.<br />

Manifold principles of beam deflection and their practical<br />

feasibility have been shown including tiltable mirrors [1],<br />

Risley-Prisms [2], decentered lenses [2], Pockels- and<br />

Kerr-cells [3], acousto-optic gratings [4], liquid crystal based<br />

spatial light modulators [5] and many more. Brandl et al. [6]<br />

and Wang et al. [7] already demonstrated the feasibility of<br />

micro mirrors for beam steering of optical wireless links.<br />

Here we present a similar approach based on micro mirrors,<br />

but with different system design. Thanks to their small form<br />

factor and fast response, we can design compact and dynamic<br />

systems. In contrast to Brandl et al. [6], we introduce a tracking<br />

algorithm for fully autonomous link establishment without the<br />

need of a special photodetector. And unlike Wang et al. [7] our<br />

tacking approach is different and does not require spot size<br />

adjustment and thus enables a simpler system design.<br />

Our paper is structured as follows: In section II, we<br />

describe our Li-Fi tracking system. We give detailed<br />

information about the used micro mirrors and our search<br />

algorithm. In section III, we present an analysis of the steerable<br />

beam and data rate and latency measurements, that prove the<br />

practical feasibility of the concept.<br />

II.<br />

SYSTEM DESCRIPTION<br />

A. Overview<br />

Fig. 2 shows a 3D model of our tracking system. We<br />

combine a Fraunhofer IPMS standard transceiver with two<br />

1D micro mirrors from our institute. The term “1D” indicates,<br />

that the mirror is able to statically keep its tilt around an axis.<br />

The communication beam is reflected at the mirrors. If a mirror<br />

is tilted by the angle θ, the beam is deflected by an angle 2θ,<br />

because of the law of reflection. In order to enable 2D beam<br />

steering, one mirror enables deflection along the horizontal axis<br />

and the other one along the vertical axis. Additional lenses are<br />

used for beam forming. These lenses influence the actual<br />

steering angle of the beam. Optimally, further angle<br />

Fig. 2: Fraunhofer IPMS tracking system. The PCB of the transceiver is<br />

highlighted green. The micro mirror PCBs are mounted vertically and colored<br />

yellow. The steered beam is highlighted blue. The receiver optic is not shown<br />

here.<br />

B. MEMS Mirror<br />

Each micro mirror has its own controller called<br />

Fraunhofer QsDrive. The controller has a USB interface and is<br />

connected to a host system, which can easily control the<br />

mirrors by simple commands given by its API. For analysis<br />

and loopback tests the algorithms is controlled via<br />

Matlab R2015a. In a second step the tracking algorithm will be<br />

moved in an arbitrary microcontroller by porting it to C-code.<br />

The mirrors can be tilted by 5° in positive and negative<br />

direction. This could form a FOV with full angle of 20°.<br />

However, because of the influence of the additional lenses the<br />

actual FOV will be slightly smaller. The settle time t s of the<br />

micro mirrors is specified with 5 ms. The 12-bit-precise<br />

addressing scheme provides us 4096 steps in beam deflection<br />

for horizontal and vertical direction separately. A window in<br />

front of the mirrors surface is used for hermetic encapsulation.<br />

For this project we use two 1D micro mirrors because of<br />

their availability. For the next revision we will use one titltable<br />

2D micro mirror, that simplifies the optical concept and<br />

improves the optical performance of our tracking system.<br />

C. Tracking Algorithm<br />

A search algorithm is necessary for fully automated link<br />

establishment. Fig. 3 illustrates the principle of our algorithm.<br />

First, both transceivers scan their FOV coarsely and<br />

transmit their current mirror tilt angles in each point. When the<br />

beam reaches the receiver of the opposite transceiver by<br />

accident, this opposite transceiver detects the angle position of<br />

device one in that point. From now on transceiver two<br />

transmits both mirror positions, its own ones and the received<br />

ones from transceiver one. As soon as the second transceivers<br />

596


eam hits transceiver one, the first transceiver knows the<br />

correct mirror tilts for its mirrors. With this information, the<br />

mirrors are configured and the received mirror position of<br />

transceiver two is transmitted back. Lastly, transceiver two<br />

receives this mirror tilts and configure its mirrors<br />

correspondingly.<br />

Fig. 4: Realtive irradiance EE rel of the spot profile in a distance of 15 cm.<br />

Fig. 3: Coarsely part of the search algorithm. The sequence is analogous if the<br />

beam of transceivers 2 first hits transceiver 1.<br />

At this point a data link is established. However, the mirror<br />

tilts may not be ideal, i.e. only the edge of the beam hits the<br />

opposite transceiver. Therefore, a fine-scanning algorithm with<br />

smaller step size is initiated in order to find the maximum. The<br />

fine-scanning starts as soon as the transceiver knows its right<br />

mirror positions. We suggest the gradient ascent or hill-climb<br />

algorithm. Both methods may not find the global maximum of<br />

the spot. However, if the spot is homogenous enough, even a<br />

local maximum should be sufficient. Therefore, we investigate<br />

shape of the spot in section III.A.<br />

As soon as the fine-scanning algorithm is finished, it is<br />

initiated again. Thereby, we can follow a moving spot. This<br />

enables communication within a dynamic scenario. Both<br />

devices can be moved within their corresponding FOVs<br />

without interrupting the communication link.<br />

The required time for link establishment depends on the<br />

transceivers positions and thus on the number of scanned<br />

points until the coarse scanning algorithm finds the opposite<br />

device. The time can be approximated with NN ∙ tt ss , where N is<br />

the number of scanned points and tt ss the settle time of the<br />

mirror. The settle time is measured in section III.C.<br />

III.<br />

EXPERIMENTAL RESULTS<br />

A. Beam profile<br />

Fig. 4 shows the spot in a distance of 15 cm. As expected, a<br />

circular spot is formed. It exhibits a donut profile, i.e. the<br />

minimum in the center of the spot is surrounded by<br />

ring-shaped maximum of the irradiance. Moreover, a speckle<br />

pattern all over the spot and a weak ghosting effect at the left<br />

side can be observed.<br />

B. Field of View and Bit Error Rate<br />

Next, the functionality of the system is evaluated.<br />

Therefore, both transceivers are placed in front of each other.<br />

Transceiver 2 is moved within the plane perpendicular to the<br />

optical axis along the horizontal x-axis (red) and along the<br />

vertical y-axis (blue) respectively. Next, the tracking algorithm<br />

is initiated and the mirrors are tilted correspondingly. Fig. 5<br />

illustrates the measurement setup and the bit error rate (BER)<br />

over displacement to the optical axis. Data transmission takes<br />

place in bi-directional full-duplex mode with a data rate of<br />

1.289 Gbit/s in both directions.<br />

If we assume BER < 10 -8 , we can establish a robust link<br />

over an area with an extent of 22.5 cm in horizontal and<br />

23.5 cm in vertical direction. This corresponds to full angles of<br />

12,7° and 13,22° respectively.<br />

Fig. 5: Top: Measurement setup. Bottom: BER over displacement along<br />

horizontal x and vertical y axis seperately. The distance in this scenario was<br />

80 cm for practical reasons.<br />

597


C. Settle time<br />

Table I shows settle times of the mirrors for different mirror<br />

tilts. The settle times range from 7.1 ms to 7.5 ms. They only<br />

slightly increase, if the mirrors are tilted by larger angles.<br />

TABLE I: MEASURED SETTLE TIMES FOR DIFFERENT MIRROR TILTS.<br />

Delta Mirror Tilt Settle Time tt ss<br />

0.1 ° 7.1 ms<br />

0.5 ° 7.2 ms<br />

1.0 ° 7.3 ms<br />

5.0 ° 7.3 ms<br />

10.0 ° 7.5 ms<br />

As soon as a communication link is established, the system<br />

provides real-time communication. One communication<br />

channel exhibits an electrical latency of about 5 ns plus time<br />

for the light traveling through the optical channel. The value<br />

was measured from the signal input port of transceiver 1 to the<br />

signal output port of transceiver 2.<br />

IV.<br />

DISCUSSION<br />

A. Beam profile<br />

The shape of the beam profile is satisfying. The speckle<br />

pattern results from the laser diode and cannot easily be<br />

avoided without influencing the shape of the spot.<br />

However, by adjusting the collimator lenses, the donut<br />

profile can be avoided and a more homogenous power density<br />

within the spot can be achieved. This is generally useful for the<br />

fine-scanning algorithm.<br />

The ghosting effect results from reflections at the windows<br />

of the micro mirror packages. It could be minimized by<br />

applying an anti-reflection coating to the windows. However,<br />

since the ghosting effect is quite weak, the additional expense<br />

of such a coating is not justified.<br />

C. Settle time<br />

The measured settle times are slightly larger than the<br />

specified 5 ms from the manufacturer. We controlled the<br />

mirrors with Matlab R2015a in this measurement setup. As a<br />

result, we introduced an additional overhead and thus<br />

additional latency. If the logic of the tracking algorithm is<br />

integrated into a chip on the PCB, we should be able to reduce<br />

the latency to nearly 5 ms.<br />

Both devices can change their position within the FOV<br />

without losing the data link. However, the settle time of the<br />

mirrors limit this movement fundamentally. If device one is<br />

fixed and the other one is moving, the theoretically maximum<br />

angular speed ω of device two along one axis is given by to<br />

equation (1). Where ∆θ is the step size of the fine-scanning<br />

algorithm.<br />

ω = ∆θθ<br />

tt ss<br />

(1)<br />

A large step size ∆θ in the fine-scanning algorithm results<br />

in higher dynamic. However, the precision of the scanning<br />

algorithm decreases with larger steps.<br />

V. CONCLUSION<br />

In this paper we introduced a fully automatic Li-Fi tracking<br />

system, which provides large FOV and large communication<br />

distances at the same time. Our tracking algorithm allows fully<br />

automated link establishment without any additional<br />

information. The transceivers allow full-duplex, bi-directional<br />

data communication with a data rate of 1.289 Gbit/s and<br />

BER < 10 -8 . As soon as the link is established, real-time<br />

capable communication with latencies of only 5 ns is possible.<br />

Therefore, our Li-Fi system is ideal for wireless industrial<br />

communication.<br />

For the next revision, we plan to replace the 1D micro<br />

mirrors by a single 2D micro mirror. This will further simplify<br />

the optical setup and leads to further miniaturization.<br />

B. Field of View and Bit Error Rate<br />

According to Fig. 5 our system is able to find the opposite<br />

transceiver within the FOV. However, the achieved FOV is<br />

still below 20° in full angle. This basically because of the<br />

following factors:<br />

• Additional lenses within the optic setup influence the<br />

actual deflection angle<br />

• The optical gain of the receiver optic decreases with<br />

higher angles<br />

• A larger deflection angle causes spot distortion, which<br />

results in larger spot size and therefore in lower<br />

irradiance<br />

VI. ACKNOWLEDGEMENTS<br />

The authors thank Fraunhofer for founding this system<br />

within the framework of the project 600601 Autotrack.<br />

VII.<br />

REFERENCES<br />

[1] Davis, S. R. ; Farca, G ; Rommel, S. D. ; Martin, A. W. ; Anderson, M. H.:<br />

Analog, Non-Mechanical Beam-Steerer with 80 Degree Field of Regard.<br />

In: Proceedings of SPIE 6971 (2008), 24.march. – DOI<br />

10.1117/12.783766<br />

[2] Gibson, J. ; Duncan, B. ; Bos, P. ; Sergan, V.: Wide-angle beam steering<br />

for infrared countermeasures applications. In: SPIE Proceedings 4723<br />

(2002), 3. September, S. 100–111<br />

[3] Nakamura, K. ; Miyazu, J. ; Sasaki, Y. ; Imai, T. ; Sasaura, M. ;<br />

[3]Fujiura, K.: Spacecharge-controlled electro-optic effect: optical beam<br />

deflection by electro-optic effect and spacecharge-controlled electrical<br />

conduction. In: Journal of Applied Physics 104 (2008), No. 14.. – DOI<br />

10.1063/1.2949394<br />

598


[4] Römer, G.R.B.E. ; Bechtold, P.: Electro-optic and acousto-optic laser<br />

beam scanners. In: Physics Procedia 56 (2014), 4. September, 23–39. –<br />

DOI 10.1016/j.phpro.2014.08.092<br />

[5] Tholl, H. D.: Novel Laser Beam Steering Techniques. In: SPIE<br />

Proceedings 6397 (2006), 22. September, No. 639708-14<br />

[6] Brandl, P. ; Zimmerman, H.: Optoelectronic Integrated Circuit for Indoor<br />

Optical Wireless Communication with Adjustable Beam. (2013), 3. Juli.<br />

ISBN 978–1–4673–5822–4<br />

[7] Ke Wang, Ampalavanapillai Nirmalathas, Christina Lim, and Efstratios<br />

Skafidas, "4 x 12.5 Gb/s WDM Optical Wireless Communication<br />

System for Indoor Applications," J. Lightwave Technol. 29, 1988-1996<br />

(2011)<br />

599


Design of On-Chip RFID<br />

Transponder Antennas<br />

Dr. Andreas Heinig<br />

Drahtlose Mikrosysteme<br />

Fraunhofer IPMS<br />

Dresden, Germany<br />

andreas.heinig@ipms.fraunhofer.de<br />

Abstract— Manufacturing the transponder antenna directly<br />

on top of the silicon substrate of the transponder integrated<br />

circuit has the advantage of a cheap and miniaturized<br />

transponder tag. No additional mounting and joining processes is<br />

necessary. For passive transponder tags the complete system can<br />

be fabricated using the standard CMOS process.<br />

The size of the antenna depends on the wavelength of the<br />

transmission frequency. In the presented project a frequency of<br />

61 GHz was chosen. This makes it possible to design a chip of<br />

about one square millimeter. In modern silicon technologies,<br />

there is more than enough space behind the area of the antenna<br />

for the necessary electronic circuit. Regarding the high frequency<br />

and the small antenna diameter, only a very small amount of<br />

energy is expected to be available for powering the electronic<br />

circuit. Therefore, only identification and authentication with<br />

embedded cryptographic functions are planned for applications<br />

with a reading range of about five millimeters in the first steps.<br />

More energy-critical functions like passive sensor measurements<br />

will be addressed in the future.<br />

In general the design process of the transponder antenna is an<br />

iterative process based on high frequency electromagnetic field<br />

simulations. The process is similar to the design of customized<br />

UHF antennas. The antenna type is a slot antenna, consisting of<br />

the antenna slot itself and a surrounding frame. The conductive<br />

substrate material results in unavoidable losses. However, an<br />

antenna gain of 1.5 dB is achieved. The antenna gain can be<br />

boosted up to 5.6 dB with a metallic backplane on the chip with<br />

the right thickness. Additional losses, originating from the<br />

metallic filling structures in the circuit are already included in<br />

these numbers. The filling structures are necessary for<br />

technological reasons in the manufacturing process: the metallic<br />

covering has to be between a minimum and a maximum value.<br />

The filling area behind the frame does not reduce the antenna<br />

quality. In this area also the electronic circuit is placed. The<br />

filling directly below the antenna structure itself should be at the<br />

minimum allowed value. A critical point is the influence of the<br />

filling to the antenna impedance parameters. These parameters<br />

have to be fitted to the input parameters of the electronic circuit<br />

for an optimal match. Unfortunately the filling structures are<br />

restricted by the technology rules and are too complex to be<br />

simulated in the field simulations. On the other hand, the<br />

antenna and the electronic parts are manufactured in the same<br />

steps. A post-process matching is impossible. Therefore, a first<br />

antenna on silicon substrate is manufactured and verified to find<br />

differences between simulation and real measurement. This will<br />

help to find a simplified model of the filling structures in the<br />

simulation and to allow a spot landing of the antenna impedance<br />

to the complex conjugated circuit impedance after<br />

manufacturing.<br />

Keywords— GHz-Transponders, RFID, Antenna-on-Silicon,<br />

Embedded Antenna, Antenna Design<br />

I. INTRODUCTION<br />

RFID systems are currently being actively developed<br />

worldwide by various research and industrial companies and<br />

represent a multi-billion dollar future market. The goals are<br />

often to speed up processes or reduce logistics costs in tracking<br />

and controlling the corresponding mobile assets. Another area<br />

of application is real-time monitoring and reliable<br />

administration in the pharmaceutical, medical or military<br />

industries. RFID offers many advantages over the widely used<br />

barcode or data matrix code. Reading several tags can be done<br />

simultaneously and fully automatically. The information can<br />

not only be read unidirectional, but also actively written. This<br />

is an advantage, especially for poorly networked processing<br />

chains, because the information can be passed on to subsequent<br />

stages, such as validation. In addition, other systems such as<br />

sensors or cryptographic modules can be integrated in the ID<br />

tags, which can enormously expand the range of functions and<br />

possible applications. An RFID system generally consists of a<br />

transponder, the ID tag that carries the information and a reader<br />

that is required for reading. There are roughly three different<br />

types, which differed by the operation of the transponders. The<br />

transponders working in the lower frequency ranges 125 kHz<br />

and 13.56 MHz are based on the physical operating principle of<br />

the magnetic coupling (loosely coupled transformer), while the<br />

systems from 868 MHz use the backscatter principle known<br />

from radar technology. In this case, an electromagnetic wave<br />

600


from the antenna of the reader spreads freely in space and is<br />

reflected in response to the digital information to be transmitted<br />

(0.1) from the antenna of the transponder back to the reader or<br />

not. The development of the magnetically coupled transponder<br />

can today technically be considered as completed. In the<br />

application, however, there are large fields of application such<br />

as travel passports, identity card, EC and credit cards but also<br />

medical implants). With the use of near field communication<br />

(NFC) in connection with smartphones, further applications<br />

such as electronic purse or telemedicine devices have been<br />

developed for the 13.56 MHz frequency range. Most of the<br />

new developments are currently taking place in the frequency<br />

band from 850 MHz to 950 MHz (internationally not uniformly<br />

standardized). In addition to pure identification systems,<br />

transponders with sensors and, in the future, with cryptography<br />

functions are increasingly gaining in importance. The IPMS<br />

plays a key role in this development. For applications to protect<br />

against piracy these transponders are not suitable because in<br />

this frequency range, the antenna is still several centimeters in<br />

size, the simple dipole antenna is 17 cm long. The frequency<br />

range of 2.45 GHz, which is also released for RFID<br />

applications, has no significance in practical application. Only<br />

in the frequency range above 60 GHz, the antenna dimensions<br />

of one to two millimeters in length are so small that integration<br />

on the silicon chip is technically and economically sensible.<br />

The frequency range of 61GHz allows a size of the antenna<br />

that makes integration on a chip in the range of 1mm²…2mm²<br />

areas makes sense. This frequency is released for long range<br />

devices.<br />

A particular challenge is the energy supply of such<br />

transponders. Since the chip is to work passively, without its<br />

own energy source, the necessary power must be transmitted<br />

via the carrier field. At 61 gigahertz, very little power will be<br />

available on the chip for the electrical function. Since the<br />

transmission power of the reader is limited by standard<br />

regulations to 100mW, a very efficient low-power circuit must<br />

be developed for the transponder tag.<br />

A special influence on the efficiency of the transponder has<br />

its antenna. Appropriate geometries must be developed for onchip<br />

antennas (OCA). Another challenge is the optimal<br />

adaptation of the antenna to the electrical circuit. Only this<br />

makes it possible to minimize the transmission losses and to<br />

optimally use the minimum power budget. The integration of<br />

the antennas is realized in the standard CMOS process. Only in<br />

this way can the chip be realized within the envisaged cost<br />

range. Therefore, the technological requirements in the<br />

backend process have to be taken into account during<br />

development.<br />

There is still no standard protocol for the communication<br />

between reader and transponder in this frequency range. The<br />

starting point for the development will be the EPC-G2V2<br />

standard, which was developed for the UHF frequency range. It<br />

also already contains the command-state definition that is<br />

required for the function of an authentication. When<br />

implementing the function in the 61GHz chip, the power<br />

consumption must be significantly reduced.<br />

II. COMMON SYSTEM DESCRIPTION<br />

The project will develop a CMOS chip and a reader (Figure<br />

1). Both parts communicate with each other and thus represent<br />

the transponder system. The selection of the 60 GHz band<br />

enables the integration of all components in one chip on the<br />

chip side. The antenna on top of the transponder chip is the<br />

topic of this presentation.<br />

Fig. 1. Overview of the 61GHz Transponder System.<br />

The target application of the system is the wireless<br />

identification and authentication of assets. The reader should be<br />

brought close to the transponder, ranges in the range of 5mm<br />

are being considered.<br />

The used FD-SOI technology is based on an ultra-thin<br />

silicon layer, which is separated from the silicon substrate by a<br />

thin buried oxide layer. The transistor channel is made in the<br />

ultra-thin silicon layer. The transistor channel is fully depleted<br />

due to the small silicon thickness in the off state ("fully<br />

depleted"), which significantly improves the turn-off of the<br />

transistor. Due to the complete depletion of the silicon layer,<br />

implantation steps in the channel can be almost completely<br />

eliminated, whereby the mobility of the charge carriers in the<br />

channel and thus also the inrush current are increased. The<br />

"buried oxide" layer reduces the parasitic capacitances in the<br />

transistor structure, improves the field penetration of the gate<br />

onto the channel, and produces a nearly ideal sub-threshold<br />

voltage, which is reflected in an even improved inrush current.<br />

III. ANTENNA DESIGN<br />

The aim of the antenna development is to obtain the<br />

smallest possible antenna geometry that can be produced<br />

directly on the chip with the technology also used as standard<br />

for the integrated circuit, which achieves an antenna gain > 0<br />

dB. A particular challenge is the influence of the silicon<br />

substrate and the metal structures contained therein on the<br />

antenna located in the uppermost levels. The selected working<br />

frequency of 61GHz determines the size of the antenna. The<br />

optimal antenna length is in the usual antenna types in the order<br />

of half the wavelength of the operating frequency, plus a<br />

dependent of material parameters truncation factor. The half<br />

wavelength is in our case at λ / 2 ~ 2.44mm. The materials<br />

specified by the chip substrate still lead to a significant<br />

shortening factor, so that λ / 2 antennas are suitable for the<br />

selection of an antenna in the project. Through a literature<br />

review and the input and simulation of simplified models of<br />

different antenna geometries, a double slot antenna was chosen<br />

as the most promising. In contrast to the example of the normal<br />

dipole antenna, in which a conductor itself represents the<br />

antenna, in a slot antenna a recess in the metallic conductor is<br />

the actual antenna structure. This makes the antenna<br />

601


characteristics relatively independent of metal around the<br />

antenna, such as the electronic circuit structures of the chip.<br />

The structure of the antenna is shown in Figure 2.<br />

antenna<br />

antenna and circuit development were provided, this is not a<br />

limitation, important to clarify is the question whether the<br />

properties of the developed antenna can also be metrological<br />

verified so the specific adaptation to the chip then takes place<br />

in further steps. The impedance characteristics of the antenna<br />

have been adjusted substantially along the length and width of<br />

the waveguide. While any resizing affects all parameters an<br />

iterative process to readjust all parameters is necessary. Figure<br />

3 shows as an example the dependence of the antenna<br />

impedance on the length of the waveguide.<br />

frame<br />

matching<br />

Fig. 2. Used antenna geometry.<br />

In the illustration metallic surfaces shown in blue. At the<br />

top, horizontally, the two slots of the antenna are visible in the<br />

metal. In the middle downwards, two slots connect, which<br />

represent a waveguide (CPW = coplanar waveguide). This<br />

couples the signal from the antenna and feeds it to the<br />

transponder circuit. This is later electrically contacted at the<br />

lower end of the waveguide. About the geometry of the<br />

waveguide can be done in addition to the adaptation of the<br />

antenna to the electrical circuit. In addition to the antenna<br />

characteristics themselves, the impedance of the antenna must<br />

be conjugate complex to the input impedance of the<br />

transponder circuit. This is necessary in order to achieve an<br />

optimum energy input from the electric field via the anode into<br />

the transponder circuit and to have a good reflection coefficient<br />

for the data transmission from the transponder back to the<br />

reading device. The geometrical distances entered in the figure<br />

affect the antenna properties and the impedance of the antenna.<br />

In order to develop the antenna for the 61GHz transponder<br />

chip, a model adjustable in all geo-metric dimensions has been<br />

created. With this the antenna can be entered in the software<br />

for high-frequency simulation "HFSS", a product of the<br />

ANSYS, Inc.. An analytical calculation of the geometries is not<br />

possible because of the complex relationships. Due to the large<br />

number of variables, a large number of time-consuming<br />

simulations is necessary in order to iteratively approach the<br />

best possible solution. The antenna structure is located in the<br />

highest metallic conductor level of the used technology<br />

"22FDX". All levels of the silicon wafer are implemented in<br />

the simulation model, which together comprise 32 different<br />

substrate, trace and via levels. This leads to a very complex<br />

simulation model, which places high demands on the used<br />

computer technology and represents a significant expenditure<br />

of time per simulation. First of all, in an iterative process, the<br />

geometries were determined by varying the antenna length,<br />

width of the slit, and extent of the enclosing frame to give<br />

optimal antenna characteristics. Thereafter, the adaptation of<br />

the antenna to the input impedance at the transponder circuit<br />

was performed. Since, by working in parallel at the time of<br />

antenna development, the input impedance of the transponder<br />

circuit was known neither by simulation nor by measurement,<br />

an estimated value of ZA = (5 + j10) Ω was given for the<br />

antenna. Since for the first trials on silicon separate chips for<br />

Fig. 3. Complex impedance in respect to the matching feed length.<br />

Combined with the other variables, this results in a multidimensional<br />

result field which, due to its complexity, makes it<br />

necessary to use computer-assisted optimization options<br />

offered by the software "HFSS" and to develop end-to-end<br />

optimization programs. As a result, it was possible to develop a<br />

chip antenna which, with an antenna gain of GAIN = ~ 1.55<br />

dB, is significantly better than the target value. Figure 4 shows<br />

the straightening characteristic of the designed antenna.<br />

Fig. 4. Antenna Gain characteristic with high impedance wafer material.<br />

It was evaluated to what extent a metal surface behind the<br />

antenna bundles the directivity and leads to a higher antenna<br />

gain. This has been confirmed, the antenna gain increases to<br />

GAIN = ~ 5.9 dB when the Reader is above the chip, which<br />

corresponds to the normal application. Figure 5 shows the<br />

corresponding diagram. It is later technologically easily<br />

possible to mount the transponder chip on a metallic substrate<br />

or to metallize the back of the wafer to take advantage of this.<br />

However, the GAIN value depends on the thickness of the<br />

material, the values shown are valid for a substrate thickness of<br />

700μm, which corresponds to a wafer thickness of<br />

602


approximately 730μm during production. If the distance<br />

decreases, the antenna gain decreases. The change is very small<br />

up to 300μm substrate, but from 300μm<br />

mm x 0.7 mm, which is within the desired target range. The<br />

areas under the antenna frame can later be used for the<br />

transponder circuit. For organizational reasons, a chip size of 2<br />

mm x 2 mm had to be occupied for the production of the test<br />

chip, so the remaining area was left empty in order to exclude<br />

an influence on the antenna properties. Very problematic for<br />

antenna design is the technological need for all metal levels to<br />

have a certain minimum and maximum metal coverage,<br />

otherwise the chip cannot be produced. Excluded from this rule<br />

is the top metal level, in which the antenna itself is located.<br />

However, the additional metal surfaces in the few nanometers /<br />

micrometer below the antenna have a large impact on the<br />

antenna parameters. For the antenna gain, the filling under the<br />

actual antenna has a negative effect, the filling under the frame<br />

has a positive effect. In total, this leads to a loss of antenna<br />

gain, which must be accepted. The above-mentioned antenna<br />

gain of GAIN = 5.9 dB is already the final value reached. The<br />

Fig. 5. Antenna Gain characteristic with reflector and high impedance wafer.<br />

to 100μm substrate thickness the antenna gain drops back to<br />

the value without backside metallization. That is, if the<br />

additional antenna gain is to be exploited, the wafer must not<br />

subsequently be thinned below 320 microns thickness. For<br />

initial measurements of the antenna, it was planned to mount<br />

the antenna chip on a PCB of 0.5 mm thickness as a carrier for<br />

easier handling and to use undiluted chips. The circuit board<br />

has a copper surface as a backside metallization. The<br />

simulation model has been adjusted accordingly, Figure 6<br />

shows this in the appropriate size ratio.<br />

Fig. 7. Antenna test chip layout.<br />

Fig. 6. Chip carrier with reflector for test.<br />

IV. TRANSFER TO CHIP LAYOUT<br />

For the design of the test chips, test pads were added in<br />

order to be able to measure the antenna via needle probes. At<br />

the later transponder chip these pads are not necessary. The test<br />

pads affect the impedance of the antenna. Since an adaptation<br />

to a specific target impedance in the probe chip is not necessary<br />

and the resultant impedance is in the desired window, a<br />

correction of the influence of the test pads was omitted for<br />

reasons of expense. Figure 7 shows the resulting test chip<br />

including the technologically required marginal and auxiliary<br />

structures. The predetermined by the antenna chip size is 1.3<br />

losses due to the filling structures of approx. 3 dB are already<br />

included here, without filling structures the antenna with a<br />

metallic background would even achieve over 8 dB antenna<br />

gain. The antenna impedance is also affected by the filling.<br />

Here, however, the change is not tolerable and it is necessary to<br />

counteract the influence of changes in the antenna geometry in<br />

order to ensure the adaptation between antenna and transponder<br />

circuit. Here arises the problem that the filling structure in the<br />

technologically predefined structures in the simulation is so<br />

delicate and complex that the space requirement and the<br />

computing time for this simulation are beyond any scope. The<br />

filling structure was therefore modeled in a greatly simplified<br />

form. The measurements of the test chip must show whether<br />

this simplification emulates the influence of the filling structure<br />

exactly enough. Despite the simplification, the simulation of<br />

the antenna with filling structure remains very computationally<br />

intensive, for a single simulation point more than 60 hours of<br />

simulation time are necessary. This dramatically complicates<br />

the parameter optimization on the antenna. For this purpose, a<br />

suitable procedure is still to be determined, in the test chip, the<br />

adjustment was not necessary and was not carried out for<br />

scheduling reasons. The impedance shown in Figure 8 results<br />

for the implemented test chip as a simulation result.<br />

603


Fig. 8. Simulated complex impedance values of the antenna.<br />

V. MEASUREMENT<br />

The designed test chip for the antenna was manufactured<br />

under the name "autag1a" in the target technology. The test<br />

chips were then mounted on a circuit board carrier (Figure 9)<br />

and the impedance measured at a needlepoint test station and a<br />

network analyzer. The measurement setup is first calibrated on<br />

a specially designed calibration substrate. For this purpose,<br />

structures for "short circuit" (0 Ω), "adaptation" (50 Ω) and<br />

"open" (∞) are located on the calibration substrate.<br />

The measured values obtained are very stable and<br />

reproducible, they are also very homogeneous over the various<br />

test chips on the printed circuit boards as well as the test chip<br />

without PCB assembly. The measured values initially deviated<br />

significantly from the simulated values. After a further<br />

refinement of the model by structures in the top metal layer,<br />

which were initially not realized in the simulation model (pads,<br />

crack stop) and better modelling of the filling structure and the<br />

measurement environment like needles and mountings (Figure<br />

11), the measurement could be exactly matched with the<br />

simulation (


extensive electronic circuits in the direct area of the antenna<br />

can be taken into account (Figure 13).<br />

All technologically necessary structures, such as fillings to<br />

achieve the required metal coverages, pads and other essential<br />

metal surfaces, which can potentially influence the antenna,<br />

were considered in advance in the design process and allow an<br />

optimal match between antenna and circuit. The influences on<br />

the antenna gain could be recognized and optimized for the<br />

best possible gain.<br />

The design methodology can also be applied to other<br />

transponder antennas in other frequency ranges. Figure 15<br />

shows a set of printed board antennas for a other IPMS<br />

transponder chip in the 869MHz frequency band.<br />

Fig. 13. UHF printed board antenna with complex electronic.<br />

VIII. CONCLUTION<br />

The article shows the development of an antenna placed on<br />

the chip in the frequency range of 61GHz. The selected target<br />

frequency enables chip sizes in the range of 1mm². The<br />

transponder electronics are also located on the chip, so that a<br />

very small transponder can be created. The parameters<br />

determined by simulation in the design process could be<br />

detected on a first sample chip. This is particularly important<br />

for the adaptation of antenna and transponder circuit, since<br />

both are inseparable from each other and a subsequent change<br />

is not possible. Sensible antenna parameters can also be<br />

achieved on standard wafers with high conductivity of the<br />

material. Figure 14 shows a photograph of the realized chip on<br />

standard wafer material.<br />

Fig. 15. Photo of several UHF antenna designs.<br />

ACKNOWLEDGMENT<br />

The work on this Topic was supported by the Project<br />

“PROSECCO: PROduct SECurity and COmmunication” of the<br />

Collaborative R & D project funding of the SAB (Sächsische<br />

AufbauBank) and the Fraunhofer internal Project “Radar-Tag:<br />

System for the authentication of assets”.<br />

REFERENCES<br />

[1] Heiß M., “Antennenentwurf für radio-frequency identification (rfid)-<br />

sensor-transponder”, Dissertation, Technical University of Dresden,<br />

2014<br />

[2] Lischer S, “A 24 ghz rfid system-on-a-chip with on-chip antenna,<br />

compatible to iso 18000-6c / epc c1g2”, IEEE_COMCAS_2015<br />

[3] Lischer S., “Ein iso 18000-6c / epc c1g2 kompatibles 24-ghz-rfid-einchip-<br />

system mit integrierter antenne“, MST-Kongress 2015<br />

[4] Fonte A., Saponara S., Pinto G., Neri B., “Feasibility study and on-chip<br />

antenna for fully integrated μrfid tag at 60 ghz in 65 nm cmos soi”, IEEE<br />

International Conference on RFID-Technologies and Applications,<br />

2011, 457-462<br />

[5] Guo,L.H., “A small oca on a 1 x 0.5-mm2 2.45 ghz rfid tag – design and<br />

integration based on a cmos-compatible manufacturing technology”,<br />

IEEE Electron Device Letters, February, No. 2, 2006, 96-98<br />

[6] Dagan,H., “A low-power low-cost 24 ghz rfid tag with a c-flash based<br />

embedded memory”, IEEE Journal of Solid-State Circuits, VOL. 49,<br />

NO. 9, September 2014, 1942-1957<br />

Fig. 14. Chip photo of the test antenna.<br />

605


Demystifying Why Your ADC Does Not Perform To<br />

The Datasheet And What You Can Do To Improve<br />

Performance<br />

Christy She<br />

Connected MCU Systems<br />

Texas Instruments<br />

Dallas, Texas, U.S.<br />

Chris Sterzik<br />

Connected MCU Applications<br />

Texas Instruments<br />

Dallas, Texas, U.S.<br />

Abstract—Noise is a complex problem that challenges even<br />

the most experienced analog engineer working with sensor nodes<br />

in IOT. The complexity comes from the number and types of<br />

noise sources. These sources can be within the microcontroller<br />

(MCU), on the board, or in the environment. As MCU<br />

integration increases and speeds increase the internal noise has<br />

increased. Additionally, the rise of IOT and wireless connectivity<br />

has increased the noise in the environment. This paper will<br />

educate the reader on why the datasheet doesn’t tell the whole<br />

story of integrated performance and generally represents a<br />

subset of use cases. This subset represents characterization of<br />

singular peripherals and functions with the remainder of the<br />

SOC peripherals and functions in a sleep or idle state. This<br />

paper will show data from real world examples implementing<br />

different techniques to reduce noise. These examples will include<br />

noise introduced by SPI and sub-1GHz wireless connectivity<br />

which will be generalized to I2C and UART as well as BLE and<br />

WIFI. Using specific use cases and showing how to generalize to<br />

other wired or wireless configurations, MCU developers can<br />

apply the concepts discussed in this paper to successfully<br />

integrate precision analog measurements in their sensor node<br />

designs.<br />

Keywords— Microcontroller; MCU; ADC; analog to digital<br />

converter; differential; high performance ADC; coexistence;<br />

electromagnetic compatibility (EMC); sensor nodes; IoT;<br />

ratiometric measurements; ADC calibration<br />

I. INTRODUCTION<br />

Most analog to digital converters (ADCs) have<br />

configurability that affects its performance. Thus, a single<br />

datasheet value may not cover performance for all possible use<br />

case configurations. ADCs integrated into a microcontroller<br />

(MCU) often have even more configurability in order to<br />

optimize the ADC for power and performance across varied<br />

use cases.<br />

Integrated circuit (IC) manufacturers want to show the best<br />

performance possible; thus, they select the configuration that<br />

shows the best performance. In a few cases, manufacturers will<br />

split parameters to show how specific configurations affect<br />

performance. Therefore, you must pay careful attention to test<br />

conditions including the typical test conditions to know if the<br />

data sheet performance for a parameter of interest applies to<br />

your use case.<br />

The next section goes into the details of why the ADC’s<br />

datasheet performance may not represent your use case and<br />

gives guidance on what performance to expect. Followed by a<br />

section describing how to maximize ADC performance for<br />

your use case.<br />

II.<br />

WHY ADC DATASHEET PERFORMANCE MAY<br />

NOT BE APPLICABLE TO YOUR USE CASE<br />

This section lists the common configuration parameters that<br />

affect performance, with some guidelines on how to take a data<br />

sheet with parametric conditions that do not match your usecase<br />

and yet still know what performance you can expect.<br />

A. Reference Choice<br />

There are two main parts to the reference voltage which<br />

affect the performance of the ADC: accuracy and voltage.<br />

1) Accuracy: Accuracy is driven by what reference is<br />

used. For ADCs integrated on an MCU, the reference options<br />

may include (in order of increasing accuracy): the supply,<br />

internal reference, or an external reference (separate chip).<br />

The supply as the reference is the lowest current option but<br />

is usually noisier, as it supplies the digital circuitry (which has<br />

switching noise). One common technique to mitigate or protect<br />

the analog supply from digital switching noise is to use a filter<br />

between the analog and digital supplies if there are separate<br />

pins. Similarly, to isolate noise on the supply from the<br />

reference, connect the external supply to the ADC’s external<br />

reference pin using a ferrite bead (a passive electric<br />

component) and decoupling filters to reduce noise, as shown in<br />

Fig. 1.<br />

Using a ferrite bead is a common practice to isolate noise,<br />

especially between analog and noisy switching digital signals.<br />

Reference [1] provides details about the use of a ferrite bead<br />

and although it is written around a phase-locked-loop (PLL) it<br />

www.embedded-world.eu<br />

606


Power<br />

Supply<br />

Decoupling<br />

Capacitor<br />

Ferrite Bead<br />

Filter Between Analog<br />

Supply/Reference and<br />

Digital Sup ply<br />

Decoupling<br />

Capacitor<br />

Digital Supply<br />

Analo g Supply<br />

External<br />

Reference<br />

ADC Input<br />

Internal<br />

Referece<br />

Digital Domain<br />

Analog Domain<br />

Reference<br />

Multip lexer<br />

Fig. 1. Example connection of supply voltage to ADC external<br />

reference pin.<br />

is applicable to an ADC as well. Also, the supply used for the<br />

ADC reference generally cannot be a direct connection from a<br />

battery because the voltage will decay over the lifetime of the<br />

battery’s lifetime whereas the ADC reference voltage must be<br />

known to calculate the ADC converted voltage.<br />

The internal reference typically provides lower noise than<br />

the supply at the cost of increased current consumption. Even<br />

when filtering the supply and applying it to the external<br />

reference path, as described earlier, the internal reference is<br />

typically a lower noise option.<br />

Applications that need better accuracy, especially over a<br />

wide temperature range, may require an external reference.<br />

External references are available with better accuracy and a<br />

lower temperature coefficient/drift (generally the two dominant<br />

error factors). External reference voltages are available with a<br />

temperature coefficient in the single digit parts per million<br />

(ppm)/ ْC vs references integrated into a MCU range from 25<br />

ppm/ ْC - 50 ppm/ ْC. For more details on how to select a<br />

voltage reference and example calculations of total reference<br />

voltage error refer to [2].<br />

There are two alternatives to using an external reference to<br />

improve DC reference accuracy across temperature:<br />

a. Calibration: In production create a lookup table (or a<br />

single point if temperature range is small) to determine the<br />

actual reference voltage (some device manufacturers on select<br />

devices actually measure this during device production and<br />

store it on chip) and use that in software to either correct the<br />

raw ADC code or adjust the ADC result for the inaccurate<br />

reference voltage. Equation (1) is the correction equation:<br />

measured_VREF<br />

ADC_correc ted = ADC_raw x<br />

(1)<br />

VREF<br />

where VREF is the ideal ADC reference voltage and<br />

measured_VREF is the measured ADC reference voltage. If<br />

you are correcting across temperature, a temperature<br />

measurement needs to be taken at the time of the ADC<br />

measurement to know which measured voltage reference value<br />

to use in the look up table.<br />

b. Ratiometric measurement: In applications where the<br />

voltage used to excite the senor, is the same voltage used as<br />

the reference for the ADC, the measurement is called a<br />

ratiometric. By using the same voltage to excite the sensor and<br />

the ADC reference voltage, any errors in that voltage will be<br />

canceled out. For ratiometric measurements either an external<br />

ADC<br />

reference can be used or the internal reference if it can be<br />

made available outside the device. You can also take<br />

ratiometric measurements when a current source to excite the<br />

sensor with a resistor placed between the positive and negative<br />

ADC reference pins and the excitation current through that<br />

resistor. For a detailed example with a resistance temperature<br />

detector (RTD) refer to [3].<br />

2) Voltage: If the integrated ADC supports a range for the<br />

input reference voltages, then understanding how the voltage<br />

level affects performance is important. Selecting a lower<br />

reference voltage reduces the least significant bit (LSB) size<br />

so that the overall (full-scale) range is decreased in order to<br />

resolve smaller changes in voltage. This reduction in the<br />

signal via the reference voltage level affects performance<br />

shown in this signal to noise ratio (SNR) equation (2):<br />

rms_SIGNAL<br />

SNR(dB) 20 x log10(<br />

) (2)<br />

rms_NOISE<br />

where SIGNAL is the full scale ADC input less than or equal<br />

to the reference voltage.<br />

Figure 2 shows how the SNR decreases as the reference<br />

voltage decreases. Given the same noise, when the signal is<br />

smaller (in the case of a lower reference voltage), SNR is<br />

lower. Thus, to maximize performance keep in mind the full<br />

dynamic range of the ADC and, if required, to pre-condition or<br />

amplify the ADC input to use the full ADC dynamic range.<br />

When cost is more important than performance, choose the<br />

smallest reference voltage level that will always be larger than<br />

the input signal. For example, assuming an ideal reference, if<br />

an input signal is 1V max and voltage references 1V are 2V are<br />

available, then amplifying the input by a factor of 2 and using<br />

the 2V reference would provide better SNR than measuring the<br />

1V directly with a 1V reference.<br />

B. Supply Voltage<br />

MCUs have a fairly wide operating range to support many<br />

applications – specifically battery-powered applications. This<br />

wide range does not always propagate to the ADC, which may<br />

require a higher minimum supply voltage. If a device has this<br />

limitation, then you can find the minimum supply voltage for<br />

ADC operation in the data sheet, usually in an ADC parametric<br />

table row.<br />

Depending on the ADC’s architecture and design, there<br />

may be performance degradation at lower supplies so look<br />

carefully at the test conditions. Data sheets show test<br />

conditions in different ways including: footnotes, a column in<br />

the data sheet, or in the table title. Some datasheets supplement<br />

table entries with graphs that show how performance changes<br />

over voltage or temperature. In a battery powered application,<br />

understanding the performance over the range of the<br />

operational battery voltage is critical to a successful design. If<br />

your application needs a lower supply than the datasheet shows<br />

the ADC parametric at, you should measure the performance at<br />

the minimum supply of your application to know if it meets<br />

your performance requirements.<br />

Also note that when the supply varies, as is the case of a<br />

direct battery connection, some parametric values can change<br />

607


Signal-to-Noise Ratio (dB)<br />

72<br />

70<br />

68<br />

66<br />

64<br />

62<br />

60<br />

58<br />

1.00 1.25 1.50 1.75 2.00 2.25 2.50 2.75 3.00<br />

Fig. 2. SNR vs reference voltage.<br />

Reference Voltage (V)<br />

across the supply voltages. The power supply rejection ratio<br />

(PSRR) is one measure but also look for any parameter with<br />

units of per Vsupply. Examples of parameters which may be<br />

affected by the supply are gain and offset errors but keep in<br />

mind this is ADC architecture dependent. And some ADCs<br />

may be subregulated (with an internal low-dropout regulator<br />

(LDO) for instance) to always have the same voltage supply<br />

independent of the device supply. In which case, the ADC<br />

would only see the small ripple at the LDO output.<br />

C. Multiple Modes<br />

MCUs typically offer multiple modes to allow you to<br />

customize ADC tradeoffs such as speed, performance, and<br />

current. So unless your use case matches that of the data sheet<br />

test condition, your application would have different<br />

performance, current and sample rate limits.<br />

Several things affect the sample including: mode,<br />

conversion clock frequencies and sample time. The device data<br />

sheet will list the minimum sample time for a specific source<br />

resistance and capacitance. But if the source you are measuring<br />

has a larger source resistance, the ADC needs a longer sample<br />

time to maximize its performance. The manufacturer should<br />

document a minimum sample time equation for the ADC in the<br />

datasheet and/or reference manual. Reference [4] provides an<br />

example of minimum sample time equation and example<br />

calculation for a specific device [4].<br />

Datasheet maximum current values that apply should be<br />

used whereas typical currents can be obtained by characterizing<br />

the current for your application across devices. Some<br />

datasheets will have typical curves showing how current varies<br />

with different configurations. Current is often the result of<br />

multiple parameters. For a more detailed list of low power<br />

features and configurability of one specific ADC refer to [5].<br />

D. Datasheet Use Case Only Has ADC operating<br />

To showcase best case ADC performance, datasheet<br />

performance numbers often use a low power mode where the<br />

central processing unit (CPU) is not active to minimize on chip<br />

noise. And, if there is an option to choose between in internal<br />

low drop out (LDO) regulator versus a direct current to direct<br />

current (DC/DC) converter, the LDO is used to minimize on<br />

chip noise. Note some ADC architecture/layout may be less<br />

sensitive to the on chip noise.<br />

If you have the luxury to limit what is on during ADC<br />

measurement, the datasheet performance may be a good<br />

indicator of the level of performance you can reach but if you<br />

have noisy signals (i.e. high speed signals especially clocks) on<br />

your board or in the CPU it would be good to bench test the<br />

performance early on to make sure the ADC is meeting your<br />

needs. See section III How Differential Signaling Can Address<br />

Noise for more details on what you can do to help this,<br />

including board layout techniques and differential input.<br />

E. Datasheet only Considers Noise from the ADC<br />

The previous section discussed additional on-chip noise not<br />

accounted for in datasheet performance numbers. This section<br />

discusses additional noise off-chip (prior to the ADC input<br />

coming on chip). For datasheet performance measurements,<br />

signal generators are used and directly connected to the ADC<br />

so the input signal has very low noise. In a real application, the<br />

input signal has noise from the board/external environment in<br />

addition to any noise from preconditioners in the analog front<br />

end. In the signal chain, the noise of each component in front<br />

of ADC degrades the signal into the ADC; thus, you must<br />

consider the noise of each component to determine the signal<br />

chain performance, not just the ADC performance alone. If<br />

your input signal is only 10 bits due to noise, an 11 effective<br />

number of bits (ENOB) ADC will still only give you 10 bits of<br />

information because the rest of the bits are noise. Examples of<br />

the additional components in front of the ADC include<br />

operational amplifiers (for amplification, filters or current-tovoltage<br />

conversions), passives for resistor-capacitor (RC)<br />

filtering, and bias voltages.<br />

III. HOW DIFFERENTIAL SIGNALING CAN ADDRESS NOISE<br />

(GETTING CLOSE TO THE DATASHEET NUMBERS)<br />

The last two sections (D and E) of the previous section<br />

highlight the significance of noise in hindering the achievement<br />

of datasheet performance. This section focuses on differential<br />

signaling as a means to address both of these components.<br />

Differential signaling is an invaluable tool in the engineers’<br />

toolbox for addressing noise during analog measurements. The<br />

strength of differential signaling is in the simplicity of<br />

removing noise as common-mode. The challenge is designing<br />

a circuit so that the noise is in fact common to both conductors<br />

of the differential pair. This challenge is extended to both the<br />

embedded hardware engineer and the integrated circuit (IC)<br />

designer. In an IC design, a great example of this challenge is<br />

substrate noise. The substrate acts as the bridge or ‘medium’<br />

between a component or peripheral generating noise and the<br />

integrated ADC. Similarly, at the board level, neighboring<br />

digital signals can couple with the analog traces. The strength<br />

of that coupling is often augmented by poor ground structures,<br />

forcing long return paths, which increase electromagnetic field<br />

fringing. Finally, with radiated immunity, the differential<br />

spacing should be relatively small compared to the distance<br />

from the radio. This highlights the use of symmetry in<br />

differential signaling in order to cancel or reject common-mode<br />

signals, such as noise.<br />

A. Addressing ADC Noise Internal to the MCU<br />

At the IC level, the power management architecture can<br />

contribute noise to the system and should be considered when<br />

www.embedded-world.eu<br />

608


Fig. 3A) Internal LDO regulator with single-ended measurements. 3B) Internal DC/DC regulator with single-ended<br />

measurements. 3C) Internal DC/DC regulator with differential-ended measurements.<br />

comparing the benefits of one architecture over another. For<br />

internal voltage regulation, the IC may use an LDO or a<br />

DC/DC. While the DC/DC is often the more efficient of the<br />

two, Fig. 3.B shows that the DC/DC also contributes more<br />

noise relative to the LDO in Fig. 3.A. Noise is equated to an<br />

increase in the difference between the minimum and maximum<br />

voltages returned by the ADC. In both Fig. 3.A. and Fig. 3.B,<br />

the ADC is measuring at DC voltage at approximately 250ksps<br />

for 32ms. The variation in the conversion result is more than 6<br />

times greater in the case of the DC/DC as the LDO.<br />

By comparison, if you were to make the same measurement<br />

with the DC/DC in differential mode, (see Fig. 3C), the overall<br />

noise is decreased and the difference between the LDO and<br />

DC/DC performance is minor. Fig. 3 shows performance in<br />

volts instead of LSB, with the vertical axis converted to mV<br />

since the LSB of the differential is twice that of the singleended<br />

to account for the support of signed results. The<br />

variance in the differential measurement is less than half the<br />

variance in the single-ended implementation, showing that the<br />

majority of the noise from the DC/DC is seen as commonmode<br />

by the ADC.<br />

The DC voltage being measured is treated as a differential<br />

input, where Vss is the negative input to the ADC. So even<br />

though the signal itself is a single-ended signal, measuring in<br />

differential mode enabled a reduction in the noise and<br />

moreover reduced the noise seen with using the DC/DC<br />

regulator. This is very good news enabling engineers to take<br />

advantage of the benefits of the DC/DC while eliminating the<br />

associated noise cost.<br />

B. Addressing ADC Noise External to the MCU (Neighboring<br />

Signals)<br />

The noise from the internal regulator is only one possible<br />

source of noise. Other possible noise sources can be<br />

neighboring digital signals, such as I2C or SPI<br />

communications, as well as digital stimuli like a pulse-width<br />

modulated (PWM) waveform. As a general rule it is<br />

recommended to keep these signals as physically far away as<br />

possible from the ADC pins and if possible, inactive during the<br />

ADC measurements. Typically, most IC manufacturers<br />

intentionally keep digital signals away from the analog by<br />

creating dedicated analog pins. In smaller packages, however,<br />

some digital functions may be multiplexed with analog pins or<br />

the digital input/output (I/O) pins can be neighboring the<br />

analog pins. In Fig. 4, the analog input is located immediately<br />

next to a 48-MHz clock output (full rail-to-rail swing) to<br />

represent an SPI clock.<br />

As shown in Fig. 5 and Fig. 6, the increase in noise<br />

(variance) seen with the addition of the neighboring clock<br />

output is greater with the single-ended measurement as<br />

compared with the differential. In the single-ended case only<br />

the signal A+ is used and the complimentary input is left in<br />

general-purpose I/O (GPIO) mode and actively driven low,<br />

DVSS. In the differential case, the complimentary input is<br />

externally connected to AVSS (see Fig. 6).<br />

Although small when compared to the single-ended<br />

example, the differential result indicates that noise is still<br />

present. As a point of discussion, it is important to notice that<br />

the clock is relatively close to the positive leg of the differential<br />

measurement when compared to the separation between the<br />

positive and negative signals of the differential pair. Therefore<br />

the relative coupling will not be equal and the noise will not<br />

appear completely as common-mode. Additionally, the printed<br />

circuit board (PCB) layers below the top signal layer are not<br />

shown which would show the signal return paths. This is a<br />

four-layer PCB, with the 3rd layer providing an almost<br />

completely solid plane so that return currents may follow the<br />

‘path of least impedance’ [6], which for high frequency signals,<br />

such as the 48MHz clock will be directly below the trace. The<br />

second layer provides reference voltages and is split in several<br />

places complicating the coupling between the signal and<br />

ground plane return path. While a more complicated (greater<br />

than 4 layer) PCB can be used to help bring the ground plane<br />

closer to the signal, most issues can be resolved by simply<br />

moving the ‘aggressor’ signal (SPI clock) farther away from<br />

the ‘victim’ (A+/-). Another point to make is the orientation<br />

of the clock signal relative to the analog input(s). Keeping the<br />

Fig. 4. Clock adjacent to ADC input.<br />

609


Fig. 5. Crosstalk, ’A’ induced from adjacent clock onto Single-ended ADC input vs ‘No Noise’<br />

Fig, 6. Crosstalk. ‘A’ induced from adjacent clock onto Differential ADC input vs ‘No Noise’.<br />

signals separated may not always be possible or worse, the<br />

signals may need to cross. In order to keep coupling to a<br />

minimum, signals should not run in parallel and when needed<br />

cross at a 90 degree angle [6].<br />

As a final note, the idea of noise coupling into the ADC<br />

from neighboring signals is not limited to the ADC input pins.<br />

In the case of an external reference, noise could also be<br />

coupled into the reference before entering the IC and similar<br />

precautions should be taken.<br />

C. Addressing ADC Noise from RF Sources<br />

Making analog measurements coincident with wireless<br />

communication is typically not recommended and as a matter<br />

of application, typically any communication is done after a<br />

measurement in order to convey a subset or summary of the<br />

measurement(s). The radio source used in Fig. 7 was an<br />

evaluation module (EVM) which was transmitting 100 random<br />

packets at 50kB (868 MHz, 2-GFSK, 2 kHz deviation). The<br />

EVM was placed adjacent to the MCU test board, so that the<br />

MCU (and ADC) under test was approximately 6cm from the<br />

EVM PCB antenna. Fig. 7 shows that the differential<br />

configuration is superior in noise immunity over the singleended.<br />

Again, the key is that the energy is induced or coupled<br />

uniformly on both the positive and negative inputs of the<br />

differential ADC, so the signal is rejected as common-mode.<br />

And again, the differential is far from ideal and merits<br />

discussion on the potential sources.<br />

The most notable difference between the experiments with<br />

the clock and sub-1GHz radio is the relative coupling area,<br />

shown in Fig. 8. In the case of the clock the coupling area was<br />

most related to where the clock trace ran parallel with the ADC<br />

input lines. After this parallel run the signals diverged, the<br />

ADC signals went off-board to the voltage source to be<br />

measured while the SPI signal terminated at another receive<br />

input.<br />

It is the off-board connection with minimal shielding which<br />

provides a potential path for the radio energy to couple into the<br />

ADC. Moreover, any differences in electrical length between<br />

the positive and negative inputs to the ADC can cause the<br />

coupled noise to be differential rather than common-mode. One<br />

powerful way to minimize differences in electrical length<br />

between the positive and negative inputs of the ADC is by<br />

designing signal paths which are symmetrical. Fig. 9 is taken<br />

from [8] and the different axis of symmetry for the input and<br />

outputs are highlighted.<br />

The testing in this section was intended to show the breadth<br />

of improvement made available by differential signaling. The<br />

improvement was seen at an application or implementation<br />

level with interference from a neighboring radio which can be<br />

applied to Bluetooth and WiFi applications where<br />

www.embedded-world.eu<br />

610


Fig. 7. Noise induced from a nearby radio.<br />

Fig. 8. Off-board routing of differential inputs.<br />

electro-magnetic compatibility (EMC) is needed.<br />

Improvement was also seen at the board level with crosscoupling<br />

(cross-talk) from a neighboring digital signal. And<br />

finally, improvement was even seen at the IC level where a<br />

noisy regulator is chosen to achieve lower power operation and<br />

sacrifice in ADC performance was mitigated.<br />

IV.<br />

CONCLUSION<br />

While differential signaling can be a great tool to achieve<br />

the ADC performance found in datasheets, it cannot supersede<br />

the need to understand the datasheet parameters.<br />

Comprehending ADC datasheet performance, to see which will<br />

meet your performance needs can be difficult due to all of the<br />

nuances and dependencies of performance described in this<br />

paper. This paper touched on some of the main performance<br />

dependencies and provided trends to give tips and tricks to help<br />

better decide from the datasheet if the ADC may meet your<br />

performance needs. But in some cases, bench tests in your lab<br />

using your specific configuration may still be required. This<br />

paper covered the main points to look for but remember there<br />

are more.<br />

An application using an ADC cares about the whole analog<br />

front end performance. This paper spoke of the ADC and the<br />

voltage reference but additional analog front end pieces which<br />

if present must be considered are gain, filters, and bias<br />

voltages. The ADC is the last piece of the analog front end but<br />

additional post digital filtering can further improve<br />

performance. Also, if the ADC samples enough over the<br />

Nyquist rate of the input signal, over-sampling can be<br />

Fig. 9. An example of signal path symmetry.<br />

implemented at system level to improve SNR as out-of-band<br />

quantization and thermal noise can be filtered out [9].<br />

A good starting point to learn more about ADCs is Texas<br />

Instrument’s precision labs online classroom with on-demand<br />

courses and tutorials. The introduction to Analog and Digital<br />

Converters section explains what many of the electrical<br />

parametric values you find in ADC datasheets [10].<br />

REFERENCES<br />

[1] K. Mustafa, “Filtering Techniques: Isolating Analog and<br />

Digital Power Supplies in TI’s PLL-Based CDC<br />

Devices,” Texas Instruments, Application report<br />

SCAA048, October 2011. [Online]. Available:<br />

http://www.ti.com/lit/an/scaa048/scaa048.pdf<br />

[2] D. Megaw, “Voltage Reference Selection Basics,”<br />

SNVA602. [Online]. Available:<br />

http://www.ti.com/lit/an/snva602/snva602.pdf<br />

[3] C. Hall, “It’s in the math: how to convert and ADC code<br />

to a Voltage (part 2)”, Texas Instruments E2E<br />

Community blog post [Online]. Available:<br />

https://e2e.ti.com/blogs_/archives/b/precisionhub/archive/<br />

2016/04/29/it-39-s-in-the-math-how-to-convert-an-adccode-to-a-voltage-part-2<br />

[4] MSP432P4xx Family Technical Reference Manual page<br />

848 22.2.6.3 Sample Timing Considerations, SLAU356<br />

December 2017. [Online]. Available:<br />

http://www.ti.com/lit/ug/slau356h/slau356h.pdf<br />

611


[5] C. She, “Top 12 ways to achieve low power using the<br />

features of an integrated ADC,” [Online]. Available:<br />

http://e2e.ti.com/blogs_/b/msp430blog/archive/2016/06/0<br />

6/top-12-ways-to-achieve-low-power-using-the-featuresof-an-integrated-adc<br />

[6] N. Gray, “The problem of ADC and mixed-signal<br />

grounding and layout for dynamic performance while<br />

minimizing RFI/EMI,” Texas Instruments, SNAA113.<br />

[Online]. Available:<br />

http://www.ti.com/lit/wp/snaa113/snaa113.pdf<br />

[7] DAC3482 Data Sheet, Section 10.1 Layout Guidelines,<br />

Texas Instruments, SLAS7487768. [Online]. Available:<br />

http://www.ti.com/product/DAC3482/datasheet/layout#S<br />

LAS7487768<br />

[8] X. Ramus, “The PCB Layout for Low Distortion High-<br />

Speed ADC Drivers,” Texas Instruments, application<br />

report SBAA113, April 2014. [Online]. Available:<br />

http://www.ti.com/lit/an/sbaa113/sbaa113.pdf<br />

[9] “Precision ADC with 16-bit Performance,” Texas<br />

Instruments, application report SLAA821. [Online].<br />

Available: http://www.ti.com/lit/an/slaa821/slaa821.pdf<br />

[10] Texas Instruments Precision Labs - ADCs [Online].<br />

Available: https://training.ti.com/ti-precision-labs-adcs<br />

www.embedded-world.eu<br />

612


Embedded Algorithms for Motion Detection and<br />

Processing<br />

Smart sensors with embedded configurable algorithms and machine learning<br />

processing software pave the way to advance innovation and reduce consumption at<br />

system level<br />

M. Castellano 1 , R. Bassoli 2 , M. Bianco 1 , A. Cagidiaco 1 , C. Crippa 2 , M. Ferraina 1 , M. Leo 1 , S.P. Rivolta 1<br />

1<br />

STMicroelectronics, Castelletto, Italy; 2 STMicroelectronics, Agrate, Italy;<br />

marco.castellano@st.com<br />

Abstract—MEMS inertial modules are powerful and versatile<br />

converging technologies: mechanical and electronic functions are<br />

merged into a single component, ready to offer physical data to<br />

users about the environment (through wearables or equipment on<br />

which a sensor is mounted). In recent years, drastic reduction in<br />

the power consumption of inertial sensors has opened the door to<br />

a new world of applications. IoT is certainly one, but not the only<br />

example of what can be achieved using battery-operated devices.<br />

This technology is ubiquitous and innovative smart sensors, able<br />

to further reduce energy consumption and recognize and interpret<br />

their environment autonomously, are on the horizon. New sensors<br />

are able to provide the application with the right feedback<br />

precisely when the application needs it. This paper introduces a<br />

programmable and configurable embedded digital module which<br />

further reduces system power consumption, moving part of the<br />

intelligence into the sensor, and thus keeping the main processor<br />

in sleep mode. The digital module is composed of two embedded<br />

reconfigurable blocks able to solve two main sets of application<br />

requirements. The first block has been developed for systematic<br />

motion recognition using a reconfigurable Finite State Machine;<br />

application examples are motion/no-motion, human gesture and<br />

industrial applications. The second block has been developed for<br />

statistical-based context awareness; using a decision-tree<br />

approach it is possible to perform human activity recognition<br />

(stillness, walking, vehicle motion, etc.), carry-position detection<br />

(on wrist, in pocket, on table, etc.) and machine activity and<br />

movement recognition. These two blocks can be programmed and<br />

mutually concatenated by using a simple GUI running on a<br />

common PC to exploit the full configurability of the digital module<br />

and to meet user needs easily, quickly and effectively. The<br />

embedded digital module allows moving all or part of the<br />

algorithm elaboration to a custom, low-power environment on the<br />

sensor side, reducing communication to the main processor, and<br />

thus reducing overall power consumption.<br />

Index Terms— MEMS, smart sensor, sensor networks, low<br />

power, autonomous system, embedded algorithms, gesture<br />

recognition, context awareness, machine learning, decision tree,<br />

IoT.<br />

I. INTRODUCTION<br />

During the past 10 years, the number of IoT applications has<br />

increased exponentially. Most of IoT applications involve<br />

measuring a physical quantity in a location that may not already<br />

have a power source available. It’s often not feasible to add<br />

wiring, so a battery solution is a preferred option, and wireless<br />

connectivity for data transmission is a must. At a minimum, the<br />

IoT application needs a sensor to get data, a medium over which<br />

to transmit and a battery to supply power to both operations. In<br />

a design of this type at this point a trade-off is needed: maximize<br />

battery life or frequency of data transmission?<br />

A key tool available to the application designer to manage<br />

this trade-off is an elaboration unit, which can perform the<br />

measurements and transmission effectively and efficiently. The<br />

computational unit is usually a general-purpose microcontroller,<br />

targeted for low-power consumption. Since the wireless data<br />

transmission is critical for low power consumption, with respect<br />

to the other processes involved, the strategy in the IoT<br />

application design is to move the elaboration to the IoT side, if<br />

the latter allows reducing communication. For example, let’s<br />

suppose that we have to design a healthcare product which<br />

sounds an alarm when the standard deviation of a certain<br />

measured parameter is above a certain threshold. A good design<br />

practice choice, considering battery longevity, is to write the<br />

algorithm on the transmitter side in order to limit the wireless<br />

transmission just during the alarm event.<br />

The objective of this paper is to introduce a new step in the<br />

reduction of product power consumption thanks to innovative<br />

sensors. The new inertial module LSM6DSOX from<br />

STMicroelectronics allows moving all or part of the algorithm<br />

elaboration to a custom low-power environment in the sensor.<br />

The broad configurability of this approach guarantees a wide<br />

spectrum of applications. This paper is organized as follows: the<br />

first section will show the innovative creativity beyond the<br />

embedded algorithms and the advantages in an applicative case<br />

example. Two sections are dedicated to the description of the<br />

embedded algorithms description. The last section is dedicated<br />

to custom support software which is user-friendly and can be<br />

rapidly adapted to the creation of new applications.<br />

613<br />

www.embedded-world.eu


II. EMBEDDED ALGORITHMS SCENARIO<br />

As introduced in the previous section, a simple model of an<br />

IoT application is composed of a transmitter/receiver apparatus,<br />

an elaboration unit, actuators or sensors connected to the<br />

elaboration unit, and a battery. In order to show the advantages<br />

of embedded algorithms on the sensor side, we’ll introduce an<br />

application use case. Although we provide an example, the<br />

subject can be easily extended to many other use cases.<br />

The analyzed use case is a smart-bracelet able to perform useractivity<br />

recognition, and give a feedback on it: how long the<br />

user has walked, has been in a vehicle, etc. Of course the smartbracelet<br />

should give the date and time to the user, prompted by<br />

a wrist-tilt gesture. A key component to perform both<br />

transmission and elaboration is a Bluetooth low-energy systemon-chip[4].<br />

This solution embeds a complete Bluetooth network<br />

processor and an elaboration unit for running application code.<br />

The elaboration unit is composed of a low-power<br />

microcontroller, NVM memory for user programs, memories<br />

for data and programming (mirror of NVM) and common<br />

interfaces (SPI, I²C, etc.). With this kind of solution, the<br />

application is able to read/write to the sensor and actuators from<br />

the interfaces, execute an algorithm computation and connect<br />

to an end component (computer/smartphone) by means of<br />

Bluetooth communication protocol. A rough power budget of<br />

this solution can be estimated from the following proposed<br />

system example. The “smart” Bluetooth module with<br />

embedded microcontroller in general has different power<br />

modes. The most common modes are listed below:<br />

a) Sleep mode: this mode is used to minimize power<br />

consumption by turning off or putting most of the internal<br />

blocks in low-power conditions. In sleep mode an interrupt is<br />

monitored for waking up, or an internal RTZ can be used.<br />

Plenty of options can be configured for blocks in the system in<br />

this mode. Exiting from this mode requires some time to regain<br />

full operative mode (0.5-2 ms). Current consumption of this<br />

mode is in the 0.5-2 µA range.<br />

b) Microcontroller active mode: radio transmitter/receiver<br />

off, microcontroller fully-active elaborating. Current<br />

consumption range of this mode is 1-3 mA.<br />

c) Radio transmitting/receiving mode: Device is<br />

communicating, power consumption is 3-20 mA.<br />

The present “smart” Bluetooth power consumption values<br />

are just a rough estimation based on the product datasheet, the<br />

purpose of this exercise is to show the advantage of embedding<br />

algorithms on the sensor side. Before making the full<br />

computation of current consumption usage in this smart-bracelet<br />

case other useful assumptions must be defined. In the first place<br />

the microcontroller in the smart-Bluetooth module is connected<br />

to an inertial module using an I 2 C/SPI interface, configured to<br />

generate sensor data at 25 Hz Output Data Rate. The<br />

microcontroller exits from sleep-mode, reads data from the<br />

sensor and executes the activity-recognition algorithm every<br />

time a sample is generated using an embedded 16 MHz clock<br />

domain. The high-quality activity-recognition algorithm case<br />

requires a mean elaboration time of 4 ms. The Bluetooth<br />

transmission is sporadic on user request (once a day).<br />

Figure 1: Microcontroller sleep to active timing<br />

Figure 1 shows the timing of the duty cycle of the<br />

microcontroller while running the algorithm. Tstart is the turnon<br />

time of the microcontroller, Talgo is the execution time of<br />

algorithm, Todr is the time between sensor reads.<br />

A basic formula of the mean of the current I TOT containing<br />

its main contributors is:<br />

I TOT = I BUS + I SLEEP + f algo * I UCORE * ( T start/2 + T algo )<br />

I BUS is the current related to interface bus read; for the SPI<br />

bus the contribution should be < 1 µA, in case of I²C the range<br />

is 2-5 µA roughly. The other variables have been already<br />

introduced, a power budget can be estimated for sensor plus<br />

microcontroller system, since radio transmission has been<br />

considered negligible due to its sporadic nature. Considering<br />

that the middle of the declared range of each parameter has been<br />

taken, an I TOT around 230 µA is obtained.<br />

The LSM6DSOx embedded algorithm, reconfigured for<br />

implementing “activity recognition”, requires less than 8 µA.<br />

We would like to point out that we’re referring to the same high<br />

quality algorithm with exactly the same performance as running<br />

on a microcontroller. A significant advantage of the embedded<br />

solution is that data is already available inside the component,<br />

so the I BUS contribution is absent. Another contribution which is<br />

completely missing in the embedded solution is T start needed for<br />

the safe exiting from the sleep state of the microcontroller. I TOT<br />

estimation with the two terms T start and I BUS at zero leads to 200<br />

µA which means that, using the same formula f algo*I UCORE*T algo,<br />

the porting from the microcontroller to the sensor leads to a<br />

reduction of 25 times less power consumption.<br />

Where does the magic come from? The main consideration<br />

comes from the general-purpose microcontroller versus the<br />

application-specific digital logic. STMicroelectronics has been<br />

a pioneer and a leader in inertial MEMS modules since the<br />

beginning of the MEMS era. The ST software library and<br />

customer requests are well known and consolidated, so the<br />

strategy has been to collect and divide the most common<br />

application use cases into two sets. The first set is composed of<br />

algorithms well-suited to using a Finite State Machine, the<br />

second set is based on applications which need statistical<br />

analysis (based on pattern analysis) and that can be implemented<br />

with a decisional net (tree) in an effective way. For the two sets,<br />

a collection of “metacommands” has been implemented, in order<br />

to cover existing algorithms, and to guarantee wide reconfigurability<br />

for new custom requests. At the end of the<br />

process arithmetical analysis has been done with the aim to find<br />

the best low-power but effective arithmetical custom logic.<br />

614


Arithmetic simplification has been done to tailor to application<br />

needs, without impacting algorithm performance. In the two<br />

following sections the two blocks and the metadata are<br />

presented.<br />

III. MOTION DETECTION FINITE STATE MACHINE<br />

The purpose of the FSM block is to provide tools that allow<br />

writing compact programs able to recognize user gestures. Each<br />

gesture requires a specific PROGRAM, thus many programs<br />

can be written and concatenated, making an array of programs<br />

named PROGMEM, as shown in Figure 2, to be processed by<br />

an interpreter resident in ROM.<br />

Each program is made of two parts, a data section and an<br />

instructions section. In more detail, a data section is made of a<br />

fixed length part, present in all the programs, and a variable<br />

length part, whose size is specific for each program. Finally, the<br />

instructions section, i.e. the executable part, is made of<br />

conditions and commands, the latter sometimes requiring<br />

parameters to be executed.<br />

condition is true, the next sample is awaited and both conditions<br />

are evaluated again later on.<br />

Inside the fixed data section, two bytes store respectively the<br />

address of the reset instruction and the address of the current<br />

instruction, i.e. the program pointer is updated every time a next<br />

condition is true or forced to the reset address in case a reset<br />

condition is true.<br />

Since a condition is coded over four bits, a maximum of sixteen<br />

different conditions can be coded. There are four types of<br />

conditions, namely timeouts, thresholds comparisons, zerocrossing<br />

detection and decision-tree checks. Timeout<br />

conditions are true when a counter, preset with a timeout value,<br />

reaches zero, while threshold comparisons are true when<br />

enabled inputs (such as accelerometer XYZ axes or norm<br />

V = √ ) are higher (or lower) than a programmed<br />

threshold, zero-crossing detection is true when an enabled input<br />

crosses the zero and decision-tree check condition is true when<br />

the tree result matches the expected result. If the counter has not<br />

yet reached zero or an enabled input is not yet higher (or lower)<br />

than a programmed threshold, or no zero-crossing event has<br />

been detected, or the decision-tree result does not match, then<br />

the condition is false and the program pointer is not updated:<br />

when the next input sample arrives, the conditions are evaluated<br />

again until one of the two (reset or next) becomes true.<br />

Figure 2: Programs organization in memory<br />

The gesture recognition interpreter decodes each instruction of<br />

a program’s instructions section and executes it by operating on<br />

data located in the data section. Each program recognizing a<br />

specific gesture realizes a simple, programmable FSM and it is<br />

totally independent from the other programs.<br />

The instructions section is the operative part of the FSM. It is<br />

made of conditions and commands. Instructions are also called<br />

states. Each condition is coded in one byte; the highest nibble<br />

codes a reset condition, while the lowest nibble a next<br />

condition.<br />

A condition is one line of code. Any time a program is executed,<br />

each condition is evaluated in both its parts, i.e. reset and next,<br />

with this priority. If a reset condition is true, the code is<br />

restarted from the beginning, whereas if a next condition is true<br />

the execution progresses to the next line of code. If neither<br />

Figure 3: Example of a program instruction section<br />

TABLE I. shows the possible conditions, TABLE II. shows all<br />

the available commands. TABLE III. shows an example of the<br />

instruction section of a PROGRAM, mixing conditions and<br />

commands.<br />

CONDITION<br />

TABLE I. CONDITIONS<br />

DESCRIPTION<br />

0x0 NOP No execution on current sample<br />

0x1 TI1 Timeout 1 expired<br />

0x2 TI2 Timeout 2 expired<br />

0x3 TI3 Timeout 3 expired<br />

0x4 TI4 Timeout 4 expired<br />

615<br />

www.embedded-world.eu


0x5 GNTH1 Any triggered axis > THRS1<br />

0x6 GNTH2 Any triggered axis > THRS2<br />

0x7 LNTH1 Any triggered axis ≤ THRS1<br />

0x8 LNTH2 Any triggered axis ≤ THRS2<br />

0x9 GLTH1 All triggered axis > THRS1<br />

0xA LLTH1<br />

0xB GRTH1<br />

0xC LRTH1<br />

All triggered axis ≤ THRS1<br />

Any triggered axis > -THRS1<br />

Any triggered axis ≤ -THRS1<br />

0xD PZC Any triggered axis crossed zero pos. slope<br />

0xE NZC Any triggered axis crossed zero neg. slope<br />

0xF CHKDT<br />

COMMAND<br />

Check result from decision tree vs. expected<br />

TABLE II. COMMANDS<br />

DESCRIPTION<br />

0x00 STOP Stop execution and wait for new start<br />

0x11 CONT Continues execution from reset-point<br />

0x22 CONTREL Like CONT but reset temporary mask<br />

0x33 SRP Set reset-point to next address/state<br />

0x44 CRP Clear reset-point to first program line<br />

0x55 SETP<br />

Set parameter in the program data<br />

section<br />

0x66 SELMA<br />

Select MASKA and TMASKA as<br />

current mask<br />

0x77 SELMB<br />

Select MASKB and TMASKB as<br />

current mask<br />

0x88 SELMC<br />

Select MASKC and TMASKC as<br />

current mask<br />

0x99 OUTC<br />

Write the temporary mask in the output<br />

register<br />

0xAA STHR1 Set new value to THRESH1<br />

0xBB STHR2 Set new value to THRESH2<br />

0xCC SELTHR1 Select THRESH1 instead of THRESH3<br />

0xDD SELTHR3 Select THRESH3 instead of THRESH1<br />

0xEE SISW Swap sign to opposite in selected mask<br />

0xFF REL Reset temporary mask to default<br />

0x12 SSIGN0 Set UNSIGNED comparison mode<br />

0x13 SSIGN1 Set SIGNED comparison mode<br />

0x14 SRTAM0<br />

Do not reset temporary mask after a<br />

next condition is true<br />

0x21 SRTAM1<br />

Reset temporary mask after a next<br />

condition is true<br />

0x23 SINMUX Set input multiplexer<br />

0x24 STIMER3 Set new value to TIMER3 register<br />

0x31 STIMER4 Set new value to TIMER4 register<br />

0x32 SWAPMSK<br />

Swap mask selection MASKA <br />

MASKB<br />

0x34 INCR Increase long counter +1<br />

0x41 JMP Jump address for two Next conditions<br />

0x42 CANGLE Clear angle<br />

0x43 SMA Set MASKA and TMASKA<br />

0xDF SMB Set MASKB and TMASKB<br />

0xFE SMC Set MASKC and TMASKC<br />

0x5B SCTC0<br />

Clear the Time Counter TC on next<br />

condition true<br />

0x7C SCTC1<br />

Don’t clear the Time Counter TC on<br />

next condition true<br />

0xB5 SETR<br />

Set external registers at given address<br />

with given data<br />

0xC7 UMSKIT<br />

Unmask interrupt generation when<br />

setting OUTS<br />

0xEF MSKITEQ<br />

Mask interrupt if OUTS does not<br />

change<br />

0xF5 MSKIT<br />

Mask interrupt generation when setting<br />

OUTS<br />

TABLE III. EXAMPLE OF A PROGRAM INSTRUCTION SECTION<br />

STATE 0 NOP GNTH1<br />

STATE 1 LNTH1 TI3<br />

STATE 2<br />

STATE 3<br />

OUTC<br />

SRP<br />

STATE 4 NOP LNTH1<br />

STATE 5 GNTH1 TI3<br />

STATE 6<br />

STATE 7<br />

CRP<br />

CONTREL<br />

Go to next state if enabled<br />

input is greater than TH1;<br />

otherwise wait<br />

Stay over TH1 for TI3<br />

seconds; if goes down<br />

before then restart from<br />

STATE 0<br />

After TI3 seconds output<br />

temporary mask and<br />

interrupt<br />

Set reset pointer to STATE<br />

4<br />

Go to next state if triggered<br />

input is lower than TH1;<br />

otherwise wait<br />

Stay under TH1 for TI3<br />

seconds; if goes up before,<br />

then restart from STATE 4<br />

After TI3 seconds clear reset<br />

pointer to STATE0<br />

Output temporary mask and<br />

interrupt, reset temporary<br />

mask and continue from<br />

STATE 0<br />

616


The interpreter decodes instructions and executes them; actions<br />

are based on data stored inside the data section.<br />

Each program realizing an FSM able to recognize a specific<br />

gesture has its own data set. For example the instructions<br />

section in the previous example, in order to properly work,<br />

needs the following data:<br />

- Threshold 1 value<br />

- Timeout 3 value<br />

- Mask to enable the relevant axis among XYZV to trigger<br />

the events<br />

- Timer to count wait times<br />

These are the most commonly used data, also called<br />

resources; however five more data are available to be declared<br />

and used:<br />

- Hysteresis value, to be added/subtracted to/from threshold<br />

values when performing comparisons of the kind “greater<br />

than” / ”less than”<br />

- Decimation mechanism, in case the FSM has to be<br />

executed not at every input sample but at a lower<br />

frequency; in these cases two bytes must be reserved, one<br />

with the decimation value and another for the decimation<br />

counter<br />

- Previous axis sign PAS, declared and used in case a zerocrossing<br />

condition PZC or NZC is present in the<br />

instructions section, to store the previous input sample<br />

XYZV signs<br />

- Memory locations to store gyroscope integrated angles<br />

and ODR period duration when using FSM with such<br />

input data<br />

- Decision-tree interface handling<br />

Simpler programs other than the above examples could use less<br />

data, whereas more complex programs could use more data (i.e.<br />

thresholds and/or timeouts and/or masks).<br />

In order to implement an efficient and effective datainstructions<br />

structure, a fixed data section, present in all the<br />

programs, stores information about the amount of resources to<br />

be used by the program. The user must carefully fill it and<br />

reserve memory locations accordingly in the variable data<br />

section.<br />

Six bytes store information about the variable-data section and<br />

the instructions section:<br />

CONFIG_A: masks, thresholds, long timeouts, short timeouts<br />

CONFIG_B: decimation, hysteresis, gyro angles, PAS<br />

SIZE: length in bytes of the whole program data + instructions<br />

SETTINGS: flags used in the instruction section processing<br />

PP: program pointer<br />

RP: reset pointer<br />

The variable data section is normally different in size between<br />

two FSM programs. It collects all parameters needed by the<br />

program, such as masks (1-3), thresholds (1-3), timeouts (1-4)<br />

etc. The resources in the variable data part are declared in the<br />

CONFIG_A and CONFIG_B bytes belonging to the fixed data<br />

part. In this way the interpreter can easily process the variable<br />

data part of a given program knowing exactly what is stored<br />

there owing to the information obtained from the fixed data<br />

part.<br />

Returning to the previous TABLE III. example, TABLE IV.<br />

and TABLE V. show the data section consistent with the<br />

instruction section.<br />

TABLE IV. EXAMPLE OF A FIXED DATA PART<br />

CONFIG_A 0 1 0 1 0 0 0 1<br />

1 mask, 1 threshold, 1<br />

short timeout<br />

CONFIG_B 0 0 0 0 0 0 0 0 No other resources<br />

SIZE 00010100 20 bytes length<br />

SETTINGS 0 0 0 0 0 0 0 0 No special flags<br />

PP 00000000 Starting value PP<br />

RP 00000000 Starting value RP<br />

TABLE V. EXAMPLE OF A VARIABLE DATA PART<br />

THRESH1_<br />

LSB<br />

THRESH1_<br />

MSB<br />

00000000<br />

00111000<br />

MASKA 10 00 00 00<br />

0.5 g accelerometer X+<br />

axis threshold<br />

X+ accelerometer axis<br />

mask<br />

TMASKA 00 00 00 00 Temporary mask A<br />

TC 00000000 Timer starting value<br />

TI3 00001111 Timeout3: 15 samples<br />

The whole FSM program shown as an example thus is made<br />

of 20 bytes, respectively 12 bytes of data section (shown in<br />

TABLE IV. and TABLE V. ) and 8 bytes of instruction<br />

section (show in TABLE III. ).<br />

This 20-byte long, FSM example program, starting from data<br />

coming from a three-axis accelerometer sensor, is able to<br />

detect on/off wrist tilt, useful in case of a smartwatch or<br />

fitness bracelet application.<br />

IV. MACHINE LEARNING PROCESSING (MLP)<br />

Finite State Machine, presented in the previous section,<br />

makes use of its own nature of deductive reasoning: it starts out<br />

with a hypothesis, and examines the possibilities to reach a<br />

specific logical state. For motion detection algorithms this<br />

implies finding “rules” to be satisfied in a sequence of events.<br />

This approach works with most gesture detection algorithms, but<br />

surely not for all. For example a phone-up to phone-down<br />

gesture algorithm can be solidly based on the fact that the gravity<br />

detected by the accelerometer in the phone is mainly on one axis<br />

and will be inverted on the same axis over a sequence of time.<br />

Gesture definition can be changed based on a few parameters:<br />

definition of axis, threshold and time to complete the sequence.<br />

A different motion algorithm like walking detection could<br />

hardly be defined by means of a simple state machine, since the<br />

number of variables would dramatically increase: sensor<br />

positioning, frequency, terrain and personal behavior render the<br />

617<br />

www.embedded-world.eu


sensed signal widely variable. From the last example it is<br />

possible to extract a more general concept: while phone-up to<br />

phone-down gesture statistical variance on a population is sharp,<br />

and allows deductive-reasoning application design, walking<br />

gesture would lead to broad statistical variance and subsequently<br />

deductive-reasoning should be abandoned in favor of inductive<br />

reasoning.<br />

The idea behind Machine Learning Processing is to allow the<br />

implementation on silicon of data-driven algorithms, exploiting<br />

the capability of building a model from input patterns. Over the<br />

last decade the explosion of the Internet and IOT has made<br />

available an enormous quantity of information. Following the<br />

increase in quantity of data, tools to manage these collections of<br />

data have been developed, in order to make them effective for<br />

applications. MLP is considered a suitable solution to implement<br />

data-driven algorithms on inertial sensors. MLP is highly<br />

reconfigurable, effective in the field of inertial sensors,<br />

implemented in an ultra-low power domain, and suitable for<br />

power-hungry products, for example IOT algorithms.<br />

An important branch of machine learning is data mining:<br />

“data mining is an interdisciplinary field bringing together<br />

techniques from machine learning, pattern recognition, and<br />

statistics" [1][2] with the aim of knowledge discovery.<br />

The task output of a data-mining tool is a decision tree: the<br />

application design starts from a collection of patterns, and ends<br />

with the loading of the decision tree obtained on the MLP. The<br />

entire process of the application design is supervised by<br />

supporting software that is described in the next section. In the<br />

present section the set of basic blocks behind the MLP is<br />

introduced.<br />

The general scheme is illustrated in Figure 4.<br />

inputs for the algorithm, data from up to 3 sensors. The<br />

gyroscope and accelerometer blocks are internal to the sensor<br />

but data from an external sensor such as a magnetometer can be<br />

read over an embedded I 2 C master. Input sensor data is<br />

composed of the axis and the magnitude of the physical sensor<br />

(TABLE VI. ).<br />

Accelerometer<br />

Gyroscope<br />

External sensor<br />

TABLE VI. INPUT TYPES FOR MLP<br />

AVAILABLE INPUTS<br />

accX accY accZ axis<br />

accV accV 2 mag<br />

GyX gyY gyZ axis<br />

gyV gyV 2 mag<br />

magX magY magZ axis<br />

magV magV 2 mag<br />

A wide set of configurable filters is available to condition<br />

input data as illustrated in following table (TABLE VII.<br />

ORDER<br />

1<br />

2<br />

TABLE VII. FILTER TYPES IN MLP<br />

TYPE<br />

High Pass<br />

Generic IIR<br />

Band Pass<br />

Generic IIR<br />

Both raw and filtered data can be used as inputs for the<br />

feature block: this block performs statistical computation of<br />

data, and can be configured to output up to 19 different statistical<br />

features. The list of the available features is given in TABLE<br />

VIII. There are two main sets of features, triggered and<br />

windowed: the former are elaborated at a feature event, the latter<br />

at fixed window time intervals. While all features can be<br />

calculated as windowed or triggered depending on user<br />

configuration, only a subset of these features can generate a<br />

trigger.<br />

TABLE VIII. AVAILABLE STATISTICAL FEATURES IN MLP<br />

Figure 4: MLP General Scheme<br />

For the figure it possible to deduce the boundary between<br />

software and hardware layers. The application starts from<br />

patterns of sensor data which describe the knowledge that MLP<br />

has to understand while running. For example an activity<br />

recognition algorithm starts from patterns involving activities to<br />

be recognized (walking, running, moving vehicle, no motion,<br />

etc.) with the aim that MLP outputs the results of the current<br />

activity directly from the sensor data. The user can configure, as<br />

FEATURE<br />

Mean<br />

Variance<br />

Energy<br />

Peak to Peak<br />

Zerocross<br />

Zerocross_trigger_gen<br />

Positive Zerocross<br />

Positive Zerocross trigger<br />

gen<br />

Negative Zerocross<br />

Negative zerocross trigger<br />

gen<br />

Peak detector<br />

Peak detector trigger gen<br />

TRIGGER GENERATION<br />

No<br />

No<br />

No<br />

No<br />

No<br />

Yes<br />

No<br />

Yes<br />

No<br />

Yes<br />

No<br />

Yes<br />

618


Positive peak detector<br />

Positive peak detector trigger<br />

gen<br />

Negative peak detector<br />

Negative peak detector<br />

trigger gen<br />

Min<br />

Max<br />

Duration<br />

Clock Feature<br />

No<br />

Yes<br />

No<br />

Yes<br />

No<br />

No<br />

No<br />

Yes<br />

At the end of the features configuration step, the software<br />

tool described in the following section can output a<br />

configuration file to be loaded on the device for MLP<br />

configuration and an ARFF file for data-mining tool. The ARFF<br />

file obtained is matched with silicon implementation of MLP<br />

computation. The data-mining tool forming the ARFF file is<br />

able to refine (or “determine”) the best set of features to be<br />

chosen for a specific application case, and is able to output a<br />

decision tree and the relative statistical performance.<br />

After elaboration and feedback from the data-mining tool, it<br />

is possible to reprocess the data and optimize the set of features.<br />

When the performance matches the expectation, the decision<br />

tree can be loaded on the MLP by means of a configuration file<br />

produced by the STM software tool.<br />

V. SUPPORTING SOFTWARE<br />

Two dedicated tools have been developed to allow the<br />

programmability of the MEMS sensor, the first for the Finite<br />

State Machine configuration, the second for decision-tree<br />

configuration using a statistical-based / machine-learning<br />

approach. These tools make the device configuration process<br />

easy and fast.<br />

The tools for Finite State Machine and decision-tree<br />

configuration work as an extension of the Unico GUI (the<br />

Graphical User Interface for all the MEMS sensor<br />

demonstration boards available in the STMicroelectronics<br />

portfolio[5]). Unico interacts with a motherboard[6][5] based on<br />

the STM32 microcontroller, which enables the communication<br />

between the MEMS sensor and the PC GUI. The software<br />

visualizes the output of the sensors in both graphical and<br />

numerical format, and allows the user to save or generally<br />

manage data coming from the device.<br />

Unico allows access to the MEMS sensor registers, enabling<br />

fast prototype of register setup and easy testing of the<br />

configuration directly on the device. It is possible to save the<br />

current registers configuration in a text file, and load a<br />

configuration from an existing file. In this way, the sensor can<br />

be reprogrammed in few seconds.<br />

The Finite State Machine and Machine Learning tools<br />

abstract the process of register configuration by automatically<br />

generating configuration files for the device. The user just needs<br />

to set some parameters in the GUI and the configuration file is<br />

automatically generated. A set of configuration files is already<br />

available and can be distributed to the users. The user can<br />

modify these configurations, and also create his own library of<br />

configuration files by generating new configurations using the<br />

tools.<br />

A. Finite state machine tool:<br />

The State Machine tool extension of Unico allows the user<br />

to configure the state machines and test the functionalities.<br />

Different tabs are available in this tool:<br />

- A configuration tab allows setting a configuration for the<br />

state machines, writing the configurations to the MEMS<br />

sensor, and loading and saving configuration files (Figure<br />

5).<br />

- An interrupt tab showing sensor data and interrupts<br />

generated from the state machine execution (Figure 6).<br />

- A debug tab allows injecting data in the sensor and<br />

debugging the state machine execution sample-by-sample<br />

(Figure 7).<br />

Figure 5: Finite State Machine configuration tab<br />

Figure 6: Sensor data and Interrupt generation tab<br />

619<br />

www.embedded-world.eu


Figure 7: Debug tab and step by step data insertion<br />

B. Machine learning tool<br />

The statistical-based / machine-learning algorithms require<br />

the collection of data logs. This is possible using Unico GUI. To<br />

each data log, an expected result must be associated (e.g. no<br />

motion, walking, running, etc.). The tool collects these data<br />

patterns to compute some features.<br />

Figure 9: Configuration tab<br />

The ARFF file is the starting point for the decision-tree<br />

generation process. The decision tree can be generated by<br />

different machine learning tools. Weka[7], software developed<br />

by the University of Waikato, is able generate a decision tree<br />

starting from the Attribute-Relation File. Through Weka it is<br />

possible to evaluate which attributes are good for the decision<br />

tree, and different decision-tree configurations can be<br />

implemented by changing all the parameters available in Weka.<br />

Figure 8: Data Patterns tab<br />

The tool allows selecting filters which can be applied to the<br />

raw data, and features to be computed from the filtered data. The<br />

features computed will be the attributes of the decision tree.<br />

After a few steps, an Attribute-Relation File (ARFF) is<br />

generated by the tool.<br />

Figure 10: Attributes view in Weka<br />

620


VII. CONCLUSIONS<br />

The world is becoming more connected: devices are linked<br />

together to exchange massive quantities of data. IoT applications<br />

rely on three key building blocks: sensing, intelligence and<br />

connectivity. In this paper a highly configurable digital module<br />

embedded in an inertial sensor has been introduced. The digital<br />

module adds intelligence to the sensor, thus allowing significant<br />

power saving at system level. In order to make the prototype of<br />

the application immediate, supporting configuration software<br />

for the digital module is furnished along with the hardware. The<br />

application case in the previous section clearly shows that thanks<br />

to the digital module, the reduction in current consumption is<br />

huge. Smart sensors are enablers for new applications where<br />

battery life is crucial.<br />

Figure 11: Decision-tree generation in Weka<br />

Once the decision tree has been generated, it can be uploaded<br />

to the ST tool to complete the generation of the register<br />

configuration for the MEMS sensor.<br />

The Unico GUI, by accessing the sensor registers, can read<br />

the status of the decision-tree outputs.<br />

VI. APPLICATION CASE EXAMPLE<br />

Starting from the example presented in the second section,<br />

some current consumption measurements have been taken. As<br />

an example, an activity recognition algorithm has been chosen<br />

since it shows some benefits: performance is unambiguously<br />

evaluated in a patterns database, and current consumption for<br />

this algorithm, running on common general-purpose<br />

microcontrollers, is in the order of hundreds of µA. MLP can be<br />

easily configured, by means of the supporting software<br />

presented in the previous section, to run the activity recognition<br />

algorithm.<br />

TABLE IX. CURRENT REQUIREMENTS<br />

Current<br />

Additional current consumption<br />

Current mean [µA]<br />

MLP on LSM6DSOx 7<br />

Cortex-M3<br />

STM32L152RE@32MHz<br />

TABLE IX. summarizes the current requirement of the<br />

activity recognition algorithm running on a Cortex-<br />

M3[8][9][10], and the additional current requirement for the<br />

same algorithm running on LSM6DSOx MLP.<br />

240<br />

ACKNOWLEDGMENT<br />

The authors thank the STMicroelectronics Analog Mems<br />

Sensor division for discussions, encouragement and support.<br />

REFERENCES<br />

[1] S. Sumathi and S.N. Sivanandam: Introduction to Data Mining<br />

Principles, Studies in Computational Intelligence (SCI) 29, 1–20<br />

(2006).<br />

[2] V. Sze, Y. H. Chen, J. Einer, A. Suleiman and Z. Zhang,<br />

"Hardware for machine learning: Challenges and opportunities,"<br />

2017 IEEE Custom Integrated Circuits Conference (CICC),<br />

Austin, TX, 2017, pp. 1-8.<br />

[3] V. Sze, "Designing Hardware for Machine Learning: The<br />

Important Role Played by Circuit Designers," in IEEE Solid-State<br />

Circuits Magazine, vol. 9, no. 4, pp. 46-54 , Fall 2017.<br />

[4] STMicroelectronics, “Bluetooth® low energy wireless systemon-chip,”<br />

BlueNRG-2 datasheet, November 2017,<br />

[DocID030675 Rev 2].<br />

[5] STMicroelectronics Analog Mems Sensor Application Team ,<br />

Unico GUI User manual, Rev. 5 October 2016.<br />

[6] STMicroelectronics Technical Staff, STEVAL-MKI109V3<br />

Professional MEMS Tool motherboard for MEMS adapter<br />

boards, July 2016<br />

[7] Ian H. Witten, Eibe Frank, and Mark A. Hall. 2011. Data Mining:<br />

Practical Machine Learning Tools and Techniques (3rd ed.).<br />

Morgan Kaufmann Publishers Inc., San Francisco, CA, USA.<br />

[8] STMicroelectronics, “Ultra-low-power 32-bit MCU ARM®based<br />

Cortex®-M3 with 512KB Flash, 80KB SRAM, 16KB<br />

EEPROM, LCD, USB, ADC, DAC,” STM32L151xE<br />

STM32L152xE datasheet, Rev. 9 August 2017.<br />

[9] STMicroelectronics Technical Staff, STM32 Nucleo-64 boards,<br />

NUCLEO-XXXXRX NUCLEO-XXXXRX-P data brief, Rev. 10<br />

December 2017.<br />

[10] STMicroelectronics Technical Staff, Sensor and motion<br />

algorithm software expansion for STM32Cube , X-CUBE-<br />

MEMS1data brief, Rev. 10 November 2017.<br />

621<br />

www.embedded-world.eu


Parallel Architectures for Object-Based Sensor<br />

Fusion on Automotive Embedded Systems<br />

Florian Fembacher<br />

Infineon Technologies AG<br />

Neubiberg, Germany<br />

Email: florian.fembacher@infineon.com<br />

Abstract—Autonomous or highly automated driving is an<br />

emerging development in science and industry in the last decades.<br />

Sensor fusion, in which information from different sensors is<br />

combined to achieve higher measuring accuracy or to create<br />

new information, acts as a key for this technology. Although<br />

much research is done in developing and improving algorithms<br />

for advanced driver assistance systems (ADAS), the development<br />

of automotive embedded hardware that is capable of running<br />

those applications is still in its beginnings. Automotive embedded<br />

systems have to meet some special requirements compared to<br />

other embedded applications. Important factors are costs per<br />

piece, energy consumption and safety requirements (cf. ISO<br />

26262). Since power consumption increases proportional to CPU<br />

frequency it seems reasonable to use multi-core architectures,<br />

which are running at a lower frequency, to meet the computational<br />

requirements.<br />

This paper studies the benefits of using parallel architectures<br />

for object-based sensor fusion on automotive embedded systems.<br />

For the evaluation a simulation using Kalman filtering for the<br />

state estimation and an auction algorithm for the data association<br />

was implemented. All simulations were performed on a NVIDIA<br />

DRIVE PX 2 board, containing four ARM A57 cores and a Pascal<br />

GPU. The results show that multi-core processors can be used<br />

to efficiently speed up object-based sensor fusion in embedded<br />

systems, whereas a GPU based implementation largely suffers<br />

from high latency caused by memory accesses.<br />

Keywords—sensor fusion, embedded system<br />

I. INTRODUCTION<br />

In recent years a lot of effort has been made to develop highly<br />

automated vehicles [1], [2], [3]. A lot of different advanced<br />

driving assistant systems (ADAS) like break assist, lane keeping<br />

or traffic jam assistant systems have already come onto the<br />

market in the last couple of years. The promise of all these<br />

systems is to offer more safety by preventing accidents and<br />

to increase the driver’s comfort by carrying out simple driver<br />

tasks. Currently most progress in autonomous driving can be<br />

seen on highways since driving in urban areas is much more<br />

challenging due to the amount of monitoring and decision tasks,<br />

that have to be performed in such a scenario.<br />

Table I illustrates the six levels of automation defined by<br />

SAE J3016. In the first three levels the human driver is<br />

actually monitoring the driving environment, whereas in the<br />

last three levels the automated driving system is monitoring<br />

the environment. At level 0 the driver is responsible for all<br />

driver tasks. At level 1, simple systems are assisting the driver<br />

by either steering or accelerating, but still the human driver is<br />

needed to perform all other driving tasks. At level 2, which is<br />

TABLE I<br />

AUTOMATION LEVEL ACCORDING TO SAE J3016<br />

Level 5<br />

Level 4<br />

Level 3<br />

Level 2<br />

Level 1<br />

Level 0<br />

Full automation<br />

High automation<br />

Conditional Automation<br />

Partial automation<br />

Driver Assistance<br />

Driver only, no assistance<br />

already available in premium vehicles, the system is capable of<br />

doing both, steering and accelerating the vehicle. For level 3 the<br />

human driver has to intervene if necessary, but all driving tasks<br />

are performed by the system. For the next level the system<br />

is basically driving autonomously, but the driver still has to<br />

overtake the dynamic driving tasks if the system cannot handle<br />

a certain situation. Finally at level 5 the system is performing<br />

all driving modes and no human interaction is needed any more.<br />

The following work is focusing on level 2 and its requirements<br />

for an automotive ECU.<br />

For autonomous driving capabilities it is necessary to model<br />

the surrounding world of a vehicle in real time. At level 2,<br />

radar and camera sensors are used to monitor objects in the<br />

surrounding world. These objects are used to create a model that<br />

serves as an abstract representation of the real world. An object<br />

model usually contains geometric and dynamic information.<br />

To create a global state all the sensor input has to be fused<br />

on a central unit. To keep track of all objects over time<br />

an association of sensor measurements and detected objects<br />

has to be performed. For complex rural or urban areas this<br />

approach might not be sufficient since it lacks of important<br />

information, such as free and occupied space. For this purpose<br />

occupancy grids [4] that model occupancy of the environment<br />

by a probability grid seem more promising. In contrast to<br />

the object-based approach this representation lacks of dynamic<br />

information of objects. To compensate eache advantages and<br />

disadvantages a combination of both approaches is possible [5].<br />

In 2016 Rakotovao et al. published two papers in which they<br />

describe approaches to integrate grid based sensor fusion on less<br />

powerful automotive embedded systems. In [6] they present an<br />

efficient traversal algorithm to find cells covered by a sensor<br />

beam and an integer based implementation for ECUs without<br />

floating point units in [7].<br />

Object of this paper is to analyze the suitability of parallel<br />

www.embedded-world.eu<br />

622


architectures for an object-based sensor fusion for a dynamic<br />

changing number of objects on less powerful automotive embedded<br />

systems. For this purpose a general multi-taget tracking<br />

software framework for the evaluation of computational requirements<br />

of object-based sensor fusion was developed. Multitarget<br />

tracking is a well studied subject from the 80s and 90s<br />

and will not be discussed in very detail. Further information<br />

can be found in [8], [9] and [10]. A new fully working multitarget<br />

tracking system that is completely implemented using<br />

recurrent neural networks was published in 2016 by Milan et<br />

al.. Although this approach showed promising results it is not<br />

suitable for current automotive embedded system architectures.<br />

The Kalman (KF) filter and its variants are probably the<br />

most used approach for target tracking. For the data association<br />

the auction algorithm (AA) is a suitable choice for embedded<br />

systems because of its low computational requirements. For<br />

the implementation the KF for the state estimation and the<br />

AA for the data association were chosen and will be described<br />

in further detail in section II including a detailed complexity<br />

analysis. In section III results for different levels of parallelism<br />

are presented.<br />

In [11] a parallel architecture for Kalman tracking in FPGAs<br />

is presented resulting in a significant speed up. An optimization<br />

method for graphic precessing units can be found in [12].<br />

Nevertheless the presented approaches are mainly suitable for<br />

state models with large dimensions and are not focusing on<br />

the computation of multiple hundred of KFs in parallel. In<br />

the given multi-target tracking typically a constant velocity or<br />

constant acceleration model is used with a low dimension. For<br />

this reason a SIMD accelerator was used to speed up the matrix<br />

computations while the KF computation for different targets<br />

was distributed over the available cores. A profound theoretical<br />

discussion about parallel implementations of the AA can be<br />

found in [13].<br />

A. Sensor Fusion<br />

II. BACKGROUND<br />

1) Framework: In the following Fig. 1 a general setup of<br />

a multi-target tracking system is given. The process can be<br />

described in 4 basic steps that have to be executed repeatedly.<br />

In the first step the sensor input and the prediction of the<br />

existing tracks have to be associated. If a track can be associated<br />

with a measurement it will be updated in the second step.<br />

Otherwise a new track has to be initialized. In the next step<br />

track management is needed to delete invalid tracks or to fuse<br />

tracks that are recognized to belong to the same object. In the<br />

last step the existing tracks will be predicted.<br />

2) Time Dependencies: In this section we will elaborate on<br />

time requirements and dependencies that arise for a multi-target<br />

tracking system. First we will define the following sets:<br />

Definition 1: S = {s 1 ,...,s n |s i (i ∈ {1,...,n}) is the i’th<br />

sensor in the multi-target tracking system}.<br />

Definition 2: D = {d s1 ,...,d sn |d i (i ∈ {1,...,n}) is the<br />

dimension of a measurement vector for sensor s i }.<br />

Definition 3: M = {m 1 ,...,m n |m i (i ∈ {1,...,n}) is the<br />

maximal number of objects that are observable by sensor s i }.<br />

Sensor<br />

Data<br />

no association<br />

State<br />

Initialisation<br />

Association<br />

association<br />

State Update<br />

Fig. 1. Framework of a multi-target tracking system.<br />

State<br />

Predicition<br />

Track<br />

Management<br />

Definition 4: ∆T = {δt 1 ,...,δt n |δt i (i ∈ {1,...,n}) is the<br />

time span between two consecutive measurements by sensor<br />

s i }.<br />

Definition 5: P = {p m1 ,...,p mn |p mi (i ∈ {1,...,n}) is the<br />

maximal processing time needed for an object list measured by<br />

sensor s i }.<br />

As depicted in Fig. 2 we will make the assumption that all<br />

measurements send their object list asynchronously in a correct<br />

time order. This means if the acquisition time of a sensor is<br />

earlier than the acquisition of a second one, the first one will<br />

also be sent before the second one. Let δ tmax = max i {δt i }<br />

be the longest update rate that is existing in the system and<br />

s max = argmax i=1,...,n = {δt i } the corresponding sensor.<br />

To keep the time dependencies it is clear that all measurements<br />

generated in this time interval have to be processed befores max<br />

sends the next measurement list.<br />

T p =<br />

n∑<br />

p si ≤ δt max (1)<br />

i=1<br />

The total processing time T p directly depends on the number<br />

of sensors |S|, corresponding measurement dimensions d i and<br />

the maximal size of the observable objects m i for each sensor.<br />

In general it can be expected that the number of sensors in an<br />

automotive multi-target tracking system as well as the number<br />

of objects that are observed will increase in the near future. For<br />

this reason it is indispensable to use algorithms that offer real<br />

time performance while still showing a sufficient precision for<br />

ADAS systems that rely on the provided object tracks.<br />

In the following we chose the KF for the track estimation<br />

and the AA for the data association as depicted in Fig. 1.<br />

Both algorithms with a complexity analysis will be discussed<br />

in the following sections. The KF and its different variations<br />

for non linear object models is a common used estimation<br />

filter in autonomous driving. The AA belongs to the algorithm<br />

class of nearest neighbor association algorithms. Other well<br />

known techniques, which are much more complex and therefore<br />

not targeted in this study, are the Joint Probabilistic Data<br />

Association Algorithm (JPDAF) [14] and Multiple Hypthesis<br />

Filter (MHT) [15].<br />

B. Kalman Filter<br />

The KF is a discrete time recursive linear implementation<br />

of the Bayes filter that was originally published by R.E.<br />

Kalman in 1960 [16]. It is well studied and used in numerous<br />

623


∆t s1<br />

S 1 S 2 ... S n F<br />

m 1j<br />

m 2j<br />

m nj<br />

∆t s2<br />

m 1j+1<br />

m 2j+1<br />

∆t sn<br />

m nj+1<br />

Algorithm 1: Discrete Kalman filter<br />

1: Input: A k ,P k ,Q k ,R k ,x k ,z k<br />

2: Output: P k+1 ,x k+1<br />

3: Prediction Step:<br />

4: x k+1 ← A k x k<br />

5: P k+1 ← A k P k A T k +Q k<br />

6: Correction Step:<br />

7: K ← P k H T k (H kP k H T k +R k) −1<br />

8: x k+1 ← x k +K k (z k −H k x k )<br />

9: P k+1 ← (I −K k H k )P k<br />

Fig. 2. Time dependencies between n sensors sending measurements to a<br />

fusion unit.<br />

applications. A more detailed introduction can be found in [17],<br />

[18] and [19]. The filter addresses the problem of predicting<br />

a state x k ∈ R n , given only a measurement z k ∈ R m .<br />

The relationship between two states is expressed by the linear<br />

difference equation<br />

x k+1 = A k x k +Bu k +w k . (2)<br />

The square matrix A ∈ R nxn maps the state at time step k<br />

to the state at time step k +1. Accordingly matrix B ∈ R nxl<br />

relates a control vector u k ∈ R l to the state vector at time k. In<br />

the following multi-target tracking application the input vector<br />

is assumed to be zero. Therefore equation (2) can be shortened<br />

to<br />

x k+1 = A k x k +w k , (3)<br />

with a Gaussian noise vector w k that models the system’s<br />

uncertainty. The relation between the measurement vector and<br />

the state vector is modeled by<br />

z k = H k x x +v k (4)<br />

with matrix H ∈ R mxn that relates the state vector x k to<br />

measurement z k . Again uncertainty is modeled by a gaussian<br />

noise vector v k .<br />

The complete KF is shown in Algorithm 1. It consists of<br />

two steps. First in a prediction step the a priori state x k+1<br />

is computed using the difference equation given in equation<br />

(2). Additionally in line 5 the a priori covariance matrix P k+1<br />

is computed using the transition matrix A and a Matrix Q<br />

modeling uncertainty. In the second correction step the updated<br />

state and covariance is computed. First in line 7 the gain K is<br />

computed minimizing the a posteriori error covariance. With<br />

the gain and the innovation z k −H k x k the a posteriori state is<br />

computed in line 8. Finally in line 9 the a posteriori covariance<br />

is updated.<br />

1) Complexity: Summarizing the equations given in Algorithm<br />

1 we have 8 matrix matrix multiplications, 3 matrix vector<br />

multiplications, 3 matrix matrix additions, 2 vector matrix<br />

additions and one matrix inversion. The complexity of these<br />

TABLE II<br />

MATRIX AND VECTOR DIMENSIONS.<br />

Matrix A B P Q R K I H<br />

Dimension nxn nxl nxn nxn mxn nxm nxn mxn<br />

Vector x z u<br />

Dimension n m l<br />

operations is listed in Table III. As it can be seen from the<br />

table the total runtime is bounded by O(n 3 ). The matrix and<br />

vector dimension are given in Table I and II respectively. Using<br />

single precision floating point format the memory requirement<br />

for all matrices is n(3n+4m)×32 bit and (n+m)×32 bit<br />

for the state and measurement vector. Additional memory space<br />

needed for temporary results is not considered.<br />

C. Auction Algorithm<br />

One important step in a target tracking application is to<br />

assign measurements to existing tracks and if not existing to<br />

create a new track. In this scenario the AA [20] is chosen,<br />

since it can be easily parallelized and in general delivers an<br />

optimal solution. The algorithm solves the problem of assigning<br />

n existing tracks to m measurements. The assignment between<br />

tracks and measurements is bijective. Algorithm II-C show<br />

the AA for symmetric problems, this means the number of<br />

tracks equals the number of measurements. A multi-target<br />

tracking application will be in general asymmetric, but still the<br />

TABLE III<br />

COMPLEXITY<br />

Operation Expression Complexity<br />

Multiplication A k x k O(n 2 )<br />

Multiplication A k P k A T k O(n 3 )<br />

Addition A k P k A T k +Q k O(n 2 )<br />

Multiplication H k P k Hk T O(mn 2 )<br />

Addition H k P k Hk T +R k O(m 2 )<br />

Inversion (H k P k Hk T +R k) −1 O(m 3 )<br />

Multiplication P k Hk T(H kP k Hk T +R k) −1 O(mn 2 )<br />

Multiplication H k x k O(mn)<br />

Subtraction z k −H k x k O(m)<br />

Multiplication K(z k −H k x k ) O(mn)<br />

Addition x k +K(z k −H k x k ) O(n)<br />

Multiplication K k H k O(mn 2 )<br />

Subtraction I −K k H k O(n 2 )<br />

Multiplication (I −K k H k )P k O(n 3 )<br />

www.embedded-world.eu<br />

624


symmetric approach can be used by adding dummy tracks or<br />

measurements. Other possible modifications for the asymmetric<br />

assignment problem are discussed in [20]. The AA is run<br />

iteratively and will stop if a feasible assignment S was found<br />

or some time boundary is reached. An assignment S is called<br />

feasible if every track is assigned exactly to one measurement.<br />

The AA consists of three phases. In the first phase a gating<br />

area for each track is computed. This approach allows us<br />

to reject assignments that are highly improbable. For that<br />

reason only measurements that lie within the gating area G<br />

are considered for an assignment<br />

d 2 ij = y ij S −1 y T ij < G (5)<br />

with y ij being the innovation vector given in equation (5)<br />

and S being the innovation covariance.<br />

In the bidding phase for each track i the measurement j i<br />

with the smallest and second smallest distance is searched and<br />

the corresponding values<br />

and<br />

j i = max j∈A(i) {a ij −1/d ij } (6)<br />

w i = max j∈A(i),j≠ji {a ij −1/d ij } (7)<br />

computed. If there is no second lowest value than w i will<br />

be set to a value much smaller than v i . The bid of track i for<br />

measurement j is finally computed as<br />

b iji = 1/d ij +v i −w i +ǫ (8)<br />

with ǫ being a constant that limits the error to the optimal<br />

solution.<br />

The third phase is called assignment phase in which the new<br />

pricep j is set to the highest bid that was given for measurement<br />

j. In the next step all pairs (i,j) will be removed from the<br />

assignment set S and the new pair (i j ,j) will be added.<br />

1) Complexity: The AA consists only of a small number<br />

of mathematical operations. As Table IV shows there are 2<br />

matrix vector multiplications, 1 vector vector subtraction and<br />

1 matrix vector multiplication for the gating phase that are<br />

performed for every track i. In the following bidding phase<br />

4 scalar additions are performed for each track until a feasible<br />

solution is found or some time boundary is reached. In the last<br />

phase no mathematical operation is performed.<br />

III. RESULTS<br />

For the implementation the aforementioned KF and AA were<br />

used. To ensure the functional correctness of the implementation,<br />

it was verified if targets were correctly assigned to<br />

their track. The KF used a constant velocity model, so the<br />

state and measurement vector both had an equal dimension<br />

of 4 (position in x, position in y, velocity in x and velocity<br />

in y). The implementation was run for a different number of<br />

targets on a Nvidia Drive PX 2. Its Hardware specification is<br />

shown in Table V. The Drive PX2 has 4 ARM-Cotex A57<br />

Algorithm 2: Data Association<br />

1: Input: Predicted States X k+1 , Measurements z k+1 , Gate<br />

G := (H k P k H T k +R k) −1<br />

2: Output: Assignment Vector S k+1<br />

3: Gating<br />

4: for all Tracks i do<br />

5: for all Measurements j do<br />

6: residual ← z j −Hx i<br />

7: diff = residual T ∗Gate∗residual<br />

8: if diff < ǫ then<br />

9: a ij = 1/diff<br />

10: else<br />

11: a ij = ∞<br />

12: repeat<br />

13: for all Tracks i do<br />

14: Bidding Phase<br />

15: j i ← argmax j∈A(i) {a ij −p j }{measurement with<br />

max value}<br />

16: v i = max j∈A(i) {a ij −p j }<br />

17: w i = max j∈A(i),j≠ji {a ij −p j }{second best<br />

measurement}<br />

18: b iji +v i −w i +ǫ{bid for chosen measurement}<br />

19: Assignment Phase<br />

20: for all Measurements j do<br />

̸<br />

21: p j ← max i∈P(j) b ij<br />

22: for all pairs (i,j) do<br />

23: if (i,j) ∈ S then<br />

24: S ← S (i,j)<br />

25: S ← S ∪(i j ,j)<br />

26: until each track is assigned to one measurement<br />

TABLE IV<br />

COMPLEXITY<br />

Operation Variables Complexity<br />

Multiplication Hx k+1 O(mn)<br />

Substraction z k+1 −Hx k+1 O(m)<br />

Multiplication residual T ∗Gate∗residual O(m 2 )<br />

Substraction a ij −p j O(1)<br />

Addition b iji +v i −w i +ǫ O(1)<br />

cores, 2 Denver Cores and one Pascal GPU. The two Denver<br />

superscalar processors, that support the ARM v8 instruction<br />

set, are connected to the 4 ARM-Cotex A57 cores. For the<br />

evaluation on the CPU the two Denver cores were deactivated<br />

to achieve a consistent result on the ARM-Cotex A57 processor.<br />

The four Cortex-A57 were run at approximately 2 GHz. In<br />

theory it has a total performance of 64 GFLOPS (single<br />

precision). For each core 32 kB L1 cache is available and a<br />

shared L2 cache of size 2 MB. The GPU is based on the Pascal<br />

architecture with 256 Cuda cores running at 1275 MHz. The<br />

GPU has a theoretical maximal performance of 653 GFLOPS.<br />

For benchmarking four different implementations were compared.<br />

The reference implementation was single threaded without<br />

using the advanced SIMD and floating point unit of the<br />

625


TABLE V<br />

TESTING HARDWARE<br />

CPU GPU RAM<br />

Instruction set ARMv8 Architecture Pascal Protocol LPDDR4<br />

Cores 4 x A57 Cuda Cores 256 Memory size 8 GB<br />

Frequency 1996 MHz Clock Rate 1275 MHz<br />

Cache L1(I/D) 48 kB/32 kB Global memory 7686 MB<br />

Cache L2 2 MB Shared memory 48 kB<br />

Constant memory 64 kB<br />

Block registers 32768<br />

Cortex-A57 cores. For the second one the advanced SIMD and<br />

floating point unit was explicitly used for all matrix and vector<br />

operations. In the third implementation an additional level of<br />

parallelism was added by parallelizing the application objectbased<br />

using OpenMP. In this case object-based means, that the<br />

state update and prediction for different targets was distributed<br />

over the available cores. For the AA the for loops in algorithm 2<br />

were parallelized. At last the same object-based parallelization<br />

was implemented using the CUDA 8.0 Framework. The results<br />

for the KF are presented in Subfig. 3 a) and for the AA in<br />

Subfig. 3 b) for different numbers of targets.<br />

As already discussed in section II-B1 there is a cubic runtime<br />

for the KF. For the target tracking the KF was computed for a<br />

growing number of targets. As the runtime for each single KF<br />

stays the same, a linear increase of the runtime was expected<br />

as it can be seen in the runtime diagram. Using the SIMD<br />

accelerator a total speed up of 1.2 was achieved for 512 targets.<br />

Since there are not that many matrix and vector operations in<br />

the data association algorithm no significant difference between<br />

those two implementations could be observed. By distributing<br />

the computation over all four cores and using the SIMD device<br />

a speed up of approximately 4.1 was observed for the KF<br />

Computation and 3.7 for the data association.<br />

For the CUDA implementation a batched processing function<br />

of the NVIDIA CUBLAS lib was used. Since the overhead for<br />

creating and starting the streams for processing were already<br />

much higher than the overall processing time on the CPU, the<br />

runtime was not considered for the KF computation. For the<br />

data association a speed up of 3.4 was achieved. As a result<br />

it can be seen that the parallelization on the GPU cannot take<br />

advantage of its capability to process simultaneously multiple<br />

data. The reason for this behavior can be explained by the<br />

slow memory access that is present for the GPU. Since the<br />

distance matrix computed in AA is about 1 MB it does not fit<br />

into the fast shared memory, while it completely fits into the<br />

fast shared L2 cache on the A57 cores. Certainly the CUDA<br />

implementation could be further optimized, but at most a<br />

limited advantage over a CPU implementation can be expected.<br />

Since the computations are executed on different memory, it<br />

seems best to support such a multi-target embedded system<br />

with sufficient cache sizes and fast memory access. Further an<br />

object-based parallelization seems most promising as a general<br />

solution approach.<br />

IV. CONCLUSION<br />

Sensor fusion is a key technology for future autonomous<br />

driving. Automation level 2 is already available in premium<br />

cars and will certainly be established in the broad market in<br />

the near future. Today’s automotive ECUs are very limited in<br />

memory and computational resources and are therefore only<br />

used for safety critical functions while sensor fusion is usually<br />

performed on more powerful systems. Since the computational<br />

power of an embedded system is directly limited by its energy<br />

consumption, parallel architectures can be used to overcome<br />

those limits.<br />

In this paper the suitability of parallel embedded architectures<br />

were investigated by using a multi-target tracking implementation<br />

based on the KF and AA. The multi-target tracking<br />

was performed on up to 4 ARM Cortex-A57 cores and an<br />

embedded Pascal GPU. It was shown that parallel architectures<br />

offer an efficient way to speed up object-based sensor fusion<br />

using a microprocessor architecture that is especially optimized<br />

for low energy consumption.<br />

It was possible to speed up the computational critical data<br />

association 3.7 times by distributing the load on 4 cores and<br />

using a SIMD accelerator at the same time. By this implementation<br />

512 targets could be associated in less than 20 ms,<br />

which is way below the update rate of a camera or radar system.<br />

A comparable speed up was achieved using a GPU that was<br />

developed for autonomous driving tasks by NVIDIA. In the<br />

given scenario a GPU implementation is mainly hitting the so<br />

called ”memory wall” which prevents an efficient use of the<br />

parallel architecture.<br />

REFERENCES<br />

[1] M. Aeberhard, S. Rauch, M. Bahram, G. Tanzmeister, J. Thomas, Y. Pilat,<br />

F. Homm, W. Huber, and N. Kaempchen, “Experience, results and lessons<br />

learned from automated driving on germany’s highways,” IEEE Intelligent<br />

Transportation Systems Magazine, vol. 7, no. 1, pp. 42–57, 2015.<br />

[2] M. Birdsall, “Google and ite: The road ahead for self-driving cars,”<br />

Institute of Transportation Engineers. ITE Journal, vol. 84, no. 5, p. 36,<br />

2014.<br />

[3] A. Hohm, F. Lotz, O. Fochler, S. Lueke, and H. Winner, “Automated driving<br />

in real traffic: from current technical approaches towards architectural<br />

perspectives,” SAE Technical Paper, Tech. Rep., 2014.<br />

[4] H. P. Moravec, “Sensor fusion in certainty grids for mobile robots,” AI<br />

magazine, vol. 9, no. 2, p. 61, 1988.<br />

[5] M. E. Bouzouraa and U. Hofmann, “Fusion of occupancy grid mapping<br />

and model based object tracking for driver assistance systems using laser<br />

and radar sensors,” in Intelligent Vehicles Symposium (IV), 2010 IEEE.<br />

IEEE, 2010, pp. 294–300.<br />

www.embedded-world.eu<br />

626


Runtime in ms<br />

0.3<br />

0.2<br />

0.1<br />

Single Core<br />

SIMD<br />

OpenMP<br />

Runtime in ms<br />

60<br />

40<br />

20<br />

Single Core<br />

SIMD<br />

OpenMP<br />

CUDA<br />

0<br />

0<br />

0 100 200 300 400 500<br />

Targets<br />

(a)<br />

0 100 200 300 400 500<br />

Targets<br />

Fig. 3. Results for different levels of parallelism for the Kalman filter and auction algorithm. The runtime was measured for a growing number of targets. The<br />

assignment phase was constantly executed 30 times to get comparable results. Subfig. a) shows the measured runtime for the Kalman filter and Subfig. b) the<br />

runtime for the auction algorithm.<br />

(b)<br />

[6] T. Rakotovao, J. Mottin, D. Puschini, and C. Laugier, “Integration of<br />

multi-sensor occupancy grids into automotive ecus,” in Proceedings of<br />

the 53rd Annual Design Automation Conference. ACM, 2016, p. 27.<br />

[7] ——, “Multi-sensor fusion of occupancy grids based on integer arithmetic,”<br />

in Robotics and Automation (ICRA), 2016 IEEE International<br />

Conference on. IEEE, 2016, pp. 1854–1859.<br />

[8] Y. Bar-Shalom, T. E. Fortmann, and P. G. Cable, “Tracking and data<br />

association,” The Journal of the Acoustical Society of America, vol. 87,<br />

no. 2, pp. 918–919, 1990.<br />

[9] Y. Bar-Shalom, “Multitarget-multisensor tracking: advanced applications,”<br />

Norwood, MA, Artech House, 1990, 391 p., 1990.<br />

[10] S. S. Blackman, “Multiple-target tracking with radar applications,” Dedham,<br />

MA, Artech House, Inc., 1986, 463 p., 1986.<br />

[11] C. Lee and Z. Salcic, “A fully-hardware-type maximum-parallel architecture<br />

for kalman tracking filter in fpgas,” in Information, Communications<br />

and Signal Processing, 1997. ICICS., Proceedings of 1997 International<br />

Conference on, vol. 2. IEEE, 1997, pp. 1243–1247.<br />

[12] M.-Y. Huang, S.-C. Wei, B. Huang, and Y.-L. Chang, “Accelerating the<br />

kalman filter on a gpu,” in Parallel and Distributed Systems (ICPADS),<br />

2011 IEEE 17th International Conference on. IEEE, 2011, pp. 1016–<br />

1020.<br />

[13] D. P. Bertsekas and D. A. Castañon, “Parallel synchronous and asynchronous<br />

implementations of the auction algorithm,” Parallel Computing,<br />

vol. 17, no. 6-7, pp. 707–732, 1991.<br />

[14] Y. Bar-Shalom, F. Daum, and J. Huang, “The probabilistic data association<br />

filter,” IEEE Control Systems, vol. 29, no. 6, 2009.<br />

[15] S. S. Blackman, “Multiple hypothesis tracking for multiple target tracking,”<br />

IEEE Aerospace and Electronic Systems Magazine, vol. 19, no. 1,<br />

pp. 5–18, 2004.<br />

[16] R. E. Kalman et al., “A new approach to linear filtering and prediction<br />

problems,” Journal of basic Engineering, vol. 82, no. 1, pp. 35–45, 1960.<br />

[17] H. W. Sorenson, “Least-squares estimation: from gauss to kalman,” IEEE<br />

spectrum, vol. 7, no. 7, pp. 63–68, 1970.<br />

[18] P. S. Maybeck, Stochastic models, estimation, and control. Academic<br />

press, 1982, vol. 3.<br />

[19] G. Welch and G. Bishop, “An introduction to the kalman filter,” 1995.<br />

[20] D. P. Bertsekas, “Auction algorithms,” in Encyclopedia of Optimization.<br />

Springer, 2008, pp. 128–132.<br />

627


DeepAPI – bringing deep learning at the edge device<br />

with a use case in food recognition<br />

Spiros Oikonomou, Nikos Fragoulis, Vassilis Tsagaris, Christos Theoharatos<br />

Irida Labs S.A<br />

Patras, Greece<br />

tsagaris@iridalabs.gr<br />

Abstract—In this paper, we present an innovative approach for<br />

food product identification in real-time, based on Artificial<br />

Intelligent (AI) and deep learning methods that provide extreme<br />

accuracy without depending on cloud or high-end processing<br />

systems.<br />

Index Terms—DeepAPI, convolutional neural networks, CNN,<br />

SqueezeNet, food, image classification, deep learning.<br />

I. INTRODUCTION<br />

During the past years, convolutional neural networks<br />

(CNNs) have been established as the dominant technology for<br />

approaching real-world visual understanding tasks. A<br />

significant research effort has been put into the design of very<br />

deep architectures, able to construct high-order representations<br />

of visual information. The accuracy obtained by deep<br />

architectures such as GoogleNet [1] and the more<br />

contemporary ResNet [2] on image classification and object<br />

detection tasks, proved that depth of representation is indeed<br />

the key to a successful implementation.<br />

Main focus up to now was on implementations for<br />

mainstream PC-like computing system or cloud based systems,<br />

in order to deploy deep learning approaches into diverse<br />

technological areas like automotive, transportation, Internet of<br />

Things (IoT), medical etc. However, meeting particular<br />

performance requirements on embedded platforms is, in<br />

general, difficult and complex. A possible workaround to this<br />

problem, is the use of heterogeneous computing. This involves<br />

the exploitation of every computing resource present on an<br />

embedded system (CPU, GPU, DSP) to which a part of the<br />

load is off-loaded increasing this way the overall computational<br />

capacity and thus processing speed. There are however cases,<br />

where a multicore CPU is the only available resource on an<br />

embedded system. So, there is a reasonable question in this<br />

case: Is it possible to have a fast inference speed?<br />

We specialize on solving embedded vision problems of<br />

such a kind. To this end, we took a step forward and we<br />

evaluated the performance of a SqueezeNet CNN [3] model on<br />

a multi-core multi-cluster CPU.<br />

II. DEEPAPI AND SQUEEZENET CNN ARCHITECTURE<br />

In this section we will refer to DeepAPI which is a software<br />

library and why this is useful for the developers. Moreover, we<br />

will describe the architecture of SqueezeNet and we will<br />

analyze the reasons why this architecture is suitable for<br />

embedded applications. Finally, we will briefly refer to the<br />

Food-101 [5] database. In Fig. 1 we observe the process steps<br />

that are performed off-device and these are performed on<br />

device.<br />

Fig. 1. The processing steps which are performed off-device and the<br />

processing steps which are performed on-device.<br />

A. DeepAPI<br />

The future brings a wave of embedded devices able to<br />

respond to their environment through embedded intelligences.<br />

Deep learning is a proven technology able to achieve this<br />

intelligence, but requires extremely complex processing to be<br />

performed in high speeds and at the same time within low<br />

levels of energy consumption.<br />

To achieve those features a holistic approach should be<br />

followed:<br />

Tweak the Models: Use special models and special<br />

compression techniques which they are mimicking the<br />

human brain and which result in significantly more<br />

economical models.<br />

1<br />

628


Tweak the Code: Use heterogeneous programming<br />

technologies, featuring the synergistic use of every<br />

available computing unit like multi-cluster CPUs, the<br />

GPU and the DSP must all be exploited for carrying<br />

out the complex tasks deep learning inference in<br />

reasonable time and with limited power budget.<br />

To help developers embed deep learning technology into<br />

their own systems and applications we developed DeepAPI.<br />

DeepAPI is a software library consisted of high-performance<br />

deep learning models, highly optimized for embedded<br />

computing systems and implemented by using a variety of<br />

approaches and techniques in order to be able to suit to a wide<br />

range of applications.<br />

DeepAPI includes also the necessary software tools in order to<br />

allow the user to train the model of interest by itself and the use<br />

the training results for building the final application. DeepAPI<br />

supports platforms of various ARM/MALI and SnapDragon<br />

family processors, but the list keep increasing.<br />

B. Tweaking the algorithms: The SqueezeNet architecture<br />

A basic downside of deep learning architectures, is that<br />

they require hundreds of MBytes in coefficients for the<br />

convolutional kernels to operate. Such requirements can render<br />

the embedded implementation of similar networks rather<br />

prohibitive. Imagine a scenario where a CNN has to operate on<br />

a video stream, in order to produce a real-time video annotation<br />

captured by a smartphone. The allocation and data transfers<br />

needed to load e.g. 600 MB of coefficients on an embedded<br />

device’s memory is a rather intense workload, particularly<br />

when it has to be completed within a limited time, starting<br />

when the user opens the camera app and ending when the video<br />

recording starts.<br />

In order to address such issues, very recently a significant<br />

research effort has been shifted towards architectures that<br />

produce significantly fewer coefficients. In particular, the<br />

recently presented SqueezeNet [3] architecture is able to<br />

achieve similar levels of classification accuracy on ImageNet,<br />

to the baseline AlexNet [4] architecture, using 50 times fewer<br />

coefficients. The smart combination of small convolutional<br />

kernels and a complex architecture that enables information to<br />

flow through different paths facilitates the construction of<br />

sufficiently high-order image representations that are suitable<br />

for a large variety of applications. A coefficients’ size of 3 MB,<br />

easily reduced further by factor of 5 via model-compression<br />

techniques, makes SqueezeNet a very appealing architecture<br />

for embedded implementations.<br />

In Fig. 2 we observe two SqueezeNet CNN Architecture to<br />

classify the 101 food categories of the Fodd-101 database [5].<br />

The original SqueezeNet architecture is shown in Fig. 2a and<br />

our implementation is shown in Fig. 2b. The main differences<br />

of architectures are the output number of the conv10 layer and<br />

the existence of the fc11 layer which is a fully connected layer.<br />

Our SqueezeNet architecture begins with standalone layer<br />

(conv1) followed by 8 Fire module (fire2-9) and 1 convolution<br />

layer (conv10), ending with a final fully connected layer<br />

(fc11). A Fire module is comprised of a squeeze convolution<br />

layer, which has only 1x1 filters, feeding into an expand layer<br />

that has a mix of 1x1 and 3x3 convolution filters. During<br />

network training, we noticed that with the changes we made in<br />

architecture, the network converged more easily and the<br />

classification accuracy was greater. At this point, we have to<br />

mention that the training of the network for Food-101 database<br />

was done with fine tune based on the pre-trained SqueezeNet<br />

of IMAGNET [6].<br />

The number of filters per fire module are gradually<br />

increased from the beginning to the end of the network.<br />

SqueezeNet performs max-pooling with a stride of 2 after<br />

conv1, fire4, fire8 and conv10.<br />

Fig. 2. SqueezeNet Food-101 Architecture. a is the original SqueezeNet<br />

architecture and b is our SqueezeNet architecture.<br />

2<br />

629


C. Food-101 Database<br />

In the use case of Food Recognition the SqueezeNet<br />

architecture has been trained to perform image tagging, and is<br />

able to discriminate between 101 food categories, tagged in the<br />

Food-101 database [5] comprised by some 101000 images. The<br />

database is balanced because each food category consisted of<br />

1000 images.<br />

The database was augmented by cropping each image at 5<br />

different frames and vertical mirroring each one of them. The<br />

training has been made using the Caffe [7] deep learning<br />

framework.<br />

III. RESULTS<br />

As we referred in the previous section, the training has been<br />

made using the Caffe deep learning framework and the<br />

accuracy achieved in terms of average recognition rate is 72%<br />

for Rank 1, 85% for Rank 3 and >90% for Rank 5 as we can<br />

see in Table 1.<br />

TABLE I. CLASSIFICATION ACCURACY OF SQUEEZENET<br />

Rank Accuracy (%)<br />

Rank 1 72<br />

Rank 2 85<br />

Rank 3 91<br />

The validity of the proposed approach has been verified for<br />

the SqueezeNet architecture on different platforms. For each<br />

implementation, the inference time has been measured and the<br />

results are shown in Table 2.<br />

TABLE II. INFERENCE SPEED (IS) OF SQUEEZENET FOR CPU-ONLY<br />

IMPLEMENTATION (MULTI-CORE) ON DIFFERENT PLATFORMS.<br />

Processor<br />

Inference Speed<br />

Mean IS<br />

(msec)<br />

Min IS<br />

(msec)<br />

1. SnapDragon 820 @ MDP820 43.7 37.3<br />

2. SnapDragon 808 @ LG G4 99.4 78.3<br />

3. SnapDragon 801 @ LG G3 133.0 123.3<br />

4.<br />

5.<br />

Mediatek MT 6797 @ Redmi<br />

Note 4<br />

Hisillicon Kirin 935 @ HUAWEI<br />

CRR-L09<br />

38.9 33.9<br />

75.9 64.2<br />

6. Exynos 8890 Octa @ S7-Edge 33.5 28.1<br />

7. AllWinner A80 94.1 84.0<br />

As seen in the above table, inference speeds vary with the<br />

technology, clock speed and the number of cores. However,<br />

they are proven efficient in order to support a real-time<br />

recognition task like Food Recognition problem.<br />

On the platform of Exynos 8890 @ S7-Edge the DeepAPI<br />

achieves mean inference speed of 33.5 msec and min inference<br />

speed of 28.1 msec. The second faster platform is the Mediatek<br />

MT 6797 @ Redmi Note 4, which achieves 38.9 msec for<br />

mean inference speed and 33.9 msec for min inference speed.<br />

DeepAPI achieves 43.7 msec mean inference speed and 37.3<br />

msec min inference speed on the platform of SnapDragon 820<br />

@ MDP820 which is fast enough inference speed time. Based<br />

on Table 2 we observe that the DeepAPI achieves decent<br />

inference speed using SqueezeNet architecture for food<br />

recognition on older processor platform like SnapDragon 801<br />

@ LG G3.<br />

IV. CONCLUSION<br />

Based on the experimental results that we could observe in<br />

Table 2, we can answer the question that was raised earlier: Is<br />

it possible to have a fast inference speed if a multicore CPU is<br />

the only available resource on an embedded system? The<br />

answer is that is possible to have a fast inference speed and<br />

good classification results.<br />

Tis answer is due to the proper design, development and<br />

implementation of DeepAPI as well as to the architecture of the<br />

SqueezeNet which is a convolution neural network that fits into<br />

only multicore CPU embedded systems.<br />

The fastest inference speed we have achieved is on the<br />

platform of Exynos 8890 Octa @ S7-Edge and is 28.1 msec.<br />

We can also mention, based on the results on the Table 2, that<br />

the DeepAPI achieves fast enough inference speed on older<br />

platforms like SnapDragon 801 @ LG G3.<br />

As a final conclusion we can say that with DeepAPI we can<br />

achieve food recognition in real-time at the edge device.<br />

REFERENCES<br />

[1] Szegedy, Christian, et al. “Going deeper with convolutions.”<br />

Proceedings of the IEEE Conference on Computer Vision and<br />

Pattern Recognition. 2015.<br />

[2] He, Kaiming, et al. “Deep residual learning for image<br />

recognition.” arXiv preprint arXiv:1512.03385 (2015).<br />

[3] Iandola, Forrest N., et al. “SqueezeNet: AlexNet-level accuracy<br />

with 50x fewer parameters and


631<br />

4


Deep Learning Requirements for Autonomous<br />

Vehicles<br />

Gordon Cooper<br />

Synopsys, Solutions Group<br />

Mountain View, CA, USA<br />

gordon.cooper@synopsys.com<br />

Abstract—Deep-learning techniques for embedded vision are<br />

enabling cars to 'see' their surroundings and have become a<br />

critical component in the push toward fully autonomous vehicles.<br />

The early use of deep learning for object detection, e.g., pedestrian<br />

detection and collision avoidance, is evolving toward scene<br />

segmentation where every pixel of a high-resolution video stream<br />

must be identified. Embedded vision solutions will be a key enabler<br />

for making automobiles fully autonomous. Giving an automobile<br />

a set of eyes – in the form of multiple cameras and image sensors<br />

– is a first step, but it also will be critical for the automobile to<br />

interpret content from those images and react accordingly. To<br />

accomplish this, embedded vision processors must be hardwareoptimized<br />

for performance while achieving low power and small<br />

area, have tools to program the hardware efficiently, and have<br />

algorithms to run on these processors. This presentation will<br />

discuss the current and next-generation requirements for ADAS<br />

vision applications, including the need for deep-learning<br />

accelerators. It will discuss how coming changes in deep learning<br />

will improve ADAS performance, and discuss how to evaluate the<br />

hardware and software tools needed to quickly deploy ADAS<br />

applications with high-definition resolutions.<br />

II. DEEP LEARNING VS MACHINE LEARNING VS ARTIFICIAL<br />

INTELLIGENCE<br />

Artificial intelligence is a broad category (Fig. 1). Until very<br />

recently, AI has been associated more with science fiction than<br />

automotive reality. AI conjures up imagines of self-aware<br />

androids or rogue robots taking over the world. In the simplest<br />

definition, however, artificial intelligence is human levels of<br />

intelligence exhibited by machines. An automobile exhibiting<br />

human levels of driving would certainly be classified as an<br />

example of artificial intelligence. Machine learning is an<br />

application of artificial intelligence that uses algorithms to<br />

analyze large amounts of data and then infers some information<br />

about the real world from the data.<br />

Keywords—embedded vision; deep learning; IP; CNN;<br />

convolutional neural network; automotive; advanced driver<br />

assistance system; ADAS; SoC design<br />

I. INTRODUCTION<br />

There is an arms race between the major automotive<br />

manufacturers – and some of the biggest tech companies – to be<br />

the first to bring autonomous driving vehicles to the masses.<br />

With about 94% of all accidents attributed to human error, the<br />

rise of autonomous vehicles will save thousands of lives daily<br />

and billions in dollars lost to road crashes. To hand over control<br />

from a person to a machine requires a high confidence in the<br />

machine’s decision making process. Deep learning techniques<br />

provide the building blocks to reach the level of artificial<br />

intelligence needed for machines to make the decisions<br />

necessary to replace human drivers. An understanding of deep<br />

learning requirements is important to best implement this new<br />

technology.<br />

Fig. 1: Hierarchy showing how deep learning related to artificial<br />

intelligence<br />

Neural networks are a class of machine learning algorithms –<br />

modeled after the human brain – with a neuron representing the<br />

computational unit and the network describes how these units<br />

are connected to each other. Unit recently, neural networks<br />

were limited to only a couple layers. But with algorithmic<br />

advances combined with the acceleration of computing power<br />

brought on by GPUs, more layers have been added to neural<br />

networks improving their performance. Any neural network<br />

with more than input and output layers – with intermediate<br />

‘hidden’ layers – is considered a deep neural network (Fig. 2).<br />

A deep neural network could have one or hundreds of hidden<br />

layers. These deep neural networks provide the state-of-the-art<br />

implementations for deep learning. A practical example of<br />

www.embedded-world.eu<br />

632


deep-learning techniques is enabling cars to 'see' their<br />

surroundings using computer vision hardware and software.<br />

automatically from training examples. Although the concept of<br />

deep neural networks has been around for a long time, only<br />

recently have semiconductors achieved the processor<br />

performance to make them a practical reality. In 2012, a<br />

convolutional neural network (CNN)-based entry into the annual<br />

ImageNet competition showed a significant improvement in<br />

accuracy in the task of image classification over the traditional<br />

computer vision algorithms (Fig. 4). Research in neural<br />

networks accelerated as did the improvements in accuracy. By<br />

2015, for the ImageNet task of classifying a thousand objects,<br />

neural networks had not only far surpassed traditional computer<br />

vision techniques, they were beating human detection.<br />

Fig. 2. An example of a deep neural network. This network is ‘deep’ because<br />

of the hidden layers between the input and output. Each node represents a<br />

computational unit with weights multiplied with inputs to form the output.<br />

III. DEEP LEARNING TECHNIQUES<br />

Computer vision represents a good starting point to<br />

understand deep learning techniques as they apply to<br />

automobiles. Most pattern recognition tasks, like detecting a<br />

pedestrian in front of your car, are part of a broad class of “object<br />

detection” techniques. For each object to be detected, traditional<br />

computer vision algorithms were hand-crafted. Examples of<br />

algorithms used for detection include Viola-Jones and more<br />

recently Histogram of Oriented Gradients (HoG). The HOG<br />

algorithm looks at the edge directions within an image to tries to<br />

describe objects (Fig. 3). HOG was considered a state-of-the art<br />

for pedestrian detection as late as 2014. It had a reasonable level<br />

of accuracy, but a significant restriction was the amount of work<br />

required to convert detecting a pedestrian to detecting a dog, for<br />

example.<br />

Fig. 3. Example of Histogram of Gradients (HoG) applied to pedestrian<br />

detection.<br />

The important breakthrough of deep neural networks is that<br />

object detection no longer has to be a hand-crafted coding<br />

exercise. Deep neural networks allow features to be learned<br />

Fig. 4. Error rates for ImageNet Large Scale Visual Recognition Challenge<br />

(ILSVRC) winners have dropped dramatically since 2012 when deep learning<br />

was introduced.<br />

The neural networks used to win the ImageNet Large Scale<br />

Vision Recognition Challenge were Convolutional Neural<br />

Networks (CNNs) which are the current state-of-the art for<br />

efficiently implementing deep neural networks for vision.<br />

CNNs are more efficient because they reuse a lot of weights<br />

across the image. CNN-based pedestrian detection solutions<br />

have been shown to have better accuracy than algorithms like<br />

HoG and perhaps more importantly, it is easier to retrain a CNN<br />

to look for a bicycle than it is to write a new hand-crafted<br />

algorithm to detect a bicycle instead of a pedestrian.<br />

IV. DEEP LEARNING APPLIED TO AUTOMOTIVE OBJECT<br />

DETECTION<br />

Auto manufacturers are including more cameras in their cars,<br />

as shown in Fig. 5. A front facing camera can detect pedestrians<br />

or other obstacles and, with the right algorithms, assist the<br />

driver in braking. A rear-facing camera – mandatory in the<br />

United States for most new vehicles starting in 2018 – can save<br />

lives by alerting the driver to objects behind the car, out of the<br />

driver’s field of view. A camera in the cars cockpit facing the<br />

driver can identify and alert for distracted driving. And most<br />

recently, adding four to six additional cameras can provide a<br />

360-degree view around the car. Giving an automobile a set of<br />

eyes – in the form of multiple cameras and image sensors – is a<br />

first step, but it also will be critical for the automobile to<br />

interpret content from those images and react accordingly.<br />

633


Fig. 5. Cameras, enabled by high-performance vision processors, can "see" if<br />

objects are not in the expected place.<br />

To replace human decision making, a front facing camera,<br />

for example, has to be consistently faster than the driver in<br />

detecting and alerting for obstacles. While an ADAS system can<br />

physically react faster than a human driver, it needs embedded<br />

vision to provide real-time analysis of the streaming video and<br />

know what to react to.<br />

Vision processing solutions will need to scale as future<br />

demands call for more processing performance. A 1MP image<br />

is a reasonable resolution for existing cameras in automobiles.<br />

However, more cameras are being added to the car and the<br />

demand is growing from 1MP to 3MP or even 8MP cameras.<br />

The greater a camera’s resolution, the farther away an object can<br />

be detected. There are simply more bits to analyze to determine<br />

if an object, such as a pedestrian, is ahead. The camera framerate<br />

(FPS) is also important. The higher the frame rate, the lower<br />

the latency and the greater the stopping distance. For a 1MP<br />

RGB camera running at 15 FPS, that would be 1280x1024<br />

pixels/frame times 15 frames/second times three colors or about<br />

59M bytes/second to process. An 8MP image at 30fps will<br />

require 3264x2448 pixels/frame times 30 frames/second times<br />

three colors or about 720M bytes/second.<br />

This extra processing performance can’t come with a<br />

disproportionate spike in power or die area. Automobiles are<br />

consumer items that have constant price pressures. Low power<br />

is very important. Vision processor architectures have to be as<br />

optimized as power and yet still retrain programmability.<br />

V. CHIP OPTIONS FOR DEEP LEARNING IMPLEMENTATIONS<br />

Implementing deep learning in embedded applications<br />

require a lot of processing power with the lowest possible power<br />

consumption. Processing power is needed to execute<br />

convolutional neural networks – the current state-of-the-art for<br />

embedded vision applications – while low power consumption<br />

will extend battery life, improving user experience and<br />

competitive differentiation. To achieve the lowest power with<br />

the best CNN graph performance in an ASIC or SoC, designers<br />

are turning to dedicated CNN engines.<br />

GPUs helped usher in the era of deep learning computing.<br />

The performance improvements gained by shrinking die<br />

geometries combined with the computational power of GPUs<br />

provide the horsepower needed to execute deep learning<br />

algorithms. However, the larger die sizes and higher power<br />

consumed by GPUs, which were originally built for graphics and<br />

repurposed for deep learning, limit their applicability in powersensitive<br />

embedded applications.<br />

Vector DSPs–very large instruction word SIMD processors–<br />

were designed as general purpose engines to execute<br />

conventionally programmed computer vision algorithms. A<br />

vector DSP’s ability to perform simultaneous multiplyaccumulate<br />

(MAC) operations help it execute the twodimensional<br />

convolutions needed to execute a CNN graph more<br />

efficiently than a GPU. Adding more MACs to a vector DSP will<br />

allow it to process more CNNs per cycle, improving the frame<br />

rate. More power and area efficiency can be gained by adding<br />

dedicated CNN accelerators to a vector DSP.<br />

The best efficiency, however, can be achieved by pairing a<br />

dedicated yet flexible CNN engine with a vector DSP (Fig. 6).<br />

A dedicated CNN engine can support all common CNN<br />

operations (convolutions, pooling, elementwise) rather than just<br />

accelerating convolutions and will offer the smallest area and<br />

power consumption because it is custom designed for these<br />

parameters. The vector DSP is still needed for pre- and postprocessing<br />

of the video images.<br />

Fig. 6. Adding a CNN engine to an embedded vision processor enables the<br />

system to learn through training.<br />

A dedicated CNN engine is also optimized for memory and<br />

register reuse. This is just as important as the number of MAC<br />

operations that the CNN engine can perform each second,<br />

because if the processor doesn’t have the bandwidth and<br />

memory architecture to feed those MACs, the system will not<br />

achieve the optimal performance. A dedicated CNN engine can<br />

be tuned for optimal memory and register re-use in state-of-theart<br />

networks like ResNet, Inception, Yolo, and MobileNet.<br />

Even lower power can be achieved with a hardwired ASIC<br />

design. This can be the desired solution when the industry agrees<br />

on a standard. For example, video compression using H.264 was<br />

implemented on programmable devices before the standard was<br />

settled on, and implemented on ASICs afterwards. While CNN<br />

has emerged as the state-of-the-art standard for embedded vision<br />

implementation, how the CNN is implemented is evolving and<br />

remains a moving target, requiring designers to implement<br />

flexible and future-proof solutions.<br />

VI. TRAINING AND DEPLOYING DEEP LEARNING CNNS<br />

As mentioned earlier, a CNN is not programmed. It is<br />

trained. A deep learning framework, like Caffe or TensorFlow,<br />

will use large data sets of images to train the CNN graph –<br />

refining coefficients over multiple iterations – to detect specific<br />

features in the image. Fig. 7 shows the key components for CNN<br />

graph training, where the training phase uses banks of GPUs in<br />

the cloud for the significant amount of processing required.<br />

www.embedded-world.eu<br />

634


Fig. 7. Components required for graph training<br />

The deployment – or “inference” – phase is executed on the<br />

embedded system. Development tools, such as Synopsys’s<br />

MetaWare EV Toolkit, take the 32-bit floating point weights or<br />

coefficients output from the training phase and scale them to a<br />

fixed point format. The goal is to use the smallest bit resolution<br />

that still produces equivalent accuracy compared to the 32-bit<br />

floating point output. Fewer bits in a multiply-accumulator<br />

means less power required to calculate the CNN and smaller die<br />

area (leading to lower the cost) for the embedded solution. Most<br />

object detection or classification tasks need 8-bits of resolution<br />

to assure the same accuracy of the 32-bit Caffe output.<br />

Advanced tools take the weights and the graph topology (the<br />

structure of the convolutional, non-linearity, pooling, and fully<br />

connected layers that exist in a CNN graph) and map them into<br />

the hardware for the dedicated CNN engine. Assuming there are<br />

no special graph layers, the CNN is now “programmed” to detect<br />

the objects that it’s been trained to detect.<br />

Fig. 8 shows the inputs and outputs of an embedded vision<br />

processor. The streaming images from the car’s camera are fed<br />

into the CNN engine that is preconfigured with the graph and<br />

weights. The output of the CNN is a classification of the contents<br />

of the image.<br />

Fig. 8. Inputs and outputs of embedded vision processor<br />

VII. DEEP LEARNING ALGORITHMS<br />

Some of the earliest implementations using CNNs were<br />

based on the neural networks or graphs used by the ImageNet<br />

winners. AlexNet was popular as a benchmark initially –<br />

although now with inefficient and some obsolete layers it has<br />

fallen out of favor. VGG and versions of GoogleNet and ResNet<br />

are still popular as classification graphs. These graphs will take<br />

a two-dimensional image and return a probability that that<br />

images includes one of the objects that the graph was trained to<br />

recognized (Fig. 8). There is also an evolving class of<br />

localization graphs – CNNs that will not only identify what is in<br />

the picture, but will identify where the object is. RCNN (regional<br />

CNN), Faster RCNN, SSD and Yolo (Fig. 9) are examples of<br />

these graphs.<br />

Fig. 9. A TinyYolo CNN graph running on Synopsys DesignWare EV61<br />

processor provides an example of object detection and localization for<br />

automotive and surveillance applications<br />

635


We’ve discussed object classification of pedestrians (or<br />

bicycles or cars or trucks) that can be used for collision<br />

avoidance – an ADAS example. CNN engines with high enough<br />

performance can also be used for scene segmentation –<br />

identifying of all the pixels in an image. The goal for scene<br />

segmentation is less about identifying specific pixels than it is to<br />

identify the boundaries between types of objects in the scene.<br />

Knowing where the road is compared to other objects in the<br />

scene provides a great benefit to a car’s navigation.<br />

Fig. 10. Scene segmentation identifies the boundaries between types of objects<br />

Much of the research has been to improve the accuracy of<br />

object detection or recognition. As the accuracy improves, the<br />

focus has become to shift to getting great accuracy with fewer<br />

computations. Fewer computations will both lower bandwidth<br />

and will improve power consumption of the implementation. In<br />

addition to new graphs, a lot of research as been focused on<br />

optimizing the existing CNN graphs but pruning coefficients or<br />

compressing features – the intermediate outputs of each layer of<br />

CNN computations. An important requirement is to make sure<br />

the CNN hardware and software supports the latest techniques<br />

of compression and pruning.<br />

VIII. POWER CONSIDERATIONS FOR DEEP LEARNING<br />

IMPLEMENTATIONS<br />

Deep learning algorithms like CNN have to process a lot of<br />

pixels in a short amount of time. This requires a significant<br />

amount of computations and lots of data transferred across an<br />

internal AXI bus. There is no question that power – or energy<br />

consumed – is high on the concern list for SoC designers, even<br />

in automotive designs.<br />

For a given process node, the easiest way to lower power is<br />

to start by lowering the frequency of the design. Other low<br />

power techniques include near-threshold logic where the logic<br />

runs at a lower voltage, greatly reducing the power required to<br />

switch the transistor. Minimizing external bus bandwidth also<br />

helps cut power. The less external bus activity, the less power is<br />

consumed. For an embedded vision application, increasing the<br />

size of internal memory will decrease bandwidth and thereby<br />

lower power, even though it will increase the overall area of the<br />

design. Other ways to minimize bandwidth – and cut power – is<br />

to use compression techniques on CNN graphs to reduce the<br />

computations and memory usage.<br />

For the most power sensitive embedded vision applications,<br />

a vision processor with a dedicated CNN engine could be the<br />

difference between meeting the design’s power budget or<br />

missing it. Choosing a dedicated CNN engine seems intuitive,<br />

but how do you measure the power before silicon is available?<br />

Consider an application having to meet a performance<br />

threshold within a tight power and thermal budget such as a<br />

battery powered camera in the cockpit of the car to identify<br />

driver alertness. Facial recognition – depending on desired<br />

frame size, frame rate and other parameters – might require a<br />

few hundred GMAC/s of embedded vision processing power.<br />

An ASIC or SoC design must now find an embedded vision<br />

solution that can execute that network within the design’s power<br />

budget – let’s say several hundred mW.<br />

Unfortunately, comparing vision processor IP is not simple.<br />

Bleeding edge IP solutions often haven’t reached silicon yet, and<br />

every implementation is different, making it difficult to calculate<br />

and compare power or performance between IP options. No<br />

benchmark standards exist for comparing CNN solutions. An<br />

FPGA prototyping platform might provide accurate benchmarks<br />

but not accurate power estimates.<br />

One way to calculate power consumption is to run a RTL or<br />

Netlist based simulation to capture the toggling of all the logic.<br />

This information, using the layout of the design, can provide a<br />

good power estimate. For smaller designs, the simulation could<br />

complete in hours, e.g., running CoreMark or Dhrystone on an<br />

embedded RISC core. For large designs, the simulation runs<br />

slow. Larger graphs requiring high frame rates could take weeks<br />

to reach a steady state to measure power. For larger CNNs with<br />

high frame rate requirements, such a simulation could take days<br />

or even weeks. There is a real risk of IP vendors skipping such<br />

arduous power measurements in favor of estimating power<br />

through shortcuts using smaller simulation models pushing the<br />

problem down-stream to the SoC vendors, to sign-off on the IP<br />

vendor’s power analysis claim.<br />

Low power requirements aren’t limited to designs using<br />

small CNN graphs. An autonomous vehicle, for example, might<br />

www.embedded-world.eu<br />

636


equire significant embedded vision performance – one or more<br />

8MP cameras running at 60 fps could require 20 to 30 TMAC/s<br />

of computational power – all within the lowest possible power<br />

budget. Note that these TMAC/s requirements might also be<br />

listed as tera-operations per second (TOP/s). Since a MAC cycle<br />

includes two operations (one multiply and one accumulate),<br />

MAC/s are converted to Ops/s by multiplying by two.<br />

For this application, having a dedicated CNN for the lowest<br />

power is only helpful if it can scale to higher levels of<br />

performance needed. Embedded vision processors such as<br />

Synopsys’ EV6x family address this challenge in two ways – by<br />

scaling the number of MACs within each CNN engine, and then<br />

by scaling multiple instances of the CNN engine on the bus<br />

Fabric, e.g., tailored NoC or standard AXI.<br />

IX. DEEP LEARNING AND FUNCTIONAL SAFETY<br />

Automotive manufacturers must ensure systems function<br />

correctly to avoid hazardous situations. The highest<br />

performance, lowest power CNN engine is of little use in an<br />

automotive design if it cannot meet critical safety requirements<br />

– like the ISO 26262 standard and Automotive Safety Integrity<br />

Levels (ASIL) – without significant loss of functionality. Safetycertified<br />

products must be able to detect and manage faults. As<br />

deep learning moves from an ADAS system who’s only job is to<br />

alert the drive (i.e., lane departure warnings) to the primary<br />

decision maker driving the vehicle, fault detection becomes<br />

more critical. ASIL D (Fig. 11) is required for the most safetycritical<br />

components (


FOC SoC - Field Oriented Control Servo on Chip<br />

Dr. Lars Larsson<br />

Research & Development<br />

TRINAMIC Motion Control GmbH & Co. KG<br />

Hamburg, Germany<br />

Abstract— Field-Oriented Control (FOC), or Vector Control<br />

(VC), is a well-known method for the energy-efficient<br />

commutation of electromagnetic motors for almost half a<br />

century. So far, FOC was completely implemented in software.<br />

There are processors available with integrated hardware<br />

supporting base transformations (Clarke, Park, iPark, iClarke)<br />

that are required for FOC. In addition to these base<br />

transformations, the realization of FOC requires PI controllers,<br />

and different peripheral function blocks such as pulse width<br />

modulation (PWM), analog digital converter (ADC), and<br />

interfaces for encoder and for Hall signals (analog or digital), to<br />

form a complete FOC servo control system.<br />

This article discusses the advantages of a full implementation<br />

of the FOC in hardware on a single chip together with ease of use<br />

peripheral units for rotor position determination by an encoder<br />

and for current measurement with intergraded scaling for signal<br />

conditioning as required for FOC. The implementation enfolds<br />

the most real-time critical inner FOC current regulation loop,<br />

together with the less time critical velocity control loop, and the<br />

less time critical position control loop. Altogether, integrated<br />

analog and digital building blocks form a Field-Oriented Control<br />

Servo-on-Chip – a FOC SoC.<br />

Keywords—Field Oriented Control; Vector Control; Servo<br />

Control; Hardware Implementation of FOC on a Chip; SoC<br />

I. INTRODUCTION<br />

The initial setup of the FOC is usually very time<br />

consuming and complex, although source code is freely<br />

available for various processors. This is because the FOC has<br />

many degrees of freedom that all need to fit together in a chain<br />

in order to work.<br />

Currently available processor architectures used for FOC<br />

are limited to PWM bandwidths within a typical range of 20<br />

kHz to 30 kHz outside the audible frequency range for the<br />

inner current regulation loop, which is sufficient when the<br />

PWM frequency is limited by the switches of the power stage.<br />

In contrast to software solutions, a hardware solution enables<br />

current regulation update rates up to 100 kHz and beyond. In<br />

addition, a hardware solution allows permanent monitoring of<br />

critical limits without eating up performance, unlike software<br />

solutions. On the other side, software is more suitable for<br />

implementation of communication protocol handling and for<br />

flexible adaptation of application specific requirements.<br />

The integration of the FOC as a SoC (System-on-Chip)<br />

drastically reduces the number of required components and<br />

reduces the required printed circuit (PCB) space. The high<br />

integration of FOC, together with velocity controller and<br />

position controller as a SoC, enables the FOC as a standard<br />

peripheral component that transforms digital information into<br />

physical motion.<br />

Compact size together with high performance and energy<br />

efficiency especially for battery powered mobile systems are<br />

enabling factors when embedded goes autonomous.<br />

Fig. 1. Illustration of FOC Basic Principle by Cartoon [8].<br />

II.<br />

FOC<br />

The base functions of the FOC are straightforward<br />

mathematics. Implementation of these base functions using<br />

floating-point arithmetic is possible in a simple way. A PC<br />

equipped with a processor with integrated hardware floatingpoint<br />

unit can archive a performance within the range of one<br />

million FOC calculations per second (Table 1). However, in<br />

terms of cost and energy a PC is not really an option for<br />

current control of a single motor with FOC. Different<br />

additional components are required to build a full FOC system<br />

for motor control.<br />

www.embedded-world.eu<br />

638


I D<br />

Fig. 2. FOC is an efficient method to turn a wheel [8].<br />

III.<br />

WHY FOC?<br />

It is a method for turning an electric motor smoothly with<br />

low torque ripple in the most energy efficient way. FOC is<br />

suitable for both motorized and regenerative operation. The<br />

method is proven over many years by many applications.<br />

IV.<br />

WHAT IS FOC?<br />

The Field Oriented Control was independently developed<br />

by K. Hasse, TU Darmstadt, 1968 [1], and by Felix Blaschke,<br />

TU Braunschweig, 1973 [2]. Theory of motor control [3] and<br />

control technology in general [4, 5] are fundamental for FOC,<br />

while the implementation of FOC brings more technical<br />

constraints into it [6, 7].<br />

The FOC is a current regulation scheme for electric motors<br />

that takes the orientation of the magnetic field of the stator of<br />

the motor and angle of the rotor of the motor with its magnetic<br />

axis into account. The FOC controls the torque in the way that<br />

the motor gives that amount of torque that is requested as target<br />

torque. The FOC maximizes active power and minimize idle<br />

power - that finally results in lowest power dissipation - by<br />

intelligent closed-loop control illustrated by the cartoon (Fig.<br />

1). FOC is an efficient method to turn a wheel applying<br />

tangential force (represented by I Q) only while zeroing<br />

tangential force (represented by I D) as the result of field<br />

oriented closed-loop control (Fig. 2).<br />

API<br />

I Q<br />

FOC<br />

Software FOC<br />

IO Interface<br />

I D = 0<br />

I Q<br />

V. WHY FOC AS A PURE HARDWARE SOLUTION?<br />

The basic implementation of the inner FOC loop in<br />

software with C using double precision arithmetic is relatively<br />

easy. At the first sight, one can achieve a performance that is<br />

more than sufficient by executing the code on a PC with a CPU<br />

with floating-point unit (FPU). Nevertheless, the initial setup of<br />

the FOC is usually very time consuming and complex,<br />

although source code is freely available for various processors.<br />

This is because the FOC has many degrees of freedom that all<br />

need to fit together in a chain in order to work.<br />

The hardware FOC as an existing standard building block<br />

drastically reduces the effort in system setup. With an off-theshelf<br />

building block, the starting point of FOC is no longer the<br />

setup and implementation of the FOC itself and the creation<br />

and programming required for the interface blocks. Instead,<br />

only the parameters for the FOC have to be setup. Real parallel<br />

processing of hardware blocks de-couples the higher lever<br />

application software from high-speed real-time tasks and<br />

simplifies the development of application software. With a<br />

field oriented control servo-on-chip realized as system-on-chip<br />

as a building block, the user is free to use its qualified CPU<br />

together with its qualified tool chain. A field oriented control<br />

servo-on-chip as a hardware building block frees the user from<br />

fighting with processer specific challenges concerning interrupt<br />

handling and direct memory access. The TMC4671 is such a<br />

FOC SoC [8]. There is no need for a dedicated tool chain to<br />

access the TMC4671 registers and to operate it. SPI (or UART)<br />

communication needs to be implemented for a given user CPU.<br />

FOC as a SoC (System-on-Chip) is in contrast to the<br />

classical FOC servo controller formed by a motor block and a<br />

separate controller box wired with motor cable and encoder<br />

cable. The high integration of FOC available as a standard<br />

peripheral component enables FOC for embedded applications<br />

where turning a motor is just part of an embedded application<br />

and not the primary application itself. A typical software FOC<br />

system architecture is outlined by Fig. 3 with the FOC as part<br />

of the application software. The challenge of this architecture is<br />

the software emulation of parallel processing of different tasks<br />

that might cause disturbance to the FOC itself. The pure<br />

hardware based FOC architecture is outlined by Fig. 4. In<br />

hardware, parallel processing of different tasks can be realized<br />

in a natural way. The challenge of hardware design is the<br />

higher effort in realizing the basic arithmetic function<br />

compared to software. The FOC as a standard SoC hardware<br />

component can fully encapsulate all real-time tasks from the<br />

software side.<br />

SPI<br />

UART<br />

OTHER Interfaces<br />

RAM<br />

ROM<br />

CPU<br />

Core<br />

Servo Control FOC Firmware<br />

CUSTOMER<br />

APPLICATION<br />

Interrupt TIMER<br />

PWM Interface<br />

ADC Interface<br />

[ABN Interface]<br />

Interrupt IO<br />

GATE DRIVERS<br />

POWER STAGE<br />

CURRENT sensing<br />

PMSM<br />

MOTOR<br />

(3 phase)<br />

ENCODER<br />

HALL<br />

API<br />

ANY type<br />

of CPU<br />

SPI<br />

CUSTOMER<br />

APPLICATION<br />

UART<br />

SPI<br />

Servo<br />

Motor<br />

Control<br />

Hardware FOC SoC<br />

Register Bank<br />

FOC<br />

Servo Control FOC Engine<br />

DBGSPI<br />

PWM Engine<br />

ADC Engine<br />

SENSOR Engine<br />

GATE DRIVERS<br />

POWER STAGE<br />

CURRENT sensing<br />

PMSM<br />

MOTOR<br />

(3 or 2 phases)<br />

ENCODER<br />

HALL<br />

Fig. 3. Typical Software FOC System Architecture.<br />

Fig. 4. Hardware Based FOC System Architecture (TMC4671)<br />

639


VI.<br />

HOW DOES FOC WORK?<br />

Two force components generated by two current<br />

components act on the rotor of an electric motor. One<br />

component is just pulling in radial direction (I D) where the<br />

other component pulling tangentially (I Q) is applying torque.<br />

The ideal FOC - apart from field weakening - performs a<br />

closed-loop current regulation that results in a pure torque<br />

generating current I Q without direct current I D.<br />

From a top level perspective, FOC for three-phase motors<br />

uses three phase currents of the stator interpreted as a current<br />

vector (Iu; Iv; Iw) and calculates three voltages interpreted as a<br />

voltage vector (Uu; Uv; Uw) taking the orientation of the rotor<br />

into account in a way that only the torque generating current I Q<br />

results. As for two-phase motors, the FOC uses two phase<br />

currents of the stator interpreted as a current vector (Ix; Iy) and<br />

calculates two voltages interpreted as a voltage vector (Ux; Uy)<br />

taking the orientation of the rotor into account in a way that<br />

only a torque generating current I Q results. To do so, the<br />

knowledge of some static parameters (number of pole pairs of<br />

the motor, number of pulses per revolution of a used encoder,<br />

orientation of encoder relative to magnetic axis of the rotor,<br />

count direction of the encoder) is required together with some<br />

dynamic parameters (phase currents, orientation of the rotor).<br />

VII. WHAT IS REQUIRED FOR FOC?<br />

The FOC bases on a couple of transformation that need to<br />

be implemented. It takes the actual current vector together with<br />

the actual electrical angle from the motor and calculates a<br />

voltage vector that is applied to the motor.<br />

The FOC for three-phase (FOC3) permanent magnet<br />

synchronous motors (PMSM [9]) maps a three-dimensional<br />

current vector (Iu; Iv; Iw) together with an angle to a threedimensional<br />

voltage vector (Uu; Uv; Uw)<br />

Uu<br />

Iu<br />

( Uv ) = FOC3 (( Iv ) ; ) (1)<br />

Uw<br />

Iw<br />

The FOC for two-phase (FOC2) permanent magnet<br />

synchronous motors (stepping motors [10]) maps the actual<br />

two-dimensional current vector (Ix; Iy) together with the with<br />

the actual electric angle to a two-dimensional voltage vector<br />

(Ux; Uy) that is applied to the motor.<br />

( Ux ) = FOC2 ((Ix) ; ) (2)<br />

Uy Iy<br />

AT<br />

FOC3<br />

AT<br />

FOC2<br />

PIDQ<br />

UQ<br />

PIDQ<br />

UQ<br />

iPARK<br />

Ua, Ub<br />

iCLARKE<br />

UU, UV, UW<br />

iPARK<br />

Ua, Ub<br />

0<br />

PIDD<br />

UD<br />

0<br />

PIDD<br />

UD<br />

PHI<br />

PHI<br />

ID<br />

ID<br />

PARK<br />

Ia, Ib<br />

CLARKE<br />

IU, IV, IW<br />

PARK<br />

Ia, Ib<br />

IQ<br />

IQ<br />

Fig. 5. Inner FOC Loop Architecture for Three-Phase Motors (FOC3)<br />

Fig. 7. Inner FOC Loop Architecture for Two-Phase Motors (FOC2)<br />

Rotor System w/ quasi static<br />

voltage vector (UD UQ)<br />

Stator System w/<br />

rotating voltage vector (Ua Ub)<br />

Stator System w/<br />

rotating voltage triple (UU UV UW)<br />

Rotor System w/ quasi static<br />

voltage vector (UD UQ)<br />

Stator System w/<br />

rotating voltage vector (Ua Ub)<br />

UD<br />

Ub<br />

UD<br />

Ub<br />

UV<br />

iPARK<br />

iCLARKE<br />

iPARK<br />

UQ<br />

Ua<br />

UU<br />

UQ<br />

Ua<br />

UW<br />

PID controllers<br />

(PI w/ D=0)<br />

PID controllers<br />

(PI w/ D=0)<br />

ID<br />

Ib<br />

ID<br />

Ib<br />

IV<br />

IQ<br />

Ia<br />

IU<br />

IQ<br />

Ia<br />

PARK<br />

CLARKE<br />

PARK<br />

IW<br />

Rotor System w/ quasi static<br />

current vector (ID IQ)<br />

Stator System w/<br />

rotating current vector (Ia Ib)<br />

Stator System w/<br />

rotating current triple (IU IV IW)<br />

Fig. 6. Inner FOC Loop Data Flow for Three-Phase Motors (FOC3)<br />

Rotor System w/ quasi static<br />

current vector (ID IQ)<br />

Stator System w/<br />

rotating current vector (Ia Ib)<br />

Fig. 8. Inner FOC Loop Data Flow for Two-Phase Motors (FOC2)<br />

www.embedded-world.eu<br />

640


The voltage vectors (Uu; Uv; Uw) resp. (Ux; Uy) drive<br />

currents (Iu; Iv; Iw) resp. (Ix; Iy) with desired target current<br />

vectors satisfying the condition (I Q=I TARGET; I D=0). To do so,<br />

the three currents need to be known together with the electrical<br />

angle of the magnetic axis of the rotor of the motor.<br />

The angle can be measured with different kind of sensors<br />

(incremental encoder, analog encoder, digital Hall sensors, and<br />

analog Hall sensors). Alternatively, the angle can be estimated<br />

sensorless by measuring voltages and currents together with a<br />

mathematical model of the motor.<br />

A. Current Measurent, Offset Cleaning and Scaling for FOC<br />

The phase currents are essential state parameters for the<br />

FOC. The currents can be measured using sense resistors with<br />

sense amplifiers giving analog voltages that represent the<br />

measured currents.<br />

Alternatively, one can use isolated sense amplifiers with<br />

integrated delta-sigma modulators [11, 12, 13] that give digital<br />

delta-sigma signal streams representing the measured currents.<br />

Currents can also be measured based on Hall sensors.<br />

Whatever type of current measurement is selected, before the<br />

measured current values are available for processing within the<br />

FOC loop, they must be freed from offsets and scaled to the<br />

value range of the FOC loop.<br />

B. Electrical Rotor Angle, Orientation and Direction<br />

The electrical angle of the rotor is an essential state variable<br />

required for the FOC. An encoder measures the mechanical<br />

angle of the rotor in terms of its resolution, the number of<br />

positions per revolution (PPR). Some encoder vendors call this<br />

counts per revolution (CPR) or pulses per revolution (PPR),<br />

others give lines per revolution (LPR) or line counts (LC)<br />

where PPR might mean either CPR or a quarter of LPR.<br />

Nevertheless, the FOC needs to know the electrical angel<br />

normed to its internal numerical representation. To map a<br />

mechanical angle measured by an encoder, one needs to take<br />

the number of pole pairs (NPP) of the used motor into account.<br />

The direction of rotation is an additional parameter as a<br />

degree of freedom. A possible phase shift between measured<br />

encoder angle and rotor angle needs to be taken into account by<br />

initialization. Hall signals that give absolute positions within<br />

each electrical period might have another direction of<br />

revolution – different sign of – as the rotor of the motor has.<br />

( Iq<br />

Iu<br />

Id ) = PARK() ∗ CLARKE ∗ ( Iv ) (3)<br />

Iw<br />

with<br />

Iv = −(Iu + Iw), (4)<br />

( Uq ) = PID (Eq) with (Eq<br />

Ud Ed Ed ) = (Iq Id ) − (Iq_target ), (5)<br />

Id_target<br />

PID(E) = P ∗ E(t) + I ∗ ∫ E(t)dt + D ∗ d E(t), (6)<br />

dt<br />

Uu<br />

( Uv ) = iCLARKE ∗ iPARK() ∗ ( Uq<br />

Ud ) (7)<br />

Uw<br />

with<br />

1<br />

CLARKE = 2 3<br />

0<br />

(<br />

− 1 2<br />

√3<br />

2<br />

cos()<br />

PARK() = (<br />

−sin()<br />

iCLARKE =<br />

(<br />

1<br />

− 1 2<br />

− 1 2<br />

cos()<br />

iPARK() = (<br />

sin()<br />

− 1 2<br />

− √3<br />

, (8)<br />

2 )<br />

sin()<br />

), (9)<br />

cos()<br />

0<br />

√3<br />

2<br />

− √3<br />

2 )<br />

, (10)<br />

−sin()<br />

). (11)<br />

cos()<br />

C. Basic Transformations and Functions of FOC<br />

The basic operations of FOC are Clarke Transformation,<br />

Park Transformation (PARK), PI control, inverse Park<br />

Transformation (iPARK), and inverse Clarke Transformation<br />

(iCLARKE).<br />

Fig. 9. Pure Hardware FOC Servo Controller on Chip as Engeneering<br />

Sample in compact QFN Package (11.5 x 6.5 mm, 76 pins, 0.4 mm pitch).<br />

641


foc3(iu, iv, iw, phi, iq_tg, id_tg, &uu, &uv, &uw);<br />

{<br />

}<br />

clarke(iu, iv, iw, &ia, &ib);<br />

park(ia, ib, phi, &id, &iq);<br />

pid(id, id_tg, Pd, Id, Dd, &ud, dt);<br />

pid(iq, iq_tg, Pq, Iq, Dq, &uq, dt);<br />

ipark(ud, uq, phi, &ua, &ub);<br />

iclarke(ua, ub, uu, uv, uw);<br />

Fig. 10. C Code Structure of Three-Phase FOC3.<br />

foc2(ix, iy, phi, iq_tg, id_tg, &ux, &uy);<br />

{<br />

}<br />

park(ix, iy, phi, &id, &iq);<br />

pid(id, id_tg, Pd, Id, Dd, &u_d, dt);<br />

pid(iq, iq_tg, Pq, Iq, Dq, &u_q, dt);<br />

ipark(ud, uq, phi, ux, uy);<br />

Fig. 11. C Code Structure of Two-Phase FOC2.<br />

With these basic transformations of the FOC implemented<br />

together with PI controller as software functions, the inner<br />

FOC loop needs to be calculated periodically by calling either<br />

foc3() or foc2() once per PWM cycle. An overview of reported<br />

FOC loop performance archived by different processors<br />

compared to the FOC SoC is given in table 1.<br />

TABLE I.<br />

INNER FOC LOOP PERFORMANCE OUTLINE<br />

FOC Loop Performance<br />

Platform FOC loops / s [Citation]<br />

Intel i7-3777 3.4GHz,<br />

MinGW gcc 4.4.0, PC<br />

1.1 M [14] a<br />

AMD Ryzen Thredripper 1950X,<br />

3.4 GHz MinGW gcc 4.4.0, PC<br />

1.1 M [14] a<br />

Intel i7-3250M 2.9GHz, MinGW<br />

gcc 4.4.0, X230 Laptop<br />

1.1 M [14] a<br />

Intel Atom x7-Z8700 1.6GHz,<br />

MinGW gcc 4.4.0, Surface 3<br />

450 k [14] a<br />

TMC4671, 25MHz,<br />

pure hardware FOC SoC<br />

250 k [8]<br />

Texas Instruments, AM437x,<br />

ARM®-Cortex®-A9, 1 GHz<br />

47 k [15]<br />

STMicroelectronics,<br />

STM32F103ZE,<br />

2 * 20 k [16]<br />

72 MHz, 85% FOC<br />

Microchip Technology,<br />

dSPIC33FJ32MC204,<br />

20 k [17]<br />

21 MIPs, 66% FOC<br />

XC886/XC888,<br />

96 MHz, 58% FOC<br />

20 k [18]<br />

Atmel (Microchip)<br />

AT32UC3B0256,<br />

42 MHz, 35% FOC<br />

10 k [19]<br />

a.<br />

Performance Estimation of foc3() on PC<br />

Normally, an interrupt timer of highest priority triggers the<br />

periodic call of the inner FOC loop which might become an<br />

issue for a software solution when another interrupt needs to be<br />

executed due to protocol handling or monitoring tasks.<br />

VIII. DELTA-SIGMA CONVERTERS AND INTERFACE<br />

The phase currents need to be measured as input values for<br />

the FOC. Analog signal conditioning is done close to the power<br />

stage either available as analog voltages representing measured<br />

currents by sense amplifiers like [20] or as digital delta-sigma<br />

data streams. Analog Digital Converters (ADCs) based on the<br />

Delta-Sigma conversion principle [21] are widely used also in<br />

audio signal processing [22]. For control, the analog digital<br />

conversion with the delta-sigma principle has the advantage<br />

that one can digitally adjust between resolution and speed.<br />

Additionally, the delta-sigma sampling can measure throughout<br />

the entire PWM period, making it insensitive to noise and<br />

spikes due to switching events. Digitalization of analog<br />

encoder signals or analog Hall signals can also be realized by<br />

delta-sigma ADCs with the advantage of digital adjustability of<br />

the required bandwidth and resolution of the encoder.<br />

Delta-Sigma ADCs integrated as part of the FOC SoC can<br />

internally operate with 100 MHz delta-sigma sampling rate,<br />

giving a good performance. External delta-sigma modulators<br />

for current measurement typically operate within a delta-sigma<br />

oversampling frequency range of 10 MHz … 20 MHz that is<br />

sufficient for digitalization of analog current signals that are in<br />

a typical frequency range from 0 Hz to some kHz. The<br />

advantage of external isolated delta-sigma modulators for<br />

current measurement is the sense amplifier that is galvanic<br />

isolated. Additionally, digital delta-sigma data streams are<br />

quite insensitive to spikes and noise compared to analog sense<br />

amplifier voltages.<br />

For the FOC SoC with an integrated ADC engine that is<br />

designed to process either internal or external delta-sigma data<br />

streams, the usage of these is reduced to the selection of type of<br />

delta-sigma source (internal, external, delta-sigma clock input<br />

or delta-sigma clock output) together with the delta-sigma<br />

clock frequency and decimation rate that determines the<br />

resolution and speed of the ADC channel.<br />

IX.<br />

BENEFITS OF DELTA-SIGMA CONVERTERS FOR FOC<br />

Delta-Sigma ADCs integrated as part of a FOC SoC give<br />

different advantages from the application point of view. With a<br />

delta-sigma signal-processing engine, also external delta-sigma<br />

modulators are supported. Digital processing of high frequency<br />

delta-sigma data streams requires dedicated hardware.<br />

A. Current Sensing<br />

From the current regulation point of view, the continuous<br />

oversampling with delta-sigma ADCs give an advantage for the<br />

closed-loop current regulation itself because the ADC values<br />

represent the mean current over a sample period resp.<br />

decimation period.<br />

www.embedded-world.eu<br />

642


B. Minimized Phase Shift by Simultaneous Sampling<br />

With delta-sigma ADCs integrated as part of a FOC SoC<br />

sampling all channels in parallel, one gets rid of the challenge<br />

of phase shifts compared to phase shift between ADC channels<br />

with multiplexed analog inputs. The advantage of simultaneous<br />

sampling becomes especially relevant when using highresolution<br />

analog sine-cosine-encoder as position sensors.<br />

C. Adjustable ADC Resolution vs. ADC Bandwidth<br />

From a system building point of view, delta-sigma ADCs<br />

integrated as part of an integrated FOC SoC with digital<br />

processed delta-sigma data streams enable the adjustment of<br />

resolution versus bandwidth in a flexible way covering a wide<br />

range of applications with a single ADC hardware – contrary to<br />

the need for application-specific selection and interfacing of<br />

different types of external ADCs with different speeds,<br />

resolutions and different types of interfaces.<br />

D. Support of External Delta-Sigma Modulators<br />

From a system-building point of view, support of external<br />

isolated delta-sigma modulators enable applications where<br />

measurement of phase current is challenging due to high<br />

potential differences.<br />

X. UNIFIED POSITION SENSOR INTERFACE<br />

For FOC SoC as standard peripheral component, different<br />

types of position sensors interfaces (digital Hall sensors, digital<br />

incremental encoders, analog hall sensors, and analog sinecosine-encoders)<br />

need to be supported in a unified way taking<br />

motors with different number of pole pairs into account. In<br />

hardware, this can be realized by a set of registers that hold the<br />

relevant parameters of the available position sensors, mapping<br />

them to a normed electrical period for commutation. This<br />

decouples the used position sensor from the FOC itself.<br />

XI.<br />

PWM<br />

Pulse width modulation (PWM) is essential for energyefficient<br />

power conversion. Where processors provide generic<br />

PWM peripherals, our integrated FOC SoC solution provides<br />

an integrated PWM unit that is dedicated for closed-loop<br />

control of brushed DC motors, two-phase stepper motors with<br />

FOC2, and for three-phase permanent magnet synchronous<br />

motors (PMSM) with FOC3. The application just needs to<br />

select the type of motor. Optional, the application can select<br />

pulse width modulation (PWM) or space vector pulse width<br />

modulation (SVPWM) for more efficient voltage usage with<br />

one control bit where PWM units of processors require more or<br />

less complex configuration of their PWM units that - in the<br />

worst case - might require re-compilation of the processor<br />

firmware. Additionally, the PWM frequency of the FOC SoC<br />

can be changed at any time during motion by setting a single<br />

parameter in contrast to processors that require changes of<br />

some parameters of the PWM unit and might require<br />

adjustment or re-compilation of the FOC for another PWM<br />

frequency. Change of PWM is useful especially for low<br />

inductance motors like [23] to reduce the phase current ripple,<br />

the associated torque ripple, and the supply current ripple. High<br />

PWM frequencies can reduce the power dissipation within the<br />

motor itself [24] due to lower current ripple at higher<br />

frequencies. On the other side, higher PWM frequencies cause<br />

higher power dissipation within MOS-FET power stages or<br />

IGBT power stages and within their gate drivers. MOS-FETs<br />

like [25, 26] with low gate charge are able to switch fast in the<br />

100 kHz PWM frequency range with relatively low power<br />

dissipation with 1A gate drivers like [27]. It makes sense to<br />

increase the PWM frequency dynamically when a motor runs<br />

at high speed and to decrease the PWM frequency when the<br />

motor runs at low speed or when it is at rest.<br />

Fig 12. Raw Delta-Sigma ADC Values of Phase Currents (left) and offset freed amplitude scaled ADC Values (right) as Input for the FOC Engine.<br />

643


FOC Servo Controller TMC4671<br />

User API<br />

UART<br />

DBGSPI<br />

Interfacing & Register Bank<br />

SPI<br />

Embedded<br />

CPU<br />

X<br />

XT<br />

PIDX<br />

V<br />

VT<br />

PIDV<br />

AT<br />

PIDQ<br />

UQ<br />

FOC<br />

iPARK<br />

Ua, Ub<br />

iCLARKE<br />

UU, UV, UW<br />

SVPWM<br />

PWM<br />

PWM<br />

BBM<br />

Power<br />

Stage<br />

PMSM<br />

MOTOR<br />

VELOCITY<br />

0<br />

PIDD<br />

UD<br />

MOTOR CONTROL<br />

ID<br />

IQ<br />

PARK<br />

Ia, Ib<br />

CLARKE<br />

PHI<br />

IU, IV, IW<br />

POSITION<br />

CURRENTS<br />

SENSING<br />

ROTOR POSITION<br />

(sensored, sensorless)<br />

IU, IW<br />

Position<br />

Sensors<br />

Sense<br />

Amplifiers<br />

Fig. 13. FOC SoC Architecture with multi-ported Register Bank, primary Application Interface (SPI) and Real-Time Monitoring Interfaces (DBGSPI, UART)<br />

With the PWM frequency, one can control that the power<br />

dissipation takes place more within the power stage or more<br />

within the motor. In contrast to MOS-FET stages, GaN-FET<br />

power stages can operate with low power dissipation even with<br />

PWM frequencies up to the MHz range that could be processed<br />

with FOC hardware in case it is beneficial for applications.<br />

An additional parameter that affects the power dissipation<br />

is the so-called brake-before-make (BBM) time, For the FOC<br />

SoC this, time is programmable even during motion separate<br />

for high side switches and low side switches in steps of 10 ns<br />

for fine-tuning. For gate drivers that do their own BBM time,<br />

the BBM time handling can be disabled. Programmability of<br />

BBM times is essential for a FOC SoC solution and for PWM<br />

units of processors.<br />

With a FOC as a SoC as a standard peripheral solution, one<br />

can focus on setting up the FOC and parameterizing it itself<br />

instead of focusing on implementing the FOC 1 st , and setting<br />

up it up as the 2 nd step. With a FOC SoC, one can use any<br />

qualified processor with its qualified tool chain in contrast to<br />

running FOC on a processor in software. FOC as a hardware<br />

takes care of all real-time critical tasks and keeps the processor<br />

free from FOC so that it can process user application and<br />

protocol handling that better fits to software. A hardware FOC<br />

that processes different types of position sensors and current<br />

sensors in a uniform way decouples application software<br />

development from those sensors. The decoupling of real-time<br />

tasks by dedicated hardware simplifies the application software<br />

development where turning a motor with FOC is just a<br />

component as part of an embedded system.<br />

XII. MULTI-PORTED COMMUNICATION INTERFACES<br />

Multi-ported user interfaces enable real-time monitoring of<br />

internal parameters while the FOC is running (fig. 12).<br />

Realizing in hardware does not disturb the execution of the<br />

FOC operations. Multi-ported access in software might disturb<br />

executing of the FOC when it eats too much processing power.<br />

XIII. CONCLUSION<br />

The FOC has many degrees of freedom, due to a chain of<br />

parameters that all need to fit together within a single chain for<br />

successful setup of the FOC. With a hardware FOC available<br />

as a building block providing all necessary functionality, one<br />

can focus on parameterizing the FOC for a given application<br />

itself. The potential to look at internal registers in real-time<br />

parallel to the running application enables monitoring and<br />

initial setup with external tools without re-compiling software.<br />

[1] K. Hasse, Zur Dynamik drehzahlgeregelter Antriebe mit<br />

stromrichtergespeisten Asynchron- Kurzschlußläufermaschinen,<br />

Dissertation, TH Darmstadt, 1969.<br />

[2] Felix Blaschke, Das Verfahren der Feldorientierung zur Regelung der<br />

Drehfeldmaschine, Dissertation, TU Braunschweig. 1974.<br />

[3] W. Leonhard, Control of Electrical Drives, 3rd Edition, Springer, 2003<br />

[4] W. Leonhard, Einführung in die Regelungstechnik: Lineare und<br />

nichtlineare Regelvorgänge für Elektrotechniker, Viewg, 1992.<br />

[5] Michael A. Johnson, PID Control: New Identification and Design<br />

Methods, Springer, 2005.<br />

[6] Nguyen Phung Quang, Praxis der feldorientierten<br />

Drehstromantriebsregelungen, expert Verlag, 1993.<br />

[7] Nguyen Phung Quang, Jörg-Andreas Dittrich, Vector Control of Three<br />

Phase AC Machines, System Development in the Practice, Second<br />

Edition, Springer-Verlag, 2015.<br />

[8] TMC4671 Preliminary Datasheet 0v90, TRINAMIC Motion Control<br />

GmbH & Co. KG, September 29, 2017, www.trinamic.com<br />

[9] T. J. E. Miller, J. R. Hendershot, Design of Brushless Permanent<br />

Magnet-Magnet Machines, Motor Design Books LCC, 2010.<br />

www.embedded-world.eu<br />

644


[10] P. Acarnley, Stepping Motors: A Guide to Theory and Practice,<br />

Institution Engineering & Tech, 4 th edition, 2002.<br />

[11] AD7400 Isolated Sigma-Delta Modulator, Analog Devices, 2013.<br />

[12] AD7401 Isolated Sigma-Delta Modulator, Analog Devices, 2015.<br />

[13] AD7403 16-Bit Isolated Sigma-Delta Modulator, Analog Devices, 2015.<br />

[14] L. Larsson, TRINAMIC, Performance Estiamtion of FOC Loop with<br />

Double Precision Arithmetics in C with MinGW, unpublished, 2017.<br />

[15] TIDU701–December 2014 AM437x Single Chip Motor Control<br />

Benchmark, Texas Instruments, 2014.<br />

[16] AN3165 Application Note Digital PFC and dual FOC MC integration,<br />

STMicroelectronics, 2010, p. 16.<br />

[17] Jorge Zambada, Debraj Deb, Application Note AN 1078, Sensorless<br />

Field Oriented Control of a PMSM, Microchip Technology Inc., 2010.<br />

[18] Field Oriented Control Using XC886/888 MCU, Application Brief,<br />

Infineon Technologies, 2007XC886/888 CM/CLM 8-Bit<br />

FlashMicrocontroller Sensorless Field Oriented Control for PMSM<br />

Motors, AP08059 Application Note V1.0, Infineon Technologies, 2007.<br />

[19] AVR32723: Sensor Field Oriented Control for Brushless DC motors<br />

with AT32UC3B0256, Atmel, 2009.<br />

[20] AD8418 Bidirectional Current Sense Amplifier, Analog Devices, 2013.<br />

[21] Shanthi Pavan, Richard Schreier, Gabor C. Temes, Understanding Delta-<br />

Sigma Data Converters, IEEE Press Series on Microelectronic Systems,<br />

Wiley, Second Edition, 2017.<br />

[22] Udo Zoelzer, Digital Audio Signal Processing, Wiley, 2008.<br />

[23] PMSM 3274G024BP4 3692 Datasheet, Faulhaber 2018.<br />

[24] Shunsuke Amano, Kan Akatsu, Study on High Frequency Inverter with<br />

100kHz Current Feedback Control by Using FPGA, 2014 17th<br />

International Conference on Electrical Machines and Systems (ICEMS),<br />

Oct. 22-25, 2014, Hangzhou, China.<br />

[25] BSZ068N06NS OptiMOS 60V Power-Transistor Data Sheet, Rev. 2.0<br />

2013-10-17, Infineon Technologies, 2013.<br />

[26] BSC030N08NS5, OptiMOS, 80V Power-Transistor Data Sheet,<br />

Rev.2.2 2014-11-10, Infineon Technologies, 2014.<br />

[27] LM5109B High Voltage 1A Peak Half-Bridge Gate Driver, Daha Sheet,<br />

Texas Instruments, 2016.<br />

[28] L. Larsson, Hardware FOC-Servo-Regler mit integrierten Schnittstellen<br />

für autonomen Betrieb, Forum Elektromagnetismus 2017, Technische<br />

Akademie Esslingen (TAE) & Hochschule Heilbronn Campus<br />

Künzelsau (HHN) - Reinhold-Würth-Hochschule, February 16-17, 2017,<br />

Tagungshandbuch 2017, pp. 131-141<br />

645


High Speed Interfaces in Cost Optimized FPGAs<br />

Ted Marena<br />

Director of SoC FPGA Marketing<br />

Marketing Chair RISC-V Foundation<br />

Microsemi Corporation<br />

3870 N 1 st Street San Jose, CA 95134<br />

ted.marena@microsemi.com<br />

https://www.linkedin.com/in/tedmarena/<br />

Abstract—This document explores high speed interfaces that<br />

are now available in cost optimized FPGAs. The presentation will<br />

explain interfaces such as 1Gb Ethernet, JESD204B, PCIe,<br />

HDMI and DDR4 memory interfaces and how they can be<br />

utilized in cost optimized, mid-range density FPGAs. In the<br />

presentation, design examples showing these interfaces and steps<br />

necessary to implement these functions will be shown. For each<br />

design, the typical power consumption will be provided so<br />

engineers can judge for themselves the benefits of the new class of<br />

cost optimized, mid-range FPGAs. We will go into detail on the<br />

FPGA densities offered as well as the package sizes that could be<br />

leveraged for embedded designs.<br />

I. INTRODUCTION<br />

Industrial designs are increasingly requiring higher<br />

performance interfaces. Protocols such as DDR4 memory,<br />

10Gigabit Ethernet, JESD204B, Gigabit Ethernet, PCIe and<br />

more are becoming commonplace. These higher speed<br />

interfaces are often found on high end FPGAs which are often<br />

overkill and cost prohibitive for most embedded designs. Now<br />

there exists a new class of mid-range density FPGAs which are<br />

cost optimized, consume lower power and offer smaller form<br />

factors with generous high speed interfaces.<br />

III.<br />

KEY HIGH SPEED INTERFACES<br />

The most common interface being leveraged in many<br />

industrial designs is Gigabit Ethernet. Most commonly a<br />

FPGA interfaces to a PHY via a serial SGMII interface. In the<br />

past, a SGMII interface required using high speed transceivers<br />

in FPGAs, but with new cost optimized mid-range FPGAs,<br />

SGMII interfaces are now on generic GPIO pins.<br />

A. SGMII on GPIO<br />

Power-efficient Gigabit Ethernet interfaces are often<br />

required in industrial system architectures. Many embedded<br />

product developers are using Gigabit Ethernet for an increasing<br />

number of connections. No longer only for data payloads, these<br />

links are becoming ubiquitous for control, management, status,<br />

and more. Traditional mid-range FPGAs can support these 1<br />

Gbps speeds, but they require transceivers to implement 1G<br />

SGMII interfaces (as well as other high speed interfaces).<br />

Ideally, a device would have generic I/O pins that could<br />

support SGMII, as the following illustration shows.<br />

II.<br />

MARKET DYNAMICS<br />

Although the industrial segment is unique, it has several<br />

characteristics that exist in other vertical markets. The<br />

requirements for better value & lower cost is a growing driver<br />

for industrial designs. In addition, faster and numerous<br />

networking interfaces are also more commonplace. Finally,<br />

faster performance processing in many embedded designs is<br />

now a new norm. These factors result in architectures which<br />

require interfaces such as Gigabit Ethernet, transceivers up to<br />

12.7Gbps for 10Gb Ethernet, JESD204B ADC/DAC, PCIe<br />

interfaces, HDMI 2.0b and lastly DDR4 memory buses. Now<br />

that these types of interfaces are available in cost optimized<br />

mid-range FPGAs, system architects can address the latest<br />

market dynamics for their products.<br />

Low end FPGAs and traditional mid-range FPGAs do not<br />

have this feature and so they must rely on transceivers. These<br />

transceiver interfaces are precious and frequently scarce, unless<br />

very expensive, higher density FPGA fabrics are used. The<br />

very large FPGA fabric is often not required in industrial<br />

www.embedded-world.eu<br />

646


designs, but designers are forced to choose these devices<br />

because they require additional transceivers. In addition, these<br />

larger devices dictate that larger package form factors are<br />

required. These existing solutions increase both power<br />

consumption and cost in opposition to lower cost demands in<br />

the industrial market.<br />

The new PolarFire FPGAs offer cost-optimized mid-range<br />

densities and address the requirement for numerous GigE links<br />

via SGMII on GPIOs. What differentiates this family is that<br />

they have incorporated a clock and data recovery (CDR) circuit<br />

into high-speed LVDS I/Os that can support 1.25 Gbps. This<br />

allow the device to support SGMII interfaces on several select<br />

GPIO pins. Using this architecture, designers can reduce the<br />

cost, size and power of their designs versus traditional high end<br />

FPGAs.<br />

B. Transceiver to support 10Gb Ethernet, JESD204B, PCIe,<br />

HDMI 2.0b and more<br />

Although industrial and embedded designs are not typically<br />

very high performance, the processing needs are increasing and<br />

the interfaces are also getting faster. These factors necessitate<br />

that FPGAs can support serial interfaces up to 12.5Gbps, so<br />

that these common interfaces can be supported:<br />

• PCIe Gen2 requires 5Gbps<br />

• HDMI 2.0b needs 6Gbps<br />

• 10Gb Ethernet requires 10Gbps<br />

• JESD204B can run up to 12.5Gbps<br />

These high speed serial interfaces require that transceivers be<br />

able to operate at the speeds listed above. The performance for<br />

these rates is trivial for high end FPGAs or mid-range FPGAs<br />

that are built off of high end architectures. The issue with these<br />

devices is they are costly and often beyond the budget for<br />

many embedded designs. Low density FPGAs often do not<br />

have transceivers and those that incorporate them do not<br />

support the performance rates listed. Fortunately cost<br />

optimized, mid-range density FPGAs with the right mix of LEs<br />

(logic elements) and transceivers can support the required data<br />

rates. Below is a table of one such FPGA family, the PolarFire<br />

FPGAs support a range of densities from 100k to 500k LEs and<br />

each offer up to 12.7Gbps transceivers.<br />

Package balls<br />

and spacing<br />

325<br />

0.5 mm<br />

484<br />

0.8 mm<br />

484<br />

1 mm<br />

536<br />

0.5 mm<br />

784<br />

1 mm<br />

1152<br />

1 mm<br />

Max LEs 192K 300K 300K 300K 481K 481K<br />

Max CDR GPIOs 8 14 13 15 20 24<br />

(SGMII)<br />

Transceivers up 4 4 8 4 16 24<br />

to 12.7 Gbps<br />

Max possible<br />

SGMII<br />

interfaces<br />

12 18 21 19 36 48<br />

These devices enable industrial architects to support not only<br />

the latest high speed serial interfaces but can also implement<br />

the necessary board functions with the adequate LEs on chip.<br />

In addition, because this family has both SGMII on GPIO as<br />

well as transceivers, designers can often select smaller package<br />

sizes and densities thus lowering the system cost and reducing<br />

the power needed for their FPGA functionality.<br />

C. DDR4 Interfaces<br />

Many embedded designs are required to interface to other<br />

parts of the overall system. Stand alone boards are the<br />

exception and not the norm by and large. Because designs<br />

often communicate and network to other system components<br />

this means data is being transmitted to and from most industrial<br />

designs. As the data rates increase, so does the need to store it,<br />

so it can be processed and acted up. Hence the need for<br />

memory on many industrial designs.<br />

The most common memory that engineers tend to connect<br />

to a FPGA is DDR DRAM based devices. There are several<br />

generations to choose from and generally speaking the best<br />

choice is to use memory which has been shipping for some<br />

time, but not the absolute newest standard. For DRAMs the<br />

best cost per bit and architecture that will be supported for<br />

numerous years is DDR4. Although DDR3 is still a viable<br />

choice for designs, the majority of new designs are choosing<br />

DDR4 because it will offer reduced pricing in the future, faster<br />

performance and wider single chip data buses are offered.<br />

Today there do not exist low density FPGAs that support<br />

DDR4 memory interfaces. One must go to mid-range density<br />

FPGAs for DDR4 interfaces. Previously mid-range FPGAs<br />

which were built off of high end architectures were the only<br />

choice. The issue with these devices is they are costly,<br />

consume high power and the form factors are very large. One<br />

should look to new cost optimized mid-range FPGAs to offer<br />

the required DDR4 performance and provide smaller packages<br />

and lower cost to meet the new demands on industrial designs.<br />

Below are a few cost optimized mid-range FPGA devices<br />

offered in smaller package sizes that support DDR4 interfaces.<br />

D. Conclusion<br />

With the growing demands of higher performance<br />

interfaces, more connectivity and lower costs for industrial<br />

designs, system architects and engineers need to look for new<br />

solutions. Today’s cost optimized, mid-range density FPGAs<br />

solve these design challenges. These devices offer great value,<br />

lower power consumption while still providing the capabilities<br />

demanded by modern industrial designs.<br />

647


Lucky Seven<br />

Taking Advantage of COM Express Type 7<br />

Ansgar Hein<br />

Marketing<br />

iesy GmbH & Co. KG<br />

Meinerzhagen, Germany<br />

sales@iesy.com<br />

Abstract—This talk delivers facts and features of the latest<br />

PICMG development as well as give insights on custom solutions<br />

based on this Server on Module standard.<br />

Keywords—COM Express; Type 7; PICMG; Server on Module;<br />

custom; solution; speed; memory; connectivity; flexibility;<br />

virtualization; size; power; performance; 10GbE; customization;<br />

Micro-Server; ATX; 19”; QuadServer<br />

I. INTRODUCTION<br />

COM Express Type 7 is a brand new standard introduced<br />

by PICMG. However, the Type 7 is not a replacement for the<br />

existing and well-established Type 6 pin-out. In place of all<br />

audio and video interfaces of the Type it offers four 10GbE<br />

ports and a total of 32 PCI express lanes in order to support<br />

high computing performance and high speed communication<br />

while reducing power consumption at the same time.<br />

II. DIFFERENCE BETWEEN TYPE 6 AND TYPE 7<br />

There are several reasons why Type 7 was introduced. One<br />

is the availability of server-class SOC processors. Another is<br />

the support of 10Gigabit Ethernet and NC-SI signals as well as<br />

the definition of a larger amount of PCIe lanes for high-speed<br />

data-transfer. The differences are as follows:<br />

A. Added in Type 7<br />

4 x 10GBaseKR Ethernet<br />

NC-SI<br />

32 x PCI express lanes<br />

2 x SATA<br />

4 x USB 3.0 / 2.0<br />

B. Removed in comparison to Type 6<br />

DDI [0:2]<br />

SATA [2:3]<br />

AC97 / HDA Audio<br />

VGA<br />

LVDS/eDP<br />

USB 2.0 [4:7]<br />

III. ADVANTAGE NO. 1: SPEED<br />

When it comes to speed, COM Express Type 7 is<br />

unparalleled in the market for embedded computing modules.<br />

First, because of its 32 PCI express 3.0 lanes in place of 16 PCI<br />

express lanes for COM Express Type 6. Compared to Type 6<br />

modules, Type 7 modules offer 40 times increased bandwidth<br />

when it comes to network connectivity. Second, because of its<br />

support for M.2 socket and thus a wide range of expansions,<br />

ranging from storage to connectivity. Third, the COM Express<br />

Type 7 standard comes with server-grade processors, making it<br />

a true server-on-module approach with Intel® Xeon® class<br />

processors. A further plus is its headless design which – in<br />

combination with a Baseboard Management Controller –<br />

makes COM Express Type 7 modules the perfect match for<br />

any server application you can think of.<br />

IV. ADVANTAGE NO. 2: STORAGE<br />

Looking at the difference between Type 6 and Type 7 one<br />

might wonder why and how the removal of two SATA-ports<br />

can lead to an advantage for storage, since server applications<br />

always have a high demand for storage capabilities. The use of<br />

fast SSDs in place of SATA disks makes the SATA interface a<br />

bottleneck. This is where NVMe (non-volatile memory<br />

express) comes in, a new specification for connecting mass<br />

storage to PCI express and Type 7 supports this development<br />

through its increased amount of PCI express lanes.<br />

M.2 NVMe is probably the most advanced specification for<br />

internal extension cards currently available for embedded<br />

systems. COM Express Type 7 now makes full use of this form<br />

factor, for example to connect SSDs at a higher speed<br />

compared to mSATA. Since NVMe reduces the I/O overhead<br />

and latencies ( see TABLE I. ), a paradigm change is about to<br />

take place when it comes to mass storage solutions in server<br />

environments, leading to an increased use of M.2 SSDs.<br />

Further to this the slick design of M.2 allows for smaller<br />

footprints when it comes to storage solutions. However, one<br />

downside at present is a limited capacity of max. 8 TB for M.2<br />

storage modules and a higher price compared to mSATA<br />

solutions, but both will change for the better in the near future.<br />

Talking about future: in 2016 Intel has announced Optane SSD<br />

products based on the brand new 3D Xpoint technology<br />

offering a thousand times more performance and durability<br />

than NAND flash technology. NVMe already supports Optane.<br />

648


Maximum<br />

Queue Length<br />

TABLE I.<br />

COMPARISON OF SATA AND NVME<br />

SATA<br />

2048 MSI-X<br />

NVMe<br />

65535 command chains<br />

65536 commands per queue<br />

Interrupts 1 single interrupt 2048 MSI-X<br />

Parallelism and<br />

Multi-threading<br />

needs synchronization<br />

locking for execution<br />

no locking<br />

V. ADVANTAGE NO. 3: CONNECTIVITY<br />

The COM Express Type 7 standard has been released<br />

because of the bandwidth bottleneck in connected devices that<br />

need to interact in (industrial) applications, where many<br />

devices exchange massive data streams and need to be<br />

synchronized in real time (i.e. IoT, telemedicine). Further to<br />

this the new PICMG standard provides for the addition of up to<br />

four 10 Gigabit Ethernet (10GbE) interfaces on the baseboard.<br />

An increased number of PCI Express lanes (32 instead of 16)<br />

provides a wealth of connectivity and interface options.<br />

While all 10-GbE interfaces on the module are defined as<br />

10GBASE-KR single backplane lanes to keep them from being<br />

bound to predefined physical interfaces, the PHY is not places<br />

on the module itself, but on the baseboard. This allows for even<br />

greater flexibility, since the interfaces can be implemented as<br />

interchangeable SFP+ modules. Further to this it is also<br />

possible to combine the performance of several 10-GbE signals<br />

into in a PHY for 40GBASE-KR4 for example.<br />

VI. ADVANTAGE NO. 4: FLEXIBILITY<br />

Talking about flexibility, there are many use-cases for<br />

COM Express Type 7 across various markets because of its<br />

versatile high-speed approach. Especially in Industry 4.0<br />

environments there is a huge need for server-like appliances<br />

with high-speed connections, fast memory and computing<br />

performance. Due to its compact size, COM Express Type 7<br />

allows for small-scale housing. At iesy we have developed four<br />

different types of baseboards as platforms for customization:<br />

A. embedded 5x5 – for micro-servers on the shopfloor<br />

Micro-Server based on the Mini STX form-factor (formerly<br />

known as Intel 5x5), making it suitable for standard housings:<br />

1 × COM Express Type 7 basic module<br />

Intel® Atom C3xxx with up to 16 cores<br />

up to 4 × SFP+ for 4 × 10 GBit Ethernet<br />

2 × 1000/100/10 MBit Ethernet<br />

3 × DDR4 SO-DIMM socket with up to 48 GB<br />

2 × USB 3.0<br />

2 × M.2 M-Key slot (1× PCIe x4, 1× SATA multiplex)<br />

1 × M.2 A-Key slot (2× PCIe x1, 1× USB 2.0, I²C)<br />

1 × Baseboard Management Controller (BMC)<br />

Dimensions: 140mm × 147mm × 55mm (incl. cooling<br />

solution)<br />

B. Flex ATX – for legacy form-factors<br />

Makes use of the advantages of the popular Flex ATX<br />

form-factor and can thus easily be implemented in existing<br />

infrastructures and Flex ATX housings:<br />

<br />

<br />

<br />

<br />

1 × COM Express Type 7 basic module<br />

Intel® Atom & Intel® Xeon® with up to 16 cores<br />

up to 6 × SFP+ for 6 × 10 GBit Ethernet<br />

2 × 1000/100/10 MBit Ethernet<br />

2 × USB 3.0<br />

<br />

<br />

<br />

2 × M.2 M-Key slot, i.e. for NVMe SSD<br />

1 × RS232<br />

Dimensions: 229mm × 191mm × 46mm<br />

C. Basic Size – for minimal footprint<br />

Sized exactly to meet the dimensions of a COM Express<br />

basic module, this minimalistic approach delivers the most<br />

compact COM Express Type 7 experience possible while at the<br />

same time providing a maximum of high-speed interfaces:<br />

<br />

<br />

<br />

<br />

1 × COM Express Type 7 basic module<br />

Intel® Atom & Intel® Xeon® with up to 16 cores<br />

4 × SFP+ for 4 × 10 GBit Ethernet<br />

2 × 1000/100/10 MBit Ethernet<br />

2 × USB 3.0<br />

<br />

<br />

<br />

1 × M.2 M-Key slot, i.e. for NVMe SSD<br />

1 × mini PCIe slot<br />

Dimensions: 125mm × 90mm × 58mm<br />

D. 19” QuadServer – for datacenters on the shopfloor<br />

Designed with industrial applications and datacenters in<br />

mind, but without the need for datacenter cooling solutions,<br />

this high-end rackmount server appliance offers extensive<br />

processing power as well as switching and storage capabilities:<br />

<br />

<br />

<br />

<br />

4 × COM Express Type 7 basic module<br />

Intel® Xeon® with up to 64 cores per HU<br />

16 × 10 GBit Ethernet (thereof 4 × external)<br />

8 × M.2 M-Key slot, for NVMe SSD (up to 64 TB)<br />

8 × USB 3.0<br />

<br />

<br />

<br />

1 × integrated Switch with 48 hosts & 48 clients and<br />

up to 128 GBit max. transfer rate<br />

4 × boot devices via SSD<br />

Dimensions: 483mm × 44mm × 530mm (incl. fans)<br />

649


Fig. 1. embedded 5x5 baseboard with cooling solution / © www.iesy.com<br />

Fig. 2. Flex ATX baseboard with BMC and 6 × 10 GbE / © www.iesy.com<br />

Fig. 3. Basic Size with 95mm × 125mm footprint / © www.iesy.com<br />

Fig. 4. QuadServer with 4 × COM Express Type 7 / © www.iesy.com<br />

VII. ADVANTAGE NO. 5: VIRTUALIZATION<br />

A few years ago, there were only few use-cases for<br />

virtualization in server-environments. Today there are several<br />

trends, for example Software-Defined Networking and<br />

Networks Functions Virtualization (SDN/NFV) in carriergrade<br />

business or the demand to separate real-time systems in<br />

industrial applications from IoT connectivity. These two trends<br />

lead to an increasing demand in virtualized server technologies.<br />

Most vendors of COM Express Type 7 modules support<br />

virtualization technologies, like RTS hypervisor, which is wellaccepted<br />

in industrial and medical real-time applications.<br />

Especially industry 4.0 applications require redundant edge<br />

or fog servers right on the shop-floor, consisting of dedicated<br />

infrastructure components, such as firewalls, routers, loadbalancers<br />

and storage servers. All of these can be virtualized<br />

with software-based solutions while all configurations are<br />

interconnected through redundant 10 GbE interfaces. The<br />

virtualized environments are hardware independent and allow<br />

for the development of multi-tenant nodes for faster<br />

implementation of heterogeneous machines, systems or sensor<br />

networks. This results in more agile, flexible and scalable<br />

installations which are well-suited to meet the requirements of<br />

all kinds of Industry 4.0, M2M and IoT services.<br />

Extensive remote monitoring and management systems are<br />

required, when using virtualization at the above mentioned<br />

level. Using a Baseboard Management Controller (BMC) helps<br />

you with out-of-band management tasks, such as rebooting a<br />

system, mounting virtual media, accessing the console from<br />

remote, managing firmware or tracking physical conditions as<br />

well as checking event logs. COM Express Type 7 modules do<br />

not have a BMC onboard and not all iesy baseboards are<br />

equipped with one. However, COM Express Type 7 supports<br />

the Network Controller Sideband Interface (NC-SI) and thus<br />

offers the possibility to run OpenBMC which allows for a wide<br />

range of system administration features and helps in remotely<br />

managing (virtualized) servers at large scale.<br />

VIII. ADVANTAGE NO. 6: SIZE<br />

Size matters. The modular approach within the COM<br />

Express specification strikes a careful balance between cost<br />

and performance and results in a variety of COM Express form<br />

factors and board sizes defined in the standard. Right now there<br />

are seven different versions that rely on a set of commonly<br />

defined connectors and mounting holes as well as common<br />

signaling where appropriate. Though currently only available<br />

in basic size (95mm × 125mm), COM Express Type 7 modules<br />

deliver a server-on-module approach at high-density level,<br />

allowing for compact casing or mounting of several modules<br />

into one 19” 1U system. As shown in section VI iesy already<br />

provides several blueprints ranging from legacy to microservers.<br />

Heat dissipation becomes critical, if you want or need<br />

to harvest the full potential of existing COM Express Type 7<br />

modules with high-performance processors using up to 65W<br />

TDP. However, there is a wide range of efficient cooling<br />

solutions along with extended temperature range modules that<br />

help conquering this embedded systems challenge.<br />

A server-like application at the size of the COM Express<br />

Type 7 basic module – even with BMC – becomes possible.<br />

650


This makes the new PICMG standard a versatile form factor<br />

for leveraging Industry 4.0 applications and bringing<br />

datacenters to the shop floor or empowering innovative highbandwidth<br />

applications in other fields, such as healthcare or<br />

autonomous driving.<br />

IX. ADVANTAGE NO. 7: POWER<br />

Compared to off-the-shelf servers, COM Express Type 7<br />

powered solutions require much less power. M.2 SSD storage,<br />

low-power CPU and less heat dissipation are the main reasons<br />

for this consequent low-power approach for server-grade<br />

processors and a thermal profile below 65 W TDP. While<br />

several application scenarios require low power, the ever rising<br />

prices for energy and running server applications at 24/7 have<br />

an impact on future product pricing as well as sustainability.<br />

X. CONCLUSION<br />

COM Express Type 7 is the first true server-on-module<br />

approach and in no way to be compared with Type 6. Its<br />

headless design along with the seven advantages highlighted<br />

above makes it the ultimate choice for Small Form Factor<br />

(SFF) applications. Further to this it is cost-efficient, easy to<br />

upgrade as newer COM Express modules are developed and<br />

suitable for a wide range of applications, from commercial to<br />

rugged environments.<br />

ACKNOWLEDGMENT<br />

iesy would like to thank its partners congatec and Kontron for their continued<br />

support in developing customized solutions based on COM Express Type 7<br />

solutions as well as in delivering facts and figures and valuable feedback for<br />

this talk.<br />

651


PCB Design Problems and Solutions for Embedded Supercomputing<br />

Dr. Andreas Döring<br />

IBM Research Laboratory<br />

Rüschlikon, Switzerland<br />

Rainer Asfalg<br />

Altium Europe GmbH<br />

Global Head of Technical Sales & Support<br />

Munich, Germany<br />

rainer.asfalg@altium.com<br />

Abstract—The DOME microserver achieves a high density for<br />

highly-efficient computing by using commodity components.<br />

Modularity allows the adaption to a specific environment. Water<br />

cooling allows ruggedized packaging. A new IO card is presented<br />

that allows the direct attachment of sensors and actuators.<br />

Keywords—component; formatting; style; styling; insert (key<br />

words)<br />

I. INTRODUCTION<br />

Improvements in performance and energy efficiency in<br />

computing result only in a low degree from CMOS technology<br />

advances nowadays, if at all. Hence, architectural contributions<br />

such as higher degrees of parallelism and the integration of<br />

specialized accelerator is used to meet the growing demands in<br />

supercomputing and high-end embedded systems. Building a<br />

system from commercial components makes use of a<br />

development ecosystem, provides shorter economic upgrade<br />

cycles, and offers a wider heterogeneity. The DOME<br />

microserver was developed in a cooperation of IBM Research<br />

Zurich, IBM Netherlands, and ASTRON as technology<br />

preparation of the exascale computing requirements of the<br />

Square Kilometer Array (SKA) radiotelescope. While the<br />

project focus lay on a large supercomputer installation, the<br />

properties fit the high-end embedded space very well. The main<br />

features are modularity, low volume/high density, water<br />

cooling (allowing closed packaging), cost and energy<br />

efficiency, and system management including temperature and<br />

power supervision.<br />

An important enabler for these characteristics are the<br />

printed circuit boards (PCBs) employed for modules and the<br />

back plane. While using HDI and advanced materials such as<br />

megtron4 by Panasonic the edge of conventional technology is<br />

reached, the boards can still be produced by several suppliers in<br />

reasonable time.<br />

Poweredge for the power converter. While the main rail for<br />

power distribution at 12V is part of the cooling system, the<br />

backplane connects two regulated power converter slots at 7<br />

different rails for DRAM-supply and I/O voltages. Using shared<br />

power converters for the minor regulated supply voltages of<br />

modern SoCs is more cost-, volume, and energy efficient than<br />

using dedicated converters on each module. The compute<br />

modules are cooled by a passive copper plate with integrated<br />

heat pipe. The pitch of 7.6mm of one compute node to the next<br />

is partitioned into 0.8mm for backside components, 1.8mm<br />

PCB thickness, 2mm heat spreader, and 3mm for the top side<br />

components. For higher components, such as BGA packages or<br />

inductors, recesses or cut outs into the heat spreader are milled.<br />

For the main SoC the package cap is removed such that the<br />

silicon die is directly cooled from its backside.<br />

The connectors for the switch and the compute modules allow<br />

data rates beyond 10Gbps on a differential pair which allows<br />

operation of the switch with 64 ports and the modules with up<br />

to 6 ports of 10Gbit Ethernet (Base10G), in addition to PCI-<br />

Express 2.0 and Serial-ATA-2.<br />

Further modules of the system support mSATA solid state disks<br />

and M.2 form factor flash memory.<br />

A. Compute Modules<br />

Currently, three compute modules have been developed, based<br />

on the NXP QorIQ processor T4240, with the NXP QorIQ<br />

processor LS2088, and Xilinx Kintex Ultrascale FPGA<br />

respectively. These cards measure 139x62.5mm, have power<br />

converters for the core voltage and switches for all other supply<br />

rails. They are managed over USB by a Cypress PSoC 5LP<br />

controller. The main features are summarized in Table 1:<br />

II.<br />

SYSTEM OVERVIEW<br />

A passive backplane provides slots for four different module<br />

types: compute/storage, power conversion, network switch, and<br />

USB-hub. Furthermore, the backplane carries the SFP and<br />

SFP+ cages/connectors for external network. For each slot a<br />

different connector is used, 3M SPD08 for compute and hub<br />

slots, Molex Impact for Switch, and Molex EXTreme<br />

www.embedded-world.eu<br />

652


Node ISA/Logic Memory I/O<br />

NXP<br />

T4240<br />

28nm<br />

Bulk<br />

43W<br />

NXP<br />

LS2088<br />

28nm<br />

Bulk<br />

35W<br />

Xilinx<br />

Kintex<br />

Ultrascale<br />

20nm<br />

FinFET<br />

PPC64<br />

24 core<br />

1.8GHz<br />

e6500<br />

ARMv8<br />

8 core<br />

2GHz<br />

A72<br />

726K cells<br />

2760 DSP<br />

38Mb<br />

RAM<br />

24GB<br />

3<br />

channels<br />

DDR3L<br />

72bit<br />

ECC<br />

32GB<br />

2<br />

channels<br />

DDR4<br />

72bit<br />

ECC<br />

16GB<br />

2<br />

channels<br />

DDR4<br />

72bit<br />

ECC<br />

Table 1: Compute nodes<br />

4x10GbE<br />

PCIe x8<br />

2 SATA<br />

USB,SDHC<br />

6x10GbE<br />

PCIe x4,x2,x1,x1<br />

2 SATA<br />

32 transceivers:<br />

GbE/SATA/PCIe<br />

as needed<br />

T4240 compute module<br />

robust pumps are wide spread. Furthermore, the modular design<br />

allows a customized heterogeneous design tuned for a particular<br />

application. However, the datacenter-oriented design of the<br />

microserver provides as primary interface 10Gb Ethernet,<br />

which does not match most industry applications. Therefore, a<br />

modified version of the FPGA module was developed, which<br />

can be equipped with a daughter card. This daughter card<br />

supports the physical interfaces needed in industrial and<br />

embedded applications, including the Internet of Things<br />

Services and People, Table 2.<br />

B. Power Converter<br />

The Power Converter Module generates regulated supply<br />

voltages for DRAM and I/O. The output voltages are<br />

programmable such that several DRAM types and IO standards<br />

can be used in the compute modules. Furthermore a fixed<br />

voltage rail is needed for internal purposes, but extra current is<br />

fed to the backplane as well. The design is optimized for<br />

maximum power per backplane area at the given width/height.<br />

The main limiters are the passive components (power inductors<br />

and capacitors) and the backplane connector.<br />

C. Ethernet Switch<br />

In order to connect a high number of modules with each other<br />

and the outside a 64-port Ethernet switch supporting 10Gbps<br />

and 40Gbps is integrated in the same form factor (139mm x<br />

55mm) but with a higher depth due to the high-speed high<br />

pinout connector and the cooling requirements (195W TDP).<br />

Two stacked PCBs integrate in the Intel FM6364 ASIC, the<br />

connector, the core power converters, clock generators, and<br />

configuration memory. The combination of 128 differential<br />

pairs at 10Gbps and the high supply currents (120A on one of<br />

the rails) result in a 3.6mm thick 28-layer PCB. Since the same<br />

ASIC is used in conventional datacenter switches, compatibility<br />

with many standard network protocols can be achieved. The<br />

management of the switch is implemented in a T4240 compute<br />

module.<br />

III.<br />

INDUSTRY INPUT/OUTPUT<br />

The combination of features of the microserver triggered early<br />

interest in its application in high-end embedded systems, for<br />

vehicles or image processing in production, for instance. In fact,<br />

the liquid cooling allows the use of tight enclosures for dusty or<br />

otherwise dirty environments. Furthermore, the microserver has<br />

no moving parts – except the pump for the cooling liquid, but<br />

Interface<br />

Count<br />

USB 2.0 host 2<br />

Optocoupler in 4<br />

Optocoupler out 4<br />

LVDS<br />

7 pairs<br />

CAN 2<br />

Output level shift 18<br />

Input Level shift 12<br />

Isolated USB 1<br />

MIPI PHY 1+1<br />

Serial<br />

2-4<br />

(RS232, RS485, etc.)<br />

Table 2: Interfaces of Industry-IO card<br />

IV.<br />

SUMMARY<br />

The DOME microserver design targets a balance between<br />

commercially available components and production processes<br />

on one side and aggressive density and performance on the<br />

other. Being a research project the risk and design cost had to<br />

be considered as well; for a commercial product additional<br />

features could be added to the modules by embedding passive<br />

components (e.g. decouple capacitors) into the PCB or using<br />

chip-on-board technology for selected components.<br />

QorIQ and Layerscape are trademarks of NXP, PSoC is a<br />

trademark of Cypress Semiconductor. EXTreme Poweredge<br />

and Impact are Trademarks of Molex.<br />

REFERENCES<br />

[1] R. Luijten and A. Doering: “The DOME embedded 64<br />

bit microserver demonstrator”, Proceedings of 2013<br />

653


International Conference on IC Design & Technology<br />

(ICICDT)<br />

[2] R. Luijten, D. Pham, R. Clauberg, M. Cossale, H. N.<br />

Nguyen, and M. Pandya: “Energy-efficient<br />

microserver based on a 12-core 1.8GHz 188K-<br />

CoreMark 28nm bulk CMOS 64b SoC for big-data<br />

applications with 159GB/S/L memory bandwidth<br />

system density” ISSCC 2015<br />

www.embedded-world.eu<br />

654


Camera Standards for Embedded Vision Systems<br />

Dr. Fritz Dierks<br />

Basler AG<br />

Ahrensburg, Germany<br />

friedrich.dierks@baslerweb.com<br />

Abstract—The market for PC-based machine vision has bread<br />

a vivid ecosystem of camera, frame grabber, and image<br />

processing library vendors whose products work quite seamlessly<br />

together. The industrial embedded vision market in contrast is<br />

still in need of such an ecosystem which would allow customers to<br />

pick suitable camera modules for their embedded system from<br />

multiple vendors especially for low and medium unit volumes<br />

and combine them equally seamlessly with software libraries of<br />

other parties. This article describes how the co-evolution of<br />

interface standards and component vendors’ portfolios has made<br />

the PC-based camera ecosystem possible and explains where the<br />

difficulties in the embedded vision market come from and how to<br />

overcome them.<br />

Keywords—camera; sensor module; embedded system;<br />

interface standard; standardization; ecosystem<br />

I. INTRODUCTION<br />

Interface standards make products from different vendors<br />

work seamlessly together. Thus by looking at the history of<br />

interface standards one can see how the ecosystem (or food<br />

chain) of a certain market has evolved over time and<br />

understand the underlying market mechanisms. The market for<br />

digital machine vision cameras is approximately 20 years old<br />

and quite mature already while the market for embedded<br />

camera modules for industrial purposes is still in its early<br />

stages and its food chain is not yet fully established.<br />

This article first looks back how things evolved in the<br />

digital machine vision camera market in order to explain the<br />

underlying market mechanisms. Then it shows how these<br />

mechanisms apply to the embedded camera module market,<br />

sketch how a camera ecosystem for this market could look like<br />

and derive what that would mean for the corresponding<br />

interface standards.<br />

II.<br />

THE MACHINE VISION CAMERA MARKET<br />

A. Structure of Camera Interface Standards<br />

For connecting a camera to a PC in a plug&play manner<br />

two layers need to be standardized:<br />

The transport layer deals with enumerating devices,<br />

accessing to the camera’s low-level registers, retrieving<br />

stream data from the device, and delivering events. The<br />

transport layer is governed by the hardware interface.<br />

<br />

Depending on the interface type, the transport layer for<br />

PC-based systems requires a dedicated frame grabber<br />

(e.g. Camera Link, Camera Link HS, CoaXPress) or a<br />

bus adapter (e.g. IEEE 1394, GigE Vision, USB3<br />

Vision). For embedded systems the typical hardware<br />

interface is MIPI CSI-2 which is normally already<br />

integrated into the embedded processor.<br />

The feature layer describes the functional properties<br />

of a camera such as Exposure Time or Gain as well as<br />

the format of the video, chunk, and event data. Two<br />

major approaches exist for standardization: one can<br />

describe a fixed register layout (e.g. IIDC for machine<br />

vision cameras, or CCS for embedded systems<br />

respectively) or provide a machine readable selfdescription<br />

of the cameras’ features (see the GenICam<br />

approach which works for both worlds).<br />

A transport layer description should be agnostic to camera<br />

functionality while a feature layer description should be valid<br />

for different transport layers in order to maximize reusability.<br />

The hardware interfaces industrial camera transport layers are<br />

based on originate in most cases from the consumer market.<br />

Examples are GigE Vision basing on Gigabit Ethernet, USB3<br />

Vision basing on USB 3.0, or MIPI CSI-2 originating from the<br />

mobile phone market. Every time a new suitable interface is<br />

introduced in the consumer market the industrial camera<br />

market tends to adopt it which is the reason why constantly<br />

new interface standards have to be adopted. The following<br />

Benefits of Standards<br />

For Customers<br />

Being ensured to “bet on the right horse”<br />

Getting cheaper and better products due to<br />

competition in the market<br />

For Vendors<br />

Faster market growth since standards attract<br />

customers<br />

Re-using know-how and IP<br />

Looking strong in the eyes of the customers when<br />

visibly spending effort on standardization<br />

(“handicap principle”)<br />

www.embedded-world.eu<br />

655


sections describe some important steps in the evolution of<br />

camera standards.<br />

B. Transport Layer without Standardized Feature Layer<br />

The first digital cameras required dedicated frame grabbers<br />

resulting in a market food chain of camera vendors and frame<br />

grabber vendors (Fig. 1). A typical transport layer standard<br />

from this time is Camera Link [1] which provides video data<br />

transfer via LVDS and camera configuration via a serial port.<br />

Camera Link has no standardized feature layer so camera and<br />

frame grabber vendors have to team up and manually create<br />

configuration files to make the connections work.<br />

The customer has to design his system using the frame<br />

grabber vendor’s proprietary SDK which makes changing<br />

frame grabbers hard to do. Many frame grabber vendors also<br />

developed image processing libraries which were closely tied<br />

to the vendors’ respective SDKs.<br />

The frame grabber vendors however still had the software<br />

libraries to sell so they created their own drivers to stay in<br />

business (Fig. 2 bottom). This was possible since the feature<br />

layer for IEEE 1394 cameras had been standardized by<br />

defining a fixed register layout called IIDC [3] which was very<br />

much inspired by a design from Sony who created one of the<br />

first industrial 1394 cameras at that time.<br />

GenICam Standard<br />

Established in 2006<br />

>180 member companies<br />

Hosted by European Machine Vision Association<br />

(EMVA)<br />

Free membership<br />

Foundation of all modern machine vision camera<br />

standards such as GigE Vision, USB3 Vision,<br />

CoaXPress, Camera Link HS, Camera Link 3.0<br />

Maintains an open source reference<br />

implementation<br />

Two international meetings pre year taking place in<br />

turn in Asia, Europe, and America.<br />

Fig. 1<br />

Camera Link - no standardized feature layer<br />

green:camera vendor, blue: frame grabber vendor<br />

C. Fixed Register Layout (IIDC)<br />

When Apple started promoting the IEEE 1394 bus<br />

interface (aka Firewire) [2] it was soon adopted as transport<br />

layer by the machine vision industry because it allowed<br />

building systems without using expensive frame grabbers but<br />

rather by using cheap bus adapter cards instead (Fig. 2 top).<br />

Since the adapter card’s hardware interface was standardized<br />

and supported by Windows as well as Linux developing one<br />

driver for each operating system was sufficient to support a<br />

wide range of systems. The corresponding drivers were<br />

provided by the camera vendors for free which changed the<br />

food chain of the market considerably since the frame grabber<br />

vendors were no longer dominating the SDK and thus the<br />

programming interface to the customer.<br />

Fig. 2<br />

IEEE 1394 – fixed register layout (IIDC)<br />

green:camera vendor, blue: frame grabber vendor, grey: consumer<br />

market<br />

The key problem with a fixed register layout however is<br />

that supporting custom features is very tricky especially when<br />

the customer is using a 3 rd party imaging library. While for the<br />

camera vendor adding a custom feature to their products often<br />

is a good business case imaging library vendors in many cases<br />

don’t make enough money on their SW licenses to justify<br />

supporting those custom features through their SDKs which<br />

makes the overall business case difficult to realize.<br />

D. Self Describing Camera Features (GenICam)<br />

When Intel announced to add Gigabit Ethernet support to<br />

their chipsets the industry teamed up in order to standardize<br />

this new interface for machine vision cameras. The transport<br />

layer was named GigE Vision [4] and defined on top of UDP<br />

packets. While it was quite easy to agree on the transport layer<br />

standard it was next to impossible to agree on a fixed register<br />

layout though the group tried for over one year: too many<br />

companies had developed their own proprietary custom<br />

features they wanted to see in the standard and discussing all<br />

that with full implementation details simply took too long. In<br />

addition the companies forming the standard committee were<br />

of similar strength so no company was strong enough to<br />

impose their solution to the others.<br />

In order to overcome these problems the GenICam<br />

standard [5] was created as unified feature layer standard for<br />

GigE Vision and all future transport layers in the machine<br />

vision industry (Fig. 3). The key idea is to describe the features<br />

of a camera in an abstract way and provide an XML feature<br />

description file defining how to map the features to the<br />

camera’s control registers. In GenICam each feature is defined<br />

by a Name, a Type, and a Meaning. The feature describing the<br />

amplification of a camera’s video signal (the meaning) is for<br />

example named ‘Gain’ and is of type Float. With each type<br />

there is associated a software interface allowing to query the<br />

656


implementation details of the features such as minimum,<br />

maximum etc. The most important types supported are integer,<br />

float, enumeration, string, command, and bool. GenICam<br />

exposes the features of a camera through a feature tree so the<br />

camera is fully self-describing.<br />

Fig. 3<br />

GigE Vision – self describing camera (GenICam)<br />

green:camera vendor, grey: consumer market<br />

The GenICam standard has several modules: The GenApi<br />

module defines the syntax of the XML description language<br />

while the SFNC (= standard feature naming convention)<br />

defines a list of over 600 camera features each by Name, Type,<br />

and Meaning. This architecture overcomes the problems of the<br />

fixed register layout standards.<br />

<br />

<br />

Since the standard does not define the implementation<br />

details of each feature agreeing on the feature layer<br />

became quite easy. In addition this scheme opened up<br />

room for competition based on different quality of<br />

implementation.<br />

The XML description language is the same for<br />

standard and custom features; the difference is just a<br />

flag. This allowed the camera vendors to deliver any<br />

kind of custom features to their customers even if these<br />

were using 3 rd party image processing libraries. In<br />

addition it made growing the standard feature list easy:<br />

typically a vendor would implement something new as<br />

custom feature and then submit it for standardization<br />

having already provided a proof of concept.<br />

The on-the-fly interpretation of the XML file is done using<br />

an open source reference implementation which is maintained<br />

by the standard committee and available for many operating<br />

systems including Linux on ARM which makes it applicable<br />

also for embedded systems. The code has been optimized for<br />

performance and quality and is used as the core of most SDKs<br />

in the industry. As a result different cameras behave very<br />

consistently in different software environments since camera<br />

vendors can use the reference implementation for testing their<br />

products.<br />

It turned out that vendors kept their products’ proprietary<br />

SDKs and used GenICam under-the-hood only. This is partly<br />

for historical reasons and partly because the SDK is an<br />

important tool for differentiation in the market. A corollary<br />

from that observation is that having a standardized SDK is not<br />

crucial for the forming of an ecosystem. This is an important<br />

learning for the embedded systems market.<br />

GigE Vision and GenICam were a huge success so it was<br />

decided to build all new transport layers on top of GenICam in<br />

order to re-use as much existing IP as possible. Newer<br />

GenICam based standards are USB3 Vision [6], CoaXPress<br />

[7] and Camera Link HS [8]. Even existing standards like<br />

IEEE 1394 [2] defined with a fixed register layout are covered<br />

by GenICam through creating a corresponding XML file, and<br />

also Camera Link [1] is about to create a new release v3.0<br />

which will make GenApi support mandatory making it finally<br />

support plug&play.<br />

E. Standardized Transport Layer API (GenTL)<br />

It turned out that standardizing the feature layer was not<br />

enough in the long run. For image processing library vendors it<br />

became more and more unattractive to create transport layer<br />

drivers and frame grabbers for the different interface<br />

technologies. Therefore a transport layer API was standardized<br />

within the GenICam standard named GenTL (Fig. 4). It<br />

defines an abstract C interface which allows using transport<br />

layer drivers from different vendors in a generic way.<br />

CoaXPress was the first transport layer standard which made<br />

the support for GenTL mandatory and several large image<br />

processing library vendors who have no history in building<br />

frame grabbers rely now on GenTL to access cameras which<br />

made most camera vendors support it. As a result a customer<br />

can now freely combine cameras, frame grabbers, drivers, and<br />

image processing libraries of different vendors even in mixed<br />

interface systems.<br />

Fig. 4<br />

CoaXPress – standardized transport layer API (GenTL)<br />

green:camera vendor, blue: frame grabber vendor, grey: image<br />

processing library vendor<br />

The development sketched in this section – the forming of<br />

an ecosystem and the development of multiple GenICam based<br />

standards – is one of the key reasons why the machine vision<br />

market is highly competitive and offers so many different<br />

camera products, a development which is of course very much<br />

in the interest of customers but also of vendors since it is one of<br />

the key reasons for the considerable market growth. The next<br />

section will use these insights for analyzing the situation in the<br />

embedded camera module market.<br />

III.<br />

EMBEDDED CAMERA MODULE MARKET<br />

In recent years a variety of embedded processors became<br />

available whose computing power reached now a range where<br />

many of these devices become very attractive for vision<br />

applications. For projects with very large unit volumes such as<br />

for mobile phones or consumer devices it is possible to get<br />

camera modules with Android support quite easily. However<br />

when it comes to industrial applications which tend to have<br />

lower or medium unit volumes only and often require Linux<br />

support things are different: due to the current market structure<br />

and missing interface standards only few offerings exist which<br />

in many cases are restricted to certain sensors or processor<br />

types.<br />

A. Technical Peculiarities of Embedded Vision Systems<br />

Currently the physical interface of choice for embedded<br />

camera modules is the MIPI CSI-2 interface [9] using the D-<br />

PHY. It has most traits of a transport layer interface meaning<br />

that it can transfer video data and allows configuration of the<br />

www.embedded-world.eu<br />

657


camera module via I2C without dealing with camera features<br />

such as Gain etc. The MIPI CSI-2 interface was originally<br />

designed to connect raw sensors directly to processors inside<br />

mobile phones but it can also be used to connect camera<br />

modules to a processor provided the customer can live with the<br />

short cable length of around 30 cm dictated by the D-PHY. If<br />

not there are technologies which allow extending the cable<br />

length to 15 m, e.g. by bridging the CSI-2 interface over coax<br />

cable using technologies such as GMSL [10], FPD-Link [11],<br />

or V-by-One HS [12].<br />

Before the raw image from a sensor is delivered to the<br />

customer it must be pre-processed, e.g. de-bayered, de-noised,<br />

and defect pixel corrected. This is done in the ISP (image<br />

signal processor) which in most machine vision cameras is<br />

implemented using an FPGA inside the camera. Since low cost<br />

FPGAs do not support the electrical interface required by the<br />

CSI-2/D-PHY interface this is however not an option for<br />

embedded camera modules. The two basic alternatives are<br />

using a separate ISP on the camera module or using the ISP<br />

many embedded processors have already built in.<br />

B. Using an External ISP (Pass-Through Mode)<br />

If the ISP resides in the camera module the CSI-2 video<br />

port of the embedded processor runs in so called pass-through<br />

mode meaning any internal ISP is bypassed. The external ISP<br />

can be implemented in different manners (Fig. 5):<br />

<br />

<br />

<br />

Some sensors have an ISP on-chip which typically has<br />

limited flexibility and performance only. Since the<br />

customer does not want to deal with the raw register<br />

interface of the ISP and do the necessary<br />

configuration/calibration himself a camera firmware is<br />

required which needs to reside on the processor.<br />

Some sensor vendors provide companion chips for<br />

their sensors which tend to be more powerful but do of<br />

course not come for free. The firmware topic is the<br />

same as with on-chip ISPs.<br />

There are dedicated ISP chips which can be<br />

programmed to behave like the stand alone camera<br />

module. In this case the camera firmware can reside on<br />

the camera module but a driver is required on the<br />

processor side nevertheless to allow customers access<br />

to the camera.<br />

Since the programming interface of the CSI-2 port is<br />

different for every processor family currently a special driver<br />

needs to be implemented for every embedded processor to be<br />

used with a camera module. This is one of the difficulties<br />

preventing the formation of an ecosystem for embedded vision<br />

since it adds a lot of cost to the camera maker’s side it they<br />

want to support the multitude of embedded processor types<br />

available in the market. Compare the situation to PC-based<br />

systems: the network interface cards / bus adapter cards have a<br />

standardized bus interface so implementing one driver for<br />

Windows and one for Linux is sufficient for covering the<br />

majority of the market which helped a lot creating the existing<br />

ecosystem.<br />

Fig. 5<br />

Pass-through mode<br />

blue: processor vendor, grey: other parties<br />

Another problem with external ISPs is cost and/or quality:<br />

sensors with ISP on-chip tend to have simple ISP algorithm<br />

implemented only and dedicated ISP chips cost money which<br />

is economically feasible only for high end und thus expensive<br />

sensors.<br />

C. Using the Internal ISP<br />

From the customer’s point of view the optimum solution<br />

would be to pay for the raw sensor in the camera module only<br />

and use the ISP inside the processor which is typically much<br />

more powerful than a reasonably priced external ISP (Fig. 6).<br />

The challenge here is to get access to the ISPs programming<br />

interface which tends to be very complex and is mostly<br />

designed with high unit volume use cases in mind where<br />

developers from the processor and the sensor vendor meet and<br />

co-design the vision system. This is an even larger obstacle for<br />

creating an ecosystem since the processor vendors are very<br />

reluctant to allow access to their ISPs since they (rightfully)<br />

fear that the amount of support required from their side isn’t<br />

justified by the revenue to be expected from industrial<br />

applications.<br />

Fig. 6<br />

Using the internal ISP<br />

blue: processor vendor, grey: other parties<br />

658


D. Existing SDK Standards<br />

In the embedded world several attempts exist to standardize<br />

with respect to cameras. A first group of standards concentrates<br />

on the SDK the customer is using for accessing the camera<br />

module.<br />

<br />

<br />

<br />

<br />

For Android the Camera HAL3 [13] provides a very<br />

sophisticated API which contains many features being<br />

useful also for industrial applications. The interface<br />

however is focused on Android and most industrial<br />

applications today require Linux support.<br />

The standard video interface for Linux is GStreamer<br />

[14] based on the low level interface Video4Linux<br />

[15]. GStreamer focuses on supplying video streams<br />

and is lacking the support and flexibility required for<br />

more advanced industrial imaging use cases.<br />

For quite a while the Khronos group attempted to<br />

create a camera access standard dubbed OpenKCam<br />

[16]. However during an Embedded Vision Alliance<br />

meeting in Dec 2017 it was reported that the initiative<br />

has not yet found enough contributors.<br />

Last but not least there are some proprietary SDKs<br />

provided by embedded processor vendors such as<br />

LibArgus from NVIDIA [17] which can be seen as a<br />

de-facto standard for that particular processor family.<br />

None of these standards however support a food chain<br />

where camera module vendors can act independently from the<br />

embedded processor vendors which control the proprietary<br />

transport layer API and the internal ISP. As the example of the<br />

machine vision camera market shows a standardized SDK for<br />

customers is not even necessary for the creation of an<br />

ecosystem; the important interface to be standardizes is that<br />

between the domain of the camera vendor and the processor<br />

vendor which is the transport layer.<br />

E. Fixed Register Layout Revisited (CCS)<br />

A quite recent development is the release of the MIPI CCS<br />

(camera command set) standard [18] which provides a fixed<br />

register layout for MIPI CSI-2 sensors like the IIDC standard<br />

provided for IEEE 1394 cameras (see section II.C). The idea is<br />

that one generic driver would be sufficient for a multitude of<br />

sensors and that this driver could be supplied by the processor<br />

vendors which control the interface and on-board ISP. This<br />

would take away much of the value currently generated by<br />

camera makers and reduce the ecosystem players to mostly the<br />

sensor and processor vendors (Fig. 7).<br />

Fig. 7<br />

CCS – fixed register layout<br />

purple:sensor vendor, blue: processor vendor<br />

The problem with this approach is its inflexibility when it<br />

comes to custom extensions. Industrial applications tend to<br />

have much more complex use cases than consumer applications<br />

like mobile phones so the feature set defined in CCS will need<br />

to have custom extensions. Any extension however would have<br />

to be created by the processor vendors governing the generic<br />

CCS driver which would result in the same situation as we<br />

have today in that customizations are only accepted for very<br />

high unit volumes. So the CCS standard will most likely not<br />

help in creating a vivid ecosystem for the industrial embedded<br />

market. Nevertheless the CCS is a good starting point for new<br />

sensors since it allows getting a lot of the base functionality set<br />

up with standard IP on the driver side and it can be supported<br />

and extended by GenICam (see next section).<br />

F. Using GenICam in Pass-Through Mode<br />

This section explains how GenICam can be applied for<br />

camera modules coming with an external ISP or a sensor with<br />

integrated ISP, using the processors in pass-through mode. The<br />

key challenge is to overcome the missing standard driver for<br />

the CSI-2 ports on the different systems. This can be solved by<br />

implementing a GenTL interface adapter for each processor<br />

family which then can be used as connection point for the<br />

camera SDK driver (Fig. 8). The GenICam reference<br />

implementation contains a GenTL producer framework which<br />

makes it quite easy to provide such an adapter which would be<br />

open source and could even be made part of the processor’s<br />

board support package. The XML file describing the camera’s<br />

features would be installed together with the camera firmware<br />

on the processor’s file system.<br />

Fig. 8<br />

Using GenICam in pass-through mode<br />

green:camera vendor, blue: processor vendor, grey: open source<br />

With this scheme camera vendors would be able to provide<br />

their products independently from the processor vendors which<br />

would most probably start the desired ecosystem. In order to<br />

get this going a combined effort of the camera and processor<br />

makers would be required to create the necessary GenTL<br />

adapters for the most important processor families.<br />

G. Using GenICam in Combination with the Internal ISP<br />

To start an ecosystem for this case is way more difficult<br />

than for the pass-through mode case since it requires an open<br />

and standardized API for the internal ISP which is not an easy<br />

task due to the complexity of the ISPs’ functionality and its<br />

many parameters for tuning the image quality.<br />

Currently a semi-open ecosystem seems to be the best thing<br />

to get where each processor vendor teams up with a small<br />

number of camera module vendors granting them access to<br />

their ISP. This will at least take away the burden from the<br />

processor vendors to deal with the industrial imaging market<br />

which they normally are not familiar with.<br />

www.embedded-world.eu<br />

659


In order to keep the look and feel of the camera interfaces<br />

similar to the case of externa ISPs GenApi can be also used<br />

here to expose the cameras feature layer (Fig. 9). The sensor<br />

driver just needs to implement a layer of pseudo registers<br />

which are described by an XML file so from the user’s point of<br />

view there is no difference to camera modules based on the<br />

pass-through mode.<br />

Fig. 9<br />

Using GenICam in combination with the internal ISP<br />

green:camera vendor, blue: processor vendor<br />

IV.<br />

CONCLUSION<br />

The market of industrial embedded vision systems could do<br />

with an ecosystem of independent camera vendors like the one<br />

the much more mature PC-based machine vision camera<br />

market has developed over the last 20 years resulting in a vivid<br />

market of cameras even for projects with low to medium unit<br />

volume. What is missing in order to get this going is an<br />

interface standard which separates the product arena of the<br />

camera vendor from that of the processor vendor. The existing<br />

standards either do not address this interface since they are<br />

focusing on the programming interface to the customer or they<br />

are not flexible enough since they are based on a fixed register<br />

layout. Using the well-established GenICam standard for the<br />

industrial embedded vision market would solve that problem<br />

and most probably trigger an ecosystem also in this emerging<br />

market.<br />

REFERENCES<br />

[1] Camera Link standard.<br />

https://www.visiononline.org/vision-standards.cfm<br />

[2] IEEE 1394 standard<br />

http://www.1394ta.org/developers/specifications/StandardsOrientation<br />

V5.0.pdf<br />

[3] IIDC 1394-based digital camera specification<br />

http://1394ta.org/wp-content/uploads/2015/07/2003017.pdf<br />

[4] GigE Vision standard.<br />

https://www.visiononline.org/vision-standards.cfm<br />

[5] GenICam standard.<br />

http://www.genicam.org<br />

[6] USB3 Vision standard.<br />

https://www.visiononline.org/vision-standards.cfm<br />

[7] CoaXPress standard.<br />

http://www.coaxpress.com/<br />

[8] Camera Link HS standard.<br />

https://www.visiononline.org/vision-standards.cfm<br />

[9] MIPI CSI-2 standard.<br />

https://www.mipi.org/specifications/csi-2<br />

[10] GMSL MIPI CSI-2 bridge over coax<br />

https://www.maximintegrated.com/en/products/interface/high-speedsignaling/gmsl.html<br />

[11] FPD-Link<br />

http://www.ti.com/lsds/ti_de/interface/fpd-link/camera-serdesoverview.page<br />

[12] V-by-One HS<br />

http://www.thine.co.jp/en/products/pr_details/V-by-OneHS.html<br />

[13] Android Camera HAL3<br />

https://source.android.com/devices/camera/camera3<br />

[14] GStreamer multimedia framework<br />

https://gstreamer.freedesktop.org/<br />

[15] Video4Linux framework<br />

https://www.linuxtv.org/<br />

[16] OpenKCam call for participation<br />

https://www.khronos.org/openkcam<br />

[17] NVIDIA LibArgus camera API<br />

https://developer.nvidia.com/embedded/jetpack<br />

[18] MIPI CCS standard<br />

https://www.mipi.org/specifications/camera-command-set<br />

660


High-Resolution Multi-Camera Methodology for<br />

Autonomous Vision System Solution Development<br />

Michaël Uyttersprot<br />

Avnet Silica<br />

Merelbeke, Belgium<br />

michael.uyttersprot@avnet.eu<br />

Mario Bergeron, Luc Langlois<br />

Avnet<br />

Quebec, Canada<br />

mario.bergeron@avnet.com<br />

luc.langlois@avnet.com<br />

Abstract— Recent advances in embedded vision have evolved<br />

from passive video capturing devices to fully autonomous vision<br />

systems. Self-driving cars, drones, or autonomous guided robots,<br />

require real-time parallel processing, low-latency, and in some<br />

cases low power consumption. Multiple camera modules provide<br />

surround view and sensor fusion improves the overall vision<br />

system, while artificial intelligence and machine learning herald<br />

tremendous improvements for recognition and learning tasks in<br />

autonomous vision systems. This paper describes a multi-camera<br />

development platform for autonomous vision systems supporting<br />

six camera modules with up to 4K UHD resolution. The core of<br />

the solution is a Xilinx Zynq UltraScale+ MPSoC combining a<br />

64-bit processing system and programmable logic. By leveraging<br />

the processing system’s quad-core ARM Cortex-A53 to run<br />

traditional software tasks coupled with hardware-accelerated<br />

functions executing in programmable logic, system designers can<br />

achieve performance gains orders of magnitude higher than<br />

traditional software-based computer vision systems. A design<br />

methodology based on the Xilinx reVISION stack is presented,<br />

with hardware-accelerated OpenCV algorithms commonly used<br />

in ADAS and other autonomous vision systems.<br />

Keywords— High-Resolution Multi-Camera, Autonomous<br />

Vision System, Embedded Vision, SoC, FPGA<br />

I. INTRODUCTION<br />

Building autonomous vision systems grows increasingly<br />

challenging, requiring multidisciplinary expertise in optics,<br />

image sensors, computer vision and deep learning. Selecting<br />

the right development platform and design methodology is<br />

crucial for a successful implementation. A multi-camera<br />

system is part of an embedded vision design and several design<br />

recommendations need to be taken into account.<br />

The amount of data involved in a multi-camera approach<br />

can be enormous. This is especially the case with highresolution,<br />

real-time video and may require parallel processing<br />

or dedicated vision processing devices. Autonomous vision<br />

applications are real-world systems with continuously changing<br />

conditions for light, motion, or orientation, which creates<br />

uncontrolled and changeable variables in the system. Relying<br />

on simulations only will not work and real-world experiments<br />

are necessary, but can be very time consuming. Appropriate<br />

computer vision and machine learning algorithms are required<br />

to manipulate and analyze the video data and deal with realworld<br />

conditions and unexpected circumstances.<br />

It is clear that fine-tuning between hardware, software,<br />

computer vision and machine learning, under real-world<br />

conditions, is a difficult task. It is important for a developer to<br />

use the right tools to reduce development time and risk. In<br />

order to ease this task, this paper proposes a full solution, close<br />

to an end application, for multi-camera designs. The solution<br />

includes the design methodology with hardware, software<br />

environment, drivers, and computer vision and machine<br />

learning.<br />

In the following sections, we will discuss first the multicamera<br />

applications and system requirements (section II),<br />

following with the details of the different hardware platform<br />

building blocks (section III), continue with the design<br />

methodology and example details (section IV), and finalize<br />

with a conclusion (section V).<br />

II.<br />

MULTI-CAMERA APPLICATIONS AND REQUIREMENTS<br />

Cameras are used in a wide range of applications today.<br />

New applications demand higher resolution image sensors at<br />

faster frame rates, enabling sharper images and better object<br />

detection at longer distances. Faster frame rates reduce latency<br />

and improve reaction time of an autonomous system at the<br />

expense of higher-performance signal processing combined<br />

with more complex computer vision and machine learning<br />

algorithms. Sensor fusion integrates various sensor<br />

technologies such as image sensors paired with a thermal<br />

sensor, LIDAR or radar. The goal of sensor fusion is to (a)<br />

improve application or system performance, (b) correct<br />

deficiencies of the individual sensors, (c) or provide better<br />

accuracy such as for position and orientation. Additionally, for<br />

time-critical reaction such as obstacle avoidance, various<br />

sensor sources must be synchronized and processed with low<br />

latency.<br />

Machine learning, and in particular deep learning, has<br />

enabled rapid progress towards recognition and classification<br />

tasks in autonomous systems. Deep learning is more efficient<br />

for most complex recognition tasks than computer vision, and<br />

www.embedded-world.eu<br />

661


in many cases more accurate than with traditional computer<br />

vision, but it requires significantly more computing power. The<br />

implementation of deep learning inference, which is the<br />

deployment of a pre-trained deep learning neural network,<br />

requires GPUs or FPGAs. FPGAs have the advantage of high<br />

flexibility and low power consumption. Companies like DeePhi<br />

Tech recognize a very large market value for deep learning and<br />

provide FPGA deep learning platforms for autonomous<br />

systems [1]. Studies prove that even binarized neural network<br />

inference can run efficiently on an FPGA for very fast<br />

classification of objects, with high accuracy and low power<br />

consumption [2]. FPGAs can run small, compressed machine<br />

learning accelerators, and can be dynamically reconfigured, to<br />

adapt the accelerator on the fly for the required acceleration<br />

task or for any deep learning topology [3]. Typically, computer<br />

vision and machine learning are combined as a hybrid model;<br />

computer vision for initial detection, deep learning for<br />

verification, classification and recognition tasks. In some cases,<br />

autonomous systems even run multiple deep learning networks,<br />

each with their own specific tasks.<br />

The multi-camera development platform can target<br />

different applications including:<br />

Automotive and advanced driver assistance systems<br />

(ADAS) – cars have multiple image cameras, radar and<br />

LIDAR, for situational awareness with 360° field of<br />

view, and to support safety outside and inside the car.<br />

The number of high-resolution cameras will increase to<br />

meet the impending challenges of fully autonomous<br />

driving. The camera resolution will increase from 1–2<br />

Mp today up to 8 Mp in the longer term, and the frame<br />

rate will increase from 10-30 fps today to 60 fps in the<br />

future [4]. ADAS include functionality for lane<br />

departure warning, traffic sign recognition, park assist,<br />

pedestrian detection, adaptive cruise control, passenger<br />

monitoring with drowsy driver detection, and blind spot<br />

detection.<br />

Unmanned aerial vehicles (UAVs) and drones – UAVs<br />

and drones are equipped with multiple cameras to have<br />

a flight view and to perform analytics of the<br />

surrounding area. UAVs are used to inspect agriculture<br />

fields, power lines, wind turbines or buildings.<br />

Additional applications are security and surveillance for<br />

police or fire brigade, search and rescue, and<br />

professional photography.<br />

Autonomous guided robots – autonomous robots deploy<br />

various sensors with multiple cameras, with strong<br />

coordination on sensing, motion and decision-making<br />

for fast reaction in environmental situations.<br />

Virtual and augmented reality (VR/AR) – multiple<br />

cameras capture 360° video and the individual video<br />

streams are stitched together to create a virtual or<br />

augmented environment.<br />

Key requirements of the different components of the<br />

solution include:<br />

Reliability – reliability and safety are a top priority for<br />

autonomous systems and functional safety is critical to<br />

avoid hazardous situations<br />

Real-time execution – fast reaction time is required with<br />

low-latency on hardware and software<br />

Flexibility – hardware and software need to be highly<br />

flexible and reconfigurable to be future proof and to<br />

meet different configurations for multi-camera designs<br />

Power consumption – power consumption must be<br />

minimized, especially for battery powered applications<br />

like drones<br />

Computer vision and machine learning – extensive<br />

capabilities to implement computer vision and machine<br />

learning with deep learning inference are crucial for<br />

object detection, recognition, verification and<br />

classification<br />

Automotive qualified – this paper proposes a solution<br />

valid for autonomous systems including automotive<br />

applications, meaning that all components need to be<br />

available in automotive grade with long-life time<br />

availability.<br />

III.<br />

MULTI-CAMERA PLATFORM<br />

The high-resolution multi-camera solution consists of 4<br />

hardware core building blocks: (A) camera modules, (B) a<br />

multi-sensor FMC module, (C) a multi-processor SoC or SOM,<br />

and (D) a carrier board.<br />

Fig. 1. Overall system with camera modules, multi-sensor FMC module,<br />

multi-processor SoC or SOM, and a carrier board<br />

A. Camera modules<br />

An image pipeline starts at the camera module combining<br />

image sensor, lens, control electronics, and an interface. Poor<br />

lens quality, or mismatch in image sensor specifications, will<br />

affect the whole image pipeline, and cannot always be<br />

improved or recovered by the processing system. This paper<br />

describes modules with image camera and serializer from<br />

MARS, a modular automotive reference system developed by<br />

ON Semiconductor [5]. With the modular approach, developers<br />

can build different combinations of image sensors, coprocessors<br />

or ISPs, and communication standards. The<br />

modules are component boards with consistent signal/power<br />

interconnect definitions to enable to swap individual boards,<br />

creating a wide range of options for experimenting, while<br />

eliminating the need for constructing custom boards. The result<br />

662


is a highly flexible solution where the various modules with<br />

different image sensors and lenses, and different field of view<br />

(FOV), are interchangeable. The modules are miniaturized with<br />

a form factor of 25mm by 25mm and can be mounted on a<br />

vehicle for real-world testing. MARS camera modules are<br />

equipped with M12 lenses with different FOV options. The<br />

camera module supports high bandwidth and image resolution,<br />

and data can be transferred over several meters to the multiprocessor<br />

SoC with the serializer. The serializer is part of a<br />

serialized/deserializer implementation (SerDes) and will be<br />

described in detail below with the multi-sensor FMC module.<br />

Both the MARS camera modules and the serializers are ideal<br />

components for the high-resolution multi-camera development.<br />

Fig. 2. ON Semiconductor MARS camera module with serializer board<br />

B. Multi-sensor FMC module<br />

The multi-sensor FMC module is an interconnection board<br />

between the camera modules and the host carrier board. It is<br />

not a stand-alone module, but rather a plug-in module designed<br />

to interface with FMC-compatible carrier boards. The FMC<br />

module was developed by Avnet [6] to support multiple<br />

cameras for Xilinx Zynq UltraScale+ embedded vision<br />

applications in automotive ADAS, augmented reality, and<br />

UAV or drones. The board has an FMC connector and 6<br />

FAKRA connectors supporting up to 4 (four) 2Mpixel and 2<br />

(two) 8Mpixel camera modules through low cost 50Ω coax<br />

cables. The camera modules make use of the Maxim Integrated<br />

quad channel GMSL deserializer for the 2Mpixel cameras, and<br />

a dual GMSL2 deserializer for the 8Mpixel camera modules<br />

[7]. Communication with the Xilinx Zynq UltraScale+ is<br />

delivered by MIPI CSI-2. The FMC-LPC connector<br />

specification, MIPI CSI-2 interface, and SerDes are explained<br />

below.<br />

Fig. 3. Multi-sensor FMC module<br />

FMC-LPC. The FPGA mezzanine card (FMC) is a<br />

daughter card for carrier boards containing an FPGA and is an<br />

ANSI standard that provides a standard form factor,<br />

connectors, and modular interface to the FPGA located on a<br />

carrier board. FMC supports data throughput up to 10 Gb/s for<br />

individual signaling and 40 Gb/s overall bandwidth between<br />

mezzanine and carrier card [8]. The multi-sensor FMC module<br />

has a Low-Pin Count (LPC) connector with 160 pins.<br />

MIPI CSI-2. MIPI CSI-2 is a camera serial interface (CSI)<br />

and is a specification of the mobile industry processor interface<br />

(MIPI). It defines the interface between cameras and host<br />

processors. It is scalable, robust, low power, high-speed, costeffective,<br />

and has low electromagnetic interference (EMI). It is<br />

a widely used camera interface for single- or multi-camera<br />

implementations. It is typically used for high performance<br />

video and still image applications, including interconnection<br />

between 4K resolution cameras, head-mounted virtual reality<br />

devices, automotive applications, and UAV or drones. MIPI<br />

CSI-2 is a lane-scalable specification and the data stream is<br />

distributed between the lanes. Applications that require extra<br />

bandwidth beyond that provided by one data lane, or those<br />

trying to avoid high clock rates, can expand the data path to<br />

two, three, or four lanes, and obtain approximately linear<br />

increases in the peak bus bandwidth.<br />

Serializer and deserializer. SerDes is used for high-speed<br />

data transmission over extended distances. The serializer takes<br />

multiple data line inputs and condenses them to a fewer<br />

number of outputs at higher frame rate, and transmits the<br />

condensed data over a cable. The deserializer captures the<br />

serialized data and outputs the recovered original data, usually<br />

to a host processor. SerDes will reduce the cost of connectors<br />

and cables, reduce noise EMI, and deliver high-speed data<br />

transmission over long distances. Gigabit multimedia serial<br />

link SerDes (GMSL) are Maxim Integrated proprietary<br />

transceivers and provide a compression-free alternative to<br />

Ethernet with a 10x increase in data rates, 50 percent lower<br />

cabling costs, and better EMC compared to Ethernet. GMSL<br />

chipsets can drive 15 meters of coax or shielded twisted pair<br />

(STP) cabling, with margin required for robust and versatile<br />

designs. Spread-spectrum capability is built into each serializer<br />

and deserializer to improve the EMI performance of the link,<br />

without the need for an external spread-spectrum clock.<br />

GMSL2 is an improved technology of GMSL supporting 6-<br />

12Gbps serial data rates, multi-streaming functionality, and<br />

advanced diagnostics (like polling of remote registers to ensure<br />

link/system integrity). GMSL and GMSL2 are ideal for data<br />

transmission for megapixel multi-camera systems.<br />

C. Multi-processor SoC and SOM<br />

A multi-processing system-on-chip (MPSoC) integrates<br />

several processing devices, including additional hardware<br />

functions, into a single silicon chip. Along with a processing<br />

unit, a SoC can contain a GPU, FPGA, memory, peripheral<br />

controllers, power management circuits, and may even contain<br />

wireless radios, or other integrated circuits. A SoC reduces the<br />

overall hardware complexity of a system, the number of<br />

external components, and power consumption because of the<br />

optimized hardware implementation.<br />

An efficient alternative to custom chip-down designs is a<br />

system-on-module (SOM), a small form-factor, ready-to-use<br />

computing module. A SOM combines all core hardware<br />

components, including SoC, external memory, and power<br />

regulations on a pre-engineered compact board, significantly<br />

reducing development time and risk. Both approaches are<br />

described below.<br />

www.embedded-world.eu<br />

663


1) Zynq UltraScale+ MPSoC<br />

Zynq UltraScale+ MPSoC devices are multiprocessor<br />

system-on-chips (MPSoC). Zynq UltraScale+ MPSoC is a<br />

family with three distinct variants: (a) Zynq UltraScale+ CG<br />

with dual ARM Cortex-A53 and dual Cortex R5 application<br />

processors, (b) Zynq UltraScale+ EG with quad ARM Cortex-<br />

A53, dual Cortex R5 processors and a GPU, and (c) Zynq<br />

UltraScale+ EV, similar to the EG version, but with an<br />

additional video codec [9]. The Zynq UltraScale+ EV devices<br />

with H.264 / H.265 video codec can simultaneous encode and<br />

decode up to 4Kx2K (60fps).<br />

Zynq UltraScale+ MPSoCs have two main hardware parts:<br />

(1) a processing system – PS, and (2) a programmable logic –<br />

PL:<br />

PS consists of an application processor unit (APU), a<br />

real-time processing unit (RPU), a graphics processing<br />

unit in case of the EG and EV devices (GPU), a<br />

configuration security unit (CSU), a variety of<br />

peripherals, integrated memory, and high-speed<br />

communications interfaces.<br />

PL is equivalent to an FPGA and contains<br />

programmable logic and interconnections. The<br />

advantage of the FPGA part is the re-programmability.<br />

In this way it becomes a large logic circuit that can be<br />

configured according to a design, but if changes are<br />

required, it can be re-programmed with an update.<br />

Additional Xilinx or third party IP blocks can be<br />

implemented in the PL.<br />

Fig. 4. Zynq UltraScale+ EV MPSoC block diagram with the details of the<br />

processing system (PS) and programmable logic (PL)<br />

The interconnection between the PS and PL is supported by<br />

the advanced extensible interface (AXI), a burst-orientated,<br />

open standard protocol, allowing the connection and<br />

management of many controllers and peripherals in a multimaster<br />

design, and for the communication between the IP<br />

cores. Each AXI port contains independent read and write<br />

channels.<br />

Zynq UltraScale+ MPSoC devices can operate in different<br />

modes: (1) the PS can work in a standalone mode without<br />

attaching any additional fabric IP, or (2) IP cores can be<br />

instantiated in fabric and attached to the Zynq UltraScale+ PS<br />

as a PS+PL combination. The second option will be most<br />

appropriate, as it takes the benefits of a software-hardware codesign<br />

with high bandwidth building blocks in fabric and lower<br />

bandwidth parts running on the processors.<br />

The different core units of the MPSoC have different<br />

functions and specifications, which are described below:<br />

Application processor unit (APU). The APU includes<br />

several ARM microprocessor units (MPU) and are ideal for<br />

high-level vision processing. MPUs are easy to program,<br />

because the tools, libraries and programming structure are<br />

similar to those for PC applications and leverage existing<br />

standard APIs like OpenCV or V4L. This reduces the learning<br />

curve required to program new hardware and applications. An<br />

MPU can run very complex algorithms, but with high image<br />

resolutions and frame rates, it will quickly runs out of<br />

horsepower. It is not able to achieve real-time performance,<br />

thus additional hardware acceleration is required.<br />

Real-time processing unit (RPU). RTUs are ideal for high<br />

performance, real-time applications. RTUs are similar to<br />

MPUs, but allow higher-performance interrupt handling and<br />

quicker response to real-time events. The high performance<br />

and high determinism of RTUs is ideal for functional safety<br />

and secured applications.<br />

Graphics processing unit (GPU). GPUs were initially<br />

designed for displaying video on computer monitors, but since<br />

several years also implemented for general-purpose computing.<br />

GPUs are also used for training and inference for deep<br />

learning. A GPU has a massively parallel architecture<br />

consisting of thousands of small, efficient cores designed for<br />

handling multiple tasks simultaneously. Compared with a<br />

MPU, which has only a few cores optimized for sequential<br />

serial processing, a GPU can perform tasks much faster than an<br />

MPU, due to its parallel processing. Memory latency is very<br />

low by very fast context switching, but GPUs typically have a<br />

higher power consumption and heat generation than other<br />

processing devices.<br />

Field-programmable gate array (FPGA). The FPGA is part<br />

of the PL and are integrated circuits, designed to be configured<br />

after manufacturing. This differs from application-specific<br />

integrated circuits (ASIC) and application-specific standard<br />

products (ASSP). These devices have the advantage of high<br />

performance and low power consumption, but they are only<br />

used for high volume manufacturing and long series, due to<br />

higher initial engineering cost. They are not suitable for rapid<br />

prototyping and they are not reconfigurable. Once they are<br />

manufactured, they cannot be reprogrammed. This lack of<br />

flexibility has led to the use of FPGAs to implement all the<br />

desired functionalities directly in the programmable logic,<br />

including the required additional custom hardware accelerators,<br />

or even the architecture of a microprocessor (soft processor<br />

core). FGPAs have low power consumption and they are better<br />

suited for low-level processing than general-purpose hardware.<br />

Digital Signal Processors (DSP) can be part of an FPGA and<br />

offer single cycle multiply and accumulation operations, in<br />

addition to parallel processing capabilities and integrated<br />

memory blocks. DSPs deliver excellent overall performance<br />

across low-, mid- and high-level vision processing.<br />

664


Configuration security unit (CSU). The security unit with<br />

cryptographic capabilities can be used for hardware<br />

acceleration of cryptographic functions. The secure boot<br />

functionality in Zynq UltraScale+ MPSoC allows support for<br />

confidentiality, integrity, and authentication of partitions.<br />

Secure boot is accomplished by combining the hardware root<br />

of trust (HROT) capabilities of the Zynq UltraScale+ device<br />

with the option of encrypting all boot partitions. The HROT is<br />

based on the RSA-4096 asymmetric algorithm in conjunction<br />

with SHA-3/384, which is hardware accelerated, or SHA-<br />

2/256, implemented as software. Confidentiality is provided<br />

using 256 bit advanced encryption standard - Galois counter<br />

mode (AES-GCM).<br />

2) UltraZed-EV SOM<br />

An efficient alternative to custom chip-down designs is a<br />

system-on-module (SOM). Designers can use a SOM as a<br />

reference to create their own vision system, or can drop the<br />

SOM into their final product with their custom-designed carrier<br />

board. The UltraZed-EV SOM, developed by Avnet [10],<br />

enables designers to build multi-camera systems for<br />

automotive ADAS, surveillance, and other embedded vision<br />

applications, and because of the integrated H.264/H.265 video<br />

codec unit into the MPSoC EV, it is possible to simultaneously<br />

encode and decode up to 4Kx2K (60fps).<br />

The Avnet UltraZed-EV SOM is a production-ready, high<br />

performance, full-featured SOM with the Zynq UltraScale+<br />

EV. The SOM includes all the necessary functions, such as onboard<br />

dual system memory, high-speed transceivers, Ethernet,<br />

USB, and configuration memory. The UltraZed-EV provides<br />

access to 152 user I/O pins, 26 PS multiplexed I/O pins (MIO)<br />

and gigabit transceivers GTR/GTH. GTR and GTH are gigabit<br />

transceivers to support the most common serial high speed<br />

interconnects. GTR supports a maximum data rate of 6.0 Gb/s,<br />

while GTH supports a maximum data rate of 16.3 Gb/s. The<br />

UltraZed-EV SOM has 4 high-speed PS GTR transceivers<br />

along with 4 GTR reference clock inputs, and 16 PL highspeed<br />

GTH transceivers along with 8 GTH reference clock<br />

inputs through three I/O connectors. Avnet also provides Linux<br />

board support packages (BSP), to reduce the time required to<br />

bring up an operating system on the SOM, allowing developers<br />

to immediately start developing their differentiating algorithms<br />

and applications.<br />

Fig. 5. UltraZed-EV SOM<br />

D. Carrier board<br />

A carrier board, carrier card, or evaluation board, provides<br />

the mounting option of the Multi-sensor FMC module and<br />

additional modules or cards. It includes also the Zynq<br />

UltraScale+ MPSoC device or connector for the Avnet<br />

UltraZed-EV SOM. It has an Ethernet connector, SD Card<br />

interface, I/O interfaces, peripherals, video output, and power<br />

supplies. Several configurations are available to build the<br />

multi-camera solution. Despite that the multi-sensor FMC<br />

module has only 6 connectors for the camera modules, it is<br />

possible to extend the number of cameras in case the carrier<br />

board has more than one FMC connector. The description<br />

below explains the type of boards with compatible MPSoC or<br />

SOM, the possible configuration of the FMC module, and the<br />

number of camera modules that can be connected:<br />

Xilinx ZCU102 evaluation board [11]: this board is<br />

equipped with the Zynq UltraScale+ EG (ZU9EG). The<br />

board has 2 FMC connectors and can support up to 10<br />

camera modules with the following configuration: 4x<br />

2Mp modules and 2x 8Mp modules on the first FMC<br />

connector, and 4x 2Mp modules on the second FMC<br />

connector.<br />

Xilinx ZCU104 evaluation board [12]: this board is<br />

equipped with the Zynq UltraScale+ EV (ZU7EV) and<br />

has an additional video codec. The board has 1 FMC<br />

connector and can support up to 6 camera modules with<br />

the following configuration: 4x 2Mp modules and 2x<br />

8Mp modules.<br />

Avnet UltraZed EV carrier card [13]: this carrier card<br />

has a socket for the Avnet UltraZed-EV SOM. The<br />

carrier card has 1 FMC connector and can support up to<br />

5 camera modules with the following configuration: 4x<br />

2Mp modules and 1x 8Mp modules. Both the carrier<br />

card and SOM can be part of the UltraZed EV starter<br />

kit, bundled to provide a complete system for<br />

prototyping and evaluation.<br />

IV.<br />

DESIGN METHODOLOGY<br />

The design methodology is based on reVISION, a stack<br />

including a broad range of development resources for (1)<br />

algorithm development, (2) application development and (3)<br />

platform development from Xilinx or from third parties.<br />

reVISION is more responsive than typical SoCs & embedded<br />

GPUs, and delivers up to 6x better images/sec/Watt in machine<br />

learning, 42x higher frames/sec/Watt for computer vision<br />

processing, and 1/5th the latency [14]. reVISION works in<br />

synergy with existing development tools for hardware and<br />

software application design, including Xilinx SDx<br />

environment, PetaLinux, and additional reVISION libraries for<br />

computer vision and machine learning applications. SDx<br />

consist of the SDAccel and SDSoC development<br />

environments, and the Vivado Design Suite. SDAccel is<br />

typically used for applications in data centers and for PCIe<br />

based accelerator systems, and is beyond the scope of this<br />

paper. SDSoC, Vivado Design Suite, PetaLinux and reVISION<br />

libraries will be explained in detail in this section and are the<br />

core components for the high-resolution multi-camera solution.<br />

Additionally, we will explain the vision capture pipeline and a<br />

workflow example.<br />

www.embedded-world.eu<br />

665


A. SDSoC<br />

The software-defined system on chip environment<br />

(SDSoC) is an Eclipse-based integrated development<br />

environment (IDE) for implementing embedded systems using<br />

Zynq devices. SDSoC includes a full-system optimizing C/C++<br />

compiler, providing an intuitive programming model for<br />

software engineers to write applications in C/C++. The SDSoC<br />

system compiler creates a complete embedded system on the<br />

device by compiling the application into hardware and<br />

software, including a complete boot image with firmware,<br />

operating system, and application executable. SDSoC performs<br />

program analysis, task scheduling and binding, onto<br />

programmable logic and the embedded MPUs, as well as<br />

hardware and software code generation that automatically<br />

implement communication among hardware and software<br />

components. SDSoC is built on the general Xilinx SDK<br />

(XSDK), inheriting many of its tools including debuggers,<br />

performance analysis, command-line tools, GNU toolchain<br />

such as GNU C library (glibc), and standard libraries like<br />

OpenCV. SDSoC also supports open computing language<br />

(OpenCL), with OpenCL kernels that target programmable<br />

logic Zynq devices. OpenCL is a framework for writing<br />

programs that execute across heterogeneous platforms<br />

consisting of CPUs, GPUs, DSPs, FPGAs and other processors<br />

or hardware accelerators.<br />

B. Vivado Design Suite<br />

Vivado Design Suite is a software suite produced by Xilinx<br />

for synthesis and analysis of hardware description language<br />

designs (HDL), with features for SoC development and highlevel<br />

synthesis (Vivado HLS). Vivado performs timing<br />

analysis, examines register-transfer level (RTL) diagrams,<br />

simulates a design's reaction to different stimuli, and<br />

configures the target device with the programmer.<br />

Vivado HLS accelerates IP creation by enabling C, C++<br />

and system C specifications to be directly targeted into Xilinx<br />

devices without the need to manually create RTL. The Vivado<br />

IP integrator makes it easy to add hardware IPs to existing<br />

design source and create connections for ports, such as clock<br />

and reset. Vivado HLS and IP integrator are used for hardware<br />

system development. Configuration of the embedded<br />

processors, peripherals, and the interconnection of these<br />

components, also takes place in Vivado.<br />

C. PetaLinux<br />

PetaLinux is a brand name used by Xilinx to provide a full<br />

embedded Linux system specifically targeting FPGA-based<br />

SoC designs. It includes the Linux OS as well as a complete<br />

configuration, build, and deploy environment for Xilinx<br />

silicon. Because it is a standard Linux OS with Linux drivers,<br />

and with standard application programming interfaces (APIs), a<br />

developer has the advantage of incorporating existing<br />

functional software blocks and facilitating porting of<br />

applications from other processors. Software ecosystems like<br />

OpenCV or GStreamer, can be used within PetaLinux. Realtime<br />

video capturing on Linux, can be achieved with<br />

Video4Linux (V4L), which includes a collection of device<br />

drivers and an API. V4L has the advantage that programmers<br />

can easily add video support to applications without a lot of<br />

development effort, because it supports USB webcams and<br />

image sensors from many image sensor vendors, and is<br />

supported by many existing libraries and applications.<br />

PetaLinux refers to an individual software package, but it is<br />

not a standalone embedded Linux development solution. The<br />

workflow for PetaLinux consists of multiple layers in which it<br />

relies on other Xilinx software like Vivado and SDSoC. It is<br />

based on Yocto and add the hardware/software interface from<br />

Vivado (HSI), and special tools for boot image creation.<br />

PetaLinux consists of three key elements: (a) pre-configured<br />

binary bootable images, (b) fully customizable Linux for<br />

Xilinx SoC devices, and (c) PetaLinux SDK, including tools<br />

and utilities to automate complex tasks across configuration,<br />

build, and deployment.<br />

PetaLinux reference board support packages (BSPs) are<br />

available and are reference designs to help the developer to<br />

start working with a fully optimized design, and can be<br />

customized afterwards for developing own projects. A BSP<br />

includes all the necessary design and configuration files, prebuilt<br />

and tested hardware and software images, ready to<br />

download on a carrier board, or for booting in the system<br />

emulator (Quick EMUlator). Developers can customize the<br />

boot loader, Linux kernel, or Linux applications. They can add<br />

new kernels, device drivers, applications, libraries, and boot<br />

and test software stacks on the QEMU or on physical hardware<br />

via a network connection or JTAG.<br />

D. reVISION libraries<br />

The reVISION stack includes xfOpenCV and xfDNN,<br />

libraries for algorithm development and execution. xfOpenCV<br />

has a broad set of acceleration-ready OpenCV functions for<br />

computer vision processing. xfDNN is a library intended for<br />

machine learning inference implementation. For application<br />

level development on top of the algorithm development, Xilinx<br />

supports industry-standard frameworks including OpenVX for<br />

computer vision and Caffe for machine learning. SDSoC is<br />

used to enable algorithm and/or application development in C,<br />

C++ and/or OpenCL, by using the reVISION resources. The<br />

SDSoC Environment can also be used to expand the reVISION<br />

resources with new acceleration-ready software libraries.<br />

1) xfOpenCV library<br />

The open source computer vision library (OpenCV) is an<br />

open source software library aimed for real-time computer<br />

vision on still images or video. The library includes a large<br />

number of algorithms for filtering and image optimization,<br />

tracking of moving objects, image stitching, recognition and<br />

classification. Advanced computer vision algorithms used for<br />

image and video processing in 2D and 3D are part of the<br />

library. The standard OpenCV library with several thousand<br />

functions, can be used within the Xilinx C/C++ embedded<br />

software environments on the PS, but it is even more<br />

interesting to use hardware-accelerated functions running in<br />

PL. Xilinx provides the xfOpenCV library, an FPGA device<br />

optimized and hardware accelerated OpenCV library, intended<br />

for application developers using Zynq devices. xfOpenCV<br />

library functions are similar in functionality to their OpenCV<br />

equivalent. xfOpenCV provides a software interface for<br />

computer vision functions accelerated on an FPGA device. The<br />

xfOpenCV library is designed to work in the SDSoC<br />

development environment. SDSoC performs two steps<br />

666


automatically, which drastically increases productivity when<br />

accelerating functions:<br />

Hardware accelerators are created for each of the<br />

xfOpenCV function instantiations in the C/C++ code.<br />

Data movers are instantiated when needed to access<br />

image data from/to external memory.<br />

The following figure shows the reVISION platform with an<br />

xfOpenCV function implemented in the hardware [15]:<br />

2) xfDNN library<br />

Xilinx deep neural network library (xfDNN) is optimized<br />

for machine learning and deep learning inference applications.<br />

The reVISION Stack with xfDNN enables deployment of<br />

trained networks on a Zynq UltraScale+ MPSoC for inference.<br />

xfDNN is designed for maximum compute efficiency at 16-bit<br />

and 8-bit integer data types. xfDNN includes support for the<br />

most popular neural networks including AlexNet, GoogLeNet,<br />

SqueezeNet, SSD, and FCN. Additionally, the stack provides<br />

library elements including pre-defined and optimized<br />

implementations for convolution neural network layers (CNN),<br />

required to build custom neural networks. Zynq UltraScale+<br />

MPSoC devices are ideal for deep learning inference,<br />

achieving better results on images/sec/Watt comparted to<br />

embedded GPUs.<br />

E. Vision capture pipeline<br />

The MIPI CSI-2 capture pipeline uses the V4L Linux<br />

framework and is implemented in the PL. It consists of the<br />

“MIPI CSI-2 sub-system”, the “AXI-Stream switcher” (for the<br />

case of the quad GMSL deserializer), the “Image Pipeline”,<br />

and the “Frame Buffer Write”. The figures below illustrate the<br />

pipeline for the quad GMSL and dual GMSL2:<br />

Fig. 6. xfOpenCV kernel on the reVISION platform<br />

The main challenge of running OpenCV-based applications<br />

on embedded hardware is the fact that all functions access<br />

image data from external memory. As more and more OpenCV<br />

functions are called, more and more accesses to external<br />

memory is required, which increases the latency and power<br />

consumption of the entire system. Xilinx resolved this<br />

challenge by providing xfOpenCV functions which can infer a<br />

streaming pipeline implementation. The following image<br />

illustrates this using a stereo vision algorithm, calculating depth<br />

from two (stereo) cameras. xfOpenCV directly infers<br />

pipelining functions from one to the next, avoiding frame<br />

buffers and external memory.<br />

Fig. 8. MIPI capture pipelines for quad GMSL (A) and dual GMSL2 (B)<br />

Fig. 7. Comparison of traditional computer vision memory access (A) with<br />

xfOpenCV (B)<br />

The following steps describe the MIPI CSI-2 capture<br />

pipeline:<br />

1. The image sensors provide raw image data via<br />

SerDes over MIPI CSI-2 link<br />

2. MIPI CSI-2 sub-system receives and decodes the<br />

incoming data stream to AXI4-Stream<br />

3. Image Pipeline manipulates the data for each<br />

image stream, including:<br />

o Demosaic IP converts the raw image<br />

format to RGB<br />

www.embedded-world.eu<br />

667


o Gamma IP provides per-channel gamma<br />

correction functionality<br />

o Color correction<br />

o Color space conversion converts the<br />

RGB image to YUV422<br />

o Video scaling resizes the images<br />

4. Frame Buffer Write IP writes the output of the<br />

Image Pipeline to memory<br />

The Image Pipeline is flexible and is determined by the<br />

developer. It can includes Xilinx IP cores and/or third party IP<br />

cores. logicBRICKS/Xylon is one of the partners that provide<br />

optimized IP cores for ADAS applications [16].<br />

F. Workflow example<br />

With the reVISION stack, developers can start with an<br />

existing platform with all the interfaces and a reference BSP in<br />

place, and can concentrate on developing their own<br />

differentiating algorithms and applications. The typical<br />

reVISION workflow involves the following steps:<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

<br />

Select the hardware platform including carrier<br />

card, FMC module and cameras<br />

Build a reference BSP and run it on the hardware<br />

Start application development in C/C++ in<br />

SDSoC, with the OpenCV library<br />

Cross-compile OpenCV applications to the PS of<br />

the Zynq UltraScale+<br />

Profile and identify bottleneck functions in the<br />

code and select the potential candidates for<br />

hardware acceleration<br />

Make minimal modifications to call the Xilinx<br />

optimized xfOpenCV functions, instead of the<br />

standard OpenCV functions<br />

Recompile the xfOpenCV functions for hardware<br />

acceleration using SDSoC. SDSoC not only builds<br />

hardware accelerators from the xfOpenCV<br />

functions, but also instantiates the data movers,<br />

such as DMA engines, needed to transfer the data<br />

to/from external DRAM memory<br />

Execute the final optimized application on the<br />

embedded hardware<br />

In the near future, a complete reVISION compliant<br />

development solution will be proposed. The solution will<br />

include the Avnet UltraZed EV carrier card with Avnet<br />

UltraZed-EV SOM and the multi-sensor FMC module, which<br />

leverages the ON Semiconductor MARS cameras and MAXIM<br />

Integrated GMSL serialized links. The solution will include<br />

PetaLinux board support package with (1) V4L2 compliant<br />

Linux drivers to access the cameras on the Multi-Camera FMC<br />

module, and (2) pre-compiled OpenCV library for computer<br />

vision functions. The solution will be fully integrated into the<br />

reVISION stack.<br />

V. CONCLUSION<br />

This paper describes a multi-camera methodology for<br />

autonomous vision systems based on the Xilinx reVISION<br />

stack. It includes development resources for algorithm<br />

development, application development and platform<br />

development. The stack is highly optimized for computer<br />

vision and machine learning tasks used in automotive ADAS,<br />

UAVs/ drones, autonomous guided robots and VR/AR<br />

applications. The core of the solution is a Xilinx Zynq<br />

UltraScale+ MPSoC combining multiple processing units, and<br />

programmable logic. The architecture of the MPSoC is ideal<br />

for a software-hardware co-design with high-bandwidth<br />

building blocks in the programmable logic, and lowerbandwidth<br />

parts and operating system running on the<br />

processors. The combination of carrier board with the multisensor<br />

FMC module and optimized MARS camera modules<br />

provide a robust solution for high-resolution multi-camera<br />

development. The platform with reference BSP simplifies the<br />

design and reduces development time and effort significantly.<br />

REFERENCES<br />

[1] DeePhi Deep Learning Platforms.<br />

https://www.xilinx.com/video/corporate/deephi-deep-learningplatforms.html.<br />

Last visited 19.1.2018.<br />

[2] Y. Umuroglu, N. J. Fraser, G. Gambardella, M. Blott, P. Leong, M.<br />

Jahre, and K. Vissers. FINN: A Framework for Fast, Scalable Binarized<br />

Neural Network Inference. ArXiv e-prints, 2016.<br />

[3] Kortiq Small and Efficient CNN Accelerator.<br />

https://www.xilinx.com/video/corporate/kortiq-small-efficient-cnnaccelerator.html.<br />

Last visited 19.1.2018.<br />

[4] Synopsys - The impact of AI on autonomous vehicles. Technology<br />

webinar, 14.12.2017.<br />

[5] ON Semiconductor Modular Automotive Reference System product<br />

page. http://www.onsemi.com/PowerSolutions/content.do?id=18780.<br />

Last visited 19.1.2018.<br />

[6] Avnet multi-sensor FMC. Not announced, January 2018. Will be public<br />

on http://ultrazed.org.<br />

[7] Maxim Integrated gigabit multimedia serial link (GMSL) serializer and<br />

deserializers product page.<br />

https://www.maximintegrated.com/en/products/interface/high-speedsignaling/gmsl.html.<br />

Last visited 19.1.2018.<br />

[8] FMC card product page. https://www.xilinx.com/products/boards-andkits/fmc-cards.html.<br />

Last visited 19.1.2018.<br />

[9] Xilinx Zynq UltraScale+ MPSoC product overiew.<br />

https://www.xilinx.com/products/silicon-devices/soc/zynq-ultrascalempsoc.html.<br />

Last visited 19.1.2018.<br />

[10] Avnet UltraZed-EV SOM product page.<br />

http://ultrazed.org/product/ultrazed-ev%E2%84%A2-som. Last visited<br />

19.1.2018.<br />

[11] Xilinx ZCU102 evalution board user guide.<br />

https://www.xilinx.com/support/documentation/boards_and_kits/zcu102<br />

/ug1182-zcu102-eval-bd.pdf. Last visited 19.1.2018.<br />

[12] Xilinx ZCU104 evaluation board. Not announced, January 2018. Will be<br />

public on Xilinx website.<br />

[13] Avnet UltraZed EV carrier card. Not announced, January 2018. Will be<br />

public on http://ultrazed.org.<br />

[14] Xilinx reVISION developer zone.<br />

https://www.xilinx.com/products/design-tools/embedded-visionzone.html.<br />

Last visited 19.1.2018.<br />

[15] Xilinx OpenCV User Guide.<br />

https://www.xilinx.com/support/documentation/sw_manuals/xilinx2017<br />

_1/ug1233-xilinx-opencv-user-guide.pdf. Last visited 19.1.2018.<br />

[16] logicBRICKS / Xylon IP Cores.<br />

https://www.logicbricks.com/Products/logiADAK-VDF.aspx. Last<br />

visited 19.1.2018.<br />

668


Embedded Vision Systems<br />

used as Sensors in IoT Applications<br />

How vision sensors can be used in different industrial areas<br />

Marcus-Michael Müller<br />

Strategic Business Initiatives<br />

Basler AG<br />

Ahrensburg, Germany<br />

Marcus.Mueller@baslerweb.com<br />

Abstract— To understand how embedded vision systems can<br />

be transformed into sensors and used in IoT applications you<br />

have to comprehend what these systems do and how they are<br />

integrated into the IoT environment. An embedded vision system<br />

consists of a camera module, a processing unit and a kind of<br />

operating system software. With a special software that runs on<br />

the embedded vision system, which is able to analyze the pictures<br />

or image streams that are taken by the camera, the embedded<br />

vision system can be turned into a vision sensor. This vision<br />

sensor delivers metadata derived from the image and this data<br />

can then be used in IoT applications for analytics, interpretations<br />

or predictions.<br />

the human vision system for decades and the result we named<br />

computer vision.<br />

In the beginning, the systems were huge and quite<br />

expensive and the processing power was very low. But during<br />

the last 10 years the processing units became smaller, more<br />

powerful and more energy efficient. So the system was made<br />

much smaller and named smart camera or IoT camera system,<br />

if it is connected to a cloud. If you put all this together and use<br />

it with software for a computer vision or machine vision<br />

application then you can call it an embedded vision system.<br />

The combination of cameras and processing boards can be<br />

utilized in several professional B2B applications to collect data<br />

that is derived out of images that were captured by the camera.<br />

In several areas, like retail, factory automation or smart cities<br />

embedded vision systems already help to deliver the required<br />

data for an IoT application. In smart buildings for example the<br />

actual count of old electric meters can be transmitted to the<br />

utility company. All these systems have to be trained, for example<br />

with neural networks to improve the results or to reduce the<br />

setup time of a system.<br />

In all cases the embedded vision system is not for capturing or<br />

streaming videos. In this case the system is used like an IoT- or<br />

IP-camera. The application software on the embedded vision<br />

system is the most relevant tool to derive the data out of images<br />

that are taken by a camera. This data can be used for analytics<br />

and even to start actions or stop machines.<br />

I. FROM HUMAN VISION<br />

TO AN EMBEDDED VISION SYSTEM<br />

What do we understand when we talk about a computer<br />

vision system and what is the difference between a vision<br />

system and an embedded vision system?<br />

There is a vision system all of us are familiar with: Our<br />

own eye with its optic nerve in connection with our brain.<br />

A part of this system is used for image recognition, processing<br />

and interpretation – 24h a day. We have tried to rebuild<br />

Fig. 1: From Human Vision to Embedded Vision<br />

www.embedded-world.eu<br />

669


II.<br />

THE DEFINITION OF INTERNET OF THINGS<br />

The Internet of Things (IoT) can be described as the<br />

intelligent connectivity of physical devices driving massive<br />

gains in efficiency, business growth, and quality of life [1]. IoT<br />

can be separated in the Industrial IoT or the Consumer IoT:<br />

• In the Industrial IoT, critical machines and sensors in<br />

high-stake industries (e.g. energy, aerospace, defense or<br />

healthcare) are connected and failures can result in lifethreatening<br />

or other emergency situations. Special<br />

sensors for industrial applications and data aggregation<br />

are used. Deep learning leads to more efficiency.<br />

• In the Consumer Internet of Things you will find more<br />

consumer-level devices, like wearable fitness tools,<br />

smart home automation, glasses or automatic pet<br />

feeders. They are convenient, and breakdowns do not<br />

immediately create emergency situations. Networked<br />

home appliances can be classified as IoT gadgets, but<br />

they also lead sometimes to less power consumption for<br />

example.<br />

In the last years, the Internet of Things (IoT) has become<br />

more of a buzzword although it just means that devices were<br />

connected to the internet. All smart phones, fitness trackers are<br />

“things” that deliver data into a cloud, regardless if the data is<br />

used for any analytics or not.<br />

At the IoT World Forum in Chicago in 2014, the 28<br />

members of the IoT World Forum’s Architecture, Management<br />

and Analytics Working Group (made up of Cisco, IBM,<br />

Rockwell Automation, Oracle, Intel and a variety of others)<br />

presented an IoT Reference Model [2]. This model was a proof<br />

that the major industry players were working closely together<br />

to move the Internet of Things from the realm of hype to<br />

something real with the necessity of an open, standards-based<br />

approach.<br />

the stack is storage and this is succeeded in turn by data<br />

abstraction, applications, and collaboration and (business)<br />

processes.[3] Referring to this model an embedded vision<br />

system that is capable of handling the first three levels, if the<br />

video analytics and processing is done on the edge in the<br />

device.<br />

In the IoT environment normally it is not useful to stream<br />

images and to analyze them in the cloud, because of the huge<br />

amount of data that is generated and the high bandwidth<br />

resources that are required. Sometimes real time interactions<br />

are needed and this means that the data processing has to be<br />

done on the edge.<br />

III.<br />

WHAT TURNS AN EMBEDDED VISION SYSTEM<br />

INTO A VISION SENSOR<br />

Actual there are many different sensors are used in IoT<br />

devices or machines that imitate the different human senses,<br />

like touching, hearing, smelling, tasting or balance. These<br />

sensors help to analyze the actual status of a machine or can<br />

measure the heart rate. But the most complex sense is the sense<br />

of vision. If an embedded vision system shall be used as a<br />

vision sensor, it has to be defined what the vision sensor should<br />

be able to “see”, because no existing system is capable to know<br />

what a human brain knows.<br />

A vision sensor can detect many things out of an image or<br />

image stream, like products, colors, empty, full, right or wrong<br />

assembled, defects, motion, light, dark, barcode, people, faces<br />

or animals.<br />

Fig. 3: Examples what a trained embedded vision system can see<br />

Fig. 2: IoT World Forum Reference Model [2]<br />

The reference model has seven tiers. Starting at the lowest<br />

tier there are physical devices and controllers (the things), then<br />

there is connectivity and, above that, edge computing where,<br />

for example, you might want to do some initial aggregation,<br />

de-duplication and analysis. These lower three levels can be<br />

considered as operational technology (OT) whereas the<br />

remaining four levels are IT. The lowest level in the IT part of<br />

It is a question of training, analytics and output. Looking at<br />

machine vision solutions that are on the market for years<br />

already, where high end computer systems with machine vision<br />

cameras get the information out of a picture using libraries and<br />

algorithms. These systems are not able to learn and have to be<br />

preconfigured. In factory automation the systems are very<br />

efficient and optimized for the things they should classify or<br />

detect.<br />

670


Actual embedded processing boards and new algorithms<br />

led to new possibilities and smaller and energy efficient<br />

systems. These are more flexible, because they can be trained<br />

and improved by using machine learning with neural networks.<br />

Training a neural network and machine learning can turn an<br />

embedded vision system into a special vision sensor for several<br />

different IoT applications. Many companies offer frameworks<br />

or special software in the cloud that can reduce the time to train<br />

a system and improve the results.<br />

A typical vision sensor classifies the images on the edge of<br />

the cloud, which means that the pictures are not transferred into<br />

the cloud and classified there. A typical classification software<br />

would be a neural network. This neural network would run on<br />

the embedded vision system itself transforming it to a vision<br />

sensor and only delivers data out of the classified images. It is<br />

also possible to do the analytics in the cloud but then the<br />

system acts like an IP-camera that streams images into the<br />

cloud and not like a vision sensor.<br />

IV.<br />

EXAMPLE HOW TO SETUP AND TRAIN<br />

AN EMBEDDED VISION SYSTEM<br />

For the initial setup and to train the system, one needs an<br />

industrial camera, a processing board and a connection to the<br />

web. Then you have to decide what the system should detect or<br />

classify.<br />

To build up a vision sensor for example that is able to<br />

classify images of a defined product a neural network and a<br />

deep learning framework can be used. This vision sensor can<br />

detect if a product was assembled in the right way or not or if it<br />

has other defects.<br />

Therefore the embedded vision system must be connected<br />

to the web to send the pictures into the cloud. Images of<br />

different “status”, like correct assembled parts as well as<br />

different defects of the product have to be taken and sent to the<br />

cloud as a foundation to train the deep learning model.<br />

In a controlled setting 100 to 200 images of the same<br />

product from different angles are sufficient to train a neural<br />

network. These images are uploaded and then classified by<br />

human inspectors. The products on the pictures classified as<br />

right or wrong assembled respectively as defect or correct<br />

products.<br />

With these qualified images a data scientist can train a<br />

neural network. A deep learning framework like mxnet, Caffé,<br />

TensorFlow or CNTK helps to train the models. With a deep<br />

learning framework a neural network can be built much faster.<br />

The speed of the training process can be increased by using the<br />

power of more processors. For this reason it is useful to do the<br />

training in the cloud. However the resulting trained model runs<br />

on a simple processing board and can be distributed to the<br />

embedded vision system.<br />

Fig. 4: How to train an embedded vision system<br />

If the system is trained and the neural network runs on this<br />

system, products can be classified. If the system is not able to<br />

classify a product or status, a picture of this product will be<br />

uploaded to a special area where a human user can label and<br />

classify the pictures manually. As soon as enough new material<br />

is generated a new model can be calculated and deployed on<br />

the embedded vision system. So the results can be improved<br />

permanently. There already are a lot of pre-trained neural<br />

networks available that can be adapted to specific use cases.<br />

V. B2B APPLICATIONS WHERE EMBEDDED VISION<br />

SYSTEMS ARE USED AS SENSORS ALREADY<br />

The following three examples will show in which B2B<br />

applications embedded vision systems are used or will be used<br />

as sensors and neural networks will help to improve the results<br />

of these sensors.<br />

In retail, for example, companies want to know who is in<br />

the shop or who reacts to an advertisement. They need only the<br />

age, gender or time of attention of the customer. Therefore<br />

special algorithms are used on the processing board to create<br />

the data. No image has to be transmitted or stored. The “IoT<br />

camera” works as a sensor and sends only data to the cloud.<br />

In factory automation, an optical inspection can be done by<br />

Embedded Vision Systems and these systems can be trained<br />

with neural networks to improve the results or to reduce the<br />

setup time of a system. Think of a special case in which the<br />

visual inspection system has to detect defect parts or inspect<br />

assembled components. Therefore the system has to be trained<br />

with pictures that are transmitted to a cloud where humans<br />

analyze and label them. After the system was trained, it can<br />

autonomously improve the results by using machine learning<br />

algorithms.<br />

Embedded vision systems in a traffic environment can<br />

reduce traffic jams or help drivers to find free parking space.<br />

The system can be trained to analyze the traffic flow by getting<br />

the relevant information out of a stream of images. Traffic data<br />

from different vision sensors is sent in real time to the cloud, is<br />

collected and analyzed in real time. These current traffic<br />

information can be sent to mobile devices and traffic lights can<br />

www.embedded-world.eu<br />

671


e controlled to reduce traffic jams. The systems can also be<br />

trained to detect free car parking spaces or number plates, so<br />

that free parking spaces are shown on mobile devices and the<br />

parking fee is automatically booked from the drivers<br />

account.[4]<br />

In all cases the right combination of hardware and software<br />

delivers the best results for a specific application. The<br />

hardware delivers the best image quality and the software is the<br />

relevant tool to transform the images into data.<br />

CONCLUSION<br />

Today embedded vision systems can already be used as<br />

vision sensors in several IoT applications, but they can’t be<br />

used as a standard product or solution out of the box. A lot of<br />

companies offer complete IoT platforms to train systems by<br />

using deep learning frameworks and artificial intelligence to<br />

build neural networks. A neural network on an embedded<br />

vision system transforms it into a special vision sensor that is<br />

able to classify something in an image.<br />

These platforms can also be used to analyze the whole data<br />

of different sensors to improve workflows or results of a<br />

production process for example.<br />

In 2009 the ImageNet [5] database was published. This<br />

database consists out of over 10 million images that a have<br />

been hand-annotated to indicate what objects are pictured.<br />

ImageNet is used in visual object recognition software research<br />

and for the training of neural networks. This means also that a<br />

lot of pre trained systems are available already.<br />

A lot of new and innovative solutions will appear on the<br />

market in the future utilizing embedded vision systems to<br />

imitate the human vision sense. Internet of Things applications<br />

will take the data of these sensors for analytics, predictions or<br />

interpretations to improve the results permanently.<br />

REFERENCES<br />

[1] The Internet of Things, Cisco Connect 2015,<br />

https://www.cisco.com/web/offer/emear/38586/images/Presentations/P1<br />

1.pdf<br />

[2] IoT Reference Model, June 2014, Author: Jim Green<br />

[3] https://www.bloorresearch.com/2015/01/the-internet-of-thingsreference-model/<br />

, January 2015, Author: Philip Howard<br />

[4] https://www.baslerweb.com/en/vision-campus/markets-andapplications/iot-applications-in-the-smart-city/<br />

[5] https://en.wikipedia.org/wiki/ImageNet<br />

672


Closing the loop in Additive Manufacturing – An<br />

embedded solution for real-time melt pool monitoring<br />

Christos Theoharatos, Vangelis Vassalos, Dimitrios Besyris and Vassilis Tsagaris<br />

Computer Vision Systems, IRIDA Labs S.A.<br />

26504 Patras, Greece<br />

{htheohar, vassalos, dbes, tsagaris}@iridalabs.gr<br />

Abstract— Direct Metal Deposition (DMD), as part of the<br />

highly expanding Additive Manufacturing (AM) market, is a<br />

subtle procedure and any variation in the machine’s condition and<br />

process parameters can result in the production of defected parts<br />

in terms of surface quality. For this reason, real-time monitoring<br />

of the DMD process is required for controlling the pat<br />

manufacturing. In this work, a novel, vision-based, solution for<br />

closing the loop in AM is presented. Our work is based on the realtime<br />

monitoring of the DMD process with a comprehensive vision<br />

sensing system that interacts with the machine process algorithms<br />

in order to detect and correct deposition errors, leading in an<br />

optimal shape of the manufactured part or the material properties<br />

and contributing towards zero-defect AM. The interoperable<br />

vision system is designed to monitor the size, shape and intensity<br />

of the melting pool. The solution performs, on-camera image<br />

processing directly on the hardware subsystem’s FPGA for closedloop<br />

AM melting process monitoring.<br />

Keywords—Additive manufacturing; melt pool monitoring; laser<br />

metal deposition; FPGA processing;<br />

I. INTRODUCTION<br />

Additive Manufacturing (AM) brought great innovation in<br />

the field of complex shape part manufacturing, since shape<br />

complexity does not imply additional costs and, moreover, the<br />

use of material is globally reduced compared to traditional<br />

technologies [1]. However, despite the evident advantages,<br />

current AM technologies have important drawbacks that<br />

severely challenge the wide adoption. For instance, AM still is<br />

limited to the production of small parts only, the final surface<br />

quality still requires machining as further processing, the scrap<br />

rate is high and the quality certification is an issue. In short AM<br />

processes are not industrially robust yet.<br />

As part of the AM technology, Direct Metal Deposition<br />

(DMD) is a powder-based process for building up 3D metallic<br />

parts layer-by-layer, in which a laser is used to melt metal<br />

powder onto a substrate [2]. Due to the increased quality<br />

requirement of 3D parts produced through a DMD process,<br />

knowledge of correlations between the main process parameters<br />

such as laser power, laser head velocity, feed rate and powder<br />

mass stream, and the melt pool behavior, is needed [3]. For this<br />

reason, a necessity has nowadays emerged for being able to<br />

monitor the part quality, detect defect formations and make<br />

corrections or repairs in situ, as a part is being built. Therefore,<br />

the real-time monitoring of the melt pool within a laser<br />

deposition process is an essential part of AM, to better<br />

understand and control the thermal behavior of the process, as<br />

well as to detect any unpredicted fault and continuous control<br />

the interior quality of the machine [4, 5].<br />

In this work, a novel, vision-based, solution for closing the<br />

loop in the AM market is presented. Our work is conducted as<br />

part of the BOREALIS and SYMBIONICA H2020 projects and<br />

is based on the real-time monitoring of the DMD process with a<br />

comprehensive vision sensing system that interacts with the<br />

machine process algorithms in order to detect and correct<br />

deposition errors, leading in an optimal shape of the<br />

manufactured part or the material properties and leading towards<br />

zero-defect AM. The interoperable vision system is targeted to<br />

monitor the size, shape and intensity of the melting pool thanks<br />

to a sophisticated sensing system (camera and spectroscopy<br />

integrated system) and process parameters corrections<br />

elaborated and implemented at Numerical Controller (NC) level.<br />

The solution implements complex image processing algorithms,<br />

directly on the hardware subsystem’s FPGA for closed-loop AM<br />

melting process monitoring. Using the DMD monitoring system<br />

that will be described in the sequel, results of the melt pool<br />

monitoring procedure and parameter estimation will be<br />

presented on data coming from different DMD processes,<br />

followed by distinctive metrics representing the resources<br />

occupied within the FPGA programming unit such as Look-Up<br />

Tables (LUTs), Flip-Flop pairs (FFs), Block-RAMs (BRAMs)<br />

and embedded Arithmetic and Logic Units (e-ALUs). In<br />

addition, the algorithmic methodology for effectively<br />

segmenting the melt pool distribution, as well as the entire DMD<br />

distribution (that is, the melt pool plus the tail that is depicted<br />

due to the movement of the laser head and the associated melting<br />

process) will be shortly presented.<br />

II.<br />

HARDWARE METHODOLOGY AND DATA<br />

A. Hardware Set-up<br />

The associated hardware subsystem is comprised of an<br />

Optronis CXP6 high-speed camera of 540 fps at 1696×1708<br />

resolution, equipped with a specific optical system, which is<br />

positioned inside the machine tool head to capture images in-<br />

www.embedded-world.eu<br />

673


axis with the laser beam. The camera is connected through<br />

CoaXPress interface to a frame grabber which utilizes a Xilinx<br />

FPGA for performing melt pool monitoring pipeline in realtime.<br />

The frame grabber focuses on high-speed image<br />

acquisition with up to 4*CoaXPress cameras giving access to<br />

advanced Machine Vision applications. The design process<br />

followed, consists of three main phases. In the first phase the<br />

algorithm was developed and simulated using artificial data.<br />

This phase ended with the development of bit-accurate models<br />

for each hardware component in the datapath. The second phase<br />

includes the implementation of the hardware components and<br />

the simulation of the whole design. After successful completion<br />

of these two steps, we used the design tools provided by the<br />

FPGA vendor for synthesis and place-and-route procedures so<br />

as to produce the final bitstream file to program the FPGA,<br />

included in the frame grabber. In the third phase, a software<br />

(C++) application was developed in order to manipulate the<br />

results from the FPGA processing, upload them to the Host DDR<br />

memory, package them and send them to the Numerical<br />

Controller.<br />

B. Equipment and Data Description<br />

In order to carry out the experimental investigation, DMD<br />

process is provided by a Laserdyne small-scale machine<br />

equipped with a 1 kW CW fiber laser coupled with a 2 axis scan<br />

head. An innovative deposition process with high speed scanner<br />

has been developed, based on the high frequency movement of<br />

the laser beam for high energy efficiency controlling melt pool<br />

size and dimensions. The ablation process makes use of a 100W<br />

infrared pulsed laser source in the ns range. For the initial<br />

experimental analysis of the melt pool geometry and intensity,<br />

single tracks and circular movements were built up, using the<br />

nominal values of the machine’s physical parameters.<br />

III.<br />

IMAGE PROCESSING AND PARAMETER ESTIMATION<br />

In this section, the methodology for monitoring the DMD<br />

process and extracting the necessary statistical parameters is<br />

presented. In order to do that, the main objective is to robustly<br />

segment each frame in 2 overlapping regions R 1 and R 2 , where<br />

R 1 is the melt pool area and R 2 is the entire distribution, that is,<br />

the melt pool along with the tail. Following that, the extraction<br />

of various geometrical shape features / properties of each region<br />

that assist in quantifying their size and shape, as well as intensity<br />

features of the two regions, is carried out. In short, our algorithm<br />

can be summarized in the flowchart of Fig. 1 and is briefly<br />

explained in the following subsections.<br />

A. Frame Recording<br />

Initially, video recording is adjusted in order to provide<br />

focused and well-contrasted images of the melt pool. This is<br />

done by setting suitable values to the camera parameters such as<br />

exposure time and frame rate (the later value is limited by the<br />

maximum operational frequency achieved by the FPGA design),<br />

given all possible values within the operational range of energy<br />

intensities. Figs. 2 (a) and (b) illustrate two typical frames from<br />

different set-ups of the DMD process, related to a laser power of<br />

300W and a deposition speed of 500 mm/min. In both frames,<br />

the two distributions, that is the melt pool (i.e. the brightest<br />

region) and the melt pool plus the tail (i.e. the entire distribution<br />

that is separated from the background), that need to be<br />

segmented from each frame are clearly distinguished.<br />

B. Image Binarization<br />

Next, image binarization is performed for initially<br />

segmenting the images into the two overlapping regions R 1 and<br />

R 2 , defined previously. The segmentation is performed by<br />

thresholding the original image at two different levels, t 1 and t 2 .<br />

Therefore, thresholding is used to segment the current frame, by<br />

setting all pixels whose intensity values are above a threshold to<br />

a foreground value and all the remaining pixels to a background<br />

value. Formally,<br />

R 1 = {(x, y) ∈ ROI ∣ I(x, y) > t 1 } (1)<br />

R 2 = {(x, y) ∈ ROI ∣ I(x, y) > t 2 } (2)<br />

The simplest way to use image binarization is to choose a<br />

global threshold value, and classify all pixels with values above<br />

this threshold as white, and all other as black. This approach is<br />

problematic since there is no standard way of effectively<br />

selecting the correct thresholds, given the intensity variations of<br />

the melt pool images when altering the physical parameters of<br />

the system (i.e. energy power, deposition rate etc.). Therefore,<br />

an adaptive multi-thresholding technique was implemented for<br />

changing the threshold dynamically over the images and<br />

segmenting the two regions R 1 and R 2 from the background.<br />

Local adaptive thresholding selects an individual threshold<br />

for each pixel, based on the range of intensity values in its local<br />

neighborhood. This allows for thresholding an image whose<br />

global intensity histogram doesn't contain distinctive peaks. In<br />

our case, adaptive thresholding can provide better estimation in<br />

case the internal physical parameters of the laser deposition<br />

process change over time. The most well-known and commonly<br />

utilized adaptive thresholding technique is Otsu’s method [6],<br />

which assumes the image contains two classes of pixels -<br />

foreground and background, and has a bi-modal histogram. In<br />

our case, a two-step procedure was adopted; after separating the<br />

background pixels, foreground was further segmented into two<br />

distributions, the melt pool and the overall distribution<br />

containing the tail. The effectiveness of the method is due to its<br />

attempt to minimize their combined spread (intra-class<br />

variance). Figs. 2 (c) and (d) present the segmentation melt pool<br />

results of the images in Figs. 2 (a) and (b), while Figs. 2 (e) and<br />

(f) present the same results for the entire distribution.<br />

C. Morphological Filtering<br />

In order to refine the binarization results and filter out sparks<br />

that are apparent due to powder and particle reflections,<br />

morphological filtering is applied for smoothing both regions.<br />

This is a non-linear operation related to the shape or morphology<br />

of features in an image and rely only on the relative ordering of<br />

pixel values, not on their numerical values, and therefore is<br />

especially suited to processing of binary images [7].<br />

Morphological filters may act as filters of shape, filtering out any<br />

details that are smaller in size than the structuring element. A<br />

combination of “opening” and “closing” operations using<br />

different structural elements is utilized here for refining the<br />

segmentation result, which is illustrated in Figs. 2 (g), (h) and<br />

(i), (j) for the melt pool and the entire distribution respectively.<br />

As it is clearly demonstrated, the sparks are effectively filtered<br />

out and therefore the resulted melt pool distribution is segmented<br />

correctly in order to extract the necessary geometry parameters.<br />

674


Read Frame In<br />

Image Binarization<br />

Morphological<br />

Filtering<br />

Shape Statistics<br />

Intensity Statistics<br />

Average Filtering<br />

Average Filtering<br />

Output Pn<br />

Fig. 1. Flowchart of the algorithmic pipeline.<br />

(a) (c) (e) (g) (i)<br />

(b) (d) (f) (h) (j)<br />

Fig. 2. (a), (b) Melt pool images of te DMD process; Binarized images of (c), (d) the melt pool and (e), (f) the melt pool plus the tail; Finally segmented images<br />

following the morphological operators of (g), (h) the melt pool and (i), (j) the melt pool plus the tail respectively.<br />

D. Extraction of Statistical Parameters<br />

In the next step, a variety of shape and intensity statistics are<br />

extracted at the back end of the image pipeline system from both<br />

regions in order to quantify their properties and be fed to a<br />

Numerical Controller, so as to be used and transformed to<br />

physical parameters of the DMD process. The set of features /<br />

parameters that are extracted for both regions R 1 and R 2 is the<br />

Major and Minor Axis Lengths, the Area, the Center of Gravity<br />

(or Mass) and the Average Intensity. Apart from the above<br />

statistical parameters, other quite interesting and important<br />

features can be extracted from the two distributions like the<br />

perimeter, which is the path that surrounds the two-dimensional<br />

shape, the aspect ratio, which is the ratio of its sizes in different<br />

dimensions and describes the proportional relationship between<br />

its width (major axis) and its height (minor axis), the<br />

eccentricity, which can be thought of as a measure of how much<br />

the conic section deviates from being circular, or other intensity<br />

features such as the maximum, minimum and median intensities.<br />

E. Average Filtering<br />

As a last step, a temporal moving average filter is applied on<br />

a sequence of statistical values in order to smooth out the<br />

temporal variation, like a rapid variation or movement of the<br />

melt pool image or even a large spark that is not entirely filtered<br />

out by the morphological operator. For example, after<br />

calculating the area at frame i, its value is set to be the average<br />

of our estimation and the k previous values. The value of k<br />

depends on the amount of regularization we need to perform.<br />

Typically a value of k=3 can smoothen the results without losing<br />

much temporal information.<br />

IV.<br />

RESULTS AND DISCUSSION<br />

Figure 3 illustrates a snapshot of the process parameters<br />

estimation for a single DMD frame, in which intensity and<br />

geometrical features are shown in graphical mode. Regarding<br />

the hardware design, the following operations were<br />

implemented on the FPGA, as described in Section III, that is:<br />

Frame acquisition and parallelism level which was set to<br />

x8 (instead of x20, which is the hardware operator<br />

default), to minimize HW resources utilization. It must<br />

be noted that a parallelism reduction may save on FPGA<br />

resources but also lowers the overall bandwidth (BW) of<br />

the link, since BW = ClockFrequency × Parallelism.<br />

Full multi-adaptive threshold-based segmentation to<br />

produce the two binarized distributions.<br />

Morphological filters on each distribution.<br />

Shape statistics’ computation for each distribution, along<br />

with the melt-pool centroid distances from the back-end<br />

and the front-end of the tail distribution.<br />

Average intensity computation for both distributions.<br />

Packing the results in a 2×8 vectors for transfer to the<br />

host DDR memory.<br />

Table 1 provides the percentage of the FPGA resources’ fill<br />

levels of each processing module separately, as well as the total<br />

amount of the complete design, after successful Synthesis and<br />

Place-and-Route procedures using XILINX ISE v14.7.<br />

www.embedded-world.eu<br />

675


Fig. 3. Snapshot of the process parameter estimation.<br />

TABLE I.<br />

FPGA RESOURCE UTILIZATION<br />

Hardware Operators<br />

Resources Utilization (%)<br />

LUTs FFs BRAMs e-ALUS<br />

Frame Acquisition 19 9 12 10<br />

Full Adaptive Thresholding 12 7 27 3<br />

Morphological Filters (both distr.) 10 2 5 0<br />

Shape Statistics Tail 5 3 0 0<br />

Shape Statistics Melt Pool 2 2 0 0<br />

Intensity Statistics (for both distr.) 6 4 3 1<br />

Merging Statistics & DmaToPC 6 7 0 0<br />

TOT|AL 60 34 47 14<br />

V. CONCLUSIONS<br />

A real-time software and hardware implementation is<br />

presented here for monitoring the DMD process, based on a<br />

comprehensive vision sensing system that interacts with the<br />

machine process algorithms in order to detect and correct<br />

deposition errors. The vision system is targeted to monitor the<br />

size, shape and intensity characteristics of the melting pool,<br />

performing on-camera image processing directly on the<br />

hardware subsystem’s FPGA for closed-loop AM monitoring.<br />

Live monitoring of the melt pool geometry on the working<br />

surface during the deposition allows to optimize the overall<br />

process since the camera’s optical information can provide, after<br />

processing, measurements of the melt pool intensity and<br />

geometry and tune, in real time, the process parameters. Results<br />

of the melt pool monitoring procedure and parameter estimation<br />

are presented on data coming from a specific DMD processes,<br />

followed by distinctive metrics of the resources occupied within<br />

the FPGA device such as LUTs, FFs, BRAMs and e-ALUs. As<br />

a next step, the association and correlation between the melt pool<br />

behavior and process parameters like laser power, laser head<br />

velocity, feed rate and powder mass stream will be presented.<br />

ACKNOWLEDGMENT<br />

This work has been financed from the European Union’s<br />

Horizon 2020 research and innovation programme under grant<br />

agreement No. 678144, the “SYMBIONICA” project.<br />

REFERENCES<br />

[1] H. Bikas, P. Stavropoulos, and G. Chryssolouris, “Additive manufacturing<br />

methods and modelling approaches: a critical review”, The International<br />

Journal of Advanced Manufacturing Technology, vol. 83, pp. 389-405, March<br />

2016.<br />

[2] R. Vilar, “Laser powder deposition,” vol. 10.7, Adv. Addit. Manuf. Tooling,<br />

in Comprehensive Materials Processing, 1 st ed., Elsevier, pp. 163–216, May<br />

2014.<br />

[3] S. Ocylok, E. Alexeev, S. Mann, A. Weisheit, K. Wissenbach, and I.<br />

Kelbassa, “Correlations of melt pool geometry and process parameters during<br />

laser metal deposition by coaxial process monitoring”, Physics Procedia, vol.<br />

56, pp. 228-238, 2014.<br />

[4] B. Qian, L. Taimisto, A. Lehti, H. Piili, O. Nyrhilä, A. Salminen, and Z.<br />

Shen, “Monitoring of temperature profiles and surface morphologies during<br />

lasersintering of alumina ceramics”, Journal of Asian Ceramic Societies, vol. 2,<br />

pp. 123-131, 2014.<br />

[5] P. Stavropoulos, D. Chantzis, C. Doukas, A. Papacharalampopoulos, and G.<br />

Chryssolouris, “Monitoring and control of manufacturing processes: a review,”<br />

in 14 th CIRP Conf. on Modelling of Machining Operations, Turin, 2014.<br />

[6] N. Otsu, “A Threshold Selection Method from Gray-Level Histograms,”<br />

IEEE Trans. on Systems, Man and Cybernetics, vol. 9, no. 1, pp. 62-66, 1979.<br />

[7] R.C. Gonzalez, and R.E. Woods, Digital Image Processing, ch. 9, 4th ed.,<br />

Prentice Hall, 2018.<br />

676


Addressing the Challenges of Creating Infra-Red<br />

Vision Systems for the IIoT and IoT<br />

Adam P Taylor<br />

Director<br />

Adiuvo Engineering & Training Ltd<br />

Harlow, United Kingdom<br />

Adam@adiuvoengineering.com<br />

Abstract— Embedded vision is ubiquitous being used across a<br />

range of applications for example ADAS, Vision Guided Robotics<br />

and of course the IOT & IIOT. Increasingly along with the<br />

visible spectrum, EV systems rely upon the wider<br />

electromagnetic spectrum such as the Infra-Red spectrum. This<br />

spectrum is used in IIoT applications for the monitoring of<br />

temperatures on key equipment providing prognostics or within<br />

IOT applications where it enables users to see in low light<br />

conditions, or provide the ability to measure the temperature of a<br />

sleeping baby remotely.<br />

Keywords—IIOT, IOT, FPGA, SOC, Infra-Red, Embedded<br />

Vision.<br />

I. INTRODUCTION<br />

One of the advantages of embedded vision systems is their<br />

ability to observe wavelengths outside those which are visible<br />

to humans. This enables the embedded vision system to<br />

provide superior performance across a range of applications<br />

and deployments.<br />

Two common deployments of embedded vision systems are<br />

within the Internet of Things (IoT) and its industrial<br />

counterpart the Industrial Internet of Things (IIoT). Indeed,<br />

IOT and IIOT deployments continue the trend of ubiquity of<br />

embedded vision. IoT and IIoT applications are diverse and<br />

include monitoring, security and surveillance within IOT<br />

deployments. While IIOT applications are dominated by<br />

Industry 4.0 solutions, including positioning, guidance,<br />

identification and inspection applications.<br />

Many IoT & IIoT applications benefit by imaging outside of<br />

the visible spectrum and utilizing the infra-red element of the<br />

electromagnetic spectrum. Using the infra-red element of the<br />

electromagnetic spectrum enables the embedded vision system<br />

to sense background thermal radiation. As the imager works<br />

with the background thermal radiation no scene illumination,<br />

making IR solutions ideal for imaging in total darkness or<br />

poor visibility conditions, making them ideal for industrial,<br />

automotive and security applications. The use of IR sensors<br />

also allows the creation of thermographic applications which<br />

accurately measure the temperature of the scene contents. One<br />

example application of this would be within renewable energy<br />

where IR Imaging can be combined with drones to monitor the<br />

performance of solar arrays and detect early failures due to the<br />

increasing temperature of failing elements<br />

Working outside the visible range requires the correct<br />

selection of the imaging device technology. If the system<br />

operates within the Near IR Spectrum or below we can use<br />

devices such as Charge Coupled Devices (CCDs) or CMOS 1<br />

(Complementary Metal Oxide Semiconductor) Image Sensors<br />

(CIS) however, as we move into the infrared spectrum we<br />

need to use specialized IR detectors.<br />

The need for specialized sensors in the IR domain is in onepart<br />

due to the excitation energy required for silicon based<br />

imagers such as CCD or CIS. These typically require photon<br />

energy of 1eV to excite an electron however at IR<br />

wavelengths photon energies range from 1.7eV to 1.24 meV<br />

as such IR imagers tend to be based upon HgCdTe or InSb.<br />

These have lower excitation energies and are often combined<br />

with a CMOS readout IC called a ROIC to control and readout<br />

the sensor.<br />

II.<br />

COOLED OR UNCOOLED<br />

IR systems fall into to two categories, cooled and uncooled.<br />

Cooled thermal imagers use image sensor technology based<br />

upon HgCdTe or InSb semi-conductors. To provide useful<br />

images a cooled thermal imager requires the use of a cooling<br />

system which cools the sensor to a temperature of 70 to 100<br />

Kelvin. This is required to reduce the generated thermal noise<br />

to below that which is generated by the scene contents. Using<br />

a cooled sensor therefore brings with it an increased<br />

complexity, cost and weight for the cooling system, the system<br />

also takes time (several minutes) to reach operating<br />

temperature and generate a useable picture.<br />

Uncooled IR sensors can operate at room temperatures and use<br />

microbolometers in place of a HgCdTe or InSb sensor. A<br />

microbolometer works by each pixel changing resistance when<br />

IR radiation strikes it. This resistance change defines the<br />

temperatures in the scene. Typically, microbolometer-based<br />

1<br />

We can use different coatings upon the image device to<br />

affect its wavelength performance.<br />

www.embedded-world.eu<br />

677


thermal imagers have much-reduced resolution when<br />

compared to a cooled imager. They do however make thermalimaging<br />

systems simpler, lighter and less costly to create.<br />

For this reason, many IoT and IIoT applications will use<br />

uncooled image sensors like the FLIR Lepton.<br />

Radio Module for WIFI and Bluetooth communications for<br />

wireless connectivity. While the programmable logic is used<br />

to receive VoSPI, perform direct memory access with DDR<br />

and output video for a local display. The high-level<br />

architecture of the solution is demonstrated within figure 2.<br />

One additional concern is export compliance, cooled thermal<br />

imagers offer higher performance and resolution than their<br />

uncooled counterparts. As such cooled thermal imaging<br />

solutions are often subject to the stricter export compliance<br />

regimes than an uncooled solution, restricting the available<br />

markets.<br />

Creating an uncooled thermal imager presents a range of<br />

challenges for embedded vision designer. Requiring a flexible<br />

interfacing capability to interface with the select device and<br />

display. While providing the processing capability to<br />

implement any additional image processing upon on the video<br />

stream. Of course, as many of these devices are hand held or<br />

power constrained, power efficiency also becomes a<br />

significant driver. This solution must also provide security of<br />

the final solution both remotely via its internet connection and<br />

physically.<br />

III.<br />

ARCHITECTURE<br />

The FLIR Lepton is a thermal imager which operates in the<br />

long wave IR spectrum, it is a self-contained camera module<br />

with a resolution of 80 by 60 pixels (Lepton 2) or 160 by 120<br />

pixels (Lepton 3). Configuration of the Lepton is performed<br />

by an I2C bus while the video is output over SPI using a<br />

Video over SPI (VoSPI) protocol. These interfaces make it<br />

ideal for use in many embedded systems, which require ability<br />

to image in the IR region.<br />

One example, combines the Lepton with a Xilinx Zynq<br />

Z7007S device mounted on a MiniZed development board. As<br />

the MiniZed board supports WIFI and Bluetooth it is possible<br />

to create both IIoT / IoT applications and traditional imaging<br />

solutions with a local display in this case a 10-inch touch<br />

display.<br />

Figure 2 High Level Architecture<br />

Within the image processing pipeline, we can instantiate<br />

custom image processing functions generated using High<br />

Level Synthesis or use pre-existing IP blocks such as the<br />

Image Enhancement core which provides noise filtering, edge<br />

enhancement and halo suppression.<br />

This high-level architecture requires translation into a detailed<br />

design within Vivado, as such the following IP blocks are used<br />

to create the hardware solution.<br />

• Quad SPI Core – Configured for single mode<br />

operation, receives the VoSPI from the Lepton<br />

• Video Timing Controller – generates the video timing<br />

signals for the output display<br />

• VDMA – Reads an image from the PS DDR into a<br />

PL AXI Stream<br />

• AXI Stream to Video Out – Converts the AXI<br />

Streamed video data to parallel video with timing<br />

syncs provided by the Video Timing Core.<br />

• Zed_ALI3_Controller – Display controller for the 7-<br />

inch touch screen display<br />

• Wireless Manager – Provides interfaces to the Radio<br />

Module for Bluetooth and WIFI. While not used in<br />

this example including this module within the HW<br />

design means addition of wireless communications<br />

requires only additional SW development.<br />

When these IP blocks are combined with the Zynq processing<br />

system and the necessary AXI interconnect IP we obtain a<br />

detailed hardware design as shown in figure 3.<br />

Figure 1 MiniZed & FLIR Lepton<br />

To create a tightly integrated solution we can use the<br />

processing system (PS) of the Zynq to configure the Lepton<br />

using the I2C bus. The PS also provides an interface to the<br />

678


This limits access to hardware functions within the Zynq<br />

including programmable logic (PL) peripherals.<br />

The XADC provided within the Zynq device provides the<br />

ability to monitor both device temperature and voltages.<br />

Raising alarms should user specified limits be breached, along<br />

with the ability to monitor external anti tamper features.<br />

These features provided by the underlaying architecture of the<br />

Zynq enable creation of a sound base upon which higher level<br />

software based security solutions can be implemented.<br />

Figure 3, detailed hardware design in Vivado<br />

IV.<br />

SECURITY SOLUTION<br />

When we are developing a IoT or IIoT solution we need to<br />

ensure the solution is secure from malicious hackers,<br />

unauthorized access and modification. A secure solution for<br />

the IoT or IIoT should as a minimum provide<br />

• Secure Boot – The ability to decrypt an encrypted<br />

boot image. Secure boot should also provide<br />

cryptographic authentication of the image.<br />

• Authentication – Only authorised users should be<br />

able to connect with the IoT /IIoT system. Strong<br />

passwords and authentication protocols should be<br />

used.<br />

• Secure Communication – Communication to and<br />

from the IoT/IIoT device should be encrypted.<br />

• Secure Data – Data stored within the system should<br />

be secure, encryption standards such as AES, Simon<br />

or Speck can be used to secure data.<br />

• Anti-Tamper – Able to determine unauthorised<br />

access attempts to the system. This may include<br />

monitoring the presence of enclosure lids, device<br />

voltages and temperatures.<br />

The Zynq device provided on the MiniZed enables the<br />

implementation of a secure solution. The Zynq is capable of<br />

secure booting both the PS and the PL with a three-stage<br />

process. This three-stage process comprise Hashed Message<br />

Authentication Code (HMAC), Advanced Encryption<br />

Standard (AES) Decryption and RSA Authentication. Both the<br />

AES and HMAC use 256-bit private keys while the RSA uses<br />

2048-bit keys, the security architecture of the Zynq also<br />

allows for JTAG access to be enabled or disabled.<br />

These security features are enabled when generating the boot<br />

file and the configuration partitions for our non-volatile boot<br />

media. It is also possible to define a fall back partition such<br />

that should the initial first stage boot loader fail to load its<br />

application it will fall back to another copy of the application<br />

stored at a different memory location.<br />

Once you successfully have the device up-and-running further<br />

security can be implemented using the ARM Trust Zone<br />

architecture within the PS to implement orthogonal worlds.<br />

V. SW DEFINITION<br />

Most of the IP blocks included within the Vivado design<br />

require configuration using application software developed<br />

within SDK. This provides the flexibility to change the<br />

operational parameters required as the product evolves for<br />

example accommodating a larger display or changing sensor<br />

from the Lepton 2 to the Lepton 3. The application software<br />

configures the video timing from the Video Timing controller.<br />

Along with configuring the Video Direct Memory Access<br />

Controller to read frames from the memory mapped DDR and<br />

convert it into a AXI Stream to be compatible with the image<br />

processing stream.<br />

Following the initialization of the IP blocks the applications<br />

software performs the following<br />

• Configure the FLIR Lepton to perform Automatic<br />

Gain Control<br />

• Synchronisation with the VoSPI data to detect the<br />

start of a valid frame<br />

• Applies a Digital Zoom to scale up the image to<br />

utilise efficiently the 800 pixels by 480-line display.<br />

This can be achieved by outputting each pixel either<br />

8 or 4 times depending upon the sensor selection.<br />

• Transfer the frame to the DDR Memory as the FLIR<br />

Lepton only outputs 8 bits data when ACG is enabled<br />

this is mapped to the green channel of the RGB<br />

display.<br />

When the completed program is executed on the MiniZed with<br />

the FLIR Lepton connected and outputting to a 10-inch touch<br />

sensitive display the output of the FLIR can be seen very<br />

clearly as demonstrated in figure 4.<br />

This application above however, only addresses the lower<br />

level software to control the imager. To communicate over the<br />

internet the MiniZed WIFI capabilities need to be used. To do<br />

this we need to use a operating system which provides not<br />

only the appropriate WIFI stack but also allows the<br />

implementation of the authentication, secure communication<br />

and overall security solution. This is provided by updating the<br />

PetaLinux operating system running on the MiniZed.<br />

Petalinux is a Linux distribution provided by Xilinx for the<br />

Zynq and Zynq UltraScale+ MPSoC devices. With the<br />

www.embedded-world.eu<br />

679


PetaLinux OS updated the MiniZeds WIFI capabilities can be<br />

used to communicate images captured from the FLIR Lepton.<br />

Figure 4 Final System, connected to display<br />

VI.<br />

CONCLUSION<br />

Imaging within the IR domain provides a very significant<br />

benefit in many IoT and IIoT applications. The creation of an<br />

imaging system based upon an uncooled thermal imager<br />

presents a number of challenges in interfacing, security, power<br />

efficiency and performance. Heterogeneous SoC allow us to<br />

create a solution which is flexible, secure and power efficient.<br />

VII. AUTHOR BIOGRAPHY<br />

Adam Taylor is a world recognized expert in design and<br />

development of embedded systems and FPGA’s for several<br />

end applications. Throughout his career Adam has used<br />

FPGA’s to implement a wide variety of solutions from<br />

RADAR to safety critical control systems, with interesting<br />

stops in image processing and cryptography along the way. He<br />

currently holds an executive position within a major European<br />

Defense company. Prior to that he was most recently the Chief<br />

Engineer of a Space Imaging company, being responsible for<br />

several game changing projects. Adam is the author of<br />

numerous articles on electronic design and FPGA design<br />

including over 230 blogs on how to use the Zynq & Zynq<br />

MPSoC for Xilinx. Adam is a Chartered Engineer and Fellow<br />

of the Institute of Engineering and Technology, he is also the<br />

owner of the engineering and consultancy company Adiuvo<br />

Engineering and Training (www.adiuvoengineering.com)<br />

680


Securing tomorrow’s IoT devices: the new potential<br />

for integrating sophisticated security functions into<br />

the microcontroller<br />

Jack Ogawa<br />

Senior Director of Marketing, MCU Business Unit<br />

Cypress Semiconductor Corp.<br />

San Jose, California<br />

Abstract—Internet of Things (IoT) devices, which transmit and<br />

receive data and commands over the world’s universal network, are<br />

exposed to a far greater variety and number of threats than earlier<br />

products that supported older machine-to-machine (M2M)<br />

communication, typically over a closed, private network. The<br />

security functions and resources required to protect an IoT device<br />

against these security threats are today available in specialized,<br />

discrete ICs such as:<br />

• a secure element – a system-on-chip combining a<br />

microcontroller with on-board cryptographic capabilities, secure<br />

memory and interfaces<br />

• secure non-volatile memory ICs, which typically feature a<br />

cryptographic engine for pairing the memory securely to authorized<br />

devices<br />

However, the use of such discrete ICs in IoT devices has the<br />

effect of increasing their component count, complexity and bill-ofmaterials<br />

cost compared to designs that use the integrated security<br />

capabilities of the host MCU (or in some cases an applications<br />

processor). The crucial question for IoT device designers, then, is<br />

whether the capabilities of the host MCU are sufficient to counter<br />

the threats of spoofing, tampering, repudiation, information<br />

disclosure, denial of service and elevation of privilege.<br />

Keywords—IoT; IoT Security; MCU Security; Microcontroller<br />

Security; Data Integrity; Trusted Firmware; Encryption; Firmware<br />

Authenticity; Malware; Inter-Processor Communication<br />

I. INTRODUCTION<br />

By all accounts, IoT (Internet of Things) devices are<br />

forecasted to become ubiquitous. IoT devices, powered by<br />

semiconductors, will make every imaginable process smart.<br />

From simply turning on a light to more complex processes such<br />

as outpatient care or factory control, IoT devices utilizing<br />

sensing, processing, and cloud connectivity will dramatically<br />

improve their effectiveness. IoT device applications are<br />

diverse, and their promise and impact are quite literally<br />

unbounded.<br />

The ubiquitous application of IoT devices introduces<br />

security challenges. For example, traditional lighting control is<br />

relatively primitive – it’s a power circuit with a physical<br />

switch. Operating the switch requires physical proximity.<br />

Securing this process against unauthorized use simply requires<br />

physical protection of the switch. Now consider lighting<br />

control in its smart incarnation as an IoT device. The physical<br />

switch is replaced by light and proximity sensors, logic<br />

(typically implemented in a microcontroller or MCU), and<br />

wireless connectivity to a Cloud-based application. In<br />

becoming smart (enlightened!), a light switch is transformed<br />

into an embedded client that works with an application server<br />

through a network. Securing the smart light switch has become<br />

much more complicated. The good news is that secure<br />

microcontrollers can greatly enhance the security of the IoT<br />

device and accelerate the design cycle.<br />

This paper examines a method for determining security<br />

requirements of an IoT device and presents Cypress’ PSoC 6<br />

secure MCU as a solution that meets these requirements.<br />

II.<br />

IOT DEVICE SECURITY ANALYSIS<br />

The idea of securing IoT devices can be daunting. Doing a<br />

bit of research immediately reveals large bodies of knowledge<br />

regarding cryptography, threats, security objectives, and<br />

myriad other subjects. Faced with overwhelming information,<br />

often the first question that IoT device designers ask is “how do<br />

I judge security?”, closely followed by “where do I start?”.<br />

As shown in figure 1, the first step in the analysis process is<br />

to identify data assets handled by the IoT device and their<br />

secure properties. The next steps are to identify threats that<br />

target these assets, define security objectives to resist these<br />

threats, and finally, the requirements to satisfy the security<br />

objectives. By meeting these requirements, a microcontrollerbased<br />

design supports the security objectives and ultimately<br />

preserves the secure properties of the assets. Finally, the<br />

design should be evaluated to determine if the final design<br />

achieves the objectives. Typically, this evaluation utilizes<br />

www.embedded-world.eu<br />

681


threat models applied to the design to assess the attack<br />

resistance of the device.<br />

the identity of the signing actor. The verifying actor uses the<br />

public key to decrypt the hash that was embedded in the data<br />

set and compares it to the calculated hash. If they match, then<br />

the verifying actor is assured that the data has not changed<br />

since it was signed, and it was provided by the signing actor as<br />

Identify Data<br />

Assets<br />

Identify Threats<br />

Define Security<br />

Objectives<br />

Requirements<br />

Fig. 1. Analysis process for designing secure IoT devices.<br />

III.<br />

DATA ASSETS<br />

The value of every IoT device is built upon data, and how<br />

that data is managed. Data assets take various forms in an<br />

embedded system, such as a unique ID, firmware, password, or<br />

encryption key. Each data asset has secure properties. Secure<br />

properties are inherent characteristics of the data asset that the<br />

system relies upon as the basis of trusting that data asset.<br />

There are three secure properties: confidentiality, integrity, and<br />

authenticity:<br />

Confidentiality: Encryption is the process of encoding data<br />

in such a way that only trusted actors can read it, thus<br />

maintaining confidentiality. Correspondingly, if an actor can<br />

read encrypted data, they are assumed to be trusted.<br />

Encryption algorithms utilize keys for encryption and<br />

decryption. Therefore, secure handling and storing of keys is a<br />

critical requirement for secure IoT devices. There are generally<br />

two types of encryption algorithms: symmetric or shared key<br />

and asymmetric or public key encryption. In shared key<br />

schemes, the encryption and decryption key are the same, and<br />

communicating parties must both have the same key to achieve<br />

secure communication. In public key schemes, the encryption<br />

key (public key) is published for anyone to use for encrypting<br />

messages. However, only the receiving actor has the<br />

decryption key (private key). Public key schemes are useful<br />

for securing many-to-one communications.<br />

Integrity: Data integrity assessment is required for data<br />

assets that are immutable. Examples are boot firmware and<br />

configuration data. Assessing data integrity involves applying<br />

a cryptographic hash function to the data asset. A hash<br />

function maps data of arbitrary size to a bit string of a fixed<br />

size called a hash. The probability of the same hash being<br />

generated for two data sets is made very small by the choice of<br />

hash bit length. Therefore, for a given application and properly<br />

chosen hash length, hashes can be considered unique to a data<br />

set. If a data set is changed, its hash will also change. Data<br />

integrity can therefore be determined by comparing a provided<br />

hash representing the original data set to a calculated hash of<br />

the data set received.<br />

Authenticity: Authenticity, when combined with integrity,<br />

establishes trust, and therefore it is a critical cornerstone of a<br />

secure IoT device. Typically, a Public Key Infrastructure<br />

(PKI) is used for this purpose. In a PKI scheme, a digital<br />

signature, which simplistically is the hash of a data set that is<br />

encrypted by the signing actor using a private key, is embedded<br />

in the data set. Separately, the verifying actor receives a<br />

certificate issued by a Certificate Authority (CA). The<br />

certificate contains the corresponding public key, along with<br />

attested by the CA.<br />

Fig. 2. Overview of Digital Signatures. Source: www.docusign.com.<br />

It is critical to comprehensively identify data assets in the<br />

IoT device, since each subsequent step relies on this step.<br />

Some examples of data assets are:<br />

Hardware ID – A unique identifier for the device<br />

Trusted Firmware – Implements Trusted Applications<br />

(TAs) that support secure objectives<br />

User Data – Data used by the application<br />

Configuration – Data used to configure the device,<br />

including network information<br />

Keys – Data used for crypto operations<br />

Each data asset will have secure properties.<br />

example data assets:<br />

Data Asset<br />

Hardware ID<br />

Trusted Firmware<br />

User Data<br />

Configuration<br />

Keys<br />

IV.<br />

Secure Properties<br />

Integrity<br />

Integrity, Authenticity<br />

Confidentiality, Integrity<br />

Confidentiality<br />

Integrity, Confidentiality<br />

THREATS<br />

For the<br />

Threats target data assets. The goal of threat identification<br />

is to expose vulnerabilities associated with the device’s ability<br />

to maintain the secure properties of its data assets when<br />

attacked. For design purposes, threats that do not target data<br />

assets, and similarly, vulnerabilities or data assets that do not<br />

have a particular attack method by definition cannot be<br />

evaluated, and therefore must be treated with extra scrutiny.<br />

682


The previous data asset examples may face the following<br />

threats:<br />

Threat<br />

Spoofing<br />

Man in the Middle<br />

Malware<br />

Tamper<br />

Targeted Data Asset<br />

Configuration<br />

User Data Keys<br />

Trusted Firmware<br />

All<br />

V. SECURITY OBJECTIVES<br />

With threats identified, security objectives can now be<br />

defined. Security objectives are defined at an application level,<br />

in essence providing implementation requirements. Some<br />

security objectives can be implemented as Trusted<br />

Applications (TAs) that execute in an isolated execution<br />

environment provided by the secure MCU. The isolated<br />

execution environment comprehensively protects the TAs and<br />

the data that they use/process. The IoT device application itself<br />

operates in an unsecure execution environment and<br />

communicates with TAs in the isolated execution environment<br />

through an API that uses an inter-processor communication<br />

(IPC) channel. The TAs in turn utilize the resources available<br />

(such as crypto accelerators, secure memory) in the hardware<br />

to support the objective.<br />

Continuing with the example, the threats identified<br />

previously can be countered by the following security<br />

objectives:<br />

Secure State: Ensures that the device maintains a<br />

secure state even in case of failure of verification of<br />

firmware integrity and authenticity. Counters Malware<br />

and Tamper threats.<br />

VI.<br />

REQUIREMENTS<br />

At this point, the analysis provides a logically connected<br />

model of data assets, threats, and security objectives. From this<br />

picture, a list of required capabilities or features for a secure<br />

MCU can be compiled. This list can of course then be used as<br />

solution implementation criteria for the particular IoT device<br />

application.<br />

Note that the requirements for a security objective may<br />

change according the life cycle stage (design, manufacturing,<br />

inventory, end use, termination) of the IoT device and should<br />

be considered as well.<br />

The analysis of the example now can be presented:<br />

Notes:<br />

1. Ideally implemented as a TA in an isolated<br />

execution environment<br />

2. C = Confidentiality, I = Integrity, A = Authenticity<br />

3. SEF = Secure Element Functionality<br />

4. Dead = In a non-operational state<br />

Fig. 3. Summary of security objectives and countered threats.<br />

Access Control: The IoT device authenticates all<br />

actors (human or machine) attempting to access data<br />

assets. Prevents unauthorized access to data assets.<br />

Counters Spoofing and Malware threats where the<br />

attacker modifies firmware or installs an outdated<br />

flawed version.<br />

Secure Storage: The IoT device maintains<br />

confidentiality (as required) and integrity of data assets.<br />

Counters Tamper threats.<br />

Firmware Authenticity: The IoT device verifies<br />

firmware authenticity prior to boot and prior to<br />

upgrade. Counters Malware threats.<br />

<br />

Communication: The IoT device authenticates remote<br />

servers and provides confidentiality (as required) and<br />

maintains integrity of exchanged data. Counters Man in<br />

the Middle (MitM) threats.<br />

VII. CONCLUSION<br />

This paper presents an analysis method for determining the<br />

requirements of a secure IoT device. By creating a logically<br />

connected model of data assets, threats against these assets, and<br />

security objectives to counter the threats, a list of requirements<br />

can be derived that can be used as criteria for implementation<br />

solutions.<br />

The vast majority of IoT devices will be built upon MCUbased<br />

embedded systems. This growth opportunity is<br />

attracting a new breed of MCUs that offer security features and<br />

capabilities to maintain the secure properties of data assets.<br />

Cypress’ PSoC 6 secure MCUs are one of the first of such new<br />

MCUs. The PSoC 6 MCU architecture was designed for IoT<br />

device applications, offering ultra-low power for extended<br />

battery life, efficient processing capacity, and hardware-based<br />

security features that support security objectives:<br />

Isolated Execution Environment: PSoC 6 secure MCUs<br />

isolate secure operations from unsecure operations through the<br />

use of hardware isolation technology:<br />

www.embedded-world.eu<br />

683


Configurable protection units are used to isolate<br />

memory, cryptography, and peripherals<br />

Inter-Processor Communication (IPC) channels<br />

between the Arm Cortex-M4 and Cortex-M0+ cores<br />

are provided to support isolated API-based interaction.<br />

<br />

Ideal for implementing Trusted Applications that<br />

support the security objective of an IoT device<br />

Integrated Secure Element functionality: The hardware<br />

isolation technology in PSoC 6 supports isolated key storage<br />

and crypto operations, delivering secure element functionality<br />

in addition to the isolated execution environment.<br />

<br />

<br />

Ideal for secure key storage<br />

Optional pre-installed root of trust to support secure<br />

boot with a chain of trust<br />

Isolated, hardware-accelerated cryptographic operations:<br />

Includes AES, 3DES, RSA, ECC, SHA-256 and SHA-512, and<br />

True Random Number Generator (TNRG).<br />

Life cycle management: eFuse-based life cycle<br />

management capability ensures secure behavior in the event of<br />

security errors such as firmware hash check failures.<br />

The forecasted explosion of IoT devices will be driven by<br />

the availability of cost effective, easy to design, and easy to use<br />

wireless connectivity to the Cloud. The ability for an<br />

embedded system to send and receive data is a fundamental<br />

enabler for smartness. Unfortunately, this ability is also an<br />

enabler for threats against the very data that makes an IoT<br />

device valuable. The more valuable the data, the more critical<br />

that IoT devices implement security capabilities that protect<br />

this data. Secure MCUs such as Cypress’ PSoC 6 MCUs<br />

address the needs of secure IoT devices.<br />

Fig. 4. PSoC 6 Secure MCU’s isolated execution environment enabled<br />

thorough hardware isolation technology.<br />

REFERENCES<br />

[1] Cypress PSoC 6 MCU Community:<br />

https://community.cypress.com/community/psoc-6; 2017<br />

684


Delivering high-mix, high-volume secure<br />

manufacturing in the distribution channel<br />

Steve Pancoast (Author)<br />

Secure Thingz, Inc.<br />

IoT and Embedded Systems Security<br />

San Jose, California, USA<br />

Rajeev Gulati (Author)<br />

Data I/O Corp.<br />

Software, Semiconductor, and Systems Technology<br />

Redmond, Washington, USA<br />

Abstract— This paper examines the cryptographic foundational<br />

elements required to establish roots-of-trust in silicon to design,<br />

manufacture and deliver secure devices. Recent advancements in<br />

security and programming technology, when designed in, streamline<br />

the manufacturing process, scale and deliver trusted devices to<br />

partners and OEMs cost-effectively. Other topics for discussion<br />

include impacts on manufacturing and downstream provisioning<br />

processes, as well as new technology in security provisioning and<br />

data programming that OEMs of any size can implement.<br />

Keywords— security, secure manufacturing, OEM, Internet of<br />

Things, IoT, root of trust, microcontroller, MCUembedded security;<br />

device provisioning; supply chain of trust; IP protection; IoT<br />

device; certificate signing; cryptographic key pairing,<br />

authentication, and decryption; hardware security module (HSM);<br />

public key infrastructure<br />

I. INTRODUCTION<br />

As billions of new IoT products come online every year, the<br />

opportunity to co-opt these products for nefarious purposes<br />

grows exponentially. In addition, the supply chain for these<br />

IoT products may be susceptible to threats such as cloning and<br />

intellectual property (IP) theft. The effects cannot be<br />

underestimated: Unauthorized attacks can significantly impact<br />

an OEM’s revenue, profits, brand and reputation. Because of<br />

this, pressure is building for each and every IoT device to<br />

include security features that prevent the device from being<br />

used by an unauthorized agent. Zero security is no longer an<br />

option – it is a must-have.<br />

Techniques used to combat the above threats include<br />

secure provisioning and programming of the products, along<br />

with operational security measures such as establishing trusted<br />

mutual authentication between the IoT device and a remote<br />

server, securing communication to and from the IoT device<br />

and securing the firmware running on the device itself. These<br />

capabilities can be enabled by features in secure<br />

semiconductor devices such as Secure Elements (SEs) and<br />

Secure Microcontrollers (Secure MCUs) as long as they are<br />

properly and securely provisioned.<br />

establishing a supply chain of trust, creating a root of trust and<br />

a solution for the secure provisioning and programming of<br />

Secure Elements and Secure MCUs. The paper also details<br />

the component architecture of the Data I/O SentriX system<br />

used for secure provisioning of Secure Elements and Secure<br />

MCUs, and it provides an example of how devices can be<br />

provisioned for mutual authentication during manufacturing.<br />

II.<br />

A. Supply Chain of Trust<br />

CHAIN OF TRUST<br />

In creating secure products, the product developer (the<br />

OEM) should adopt a “zero / low trust” approach across the<br />

supply chain to minimize vulnerabilities and IP loss or theft.<br />

The OEM should continually authenticate and individualize<br />

deliverables across the supply chain as far as possible, and this<br />

involves establishing this chain of trust across the entire<br />

product lifecycle, including the end customer who will need a<br />

way to securely apply software updates for its product.<br />

Silicon vendor<br />

(SE’s, MCUs)<br />

Programming<br />

Facility<br />

Figure 1 – Supply Chain of Trust<br />

Contract<br />

Manufacturing<br />

Customer<br />

(SW Updates)<br />

A typical supply chain of trust is shown in Figure 1. The<br />

chain of trust should start with silicon vendors (for the Secure<br />

Element or Secure MCU) and continue with programming<br />

solution providers and contract manufacturers all the way<br />

through to the OEM, who develops the end products, and even<br />

the end customer who needs to securely update the product in<br />

the field. Think of the chain of trust as a process flow - any<br />

step in the process builds upon the security of the previous<br />

step. Once the context of the overall supply chain is<br />

understood, focus can be placed on the Programming Center<br />

and the Provisioning System that resides there.<br />

OEM<br />

This paper describes common security issues facing OEMs<br />

when developing and manufacturing IoT devices, including<br />

www.embedded-world.eu<br />

685


B. Root of Trust<br />

The security of an IoT product starts by having a secure<br />

“root of trust” (RoT) that must be securely provisioned into<br />

the product which is usually contained in a SE or a Secure<br />

MCU itself. The root of trust typically consists of four key<br />

items, three of which are shown in Figure 2.<br />

Figure 2 – Components of a Root of Trust<br />

1. A unique product asymmetric key pair that is<br />

provisioned into the product and is secure and<br />

immutable. The private part of the key pair must be<br />

protected and provisioned/programmed into the SE or<br />

secure MCU so that it is never exposed, but can be<br />

used for authentication purposes (see #3).<br />

2. A unique identity that is secure and can be<br />

validated (typically a product certificate). Every<br />

connected product should have a unique identity<br />

certificate. The most common implementation of this<br />

uses signed product certificates that can be verified<br />

by a certificate authority (CA). Unlike web browsers<br />

that connect to multiple sites, most IoT devices<br />

connect back to just the OEM’s own site, so the CA<br />

can be the OEM itself (i.e. a self-signed CA) or a<br />

third-party CA can also be used. The key principle is<br />

that there needs to be a verifiable certificate chain<br />

from each product back to a trusted CA.<br />

3. A secure way to authenticate the identity of the<br />

product (i.e. tie the product to the certificate). The<br />

SE or MCU provides a cryptographic method to<br />

authenticate that the public key (from the product<br />

certificate that was previously validated) matches the<br />

corresponding private key in the SE or MCU.<br />

4. A secure and immutable boot path (for MCU<br />

solutions). In addition to the items above, a secure<br />

MCU must also provide a secure boot mechanism<br />

where the integrity of the initial boot software is<br />

cryptographically verified before executing it. This<br />

process continues successively where the boot<br />

software verifies the integrity (signature) of the<br />

subsequent software before is executed, etc.<br />

The RoT represents the base level of security information<br />

that must be protected in the secure device against readout and<br />

tampering, etc. This is usually done with a variety of<br />

hardware and software protection methods in the device.<br />

Beyond the RoT, many layers of operational security software<br />

are required for different classes of IoT products, but each of<br />

these solutions typically rely on some initial security starting<br />

point that is empirically trusted. So it is critical that the RoT<br />

be secure and properly protected at manufacturing.<br />

C. Secure Provisioning and Programming<br />

To implement the chain of trust process, the OEM should<br />

define roles for its own ecosystem partners and suppliers in<br />

the product’s lifecycle, including partners in the development<br />

and manufacturing of the product. The OEM should take<br />

ultimate ownership for the security of its own products and<br />

protect its own Intellectual Property. One of the key areas<br />

often overlooked by OEMs is the secure provisioning and<br />

programming of the RoT and product software either at a<br />

Programming Center or contract manufacturer. In addition, it<br />

is important for the OEM to address how software updates can<br />

be securely deployed for its own products.<br />

The challenges of a secure manufacturing solution should<br />

not be understated. Secure devices (SEs and MCUs) must be<br />

produced securely anywhere in the world with an OEM’s keys<br />

(RoT) and product software protected. OEMs must restrict<br />

access to secrets, but most OEMS need to trust third parties<br />

for high volume production. With an automated security<br />

provisioning and data programming solution, Programming<br />

Centers and contract manufacturers are able to handle more<br />

customers with less audit overhead because the provisioning<br />

and programming are managed cryptographically. The<br />

OEM’s secrets are protected inside a hardware security<br />

module (HSM), and over-production is eliminated.<br />

For Secure MCUs, the security problem is more<br />

complicated. Beside the need to securely provision the RoT<br />

into the MCU, there is a need to securely program the OEM’s<br />

application software / firmware into the MCU to protect<br />

against IP theft. The OEM should also provide a solution to<br />

securely update the software in its products after production.<br />

Secure Thingz has created a solution where the software can<br />

be securely programmed / updated as it is “mastered” with a<br />

secure system at the OEM and sent to the programming<br />

facility. Mastered software images are encrypted and<br />

protected against any modifications and can only be decrypted<br />

and installed by the targeted device or family of devices.<br />

III.<br />

PROVISIONING SYSTEM ARCHITECTURE<br />

A provisioning system component architecture is a turnkey<br />

solution that enables an OEM to securely provision<br />

component devices like Secure Elements and Secure MCUs.<br />

A common usage model for the provisioning system, shown in<br />

Figure 3, involves its setup at a Secure Programming Center.<br />

Secure Programming Centers provide important provisioning<br />

services to OEMs at various volume levels ranging from first<br />

article to hundreds of thousands of units. Automated security<br />

686


provisioning and data programming systems may also be<br />

similarly setup at OEM-controlled factories or factories owned<br />

by contract manufacturers.<br />

IV.<br />

MUTUAL AUTHENTICATION USE CASE<br />

Now that the component architecture of the system has<br />

been reviewed, consider a real world use case of mutual<br />

authentication to see how the provisioning system is used.<br />

Mutual authentication requires both parties to have a<br />

provisioned root of trust as outlined in Section II. Two devices<br />

attempting to communicate with each other will use this root of<br />

trust to cryptographically authenticate before exchanging data.<br />

A brief study of how mutual authentication works will<br />

establish the device provisioning requirements. One of the<br />

devices could be an update server and the other device an IoT<br />

product, but many variants are common.<br />

Figure 3 – Secure Provisioning System Architecture<br />

While the secure provisioning of devices may be<br />

outsourced by an OEM to a Secure Programming Center,<br />

OEMs need to provide both public and secret information that<br />

is necessary to provision their devices. As an example, an<br />

OEM may need to provide a Secure Programming Center with<br />

private signing keys, certificates, certificate templates,<br />

production counts and other important information. For the<br />

purpose of this paper, and as shown in Figure 3, such material<br />

will be referred to as “OEM Public and Private Information”.<br />

OEMs create and manage the OEM Public and<br />

Private information on their premise in a highly secure<br />

environment. Such information is company secret and of very<br />

high value to an OEM, yet this information needs to be<br />

transmitted to a Secure Programming Center so that devices<br />

can be provisioned. The OEM Secret Wrapping Tool is a<br />

subsystem that enables OEM Public and Private Information<br />

to be cryptographically signed and encrypted, or “wrapped”.<br />

This allows the information to be protected and be securely<br />

transmitted to a specific provisioning system at a<br />

Programming Center over an unsecure Internet connection.<br />

The unwrapping of the OEM’s secret information only occurs<br />

inside a secure, tamper-resistant HSM where it is stored and<br />

protected. This process of wrapping takes place at an OEM<br />

Premise by the OEM and the wrapped file is then sent to a<br />

Secure Programming Center.<br />

The Secure Programming Center is responsible for<br />

provisioning devices for the OEM customer at its premise.<br />

This process starts with creation of a Security Product using<br />

the specific system programming software. During product<br />

creation, the wrapped OEM Public and Private Information is<br />

imported into the programming system and cryptographically<br />

bound to a Unique Product ID and a Unique OEM ID within<br />

the HSM. Thus a unique product representation (OEM ID and<br />

Product ID) is created.<br />

Figure 4 – Mutual Authentication Between Two Devices<br />

The application and devices are developed by OEM A, who<br />

is thus the owner of Device 1, Device 2 and the application.<br />

Thus OEM A is also responsible for provisioning both devices<br />

for mutual authentication.<br />

This process generally requires creating a Public Key<br />

Infrastructure (PKI) system among the various devices in the<br />

application. Identities for Device 1 and Device 2 are created<br />

by associating an Identity Key Pair with each device. As<br />

shown in Figure 4, the private part of the Identity Key Pair for<br />

each device is stored in read and write protected storage on the<br />

device (and thus never exposed to the outside world). The<br />

public part of the key pair is stored in a Device Certificate in<br />

write-protected storage and represents the Public Identity of the<br />

device (and is available to share with the outside world).<br />

Assigning ownership of Device 1 and Device 2 to OEM A<br />

requires that an Identity Key Pair for OEM A be created as<br />

shown in Figure 4. To assign ownership of Device 1 and<br />

Device 2 to OEM A, the Device Certificate for each device is<br />

signed with the Private Key from the OEM Identity Key Pair.<br />

The Public Identity of the OEM is represented by the OEM<br />

Root CA certificate, which contains the Public Key of the<br />

OEM. In this example, we are assuming that the OEM is also<br />

the Root Certificate Authority (where the certificate signature<br />

chain of trust needs to terminate), thus the OEM Root CA<br />

certificate is signed by the Private Key of the OEM. Note that<br />

in most cases, the Root CA is used to sign intermediate CA<br />

certificates and these are, in turn, used to sign device<br />

certificates. However, for this example, the Root CA will be<br />

used directly for simplicity.<br />

www.embedded-world.eu<br />

687


The OEM Root CA certificate is also associated with<br />

Device 1 and Device 2 so that during the device authentication<br />

process, the chain of trust starting with the Device Certificate<br />

for Device 1 (or Device 2) can be dynamically verified to<br />

terminate at same Root CA certificate as the OEM Root CA<br />

certificate stored on the device.<br />

Once the above PKI system is set up, the mutual<br />

authentication algorithm works as follows:<br />

1. Device 1 requests Device 2 to send its Certificate.<br />

2. Device 2 sends its Device Certificate to Device 1.<br />

3. Device 1 authenticates Device 2 by validating Device<br />

2’s Device Certificate. This is done in two steps.<br />

a. Cryptographic validation that Device 2 certificate<br />

is from OEM A – this involves ensuring that the<br />

Device Certificate is signed by Private Key of the<br />

OEM Identity Key Pair and by verifying the<br />

signature chain of trust starting with the Device<br />

Certificate of Device 2 and terminating at the<br />

OEM Root CA Certificate. Each device has a<br />

local copy of the OEM certificate that it trusts.<br />

b. Authentication of the identity of Device 2 using<br />

the Device 2 Public Key from the Device 2<br />

certificate and performing a challenge response<br />

algorithm with Device 2.<br />

Since Device 2 is the only entity that knows the<br />

Device 2 private key, it is the only device that can<br />

successfully respond to the challenge, thus<br />

proving that the Public Key belongs to Device 2.<br />

The details of the challenge response mechanism<br />

are not covered in this paper.<br />

4. If validation of the Device Certificate by Device 1 is<br />

successful, Device 2 sends affirmative authentication<br />

response to Device 2.<br />

5. Device 2 executes a complementary sequence to<br />

authenticate Device 1.<br />

V. PROVISIONING IMPLEMENTATION<br />

Now that mutual authentication requirements have been<br />

discussed, provisioning Secure Element devices using the<br />

specified provisioning system is outlined. From the discussion<br />

related to the provisioning architecture and understanding of<br />

the mutual authentication use case, the following requirements<br />

will need to be supported for provisioning devices:<br />

1. The OEM creates the following at the OEM Premise:<br />

a. An OEM Identity Key Pair<br />

b. An OEM Root CA Certificate<br />

c. A Device Certificate Template<br />

d. Production Count for number of devices to be<br />

produced at the Secure Programming Center<br />

2. The OEM securely transmits the following OEM<br />

information to the Secure Programming Center:<br />

a. OEM Device Certificate Signature Key (which is<br />

the Private Key from the OEM Identity Key pair)<br />

b. OEM Root CA Certificate<br />

c. A Device Certificate Template<br />

d. A Production Count<br />

e. Unique serial numbers to be programmed into<br />

devices (optional and not shown)<br />

The above information flow is shown in Figure 5.<br />

Figure 5 – Secure Transfer of OEM Information to Provisioning System<br />

The OEM Identity Key is secret, must be protected and is<br />

wrapped before transfer to the specific provisioning system.<br />

The OEM Secret Wrapping Tool targets a specific HSM<br />

system at the Secure Programming Center, thus a Specific<br />

Guardian HSM identity certificate from the target<br />

programming is required. OEM Public Information is not<br />

secret and technically does not need to be wrapped; however,<br />

this is often a convenient method of transfer.<br />

Once securely stored inside the provisioning system, the<br />

OEM Information is used to create a Job Package, which<br />

initiates the provisioning cycle for a batch of Devices. The<br />

simplified provisioning flow for a single device is as follows:<br />

1. Generate a Device Identity Key Pair for each device.<br />

2. Create a Device certificate using the Public Key of the<br />

Identity Key Pair.<br />

3. Sign the Device certificate with OEM Device<br />

Certificate Signature Key.<br />

4. Program the Device Certificate into the Device Write<br />

Protected storage.<br />

5. Program the Root CA Cert into the Device Write<br />

Protected storage.<br />

6. Lock the Device.<br />

Once this provisioning flow has been executed, devices are<br />

protected from modification and prepared for mutual<br />

authentication in the field.<br />

688


VI.<br />

SUMMARY<br />

This paper has described common security issues facing<br />

IoT OEMs. In order to secure IoT devices, OEMs must<br />

establish a supply chain of trust, create a root of trust in each<br />

device and ensure the secure provisioning of secure elements<br />

and secure MCUs for those devices. A component architecture<br />

of a secure provisioning system was discussed and an example<br />

of mutual authentication was used to demonstrate the necessary<br />

device provisioning steps. This fundamental provisioning<br />

architecture can also be used for more advanced usage models,<br />

including secure boot of an MCU and the authentication and<br />

encryption of firmware. These advanced usage models can be<br />

implemented with a Secure Element or a Secure MCU.<br />

VII. TRADEMARKS<br />

Secure Deploy is a registered trademark of Secure Thingz, Inc.<br />

SentriX is a registered trademark of Data I/O Corporation.<br />

VIII. REFERENCES<br />

[1] Microsoft, “Securing Public Key Infrastructure (PKI),” May 2014.<br />

https://docs.microsoft.com/en-us/previous-versions/windows/itpro/windows-server-2012-R2-and-2012/dn786443(v=ws.11)<br />

[2] Jones, Scott, “Secure Authenticators answer the call to solve IoT device<br />

embedded security needs”, Embedded World 2017, unpublished.<br />

www.embedded-world.eu<br />

689


Providing Cryptography for Your System<br />

How to Port Transport Layer Security (TLS)<br />

Ron Eldor<br />

IoT Services Group<br />

Arm<br />

Netanya, Israel<br />

Janos Follath<br />

IoT Services Group<br />

Arm<br />

Cambridge, United Kingdom<br />

Abstract— It’s unimaginable for a device designed to be<br />

secure in a modern connected environment to not use<br />

cryptography. Authentication, secure communication, secure<br />

boot and firmware update services, all heavily rely on<br />

cryptographic protocols and primitives. However, such features<br />

come at a cost, including enlarged code size and decreased<br />

performance, which can be challenging on constrained devices.<br />

This can be made easier with hardware-accelerated<br />

cryptographic functions. Performing cryptography in hardware<br />

removes the workload from the microcontroller, decreasing<br />

power consumption and reducing overall code size. In this paper,<br />

we will review the most common cryptographic primitives,<br />

provide an overview of the porting process and demonstrate its<br />

implementation through a case study of integrating Arm Mbed<br />

TLS into an Mbed OS target platform. There are several<br />

challenges that may arise when integrating TLS technology into<br />

an operating system instead of directly into the application. For<br />

example, it may be non-trivial to expose configuration options to<br />

the application developer and/or to the silicon manufacturer.<br />

This presentation will outline Arm’s approach to implementing<br />

TLS within an OS.<br />

Keywords—cryptography; TLS; Mbed TLS; Mbed OS;<br />

CryptoCell; TrustZone; hardware accelerator<br />

I. INTRODUCTION<br />

With the Internet of Things growing by leaps and bounds,<br />

security related issues are also growing, and need to be<br />

effectively addressed now. As the ecosystem increases in size,<br />

securing the embedded platforms is more important than ever.<br />

A basic component of securing platforms is adding<br />

cryptography with some security protocol, such as Transport<br />

Layer Security (TLS). The downside of adding cryptography is<br />

the decrease in performance on a Microcontroller Unit (MCU)<br />

with already low performance, and enlarging code size on a<br />

memory limited platform at the same time. These problems can<br />

be mitigated by using a hardware-accelerated cryptography<br />

engine which offloads operations from the MCU and reduces<br />

code size.<br />

Integrating a hardware accelerator into a product takes time<br />

and effort during both development and maintenance, even<br />

more so because this integration is a security critical task in<br />

itself. This can be mitigated by using an embedded operating<br />

system that integrates an accelerator, where the code is<br />

maintained by the developers of the driver and/or the operating<br />

system.<br />

In this paper we discuss how porting Mbed TLS and Mbed<br />

OS to a platform with Arm TrustZone Cryptocell-310 resulted<br />

in a 15-53x performance increase and what were the main<br />

integration tasks performed associated with this. Section II<br />

shares some of the experimental data from this task. Section III<br />

gives a short background in cryptography, discussing the terms<br />

used throughout the paper. Section IV describes how Mbed<br />

TLS and Mbed OS support the process of porting and in<br />

Section V some representative tasks are discussed in detail.<br />

Section VI closes the paper with a summary.<br />

II. RESULTS<br />

Cryptography is a necessary but expensive operation, both<br />

in performance and in code size. In a constrained environment,<br />

where both these setbacks can be a major limitation, hardware<br />

accelerated cryptography comes into use. There are several<br />

potential benefits of using hardware acceleration for<br />

cryptography:<br />

1) Throughput increase<br />

2) Code size (therefore area) reduction (as most of the<br />

cryptography is done in hardware)<br />

3) Power reduction<br />

4) Isolation of cryptographic keys from potentially flawed<br />

or even malicious software<br />

We will study the potential increase in performance and<br />

throughputs in this paper.<br />

A. Performance<br />

As mentioned, offloading the cryptography to a hardware<br />

acceleration core improves the performance. It allows the TLS<br />

handshake to finish faster and application data to be encrypted<br />

more quickly - providing an overall smoother user experience.<br />

Cryptographic operations are performed using dedicated<br />

hardware highly optimized for the target algorithms, which<br />

leads to significant improvement in the computation time.<br />

Table I presents the comparison of the Mbed OS<br />

Benchmark [9] application output, when running Mbed TLS<br />

www.embedded-world.eu<br />

690


with pure software implementation, which has been ported to a<br />

platform with Arm TrustZone CryptoCell-310. The<br />

measurements were done on a platform embedding Arm<br />

Cortex-M4F and TrustZone CryptoCell-310, both supplied<br />

with a 64MHz clock. The benchmark application was compiled<br />

with the ARM5 toolchain.<br />

TABLE I.<br />

Algorithm<br />

COMPARISON OF MBED TLS BENCHMARK WITH ARM<br />

TRUSTZONE CRYPTOCELL-310 AND WITHOUT<br />

III.<br />

Improvement Ratio<br />

SHA-256 1:15.81<br />

AES-CBC-128 1:24.06<br />

AES-CCM-128 1:25.49<br />

ECDSA-secp384r1 (sign) 1:19.32<br />

ECDSA-secp256r1 (sign) 1:35.76<br />

ECDSA-secp384r1 (verify) 1:28.21<br />

ECDSA-secp256r1 (verify) 1:53.59<br />

ECDHE-secp384r1 (handshake) 1:20.89<br />

ECDHE-secp256r1 (handshake) 1:38.79<br />

ECDHE-Curve25519<br />

1:43.65<br />

(handshake)<br />

ECDH-secp384r1 (handshake) 1:20.50<br />

ECDH-secp256r1 (handshake) 1:40.29<br />

ECDH-Curve25519 (handshake) 1:46.73<br />

CRYPTOGRAPHIC BACKGROUND<br />

Authentication, secure communication, secure boot and<br />

firmware update services all heavily rely on cryptographic<br />

protocols and primitives. There is a wide range of crypto<br />

primitives and protocols in use for various purposes around the<br />

industry. The TLS protocol is one of the most widespread<br />

conventional protocols – it provides confidentiality, integrity<br />

and authenticity over an untrusted channel. It is a very versatile<br />

protocol, supporting numerous algorithms for authentication,<br />

key exchange and encryption. When starting a TLS connection,<br />

a handshake takes place. The peers negotiate algorithms to use,<br />

authenticate each other and share temporary key material to use<br />

during the session. Technically, TLS optionally supports<br />

mutual authentication. In the HTTPS case the server usually<br />

authenticates, and the client does not, but there are many other<br />

use cases other than HTTPS, where both sides do want to<br />

authenticate.<br />

One way of authenticating the peer is with the help of<br />

digital signatures. Only someone in the possession of the<br />

private key can produce valid signatures, which anybody can<br />

verify with the help of the public key. The Elliptic Curves<br />

Digital Signature (ECDSA) algorithm is well suited for<br />

embedded applications, as it has good performance and small<br />

key size, while still providing strong cryptographic security.<br />

The security of these cryptographic primitives is always<br />

conditional on the computational power the adversary has<br />

available. If there is an 80-bit long key and the adversary can<br />

try all the 2 80 keys fast enough, then he can break the scheme.<br />

Although it’s worth noting, that on average he only needs to try<br />

2 79 . A primitive has x bit security if the best attack to break it<br />

requires computational power equivalent to trying 2 x<br />

symmetric keys.<br />

Keys, secrets and random numbers are generated with the<br />

help of random number generators (RNG). The unpredictability<br />

of their output is crucial to the security of the system. For<br />

example, if a 256-bit long output of the RNG is used as a key<br />

in a 256-bit strong scheme and the RNG output can be<br />

predicted with a chance of 0.00001%, then the security is<br />

reduced to 23 bits. Secure systems must have some way of<br />

generating random numbers in a secure way. Ideally, the<br />

generation of randomness is based on a physical phenomenon,<br />

providing a high level of entropy (also known as “True<br />

Random Number Generator”).<br />

IV. MBED TLS AND MBED OS<br />

In this case study we ported Mbed OS and Mbed TLS to a<br />

platform equipped with Arm TrustZone CryptoCell-310. The<br />

Mbed TLS library is not just a free, open-source, highly<br />

configurable TLS stack designed with use in embedded<br />

systems in mind – it is also a cryptographic library. It provides<br />

compile time configuration options enabling the use of<br />

cryptographic hardware accelerators [1]. The portable use of C,<br />

minimal dependencies, modular structure and its own built-in<br />

thin abstraction layer make it easy to port to new platforms [2].<br />

Because of this, porting Mbed TLS for a single, specific<br />

application can be simple and easily achieved, but if integration<br />

into a multipurpose, generic platform is required, then there are<br />

several actors and viewpoints that have to be taken into account<br />

to preserve Mbed TLS’s high configurability.<br />

When integrating Mbed TLS into operating systems,<br />

instead of directly into the application, care should be taken for<br />

the following reasons:<br />

• It is non-trivial to expose configuration options to the<br />

application developer and/or to the driver developer,<br />

due to the fact that Mbed TLS has a lot of compile-time<br />

options that enable the user to fine-tune the memory<br />

footprint, performance and functionality.<br />

• The crypto engines have to be integrated and available<br />

as both an operating system service and as part of Mbed<br />

TLS.<br />

• The above two considerations have to be addressed in a<br />

way that enables a straightforward and usable<br />

integration process for the driver developer.<br />

Mbed OS is a free, open-source embedded operating<br />

system, which is pre-integrated with Mbed TLS. The<br />

integration with Mbed OS addressed the above considerations,<br />

by dividing the options into three groups, and then providing<br />

three different mechanisms to access them. The first most<br />

integrated set of options is part of the Mbed Hardware<br />

Abstraction Layer (HAL) and can be activated through Mbed<br />

OS target configuration [3]. The second set provides a more<br />

flexible way for the driver developer to provide the driver code<br />

[4]. The third set of options has a similarly flexible but<br />

different configuration method [5]. The integration process is<br />

691


still ongoing and options can move from the second group into<br />

the first, when they have stood the test of time.<br />

Since the nature of this case study is integration and not<br />

application development, the first two mechanisms are used.<br />

The TRNG is being integrated into the Mbed OS HAL and the<br />

other Arm TrustZone CryptoCell crypto acceleration services<br />

are being made available by the second mechanism.<br />

V. INTEGRATION<br />

When porting a hardware cryptography engine, the<br />

signature of the public API functions and the types of the<br />

context structures in the driver are unlikely to coincide with the<br />

ones in Mbed TLS. Furthermore, in some cases the byte order<br />

used by Mbed TLS might differ from the one preferred by the<br />

engine. Both differences arise in the case of Arm TrustZone<br />

CryptoCell-310 and need adjustment to achieve full<br />

integration. In the first case this adjustment means converting<br />

the input from the Mbed TLS type to the types prescribed by<br />

the hardware, passing it to the driver API function and<br />

converting back the output from the hardware driver type to the<br />

Mbed TLS type. In the second case a simple change in byte<br />

order is necessary. These translations have a downside of<br />

adding code and decreasing performance. However both of<br />

these effects are negligible and well justified by the gain<br />

provided by the cryptography engine. In the rest of this chapter,<br />

we present examples of the integration tasks performed when<br />

porting Arm TrustZone CryptoCell-310 to Mbed TLS on Mbed<br />

OS and the way to overcome them.<br />

A. Type mismatch in output<br />

The Mbed TLS function for ECDSA signature,<br />

mbedtls_ecdsa_sign(), receives two output parameters of type<br />

mbedtls_mpi and follows the standard SEC1 [6]. However, the<br />

hardware driver's signature function outputs a single byte<br />

buffer representing the signature. Fortunately, translating the<br />

output byte buffer to mbedtls_mpi can be done with the<br />

mbedtls_mpi_read_binary() function (Fig. 1).<br />

B. Type mismatch in input<br />

The Mbed TLS random function signature is int<br />

(*f_rng)(void * context, unsigned char * output, size_t length)<br />

and the Arm TrustZone CryptoCell-310 random function<br />

signature is int (*f_rng)(void * context, uint16_t length, uint8_t<br />

* output). This means that the random function callback given<br />

as a parameter to a function, such as mbedtls_ecdsa_sign(),<br />

cannot simply send the function pointer to the hardware<br />

accelerator driver. To overcome this, a wrapper function and<br />

context have been created.<br />

First, a structure called mbedtls_rand_func_container is<br />

defined, which will contain the context and the Mbed TLS<br />

random function pointer (Fig. 2).<br />

After that, the function mbedtls_to_cc_rand_func() is<br />

created with the hardware driver's signature, which calls the<br />

Mbed TLS callback function (Fig. 3).<br />

Last, an mbedtls_rand_func_container is initialized with<br />

the Mbed TLS RNG parameters and passed to the hardware<br />

Fig. 2. Definition of mbedtls_rand_func_container<br />

accelerator driver, along with mbedtls_to_cc_rand_func() (Fig.<br />

4).<br />

C. Difference in standards<br />

Mbed TLS and Arm TrustZone CryptoCell-310 comply<br />

with different standards. In the majority of the cases, the<br />

standards are either the same or very similar, however<br />

sometimes there are minor differences, which eventually led to<br />

some small complications during the porting process. In our<br />

case Mbed TLS follows SEC1 [6] and this standard leaves the<br />

details of transforming random bits to key candidates to the<br />

implementation. Mbed TLS implements a key generation<br />

method to be suitable for generating ephemeral keys for<br />

deterministic signing too. Namely it complies with RFC 6979<br />

[7] and implements key generation as described in section 3.3.<br />

Arm TrustZone CryptoCell-310 on the other hand follows the<br />

FIPS 186-4 [8] standard and implements key generation as<br />

described in section B.5.2. To be precise both of these describe<br />

the ephemeral key generation as SEC1 calls it, or the permessage<br />

secret number generation by the terms of FIPS 186-4.<br />

This is intentional and appropriate here, because this use of key<br />

generation is relevant in this case.<br />

The major difference between the two standards is that<br />

FIPS 186-4 ensures that the key k is in the interval [1,n-1] by<br />

checking if k>n-2, generating a new k if it is, adding one and<br />

accepting it otherwise. The RFC 6979 on the other hand checks<br />

if k>n-1 and if k=0, generating a new k if it is and accepting it<br />

otherwise. Although in both cases the algorithms are correct,<br />

the output will be different (k+1 and k respectively) when using<br />

the same random input. The only case when this poses a<br />

problem is testing: this difference makes predefined test<br />

vectors fail.<br />

To overcome this difference, the<br />

mbedtls_to_cc_rand_func() function defined previously, can be<br />

modified to decrease the output of f_rng by one, before this<br />

callback is returned to the hardware driver (Fig. 5).<br />

D. Difference in byte order<br />

Arm TrustZone CryptoCell-310 and Mbed TLS use a<br />

different byte ordering, which needs to be adjusted when<br />

passing raw data between components. For example, the<br />

hardware handles keys in byte buffers in little endian byte<br />

order. However, converting to the mbedtls_mpi structure when<br />

using mbedtls_mpi_read_binary() function, the buffer has to be<br />

in big endian byte order. In order for the values of the<br />

Fig. 1. Conversion of byte array to mbedtls_mpi<br />

Fig. 3. Definition of mbedtls_to_cc_rand_func<br />

www.embedded-world.eu<br />

692


Fig. 4. Combination of the components<br />

generated keys to be the same, the output of the random bit<br />

generator needs to be translated to match the byte order of<br />

Mbed TLS. A straightforward way to do this is, extending<br />

mbedtls_to_cc_rand_func() with this translation functionality<br />

(Fig. 6).<br />

Like in the previous subsection this is only an issue when<br />

using predefined test vectors and it will not affect the<br />

correctness of operation in production. These modifications<br />

come with a slight increase in code size and penalty in<br />

performance, but these are negligible in most use cases and can<br />

be mitigated by turning them off in production if absolutely<br />

necessary.<br />

Fig 6. Change of the byte order<br />

TrustZone CryptoCell-310 can be done with minimal<br />

integration engineering and can result in significant<br />

performance improvement. We have measured a 15-53x<br />

performance improvement when porting to our target platform.<br />

Integrating the accelerator in an operating system instead of<br />

directly into the application reduces overall development and<br />

maintenance cost. The typical integration tasks performed<br />

during the porting were addressing mismatch in function<br />

signatures, byte order and followed standards between Mbed<br />

TLS and Arm TrustZone CryptoCell-310, all of which can be<br />

solved with negligible overhead.<br />

VI. SUMMARY<br />

Adding cryptography to IoT products is a requirement for<br />

security and therefore for their overall success. Unfortunately,<br />

software implementation of cryptography comes with<br />

performance and memory costs on constrained platforms.<br />

Adding cryptographic hardware accelerators to the system can<br />

potentially solve both problems.<br />

Porting Mbed TLS and Mbed OS to a platform with Arm<br />

REFERENCES<br />

[1] S. Butcher, “Alternative cryptography engines implementation,”<br />

https://tls.mbed.org/kb/development/hw_acc_guidelines<br />

[2] M. Pégourié-Gonnard, “Porting Mbed TLS to a new environment or<br />

OS,” https://tls.mbed.org/kb/how-to/how-do-i-port-mbed-tls-to-a-newenvironment-OS<br />

[3] “Arm Mbed Reference,” https://os.mbed.com/docs/v5.7/mbed-os-apidoxy/group__hal__trng.html<br />

[4] “Mbed Handbook – Mbed TLS Hardware Acceleration,”<br />

https://docs.mbed.com/docs/mbed-oshandbook/en/latest/advanced/tls_hardware_acceleration/<br />

[5] “Mbed OS Reference – Security/TLS,”<br />

https://os.mbed.com/docs/v5.7/reference/tls.html<br />

[6] D. R. L. Brown, “SEC 1: Elliptic Curve Cryptography,”<br />

http://www.secg.org/sec1-v2.pdf<br />

[7] T. Pornin, “Deterministic Usage of the Digital Signature Algorithm<br />

(DSA) and Elliptic Curve Digital Signature Algorithm (ECDSA),”<br />

https://tools.ietf.org/html/rfc6979<br />

[8] “Digital Signature Standard (DSS) - FIPS PUB 186-4,”<br />

http://nvlpubs.nist.gov/nistpubs/FIPS/NIST.FIPS.186-4.pdf<br />

[9] “Mbed TLS Benchmark Application on Mbed OS Platform,”<br />

https://github.com/ARMmbed/mbed-os-exampletls/tree/master/benchmar<br />

Fig 5. Adapting mbedtls_to_cc_rand_func<br />

693


How Next-Generation Security ICs Deliver a<br />

Stronger Level of Protection<br />

Scott Jones<br />

Maxim Integrated<br />

Micros, Security & Software Business Unit<br />

Dallas, TX, USA<br />

Abstract—An IC-based physically unclonable function (PUF)<br />

has desirable properties that can be utilized by chips that<br />

implement cryptographic functionality. A new PUF<br />

semiconductor solution takes advantage of the random analog<br />

characteristics of MOSFET transistors, the fundamental building<br />

block of CMOS ICs. The PUF is constructed from an analog<br />

circuit element with inherent randomness in I-V characteristics.<br />

At the chip level, the PUF solution is constructed from an array of<br />

these elements sized according to the number of bits required to<br />

achieve the cryptographic requirements of the chip. When needed,<br />

the PUF is operated to derive a per-chip random, unique, and<br />

repeatable binary value that is only accessible by chip crypto<br />

blocks. Thereafter the PUF-derived key value is instantaneously<br />

erased and does not exist in digital form. For a PUF output to be<br />

used as a cryptographic key value, it must be highly reliable and<br />

have appropriate crypto quality. This new PUF solution has<br />

demonstrated the ability to satisfy both requirements.<br />

Keywords—PUF, physically unclonable function, security IC,<br />

security chip, secure authenticator, crypto, cryptography, ChipDNA<br />

I. INTRODUCTION<br />

In a world where embedded electronic systems continue to<br />

come under attack, cryptography provides flexible and effective<br />

tools to address a myriad of potential security threats.<br />

Accordingly, a variety of options exist to implement crypto<br />

solutions with both hardware and software approaches. Given<br />

the dedicated and optimized implementations, it is understood<br />

that a hardware-based solution, i.e. a dedicated security IC, is the<br />

most effective formulation for the root of trust and the way to<br />

provide the countermeasures and protection that prevent<br />

numerous types of common attacks.<br />

The reality is that there are valuable assets associated with<br />

embedded systems that face relentless threats. For example,<br />

such systems encounter intrusions such as theft of intellectual<br />

property, introduction of malware to disrupt or destroy<br />

equipment, unauthorized access to sensitive communication,<br />

tampering with data produced from IoT endpoints, etc. Security<br />

ICs and the cryptographic solutions available currently exist to<br />

address these threats. However, the security ICs themselves can<br />

become the target of attack by an adversary attempting to<br />

circumvent or break the security.<br />

II.<br />

ATTACKS ON SECURITY ICS<br />

With an assumption of a security IC-based protection<br />

solution, there are two general categories of attack scenarios:<br />

non-invasive[1] and invasive. Non-invasive attacks consist of<br />

operational measurements, sometimes combined with other<br />

externally applied stimuli, in an effort to obtain cryptographic<br />

keys or other sensitive data. Examples of such efforts include<br />

differential or simple power/electromagnetic analysis<br />

(DPA/SPA/DEMA/SEMA) or the inducing of fault states<br />

through voltage glitching, extreme thermal conditions, or laser<br />

and timing attacks. While the non-invasive attack vectors are<br />

technically complex to address, there are established circuits and<br />

algorithmic countermeasures that are proven effective in<br />

protecting the security IC and sensitive stored data from being<br />

compromised.<br />

Invasive attacks on a security IC consist of direct die-level<br />

circuit probing, modification, deprocessing and reverse<br />

engineering, again with the objective of compromising the<br />

solution by obtaining keys, disabling functionality, or<br />

completely reverse engineering the design to a netlist for<br />

reproduction. The skill set and required tools are more complex<br />

than in the non-invasive scenarios, but they do exist and are<br />

commonly used to attack the security ICs that protect high-value<br />

assets. For example, Fig. 1 and Fig. 2 are examples of the output<br />

from tools that may be used with an invasive attack to first image<br />

a portion of an IC and then extract the netlist and schematics<br />

from the imaging. An attacker would repeat this process for the<br />

entire IC with the ultimate goal of gaining some insight to launch<br />

a sub-circuit attack, or producing a data base to replicate the IC.<br />

www.embedded-world.eu<br />

694


Fig. 1. Imaged security IC area for schematic/netlist extraction<br />

III.<br />

PUF – DECISIVE INVASIVE ATTACK COUNTERMEASURE<br />

A decisive technology that has emerged to provide strong<br />

protection against the invasive threat is the physically<br />

unclonable function (PUF)[4]. PUF is a function that is derived<br />

from the complex and variable physical/electrical properties of<br />

ICs. Because PUF is dependent on random physical factors<br />

(unpredictable and uncontrollable) that exist natively and/or are<br />

incidentally introduced during a manufacturing process, it is<br />

virtually impossible to duplicate or clone. PUF technology<br />

natively generates a digital fingerprint for its associated security<br />

IC, which can be utilized as a unique key/secret to support<br />

cryptographic algorithms and services including<br />

encryption/decryption, authentication, and digital signature.<br />

A PUF implementation from Maxim Integrated operates on<br />

the naturally occurring random variation and mismatch of the<br />

analog characteristics of fundamental semiconductor MOSFET<br />

devices. This randomness originates from factors such as oxide<br />

variation, device-to-device mismatch in threshold voltage, and<br />

interconnect impedances. Similarly, the wafer manufacturing<br />

process introduces randomness through imperfect or nonuniform<br />

deposition and etching steps. Paradoxically,<br />

semiconductor device parameter variation is normally a<br />

challenge that IC designers face during development. In<br />

contrast, it is the fundamental basis and exploited for Maxim’s<br />

PUF design.<br />

Fig. 3 provides a simplified block diagram of the Maxim<br />

PUF architecture showing an example key size of 128 bits.<br />

Shown within the PUF core block is a 16x16 array of 256 PUF<br />

elements each of which is an analog structure. Through factory<br />

conditioning these 256 elements are combined into 128 pairs.<br />

Comparing structure to structure, random I/V characteristics due<br />

to the previously described parameters, exist and are utilized to<br />

generate binary 1/0 values through precision circuit-level<br />

comparison of each element within a pair. For example,<br />

elements {2,1} and {14,16} could constitute a pair, and I/V<br />

characteristics of each would be compared to derive a bit value.<br />

This is repeated with each of the 128 pairs to produce a 128-bit<br />

PUF key output (for this key size example).<br />

Fig. 2. Schemtic output from a tool that imaged the area<br />

Like in the non-invasive situation, there are circuit solutions<br />

available to combat invasive attacks. One example consists of<br />

top-level die shields that are actively monitored for a tamper<br />

event and combined with detection circuitry that takes defensive<br />

counteraction. However, the skills and equipment of attackers<br />

employing invasive techniques quickly evolve and have<br />

historically been a challenge to decisively defeat.<br />

Fig. 3. Block diagram of Maxim Integrated’s PUF architecture<br />

695


From an invasive attack perspective, any probing or<br />

attempted analog measurement of a PUF element causes the<br />

analog electrical characteristic to change due to factors including<br />

capacitive/inductive/resistive loading. As a result, it is not<br />

possible to extract any key data through invasive measurements.<br />

Also, due to the statistical nature of imperfect manufacturing<br />

techniques, there is no known method to discern any key<br />

information from inspection methods. Similarly, even<br />

knowledge of PUF element paring does not reveal any<br />

information about the key value that would ultimately be derived<br />

from the analog characteristics of the PUF element structures.<br />

Finally, the PUF key value only exists digitally when a<br />

cryptographic operation is performed; thereafter, it is<br />

instantaneously erased. Combined, these attributes of this PUF<br />

design result in a solution that is highly immune to invasive<br />

attacks.<br />

IV.<br />

PUF RELIABILITY AND CRYPTO QUALITY<br />

From a cryptographic perspective, reliability and<br />

randomness are critical characteristics that a PUF solution must<br />

exhibit. For use as a cryptographic key, or root thereof, the PUF<br />

output must have 100% reliability, meaning PUF-derived key bit<br />

values must be repeatable over time and all operating conditions.<br />

For semiconductor devices, this evaluation is performed using<br />

JEDC[5] defined industry-proven methods of reliability study.<br />

This includes selecting and subjecting a statistically significant<br />

sample set of devices to environmental and operational stress<br />

conditions that enable evaluation of lifetime reliability<br />

performance. These stresses include high-temperature operating<br />

life (HTOL), temperature cycling, packaging and solder reflow<br />

influences, voltage and temperature drift, and highly accelerated<br />

temperature/humidity stress testing (HAST). Performing a<br />

reliability qualification study using these proven methods results<br />

in a statistical assessment of how a design will perform over the<br />

life of its use in a system. For example, consider that a system<br />

end product could have a design life of 10 years and operates<br />

within -40C to +85C environments with power sources that can<br />

fluctuate by ±10%.<br />

Equally critical with a PUF solution is the requirement for<br />

high-performance cryptographic quality, with a key property<br />

being randomness. Low-quality randomness can create a<br />

cryptographic attack vulnerability through predictability<br />

weakness. Statistical test suites, including NIST[2] SP 800-<br />

22[3], provide an industry-proven means to measure<br />

randomness of PUF output. Evaluation against the test suite<br />

provides several metrics which determine whether the PUF<br />

output is consistent with a random sequence. To be statistically<br />

significant, these tools require large data sets for the analysis,<br />

e.g. 20kbit sequences. Therefore, the output from a large set of<br />

PUF instances is required and used for the assessment.<br />

V. RELIABILITY STUDIES ON PUF<br />

The reliability of Maxim’s PUF was proven from results<br />

obtained via a lifetime reliability analysis as described<br />

previously. Fundamentally, the reliability study produced data<br />

to understand the shift from aging, temperature/voltage drift, IC<br />

packaging, PCB assembly, etc., of the PUF elements. Relative<br />

to the time-zero characteristics of two PUF paired elements, the<br />

post-reliability study paired elements have been shown to<br />

consume ~7% of the total margin available to maintain the<br />

stability of the output binary value. The final output from the<br />

analysis is a PUF key error rate (KER) of ≤5ppb, where KER is<br />

defined as the probability that 1 bit within the total key sized<br />

produced by the PUF, e.g. 256 bit, would flip over the life of the<br />

product.<br />

A randomness assessment of the PUF relied on performance<br />

to NIST standard SP 800-22 monobit, poker, runs test, and long<br />

run test. These are test suites that evaluate whether output data<br />

is consistent with a random sequence. Assessment results for<br />

each of the four tests validate excellent performance with respect<br />

to randomness.<br />

To evaluate immunity to invasive attack and reverse<br />

engineering, the Maxim PUF solution was evaluated by a<br />

leading US-based company[6] that specializes in die level<br />

security assessments and IC reverse engineering expertise. With<br />

the given assessment time frame, there was no compromise of<br />

PUF operation, along with a qualitative conclusion that the<br />

solution is “highly effective and resistant against physical<br />

Reverse Engineering attacks”.<br />

VI.<br />

PUF USE CASES<br />

Numerous use cases exist for a PUF solution. Three are<br />

shown in Fig. 4, Fig. 5, and Fig. 6. In Fig. 4, to secure all stored<br />

data on a security IC, the PUF derived key is used to<br />

encrypt/decrypt data as needed using an algorithm such as AES.<br />

Any NVM data extracted from an invasive attack is useless<br />

given its encrypted state and inability to obtain the PUF-based<br />

decryption key. Fig. 5 shows the use of PUF as the unique<br />

private key for ECDSA signing operations. For this case the<br />

device would compute its own public key from the PUF private<br />

key and a certificate would be installed in NVM by a certificate<br />

authority prior to end-use deployment. In Fig. 6, the PUF private<br />

key is the root private key for the security IC and is used in<br />

conjunction with the end system to establish a “root of trust”<br />

with the security IC for subsequent services.<br />

Fig. 4. Encrypting IC NVM with the PUF secret key<br />

Fig. 5. ECDSA signing with PUF as the private key<br />

www.embedded-world.eu<br />

696


Fig. 6. PUF as trust anchor private key<br />

VII. MAXIM’S COMMERCIAL PUF-BASED SECURITY IC<br />

Maxim introduced its first PUF-based security IC, the<br />

DS28E38[7], in November 2017. The DS28E38 is an ECDSA<br />

authenticator that utilizes the company’s ChipDNA PUF<br />

output as key content to cryptographically secure all devicestored<br />

data. Optionally, under user control, ChipDNA is used as<br />

the private key for ECDSA signing operations. The device<br />

provides a core set of cryptographic tools derived from<br />

integrated blocks including asymmetric (ECC-P256) and<br />

symmetric (SHA-256) hardware engines, a FIPS/NISTcompliant<br />

true random number generator (TRNG), 2Kb of<br />

secured EEPROM, a decrement-only counter, and a unique 64-<br />

bit ROM identification number (ROM ID). The ECC<br />

public/private key capabilities operate from the NIST-defined P-<br />

256 curve to provide a FIPS 186-compliant ECDSA signaturegeneration<br />

function. A block diagram of the DS28E38 is shown<br />

in Fig. 7.<br />

VIII. SUMMARY<br />

Embedded systems have electronic assets that can be<br />

protected by cryptography. Security ICs with cryptographic<br />

functions provide optimal protection, but, ultimately, become<br />

the attack point by those attempting to compromise the assets.<br />

Furthermore, attackers are becoming increasingly sophisticated<br />

in their techniques. A decisive countermeasure to the invasive<br />

attack is the PUF, which, due to its inherit qualities, can be<br />

highly immune to reverse-engineering methods.<br />

IX.<br />

TRADEMARKS<br />

ChipDNA is a trademark of Maxim Integrated Products, Inc.<br />

X. REFERENCES<br />

[1] https://en.wikipedia.org/wiki/Side-channel_attack<br />

[2] National Institute of Standards and Technology, NIST, Current Federal<br />

Information Processing Standards (FIPS)<br />

https://www.nist.gov/itl/current-fips.<br />

[3] https://csrc.nist.gov/Projects/Random-Bit-Generation/Documentationand-Software<br />

[4] https://en.wikipedia.org/wiki/Physical_unclonable_function<br />

[5] JEDEC standards for microelectronics https://www.jedec.org/<br />

[6] MicroNet Solutions, Inc.http://micronetsol.net/<br />

[7] https://www.maximintegrated.com/en/products/digital/memoryproducts/DS28E38.html<br />

Fig. 7. Block diagram of Maxim’s PUF-based secure authenticator<br />

697


Timon, Rex and Tux<br />

How TPMs and On-Chip Security Modules improve Trust and Security in GNU/Linux<br />

Dipl.-Ing. Michael Roeder<br />

Technology Engineering and Services CE<br />

Avnet Silica<br />

Poing, Germany<br />

Michael.Roeder@avnet.eu<br />

Dipl.-Inf. Martin Hecht<br />

Technology Engineering and Services CE<br />

Avnet Silica<br />

Berlin, Germany<br />

Martin.Hecht@avnet.eu<br />

Abstract—Although Hardware Security Modules (HSM) to<br />

accelerate cryptographic operations and to perform authenticated<br />

or encrypted boot have been integrated into numerous SoC for<br />

years, they are rarely used in today’s applications. Implications of<br />

using them (both positive and negative) are mostly unknown to the<br />

majority of designers.<br />

At the same time, Trusted Platform Modules (TPM) are<br />

established more and more in embedded and industrial<br />

applications and support for TPM 2.0 in the Linux kernel has<br />

arrived. This prompts the question, to what extend TPMs can take<br />

over some of these functionalities.<br />

This paper gives an introduction into both technologies and<br />

their advantages and disadvantages for certain use-cases.<br />

We look into scenarios like encrypted, authenticated and<br />

measured boot over the various boot stages and the use of<br />

hardware security in the Linux Kernel and in applications such as<br />

OpenSSL, StrongSwan, along with the respective stacks involved.<br />

We show ways to combine hardware security technologies and<br />

software algorithms to create best-in-class solutions but also<br />

explore which hardware functionalities are currently supported in<br />

software and what is missing to create a complete, trusted solution.<br />

Keywords— Security, Trust, trusted boot, authentication, TPM,<br />

TPM2.0, HSM, trust architecture, measurement, attestation<br />

I. INTRODUCTION<br />

Over the past weeks, while the authors were finishing this<br />

paper, the Meltdown and Spectre attacks against modern<br />

processors and SoCs were intensively discussed in professional<br />

and popular press. One of the positive outcomes of such<br />

immense public interest for security concerns is that lots of<br />

people start re-thinking their concepts (and side-effects) of<br />

security and data protection. This is especially important in the<br />

consumer/IOT space, where for devices such as IP cameras or<br />

garage door openers, security used to be an afterthought. Now,<br />

that these products enter the market in big quantities and are sold<br />

even in discounter supermarkets, both public and government<br />

are alerted about potential misuse and the dangers imposed by<br />

cracking attempts to these devices. Federal agencies have started<br />

to look into criteria to be met for devices transmitting personal<br />

data over open communication channels and how to ensure the<br />

integrity of such devices.<br />

SoCs have been equipped with cryptographic accelerators<br />

for years and some like NXP’s Layerscape families, the<br />

i.MX6UL3 or i.MX8 offer amazing hardware security features<br />

by combining crypto accelerators (hardware security engines,<br />

HSM) with tamper detection features. However, these features<br />

are vendor- and part-specific, and it is hard to base a common<br />

security strategy on proprietary features.<br />

At the same time, Trusted Platform Modules (TPM) are<br />

established more and more in embedded and industrial<br />

applications and support for TPM 2.0 in the Linux kernel has<br />

arrived for some devices. This prompts the question, to what<br />

extend TPMs can take over some of these functionalities.<br />

This paper gives an introduction into both technologies and<br />

their advantages and disadvantages for certain use-cases.<br />

However,, this paper can only give an introduction and<br />

completely leaves out implementation specifics and some<br />

details. Feel free to contact the authors for more details on the<br />

topics.<br />

1 HARDWARE-ACCELERATED SECURITY<br />

In this chapter we take a closer look at HSMs and TPMs as<br />

hardware implementations to provide security in embedded<br />

systems. We also discuss some generic advantages hardware<br />

security implementations provide over software.<br />

1.1 Motivation<br />

In 1995, former NSA Chief Scientist Robert Morris said:<br />

“Systems built without requirements cannot fail;<br />

they merely offer surprises. Usually unpleasant!”<br />

Some of the basic security requirements, when talking about<br />

SoC-based systems (as usually mentioned in threat analysis<br />

documents and security requirement sheets) are:<br />

<br />

<br />

Access Control: access and remote access to device<br />

has to be denied to unauthorized users<br />

Anti-Cloning: measures against overbuilding and<br />

counterfeiting of devices<br />

www.embedded-world.eu<br />

698


IP Protection: the manufacturer’s intellectual property<br />

(e.g. software, FPGA netlists) is protected against theft<br />

Confidentiality: data is encrypted, especially in<br />

communication to outside world or when written to<br />

memories<br />

Resilience: device can detect attacks and initiate<br />

measures to protect data<br />

Data Integrity: the data generated or exchanged with<br />

the system is protected against modification<br />

Non-Repudiation: device can prove that data was<br />

generated by it and check that data arriving in it has the<br />

correct origin.<br />

In the following chapters, we will give a short overview<br />

about how and why these requirements are addressed in recent<br />

SoC hardware security modules (HSM) and the limitations<br />

imposed by them. We will also show which functionality TPMs<br />

(TPMs) provide that can be added as a peripheral to existing<br />

systems to enhance security. We will discuss how TPMs can be<br />

used to complement or replace the integrated SoC functionality,<br />

along with the advantages and disadvantages of doing this.<br />

1.2 HSMs (Hardware Security Modules)<br />

At first, we take a look at the basic functionalities provided<br />

by HSM in SoC. These are as follows:<br />

<br />

<br />

<br />

Trust: measures taken and functionality provided to<br />

ensure that the system can be trusted and is untampered<br />

after boot and during operation. This includes secure and<br />

encrypted boot functionality, certified true random number<br />

generators, secure storage and use of individual keys and<br />

protection against access from the non-secure world from<br />

either software (malware) or hardware (JTAG, debugging<br />

pins). ARM Trustzone offers some basic support to isolate<br />

trusted from untrusted software parts and to restrict system<br />

access of untrusted software. Some SoCs offer far more<br />

advanced hardware features to assists software (e.g.<br />

hypervisors or secure operating systems) to separate<br />

software and restrict access to specific hardware resources<br />

using a rule based system. Trust mechanisms are crucial to<br />

provide Access Control, Anti-Cloning, IP Protection<br />

and Non-Repudiation capabilities to the product.<br />

Hardware Crypto Engines: acceleration of crypto<br />

algorithms to offload the CPU and add additional security<br />

and key protection to crypto processes. This unit is usually<br />

closely linked to the units providing Trust and Tamper<br />

Resistance, but can also be leveraged from user<br />

applications to provide Confidentiality, Data Integrity<br />

and Non-Repudiation at user level.<br />

Tamper Resistance: provides protection against attacks<br />

by moving the SoC out of its regular specifications. This<br />

includes units providing a secure RTC and active tamper<br />

pin monitoring along with temperature, clock and voltage<br />

monitors. The tamper unit is closely connected to all key<br />

storages in the crypto accelerators to ensure deletion of<br />

critical memories upon a tamper attempt. These units are<br />

required to provide<br />

protection.<br />

Resilience and Anti-Cloning<br />

HSMs provide the best performance, power effectiveness<br />

and hardware costs and are ideally integrated into the SoC<br />

functionality. Therefore they can provide a comprehensive<br />

“security package” to the SoC user to leverage and are most<br />

easily used to achieve common security targets. For example,<br />

the integrated secure key memory may have a dedicated<br />

connection to the crypto unit to provide the encryption key to it<br />

without being snooped over system busses, or the tamper<br />

detection unit automatically erases the secure key storage and<br />

other critical memory areas, if an attack is detected.<br />

However, HSMs also have some disadvantages in the<br />

following areas.<br />

Standardization and Reusability: Most SoC vendors<br />

either develop their security modules as internal IP by<br />

themselves or buy third party IP which is then integrated<br />

into the SoC. There are no standardized ways how these<br />

modules are designed, integrated and used in the system<br />

scope. Therefore, a security concept and software written<br />

for one system can’t be easily migrated to a different one.<br />

HSMs, drivers and usage concepts will most likely differ<br />

even among members of the same family of one vendor.<br />

This gets worse, if a common security concept has to be<br />

developed and maintained among several different<br />

platforms in a company.<br />

Certification and Trust in Implementation<br />

Correctness: this problem arises if security certifications<br />

such as Common Critera (EALn) or NIST are desired on<br />

the end product. Achieving such a certification usually<br />

requires providing a lot of material, sometimes including<br />

(semi-)formal verification reports or enabling source<br />

code/HDL review to the auditor. The SoC end user is total<br />

dependent on the SoC manufacturer to assist in providing<br />

(and disclosing) this data, which most times will not<br />

happen. Even access to functional documentation is<br />

sometimes restricted to users or only available under NDA,<br />

which further decreases the trust level into these solutions.<br />

“Security through Obscurity” comes to mind.<br />

Unfortunately, this is not only a theoretical point, security<br />

problems in SoC hardware implementations that with an<br />

open and public implementation would probably have been<br />

detected within months, are exposed on a regular bases in<br />

such obscure implementations (last one known to the<br />

authors: [1]).<br />

Ecosystem: the complete ecosystem (software stacks,<br />

drivers, manufacturing utilities) is provided by the SoC<br />

vendor and therefore single-source and supported only by<br />

one company. Sometimes SoC vendors are hesitant to<br />

integrate complete support for their solutions into u-boot or<br />

Linux mainline to avoid exposing too much knowledge<br />

about the actual implementation so that users are still<br />

required to stick with proprietary versions.<br />

699


1.3 TPMs (Trusted Platform Modules)<br />

Contrary to HSMs, which are internal to the SoC, Trusted<br />

Platform Modules (TPM) are external low-cost cryptographic<br />

modules. Trusted computing platforms may use a TPM to<br />

enhance privacy and security scenarios that software alone<br />

cannot achieve. A TPM offers the four primary capabilities<br />

Authentication, Platform Integrity, Secure Communication,<br />

and IP Protection. Depending on the version of the TPM,<br />

different cryptographic algorithms are implemented in the<br />

module. Additionally, TPMs include a small, secured nonvolatile<br />

memory that can be used by user space applications to<br />

store confidential information. Other units of the TPM can be<br />

used to implement policies to manage access to this memory.<br />

The hardware specification of TPMs is maintained by the<br />

Trusted Computing Group (TCG) [2] as a non-profit<br />

organization. The TCG also drove the specification to be<br />

accepted as international standard ISO/IEC 11889/15 which<br />

corresponds to TPM 2.0. All specifications as well as the<br />

according API are open to allow wide adoption and integration<br />

into any operating system and application software.<br />

Asymmetric<br />

Engines<br />

Symmetric<br />

Engines<br />

Hash Engines<br />

Management<br />

Authorisation<br />

Nonvolatile Memory<br />

Plattform Seed<br />

Endorsement Seed<br />

Storage Seed<br />

Monotonic Counters<br />

...<br />

I/O<br />

I2C, LPC, SPI<br />

Key Generation<br />

RNG<br />

Execution Engine<br />

Power Detection<br />

Volatile Memory<br />

PCR Banks<br />

Keys in use<br />

Sessions<br />

...<br />

1.3.1 TPM Hardware and internal Firmware<br />

As of today TPMs are mostly separated small hardware<br />

modules that are hardened against several forms of electrical,<br />

environmental and physical attacks to prevent that neither keys<br />

can be stolen nor the implemented cryptographic algorithms<br />

may be influenced to break the system security. Other<br />

implementations as Firmware TPM are possible.<br />

In general, TPMs are passive components that neither<br />

measure nor monitor nor control anything directly on the system.<br />

They are physically connected by using standardized bus<br />

systems such as I2C, LPC or SPI to receive commands and<br />

responds. So they cannot influence the host system actively e.g.<br />

by stopping the execution of some kind of code on the host CPU<br />

or generating a system reset. The owner of the system even has<br />

the responsibility to manage the TPM by turning it on or off or<br />

to reset and initialize it. Unlike HSMs, TPMs are relatively slow<br />

and cannot be used as a cryptographic accelerators for<br />

encryption. The bus system is usually speed limited and<br />

depending on the particular TPM implementation some<br />

commands have execution times of several seconds. The most<br />

relevant use cases are key generation, key encryption to store<br />

keys externally, key signature and certification, and last but not<br />

least improved random number generation.<br />

In this paper we will focus on TPM version 2.0 and skip<br />

version 1.2 for the simple reason that TPM 1.2 only implements<br />

SHA-1 as cryptographic hash algorithm. As of today there exist<br />

several serious attacks against SHA-1 [3] which leads to the<br />

conclusion that TPM 1.2 cannot be used to enhance the security<br />

of a system. Instead, in TPM 1.2 based security concepts, the<br />

TPM itself is now usually considered to be the weak point of that<br />

system. TPM 2.0 comprises all features of TPM 1.2 but with<br />

significant enhancements like an offering of several mandatory<br />

and optional algorithms instead of just few isolated ones.<br />

As shown in the picture to the right, a TPM comprises<br />

several hardware blocks. An important block is the non-volatile<br />

memory. In the production process the TPM vendor programs<br />

four individual and unique primary seeds. Three of these are<br />

permanent ones which only change when the TPM2 is cleared:<br />

endorsement (EPS), Platform (PPS) and Storage (SPS).<br />

Additionally, the TPM firmware implements a Key Derivation<br />

Function (KDF). A seed (which is simply a long random<br />

number) is hereby used as input to the KDF along with the key<br />

parameters and the algorithm to produce a key based on this<br />

seed. The KDF is deterministic, so if you input the same<br />

algorithm and the same parameters you will get the same key<br />

again. There’s also a Null seed, which is used for ephemeral keys<br />

and changes every with every reboot, reset or power on. Seeds<br />

are never exposed by the TPM. The simple, unprotected physical<br />

connection between the TPM and CPU invites the idea to snoop<br />

on that connection to explore exported and imported keys.<br />

However, key import or export is always handled as encrypted<br />

key blobs (e.g. using AES) to ensure that keys generated in the<br />

TPM are protected. Using the clear command destroys all keys<br />

generated based on EPS, PPS and SPS along with these keys<br />

themselves.<br />

TPMs also contain several monotonic, un-resettable<br />

counters which can be used for instance to count the number of<br />

firmware updates or other events.<br />

There also exists a special volatile memory block on each<br />

TPM. This block comprises memory locations to store keys and<br />

session information. For TPM 2.0 this block also contains two<br />

banks of 24 Platform Configuration Registers (PCR) that can<br />

be used for measurement, which works as follows:<br />

<br />

<br />

<br />

TPM Family 2.0 Block Diagram<br />

The PCRs cannot be written directly.<br />

PCR 0 to 15 can only be extended by a value after an initial<br />

reset directly after power on.<br />

An individual reset of PCR 16 to 23 can be triggered by<br />

user.<br />

www.embedded-world.eu<br />

700


The extend formula as calculated internally in the TPM is the<br />

following:<br />

PCR[i] n+1 := hash ( PCR[i] n || [extend value] )<br />

The index i specifies which PCR register will be extended and n<br />

is the current state of the PCR register. PCR registers are usually<br />

extended multiple times with data like sets of code,<br />

configuration data or policies to calculate a measure of this data.<br />

In other words, the measurement value of a PCR after some<br />

extensions is a measure of the code which was used as extend<br />

values. Due to the nature of a strong hash function, the PCR<br />

values change significantly even for minor changes in the extend<br />

values. At the same time, it is close to impossible to calculate<br />

PCR extend values to achieve a desired PCR value. Comparing<br />

PCR values in certain system states (e.g. after boot up) versus<br />

saved reference values is therefore a possibility to assess the<br />

trust state of a system.<br />

Another important block of a TPM is the Execution Engine,<br />

a small internal MCU which executes the protocol stack for host<br />

communication and controls the asymmetric and symmetric<br />

engines, key generation, key entropy checking, module self-test<br />

and other operations. An additional power detection block<br />

monitors external events like power on and is an essential part<br />

of the tamper detection.<br />

Depending on the particular purpose of the system, the TCG<br />

publishes so called Platform Profile specifications to define a<br />

mandatory set of capabilities of the TPM as well as optional<br />

extensions for certain use cases. One example is the PC Client<br />

Platform TPM Profile (PTP) Specification for TPM family 2.0<br />

(which can be used for Embedded Systems).<br />

The picture below shows mandatory and optional algorithms<br />

and curves for elliptic cryptography algorithms as defined in the<br />

PTP. Other profiles, e.g. for Automotive and Automotive Thin<br />

Clients applications exist as well.<br />

1.4 Why not simply use a software implementation?<br />

The following chapters will show, that for many use cases,<br />

speed and security can be highly improved when using hardware<br />

assisted cryptographic modules. However, in recent ARMv8,<br />

Intel and AMD CPUs specific instructions have been<br />

implemented that help to speed up algorithms such as AES<br />

significantly [4]. So for mere speed reasons, or if the key is<br />

exposed to parts of the software or operating system anyway,<br />

using such optimized implementations is a real option. In this<br />

paper’s chapter about HSMs and TPMs on application level<br />

(chapter 4) we will also have a look at these software<br />

implementations and compare them with the ones in HSMs.<br />

However, the topic of the next chapter, establishing trust on an<br />

embedded system is a good example of a use case that highly<br />

profits from hardware implementations.<br />

2 TRUSTED SYSTEMS<br />

In this chapter, we take a closer look at possibilities of adding<br />

trust to embedded systems. This point is especially important for<br />

dynamically adaptable and network-connected systems (that<br />

might be attacked through this network), but also a sensible<br />

measure to ensure that system crashes or malfunctioning<br />

software is prevented from harming the system. Examples of<br />

dynamically adaptable systems are systems that support remote<br />

updates, user software installation or user customization.<br />

When looking at trusted systems, the terms root of trust and<br />

chain of trust are important to understand so we look at these<br />

first.<br />

2.1 Root of Trust, Chain of Trust<br />

Similar to real live, trust in the Embedded Systems world has<br />

to be “earned”. However, unlike humans who have the luxury of<br />

taking time to develop their own assessment whom to trust, in<br />

the embedded world trust has to be immediate in most cases.<br />

Therefore, a concept, called “chain of trust” is used, which is<br />

based on inheritance. It starts with an implicitly trusted<br />

component, the so called root of trust. This component then<br />

evaluates other components and decides if they can be trusted as<br />

well. If so, these components are added to the trust base, can be<br />

executed and can act themselves as new assessors of trust for<br />

other components. This way, a chain of trust is constructed<br />

which (ideally) results in a completely assessed and trusted<br />

system. The picture below illustrates a chain of trust based on<br />

the boot process of an embedded system.<br />

701


In this case, the ROM boot loader acts as the root of trust.<br />

Authentication of components (e.g. u-boot) is usually done<br />

based on hash values of the binary code of these components<br />

using either the complete component or selected (security<br />

relevant) parts of it. A root of trust is established by an<br />

unchangeable, implicitly trusted peace of code. Naturally, this<br />

code needs to be reviewed extensively for potential<br />

vulnerabilities and functional correctness and should therefore<br />

be kept as small as possible. In most SoC implementations<br />

supporting hardware authentication, the root of trust is generated<br />

in the ROM bootloader (or BIOS code, if applicable). Since the<br />

complete trust of a system is inherited from the root of trust, it<br />

stands and falls with its correctness. Therefore special diligence<br />

needs to be exercised when selecting and evaluating the root of<br />

trust. In the next chapter, we will look at specific hardware<br />

implementation concepts to authenticate components and their<br />

advantages and disadvantages.<br />

2.2 Authenticated and Encrypted Boot using HSMs<br />

The basic task of authenticating software is very simple:<br />

generate a hash value and compare it to a reference value of this<br />

hash. If they match, the software is authenticated. However, this<br />

prompts some questions with conflicting answers:<br />

<br />

<br />

<br />

Where should the reference hash be stored?<br />

Since the reference hash is the foundation of the<br />

decision if an image is valid, it should be stored in a<br />

secure, unmodifiable location, such as in fuse arrays.<br />

What can I do if my software changes (updates) and the<br />

hash needs to be updated?<br />

To update the hash, it needs to be stored in a<br />

modifiable memory, such as flash.<br />

How can I dynamically store multiple hashes to validate<br />

multiple images and how can I assign a hash to an image?<br />

Fuse memory is usually limited and the assignment<br />

should be fixed for security reasons.<br />

The solution to these conflicting answers is using<br />

private/public key cryptography, which works as follows. To<br />

sign an image for a target system, a combination of private and<br />

public key is generated during production on a development PC.<br />

This private key is then used to encrypt the hash value and<br />

directly attached to the image along with the matching public<br />

key to verify this encrypted hash. Due to the nature of<br />

private/public key cryptography, it is very simple to verify the<br />

encrypted hash using the public key, but practically impossible<br />

to guess the private key to generate such a signature (e.g. with<br />

the updated hash value after modifying an image). To ensure that<br />

attackers can’t just switch to a completely new pair of<br />

private/public key and authenticate their own images, a hash<br />

value of the public key is saved to an immutable internal<br />

memory location and checked against before using the public<br />

key. This methodology ensures that multiple images can be<br />

signed using the same private/public key with a fixed<br />

consumption of one-time programmable memory, as long as the<br />

signature of the hashes are attached to their respective images.<br />

The scheme to the right shows the process of signing an image<br />

on the host PC in a secure environment. The blue and green<br />

blocks identify the cryptographic functions used, while yellow<br />

blocks represent actual keys that are generated and used in the<br />

process. After generating a pair of (secret) private and (publicly<br />

available) public key, the private key is used to sign the hash<br />

value of the software image to be authenticated. Usually,<br />

SHA256 is used as the hash function and RSA for private/public<br />

key cryptography. The generated signature is then attached to<br />

the image along with the public key and the complete binary is<br />

written to flash memory. Using the attached public key, the<br />

image signature can then be authenticated by the SoC. To avoid<br />

that attackers simply replace the pair of private and public key<br />

with their own ones and use these to sign a new (modified)<br />

image, a hash of the public key to be used is generated and<br />

burned to the SoC fuses (one-time programmable memory).<br />

On the SoC side, this process is simply inverted as shown in<br />

the image on the next page. After extracting signature and public<br />

key from flash, the signature is checked against the calculated<br />

hash of the image. If this succeeds and the hash of the public key<br />

matches the one saved in the fuses, the authentication succeeds<br />

and the image is booted.<br />

To make this process of authenticating an image as attacksafe<br />

as possible, it is usually implemented in hardware, using<br />

Functional State Machines (FSMs) or small, completely isolated<br />

controllers. NXP’s CAAM engine is an example of such<br />

hardware, which can perform authentication in a completely<br />

automated way. Highly simplified, after being told where the<br />

respective image to authenticate is located, it performs the<br />

authentications and then updates the chip “secure” state. If the<br />

images has been authenticated successfully, it keeps the state in<br />

“secure”, otherwise it changes it to “unsecure”. Once the state<br />

has been changed to “unsecure” it can’t be changed back to<br />

“secure” without a hard reset. This way, a chain of trust is<br />

established. The authentication can be invoked from software<br />

(e.g. ROM bootloader or u-boot) by jumping into functions<br />

www.embedded-world.eu<br />

702


located in the SoC’s ROM with some registers informing about<br />

the current trust state.<br />

If encryption for the images is added to the boot process, an<br />

additional step is required which involves encrypting the image<br />

key with an integrated device key, therefore generating a<br />

cryptographic blob which can also be attached to this image.<br />

This process allows users to select their individual key, but still<br />

keeps security at a high level by encrypting this key with unique,<br />

random keys not knows to anybody and specific to the device.<br />

In other words, if two devices boot images which are encrypted<br />

with the same key, the cryptographic blobs of these keys are<br />

different, while the encrypted image itself is the same.<br />

This process highly simplifies remote updates of images,<br />

because it allows OEMs to use the same private key and<br />

encryption key on multiple images, but still minimize the attach<br />

surface by parallelized brute force attacks.<br />

This high level of security is achieved by using hardware<br />

security units. In pure software, it would be impossible to<br />

realize, starting with the inability of processor cores to boot<br />

encrypted code, which would imply to keep some code<br />

unencrypted and unsigned. As a consequence, this code and the<br />

keys used for accessing and verifying flash program code would<br />

have to be stored on protected, one-time programmable<br />

memories, making it expensive and hard to update and adapt to<br />

exposed security leaks. Another advantage of using hardware<br />

security for authentication is the possibility to integrate these<br />

units with other security units on the SoC. For example, the<br />

symmetric units can also be used to export/import user specific<br />

keys or user data into cryptographic blobs. Upon detected<br />

attacks or images not being authenticated, internal key memories<br />

can be cleared and access to critical IP modules can be<br />

prevented.<br />

If this method is implemented in a functional correct way, it<br />

is the most practical and secure way of establishing a root of<br />

trust. However, it also has some disadvantages:<br />

<br />

License Restrictions: Some open source licenses (e.g.<br />

GPLv3) impose the requirement on a system that users<br />

should be allowed to completely exchange the operating<br />

system on these devices. [5] for example states:<br />

“When it comes to security measures governing a<br />

computer’s boot process, GPLv3’s terms lead to one<br />

simple requirement: Provide clear instructions and<br />

functionality for users to disable or fully modify any boot<br />

restrictions, so that they will be able to install and run a<br />

modified version of any GPLv3-covered software on the<br />

system.”<br />

Because HSM assisted trusted boot cannot be disabled in<br />

hardware once enabled, measures (e.g. using an<br />

intermediate boot loader) need to be implemented to ensure<br />

that this is possible. Contact the authors for more<br />

information on this topic.<br />

Trust / Certification issues: as mentioned in chapter 1.2<br />

Little flexibility for bug fixing: Since the implementation<br />

is completely in hardware with no software interaction, it<br />

offers little flexibility of later improvement or bug fixing if<br />

errors are detected or changes have to be made due to<br />

exposed problems or security holes in later system updates.<br />

2.3 Measured Boot with TPM<br />

As mentioned in the last chapter, depending on the profile<br />

specification the PCR extend functionality can be used to<br />

measure dedicated parts of code, configuration data and policies.<br />

This functionality can for example be used to measure the<br />

components involved in the boot of system. The table below<br />

shows how PCRs are defined to be used on UEFI enabled X86<br />

systems and a recommendation how to adapt this usage on ARM<br />

systems with GNU/Linux.<br />

PCR PCR usage (UEFI + x86) PCR usage<br />

(ex. for ARM)<br />

0 SRTM, BIOS, Host Platform<br />

Extensions, Embedded Option<br />

ROMs and PI Drivers<br />

1 Host Platform Configuration<br />

2 UEFI driver and application Code<br />

3 UEFI driver and application<br />

Configuration and Data<br />

4 UEFI Boot Manager Code<br />

(usually the MBR) and Boot<br />

Attempts<br />

5 Boot Manager Code<br />

Configuration and Data (for use<br />

by the Boot Manager Code) and<br />

Boot ROM, if<br />

accessible<br />

First Stage Boot<br />

Loader<br />

u-boot<br />

(including SPL)<br />

u-boot<br />

Environment<br />

GPT/Partition Table<br />

6 Host Platform Manufacturer<br />

Specific<br />

7 Secure Boot Policy Linux IMA<br />

8-15 Defined for use by the Static OS Linux IMA<br />

16 Debug<br />

It is important to note, that all software code performing<br />

measuring operations (with or without using TPMs) has to be<br />

authenticated to be part in the chain of trust, otherwise the<br />

measurements performed by it can’t be trusted. TPMs by<br />

themselves are not able to establish a root of trust, since they<br />

703


(unlike HSMs) need to be controlled by software code running<br />

on the SoC (chicken/egg problem). So replacing HSM<br />

authentication with TPM measurement functionality can only<br />

start with the first code that allows the TPM to be accessed and<br />

can be authenticated by outside means. We will look into ways<br />

how to mitigate this fact for more complex systems in chapter<br />

2.4. Another important thing to know is that PCR extends of big<br />

chunks of code usually implies a big time penalty due to the<br />

limited operating speed of TPMs. Therefore, most of the times,<br />

the preferred approach would be to authenticate a software<br />

algorithm (utilizing trusted HSMs) to calculate a hash of the<br />

chunk in question and then extend the TPM PCR with the result<br />

of this calculation.<br />

No complete, adaptable and re-usable implementation to<br />

support measured boot across all boot stages exists to the<br />

knowledge of the authors. So in the remainder of this subchapter,<br />

we will outline the current state of TPM support in the most<br />

relevant components of the boot chain.<br />

Naturally, the ideal place to start TPM measurement would<br />

be the ROM bootloader implemented by SoC vendors which<br />

would come close to having a (immutable) root of trust, as long<br />

as the vendor in question is trusted or the boot loader code is<br />

publicly available to check and certify. However, at the time of<br />

writing this, no SoC known to the author offers TPM support in<br />

the ROM bootloader. However, some SoC vendors offer<br />

support to modify and extend their first stage bootloader (FSBL)<br />

implementation, so some basic routines for TPM initialization<br />

and PCR-based measurement can be added there. If this is not<br />

possible, u-boot (or u-boot’s SPL) would be the first stage in<br />

which measurement can be started for components following in<br />

the chain of trust. This implies, that HSM-based methods have<br />

to be used before to authenticate u-boot itself and previous<br />

components in the chain of trust.<br />

Mainline U-Boot currently only offers support for self-test,<br />

provisioning, un-provisioning and PCR extension for TPM 1.2.<br />

Advanced features to support integrated measurement of<br />

payload data in u-boot and support for TPM 2.0 are missing. The<br />

authors have implemented this support and are evaluating the<br />

solution in some internal and customer projects. The solution<br />

currently supports automated measurement of configuration<br />

data and boot payloads, such as the Linux Kernel, configuration<br />

data such as Device Tree Binaries, Initial RAM disks or<br />

complete FIT images. This allows to measure and boot in initial<br />

GNU/Linux system. Please get in contact if you are interested in<br />

using or contributing.<br />

To support selective measurement and remote attestation in<br />

a running GNU/Linux system, starting with GNU/Linux Kernel<br />

2.6.30 the Integrity Subsystem of the kernel has been introduced<br />

(with extensions in 3.3 and 3.7) that can be used to implement<br />

measured boot of the Kernel. Basically, it allows detection if<br />

files have been accidentally or maliciously modified, both<br />

locally or remotely. Files measurement can be appraised against<br />

a known “good” value stored as an extended attribute in the<br />

filesystem or via a server through a secure IPSEC connection<br />

(“remote attestation”). IMA uses the Extended Verification<br />

Module (EVM) to guarantee the integrity between the file and<br />

its extended attributes. IMA currently offers the following<br />

integrity functions:<br />

Collect<br />

Store<br />

Attest<br />

Appraise<br />

Protect<br />

measure a file before it is accessed.<br />

add the measurement to a kernel resident<br />

list and, if a hardware Trusted Platform<br />

Module (TPM) is present, extend the IMA<br />

PCR<br />

if present, use the TPM to sign the IMA<br />

PCR value, to allow a remote validation<br />

of the measurement list.<br />

enforce local validation of a measurement<br />

against a “good” value, stored as hash or<br />

signature in 'security.ima' extended<br />

attribute of the file, protected by EVM<br />

protect a file's security extended attributes<br />

(including appraisal hash) against off-line<br />

attack.<br />

Hash values are extended into PCR10. So the final aggregate<br />

hash in PCR10 is the record of the state of the measured files<br />

and directories, for example after booting. IMA also offers some<br />

built-in policies that can be enabled on the boot command line.<br />

The kernel log also contains all information about what files<br />

have been appraised and which executables have been started. It<br />

provides the tool evmctl (contained in the ima-evm-utils<br />

package) which can be used for producing and verifying digital<br />

signatures and to store them into the xattr to be used by IMA.<br />

For further information on the configuration of IMA see the<br />

official documentation [6] or contact the authors.<br />

2.4 Local and Remote Attestation<br />

Between the two extrema of pure local and remote<br />

attestation, other ways of verifying the system correctness are<br />

possible that combine some of the advantages of both worlds.<br />

Remote attestation provides the best security and is highly<br />

immune to local attacks, because the actual attestation is done<br />

on a remote system. However, it relies on a working<br />

communication channel to the attestation server and – even more<br />

basic – on the attestation server itself. Not all application use<br />

cases allow remote connections and highly available servers are<br />

costly and require lots of maintenance while still posing a high<br />

security risk themselves.<br />

On the other hand, local attestation, like i.e. performed by<br />

IMA’s Appraise function, implies trusting in the local selfprotection<br />

capabilities of the attestation software itself and its<br />

functional correctness. This means, that it needs to be<br />

authenticated by a component in the chain of trust (e.g. a<br />

minimum system contained in an initial ramdisk). This approach<br />

is inflexible, time consuming and complicated for components<br />

as dynamic as an Embedded Linux System and also provides no<br />

run-time protection of the authenticating system after boot.<br />

An intermediate path between these approaches is to use a<br />

secure, trusted and isolated component on the local system to<br />

authenticate the remainder of it. Two ways of implementing this<br />

come to mind, using ARM TrustZone or using an Embedded<br />

Hypervisor. Both technologies rely on hardware features in<br />

modern processors to isolate trusted from untrusted software and<br />

to restrict access of untrusted components to the trusted ones.<br />

Embedded Hypervisors have been described in detail in [7],<br />

more information about TrustZone can be found in [8], a trusted<br />

OS implementation is described here [9]. Hypervisors offer far<br />

www.embedded-world.eu<br />

704


advanced methods to implement attestation, but the basic<br />

concepts are the same and rely on such a trusted component with<br />

minimal Trusted Computing Base (TCB) [7], which is security<br />

certified and provides attestation services to the rest of the<br />

system. This component is installed on the system and – after<br />

authentication by the root of trust – acts as new, flexibly<br />

expandable measurement and authentication agent which can<br />

remain unchanged even if functionality or updates are added to<br />

the system, therefore also allowing less flexible roots of trust to<br />

be used for initial authentication.<br />

Different concepts to do the measurement are imaginable,<br />

with increasing security but also complexity.<br />

a) Triggered<br />

In this concept, the trusted component provides crypto<br />

measurement and attestation services to the untrusted system. So<br />

instead of performing the measurement itself, this is requested<br />

by the untrusted system as a service from the trusted component.<br />

The advantage is, that a common, simple API can be used which<br />

requires no knowledge about underlying hardware (e.g. TPM)<br />

by the untrusted system and is highly portable and selfcontained.<br />

All information about expected measurement values<br />

or other secrets such as keys are kept securely within the trusted<br />

component. For example, after loading a payload (such as the<br />

Linux Kernel) into memory, the bootloader could request an<br />

attestation of this payload from the trusted component<br />

(providing memory location, size and potentially also the start<br />

address). The trusted component then measures the payload and<br />

starts it, if the measurement matches an expected value.<br />

To increase security, the payload can be encrypted by a key<br />

which is sealed to this PCR measurement value in the TPM. So<br />

the trusted component would measure the payload and<br />

afterwards try to retrieve the key. If the measurement was<br />

correct, they key is available to decrypt the payload in memory<br />

and initiate booting it on the untrusted side. If the measurement<br />

was not correct, the payload can’t be decrypted and therefore<br />

booting it is impossible.<br />

The biggest advantage of this concept is its flexibility when<br />

system updates are performed. In this case, the trusted<br />

component retrieves the new images via a secure connection,<br />

places them into (publicly accessible) memory and performs a<br />

reference measurement to get the new PCR values to sign the<br />

keys to.<br />

Alternatively, if the channel through which the update image<br />

is retrieved is unsecure, it is sufficient to receive the new<br />

reference measurement value through a secure (e.g. a 2 way<br />

authenticated) connection. If the image is compromised while<br />

being fetched from the public update source, it will not match<br />

the measurement, the TPM will not unseal the key and therefore<br />

decryption is impossible. In this case, a new update will be<br />

triggered by the secure component.<br />

The secure component uses TPMs to securely store keys and<br />

measurement reference values to avoid that secrets are extracted<br />

from mass storage. Only after the secure component itself has<br />

been authenticated and measured, the TPM will reveal secrets<br />

and only if the correct measurements and PCR extends are done,<br />

more secrets will be made available by the TPM to continue the<br />

boot process into subsequent instances.<br />

b) Active<br />

While in the Triggered Measurement scenario, the<br />

measurement is triggered from the unsecure side, Active<br />

Measurement allows the secure component to initiate<br />

measurements on the unsecure side. This is done similarly to<br />

remote attestation, using Quote requests and expecting quotes<br />

from the attested system. The only difference is that the secure<br />

component acts as the (implicitly trusted) attestation server here,<br />

but all concepts (such as including nonces in the quote request)<br />

known from remote attestation can be used here.<br />

c) Observing<br />

Observing Attestation drives the Active Attestation concept<br />

one step further and is an interesting concept for embedded<br />

hypervisors. Instead of communicating with the payload virtual<br />

machine, it performs a measurement of VM contents from the<br />

outside. This happens completely without payload VM<br />

interaction and leverages the access rights of the VM|s virtual<br />

machine monitor (VMM) [7]. The main advantage is that no<br />

modifications (==hooks, functions, channels) have to be made<br />

to the payload virtual machine itself and that the software<br />

running in the virtual machine can’t even know that it is being<br />

measured. The main drawback is the implementation<br />

complexity, and a potential impact on the performance of the<br />

virtual machine during measurement. So the measurement times<br />

need to be carefully selected. Alternatively, time critical tasks<br />

the virtual machine is performing need to be shifted to another<br />

one while performing the measurement. The major advantage is<br />

that the actually executed memory content is measured instead<br />

of an image before execution. Some SoCs offer hardware<br />

support (programmable hash engines with DMA support) for<br />

performing the actual measurement, which - if these engines are<br />

trusted – can be leveraged to accelerate the measurement.<br />

Furthermore TPMs can be leveraged by the hypervisor to save<br />

measurements, reference values and encryption keys which<br />

might be necessary to access the VM.<br />

2.5 Software Implementations for File System Security<br />

Next to the very hardware (HSM and TPM) centric<br />

techniques of authentication and measurement mentioned above<br />

there are various methods in the GNU/Linux system to enhance<br />

security by use of software authentication and encryption. Some<br />

of these include support for hardware crypto acceleration<br />

(explicitly or implicitly through the Kernel Crypto API) and are<br />

therefore quickly mentioned in this chapter.<br />

The Linux Kernel supports various methods to implement<br />

block level integrity protection such as DM-Verity [10] and<br />

DM-Integrity [11].<br />

DM-Verity uses a cryptographic hash tree to authenticate<br />

block devices. The hashes are computed using kernel crypto<br />

services and hereby leverages HSM through the kernel’s crypto<br />

API. DM-Verity can only be used to verify read-only partitions,<br />

updating or changes to these partitions therefore require a new<br />

integrity setup of the partition. It can therefore be used for<br />

system partitions which are supposed to remain unchanged.<br />

DM-Integrity which was recently merged into Linux<br />

Mainline with Kernel 4.12, supports the authentication of R/W<br />

partitions and supports journals. It also leverages the Kernel<br />

Crypto API and therefore profits from HSM acceleration.<br />

705


DM-Crypt [12] provides support for encrypting block<br />

devices for GNU/Linux and works with both DM-Verity and<br />

DM-Integrity. LUKS (Linux Unified Key Setup) [13] provides<br />

some extensions to it, mostly to simplify key handling. DM-<br />

Crypt uses the Linux Kernel Cryptographic API and therefore is<br />

able to leverage HSM. The extension tpm2_luks [14] can be<br />

used to store keys in TPM 2.0 modules and to seal access to the<br />

key to specific PCR values to ensure, so that keys can only be<br />

accessed in safe measured states.<br />

2.6 Conclusion: secure / authenticated / measured boot<br />

In the past subchapters, we have looked at both authenticated<br />

boot using on-chip HSMs and at measured boot using TPMs to<br />

complement pure software based approaches. But what is the<br />

“ideal” method to implement trusted systems? The following<br />

table summarizes the advantages and disadvantages of both<br />

methods:<br />

Aspect<br />

On-Chip<br />

HSM<br />

TPM pure<br />

Software<br />

boot speed (also see<br />

next chapter)<br />

++ -- +<br />

(ARMv7)<br />

to<br />

++<br />

(ARMv8)<br />

can establish root of + - -<br />

trust<br />

ease of use + + ++<br />

open source support - + ++<br />

standardized / vendor -- + -<br />

independence<br />

pre-certification -- ++ -<br />

financial costs + -- ++<br />

integration into SoC ++ - --<br />

security system<br />

algorithm flexibility - + ++<br />

(TPM<br />

2.0)<br />

updates / security<br />

fixes possible<br />

-- + ++<br />

For most of today’s applications, the preference of system<br />

architects is to leverage TPMs as far as possible. The major<br />

problem with using TPMs for a complete authentication flow is<br />

the limited speed and the inability to establish a root of trust<br />

without a remote connection for attestation. This, however<br />

means that HSM-based methods have to be used anyway to<br />

complement TPMs. So a common approach is to use HSMs<br />

where they are absolutely required (root of trust) or show<br />

significant advantages (integration into SoC security system)<br />

and otherwise complement TPM’s disadvantages using (TPM<br />

authenticated) software routines, which allows for higher<br />

flexibility and easier certification of the algorithm. Some recent<br />

security exposures show that this approach is the right one.<br />

NXP’s high assurance boot had some security leaks, which are<br />

hard to mitigate in deployed systems. About at the same time, a<br />

problem in Infineon’s SLB9670 TPM was exposed for RSA key<br />

generation. This problem could be fixed with a simple firmware<br />

update (because the TPM is running just software in a specially<br />

protected microprocessor) and took much less time to be<br />

detected due to the wide use of this TPM.<br />

The table below shows an example flow, how an<br />

authenticated boot flow could be implemented, leveraging both<br />

HSMs and TPMs.<br />

Step Actions Payload<br />

authenticate u-boot u-boot (incl.<br />

(including hash algorithms TPM<br />

for TPM) using integrated support)<br />

HSM hardware support<br />

ROM<br />

Bootloader<br />

(no changes<br />

possible)<br />

u-boot<br />

Linux Kernel<br />

Initial<br />

RAMdisk<br />

Application<br />

Level<br />

* initialize TPM<br />

* measure u-boot binary in<br />

memory and extend PCR<br />

(alternatively:<br />

remote attestation)<br />

* load FIT Image to RAM<br />

* authenticate FIT Image<br />

using certified hash algorithm<br />

and extend PCR<br />

* Initialize HSMs + TPM<br />

support<br />

* load initial RAMdisk<br />

* enable IMA kernel<br />

measurement<br />

* setup dm-integrity and<br />

dm-crypt with keys from<br />

TPM (protected by PCR<br />

state)<br />

* mount encrypted,<br />

authenticated Root<br />

Filesystem Partition<br />

* setup HSMs/SW engines<br />

using CryptoDev<br />

* setup access to key<br />

storages and crypto<br />

functions in TPM<br />

(protected by IMA / PCR<br />

sealing)<br />

FIT Image<br />

(Kernel,<br />

Devicetree,<br />

initial RAM<br />

disk)<br />

Initial<br />

RAMdisk<br />

Root<br />

Filesystem<br />

Partition<br />

Block<br />

Device<br />

see next<br />

chapter<br />

Using hypervisors to provide authentication mechanisms to<br />

virtual machines offers advanced possibilities at the cost of<br />

increasing complexity. However, for dynamic systems that can<br />

also leverage other advantages of hypervisors, this concept<br />

should be taken into consideration.<br />

3 HSMS AND TPMS ON APPLICATION LEVEL<br />

In the last chapter we evaluated how hardware security<br />

modules are used to authenticate a system during boot and assure<br />

that it can be trusted. However, cryptography is also a common<br />

requirement on application level. Some common tasks include:<br />

<br />

<br />

<br />

Encrypting/Decrypting sensitive data<br />

Key Storage<br />

IPSeC tunnelling<br />

www.embedded-world.eu<br />

706


We start with an overview how HSMs and TPMs are used at<br />

a user space level in GNU/Linux. Then we will look into the<br />

specific support for some of these scenarios.<br />

3.1 HSM on Application Level<br />

In this chapter, we describe HSM support and benchmarking<br />

in GNU/Linux for kernel and userspace.<br />

3.1.1 Hardware<br />

The HSMs of recent SoCs can be leveraged as cryptographic<br />

accelerators by user applications. We have analysed two specific<br />

implementations, NXP’s CAAM and the HSM used in recent<br />

Marvell SoCs, SafeXCell IP197. These units support<br />

acceleration of all major cryptographic algorithms, including<br />

AES, DES/3DES, RC4, MD5, SHA-256 and some advanced<br />

message authentication and authenticated encryption<br />

algorithms. They include a secure, NIST certified random<br />

number generator and support the import and export of<br />

cryptographic blobs into DDR or flash memory and provide<br />

both, DMA support and memory mapped slave interfaces.<br />

Internally, the units are accessed through memory mapped<br />

portals (Job Rings), which are basically FIFOs that can be loaded<br />

with cryptographic job descriptors. Highly simplified, a job<br />

descriptor is a structure describing the cryptographic task to be<br />

performed (“encrypt using AES256”), the key to used (“key #2<br />

from internal key storage”) and the payload to perform the task<br />

on.<br />

3.1.2 GNU/Linux support<br />

For both units, Linux mainline driver support is available.<br />

Enabling them on Kernel level is a matter of adding the unit to<br />

the device tree and enabling and loading the drivers. In case of<br />

SafeXCell IP197, a binary firmware has also to be provided to<br />

the driver to be loaded into the unit. Successful enablement of<br />

the HSM adds it to the kernel CryptoAPI so that the<br />

cryptographic services can be used from kernel space, e.g. to<br />

provide IPSEC support. This can be checked by analysing the<br />

output of /proc/crypto which lists all the algorithms and<br />

services available, along with their priority of use. An example<br />

is shown in the table below, showing examples of different<br />

SHA-256 implementations. On the left side, the implementation<br />

provided by the SafeXCell IP197 is shown, on the right hand one<br />

provided by the kernel itself, realized in software using the<br />

ARM64-bit NEON crypto extensions. Note the different<br />

priorities.<br />

HSM Implementation<br />

name<br />

: sha256<br />

driver : safexcelsha256<br />

module : crypto_<br />

safexcel<br />

priority : 300<br />

refcnt : 1<br />

selftest : passed<br />

internal : no<br />

type<br />

: ahash<br />

async : yes<br />

blocksize : 64<br />

digestsize : 32<br />

SW Implementation<br />

name<br />

driver<br />

module<br />

: sha256<br />

: sha256-<br />

arm64-neon<br />

: kernel<br />

priority : 150<br />

refcnt : 1<br />

selftest : passed<br />

internal : no<br />

type : shash<br />

blocksize : 64<br />

digestsize : 32<br />

So in this example, if “sha256” as an algorithm is requested,<br />

the one provided by the SafeXCel HSM would be used because<br />

it has the higher priority.<br />

In the second example (below) for AES-CBC, two<br />

implementations have the same priority. The easiest way to<br />

select which one is used would be to unload the module<br />

providing the undesired one or to unselect it from the kernel<br />

build configuration. However, there are more sophisticated<br />

ways to granularly select specific implementations of a specific<br />

algorithm (contact the authors for details).<br />

HSM Implementation<br />

name<br />

: cbc(aes)<br />

driver : safexcelcbc-aes<br />

module : crypto_<br />

safexcel<br />

priority : 300<br />

refcnt : 1<br />

selftest : passed<br />

internal : no<br />

type<br />

: skcipher<br />

async : yes<br />

blocksize : 16<br />

min keysize : 16<br />

max keysize : 32<br />

ivsize : 16<br />

chunksize : 16<br />

walksize : 16<br />

SW Implementation<br />

name<br />

driver<br />

module<br />

: cbc(aes)<br />

: cbc-aesce<br />

: kernel<br />

priority : 300<br />

refcnt : 1<br />

selftest : passed<br />

internal : no<br />

type<br />

: skcipher<br />

async : yes<br />

blocksize : 16<br />

min keysize : 16<br />

max keysize : 32<br />

ivsize : 16<br />

chunksize : 16<br />

walksize : 16<br />

But how can kernel cryptographic services be accessed from<br />

userspace? Several ways exist, e.g. AF_ALG [15] which<br />

provides services through sockets or CryptoDev [16] which<br />

implements an OpenBSD compatible /dev/crypto device.<br />

Both methods have their advantages for specific applications,<br />

we will take a closer look at CryptoDev in this paper.<br />

Cryptodev is provided as an out-of-tree kernel module,<br />

support to build and include it is, for example provided through<br />

the Yocto Project, along with patches to work with recent Linux<br />

kernels. When loaded, it provides a new character device, called<br />

/dev/crypto which can be used to access the cryptographic<br />

services from userspace. Plenty of examples are provided that<br />

illustrate how to use CryptoDev in applications [17] and is also<br />

supported by well-known crypto libraries like GnuTLS [18] or<br />

OpenSSL [19].<br />

3.1.3 Performance / Benchmarking<br />

To get a first impression about the speed an HSM can<br />

provide, it should be benchmarked from userspace. OpenSSL is<br />

a great way to perform speed tests and comparisons between<br />

various engines and implementations using HSM, CPU Crypto-<br />

Acceleration functions or plain software algorithms. To do this,<br />

it should be compiled with HAVE_CRYPTODEV and<br />

USE_CRYPTODEV_DIGESTS defined, which is done<br />

automatically by Yocto if the CryptoDev recipe is included in<br />

IMAGE_INSTALL. Using the command<br />

openssl speed –engine cryptodev –elapsed –evp [cipher]<br />

starts the benchmark of a specific cipher algorithm<br />

implementation accelerated by a HSM supported by CryptoDev.<br />

The parameter –elapsed is important to get realistic results. It<br />

tells OpenSSL to measure the time it actually required to get the<br />

707


answer back from the engine (contrary to the internal times just<br />

to load the engine and process the results).<br />

Diagram 1 in the appendix shows the performance of NXP’s<br />

CAAM engine on an ARMv7 Layerscape LS1021 device for<br />

AES128 and AES256, compared to running in software on<br />

ARM Cortex-A7 at 1.2 GHz.<br />

Several observations are interesting to notice and generally<br />

true on most ARMv7 SoCs:<br />

HSM shows a significant speed advantage vs.<br />

implementation in software for bigger block size (up to 10<br />

times)<br />

for smaller block sizes (below 512 bytes), the overhead of<br />

loading the engine with the job descriptors and the memory<br />

transfer latencies involved is so high, that the achieved<br />

throughput is less than in software<br />

Even for small block sizes, the use of a HSM might be<br />

preferred to save CPU cycles and because of its better<br />

energy efficiency<br />

The HSM only achieves optimal results if being used by<br />

multiple threads (OpenSSL option: -multi n). In other<br />

words, one thread alone is not able to exploit the engine to<br />

its maximum performance.<br />

CPU load (for providing data to the HSM, interface<br />

handling and housekeeping) is about 40% to 50% of the<br />

load compared to when actually executing the algorithm in<br />

Software (setup: cryptodev, openssl). So by using the<br />

HSMs CPUs can be relieved from crypto operation load<br />

and a better total system energy efficiency can be achieved.<br />

Diagram 2 in the appendix shows the same benchmark<br />

performed on two 64 bit SoCs (Marvell Armada 8040 with<br />

SafeXCell 197 HSM and NXP LS1046 SoC with CAAM HSM).<br />

The core frequencies have been limited to 1 GHz to achieve<br />

comparable results.<br />

The following observations are interesting to notice and<br />

generally true on most ARMv8 SoCs:<br />

<br />

<br />

<br />

<br />

ARMv8 crypto extensions show a significant speed<br />

advantage over the HSM implementation for all block sizes<br />

For larger block sizes, the HSM implementation is able to<br />

reach similar speeds to the single-core software<br />

implementation when multiple threads are being used to<br />

push crypto pushing data into the HSM do not increase the<br />

crypto speed<br />

HSM implementation shows little speed degradation<br />

compared to software implementations for increasing key<br />

lengths on AES and other algorithms, profiting from<br />

parallel hardware implementations<br />

CPU load (for providing data to the HSM, interface<br />

handling and housekeeping) is about 40..50% of the load<br />

compared to when actually executing the algorithm (setup:<br />

cryptodev, openssl). Therefore from a performance point<br />

of view, using HSM from userspace is not recommended.<br />

Example: on NXP LS1046, for 8k blocksizes, 2 CPUs have<br />

to be used to 70% each to achieve an AES throughput<br />

through the CAAM engine which is equivalent to running<br />

<br />

<br />

the algorithm on one single core (90% load). The<br />

throughput achieved in software can’t even be reached on<br />

CAAM for smaller block sizes.<br />

The use of HSMs might still be desired because of their<br />

better energy efficiency or if performance is not relevant.<br />

The main advantage of HSMs is when being used by other<br />

hardware engine, such as the Ethernet offloading system<br />

(DPAA/DPAA2 for NXP SoCs).<br />

As the diagrams and further experimental results (available<br />

through the authors) show, for most algorithms there is a break<br />

even point of block size, at which the HSM implementation<br />

starts to become more efficient (in terms of energy, speed, or<br />

total system load) than a software implementation. This break<br />

even point can to be determined from experimental results. From<br />

an optimization point of view, it is then possible (e.g. in<br />

cryptodev) to distribute tasks to either the software or hardware<br />

engine, depending on these even points. In a more advanced<br />

design (either implemented in cryptodev or directly in the kernel<br />

crypto framework) a concept can implemented to distribute<br />

crypto operations to both HSMs and software implementations<br />

(e.g. locked to crypto-designated cores) to implement a speed<br />

and energy optimized high-end crypto system. Please contact the<br />

authors for more information on this.<br />

3.2 TPMs on Application Level<br />

In this chapter we will describe the support for TPM 2.0 in<br />

GNU/Linux for Kernel, userspace and some application<br />

examples.<br />

3.2.1 GNU/Linux Support<br />

Driver support implementing TIS and TCTI for TPM 2.0<br />

has been mainlined with Kernel 4.8 for several vendors. To<br />

avoid the limitation of one userspace application blocking the<br />

TPM for exclusive use, an access broker and resource<br />

management daemon was added with Kernel 4.12.<br />

But on top of the driver, there is some more infrastructure<br />

required to actually use a TPM.<br />

In addition to the module specification itself the TCG also<br />

defines the TPM Software Stack (TSS) that defines the<br />

Software API (SAPI) and the TPM Command Transmission<br />

Interface (TCTI). There exists also an additional Enhanced SAPI<br />

to simplify the userspace access on TPM functions out of several<br />

programming languages.<br />

At the moment several TSS implementations for<br />

GNU/Linux exist, e.g. from Intel [20] or IBM [21]. The picture<br />

on the next page shows the basic concept on a GNU/Linux<br />

system. A kernel driver is used to communicate with the TPM<br />

e.g. by using the TPM Interface Specification (TIS). This driver<br />

creates the /dev/tpm0 device to tunnel TCTI communication to<br />

the TPM. Optionally (depending on<br />

CONFIG_HW_RANDOM_TPM in the kernel configuration)<br />

another device for reading random numbers can be generated.<br />

One layer above resides the TCTI TPM Device Provider<br />

service as part of the TSS what will be used mostly by the SAPI<br />

layer. The SAPI is implemented as a userspace library to be<br />

linked dynamically or statically. Applications may also connect<br />

to the TPM device provider directly. There exist possible<br />

www.embedded-world.eu<br />

708


modifications and extensions of that layered stack to support<br />

more complex systems (virtualization, remote TPM access)<br />

which are explained in more detail in the TSS System Level API<br />

and TCTI Specification.<br />

Before a TPM can be<br />

used on a specific system it<br />

needs to be provisioned.<br />

During provisioning, a TPM<br />

needs to be enabled and<br />

activated. In a next step the<br />

Endorsement primary key<br />

pair will be created using<br />

TPM2_CreatePrimary.<br />

Because the fact that the<br />

calculation of the key pair is<br />

based on the Endorsement<br />

Seed in conjunction with the<br />

KDF it can be recreated each<br />

time again as long as the<br />

TPM hasn’t be cleared and<br />

doesn’t need to be stored in<br />

the NVM. Based on the<br />

several asymmetric<br />

algorithms and added<br />

template entropy of TPM 2.0,<br />

there can be more than one<br />

primary endorsement key<br />

pair or storage key pair as<br />

well as other keys. On<br />

GNU/Linux this is a<br />

manually process what can<br />

be automated by a script of<br />

course. A useful feature is the possibility to setup policies on<br />

the TPM to control access to keys or NVRAM data depending<br />

on e.g. specific PCR values (“sealing”).<br />

3.2.2 TPM application usecases in GNU/Linux<br />

The following mentions some interesting tools<br />

complementing the TSS on userspace.<br />

<br />

<br />

<br />

TCTIdev-API<br />

GNU/Linux Application<br />

TCTI TPM<br />

Device Provider<br />

/dev/tpm0<br />

System API<br />

System API<br />

Implementation<br />

Instance<br />

Local TPM<br />

TCTI<br />

Kernel TPM Driver<br />

tpm2-tools: This project implements most of the TSS<br />

SAPI functions as command line tools to access TPM 2.0<br />

functionalities.<br />

Available from https://github.com/01org/tpm2-tools<br />

tpm2-abrmd: an implementation of a TPM2 access<br />

broker and resource management daemon.<br />

Available from: https://github.com/01org/tpm2-abrmd<br />

eltt2: The Embedded Linux TPM toolbox, the swiss army<br />

knife to access basic TPM 2.0 functionality from the Linux<br />

command line. It communicates directly over the TCTI.<br />

Available from https://github.com/Infineon/eltt2<br />

The SAPI as the standardized interface to that functions or<br />

alternatively the TPM2-tools are available from the command<br />

line. For example, the tpm2_getrandom function reads a random<br />

number from the TPM’s Hardware Random Number Generator.<br />

Other functions may be used to create and manage keys for<br />

using them with the various symmetric and asymmetric<br />

encryption algorithms of the TPM for small portions of data or<br />

key management and distribution of encrypted keys for<br />

symmetric cryptography. Users can export and import their keys<br />

while the key itself will be encrypted before leaving the TPM to<br />

keep the keys secret at all.<br />

As mentioned before, PCR 17 to 24 are not used for<br />

measurement and can therefore be used by user space<br />

applications. This includes a reset of that registers. For example,<br />

these PCRs could be used for application specific measurements<br />

to monitor the healthiness of specific software products and<br />

configurations or for software license management. In the same<br />

way, the TPM’s NVRAM van be used by user applications.<br />

However, users need to consider the specified maximum number<br />

of write cycles and data retention limitations of the specific TPM<br />

used.<br />

3.3 Advanced Usecases<br />

In this chapter we briefly look into some advanced use cases<br />

on application level, combining HSMs and TPMs.<br />

3.3.1 Encrypting / decrypting sensitive data<br />

Encrypting and decrypting sensitive data is a very common<br />

requirement on application level. For small payloads, this task<br />

can be easily done using a TPM, which offers the following<br />

advantages:<br />

<br />

<br />

<br />

Key generation and protection without exposing them to<br />

the SoC world<br />

Key access rights can be restricted or granted based on the<br />

trust state of the system<br />

Encryption is performed with maximum security against<br />

snooping<br />

However, in the following situations, combining the TPM with<br />

other methods might be preferred.<br />

<br />

<br />

huge payloads that would take too long on the TPM to<br />

encrypt / decrypt or high throughput demands<br />

the SoC’s tamper detection system should be leveraged to<br />

protect and destroy keys and sensitive data upon attack<br />

In these cases, there are still two possibilities:<br />

exclusive use of HSM<br />

This is the preferred way, if the SoCs tamper detection<br />

system shall be leveraged. Most HSMs support advanced<br />

methods to import and export cryptographic blobs into<br />

memory and advanced key generation methods. However,<br />

these methods are highly proprietary and require access to<br />

the detailed documentation of the HSM (e.g. the “Security<br />

Reference Manual”).<br />

combining HSM and TPM<br />

This methods offers the best of both worlds by using the<br />

TPM as highly secure and certified key generator and<br />

storage and the HSM to actually encrypt the data using this<br />

key. This task can be either performed using proprietary<br />

software that directly accesses the HSM or using the<br />

methods described in chapter 4.1.2 e.g. through OpenSSL.<br />

There are ongoing actions to integrate a “tpm2” engine into<br />

709


OpenSSL [22] and some manual approaches [23] to<br />

achieve this goal.<br />

3.3.2 IPSEC<br />

If a kernel driver for an HSM is available and the required<br />

algorithms are supported, the HSM can be used to accelerate<br />

IPSEC traffic. TPMs can be integrated into this solution as key<br />

generators, storages and to perform low-performance<br />

cryptographic operations such as verifying signatures. Solutions<br />

such as StrongSwan support the use of TPM 2.0 though a plugin.<br />

[24] and [25] describes the possibilities which are offered by<br />

this, including remote attestation of IMA results.<br />

3.3.3 Network Acceleration<br />

Typically the most powerful HSMs can be found on<br />

networking processors, such as NXP’s Layerscape or Marvell’s<br />

Armada 8040 family to accelerate the encryption of network<br />

traffic. To be able to do this, these modules need to be highly<br />

embedded into the hardware network acceleration modules of<br />

these SoCs, such as queue, buffer and frame managers in NXP’s<br />

DPAA or DPAA2. If such an acceleration system is used in<br />

conjunction with optimized network stacks to leverage all<br />

components, Ethernet traffic can be encrypted in line-speed on<br />

such platforms. Please contact the authors for more information<br />

on this subject.<br />

4 WHAT IS IT ABOUT THE TITLE?<br />

In a way, TPMs and HSMs as the two companions to Tux<br />

(the Linux Kernel mascot) have similarities to the “the Lion<br />

King” character Timon [26] and the canine character Rex from<br />

an Austrian/Italian TV show [27]. Timon being small in size,<br />

smart and very good at self-marketing, similar to how TPMs are<br />

being positioned as the solution to every existing security<br />

concern by some parties, while Rex is a well-minded, persistent<br />

and usually underestimated K9 ready to jump in to help and save<br />

lives whenever needed, quite similar to what HSMs are doing in<br />

an SoC, especially when no one else is there to help.<br />

5 CONCLUSION<br />

Hardware accelerated cryptography support is an extremely<br />

valuable addition to enhance GNU/Linux security. However,<br />

while some applications are well enabled (user space<br />

cryptography, IMA, image authentication), support for others is<br />

missing completely (TPM 2.0 and measurement support in<br />

mainline u-boot, TPM integration as a key storage into dmcrypt).<br />

Due to the rising security demands, this will most likely<br />

improve in the future and help to replace proprietary HSM<br />

enabled solutions with TPMs. ARMv8 crypto accelerations<br />

show great speeds for most algorithms, so that the predominant<br />

use of HSMs (crypto speed) will most likely also be<br />

complemented and replaced by pure software implementations<br />

in the future. The maturity and adoption for productive use of<br />

such solutions also highly depends on community evaluation<br />

and implementation. So feel free to contact the authors to discuss<br />

collaboration on implementation or just some feedback about<br />

your use cases and experiences.<br />

REFERENCES<br />

[1] https://community.nxp.com/docs/DOC-334996<br />

[2] https://trustedcomputinggroup.org/<br />

[3] https://shattered.io/<br />

[4] https://github.com/torvalds/linux/blob/master/arch/arm64/crypto/aes-cecipher.c;<br />

https://www.linaro.org/blog/core-dump/accelerated-aes-for-the-arm64-<br />

linux-kernel/<br />

[5] https://www.fsf.org/campaigns/secure-boot-vs-restrictedboot/whitepaper.pdf<br />

[6] https://sourceforge.net/p/linux-ima/wiki/Home<br />

[7] Roeder et al, „Tux Airborne - Encapsulating Linux —real-time, safety<br />

and security with a trusted microhypervisor”, Embedded World<br />

Conference 2016<br />

[8] https://www.arm.com/products/security-on-arm/trustzone<br />

[9] https://github.com/OP-TEE/optee_os<br />

[10] https://www.kernel.org/doc/Documentation/device-mapper/verity.txt<br />

[11] https://github.com/torvalds/linux/blob/master/Documentation/devicemapper/dm-integrity.txt<br />

[12] https://github.com/torvalds/linux/blob/master/Documentation/devicemapper/dm-crypt.txt<br />

[13] https://gitlab.com/cryptsetup/cryptsetup<br />

[14] https://github.com/rqou/tpm2-luks<br />

[15] http://www.chronox.de/libkcapi/html/ch01s02.html<br />

[16] http://cryptodev-linux.org/<br />

[17] https://github.com/cryptodev-linux/cryptodevlinux/blob/master/examples/<br />

[18] http://www.gnutls.org/<br />

[19] http://www.openssl.org/<br />

[20] https://github.com/tpm2-software/tpm2-tss<br />

[21] https://sourceforge.net/projects/ibmtpm20tss/<br />

[22] https://dguerriblog.wordpress.com/2016/03/03/tpm2-0-and-openssl-onlinux-2/<br />

[23] https://mta.openssl.org/pipermail/openssl-dev/2016-<br />

December/008924.html<br />

[24] https://wiki.strongswan.org/projects/strongswan/wiki/TPMPlugin<br />

[25] https://www.strongswan.org/docs/ConnectSecurityWorld_2016.pdf<br />

[26] https://en.wikipedia.org/wiki/The_Lion_King<br />

[27] https://en.wikipedia.org/wiki/Inspector_Rex<br />

www.embedded-world.eu<br />

710


APPENDIX<br />

LS1021 CAAM vs. Cortex-A7@1.2GHz<br />

400000<br />

350000<br />

300000<br />

Speed, Kbytes/s<br />

250000<br />

200000<br />

150000<br />

100000<br />

50000<br />

0<br />

16 64 256 1k 8k<br />

Block Size<br />

aes-128-cbc (1 x A7, CAAM, 1 thread) aes-128-cbc (1 x A7, CAAM, 2 threads) aes-128-cbc (1 x A7, CAAM, 3 threads)<br />

aes-128-cbc (1 x A7, SW, 1 thread) aes-256-cbc (1 x A7, CAAM, 1 thread) aes-256-cbc (1 x A7, CAAM, 2 threads)<br />

aes-256-cbc (1 x A7, CAAM, 3 threads)<br />

aes-256-cbc (1 x A7, SW, 1 thread)<br />

Marvell A8040 SafeXCell vs. NXP LS1046 CAAM vs. ARMv8 Cortex-A72 Crypto @ 1GHz<br />

Speed, Kbytes/s<br />

900000<br />

800000<br />

700000<br />

600000<br />

500000<br />

400000<br />

300000<br />

200000<br />

100000<br />

0<br />

16 64 256 1k 8k<br />

Block Size<br />

aes-128-cbc (A72, CAAM, 1 thread)<br />

aes-128-cbc (A72, CAAM, 3 threads)<br />

aes-128-cbc (A72, SAFEXCEL, 1 thread)<br />

aes-256-cbc (A72, CAAM, 1 thread)<br />

aes-256-cbc (A72@1GHz, SW, 1 thread)<br />

aes-128-cbc (A72, CAAM, 2 threads)<br />

aes-128-cbc (A72, CAAM, 4 threads)<br />

aes-128-cbc (A72@1GHz, SW, 1 thread)<br />

aes-256-cbc (A72, SAFEXCEL, 1 thread)<br />

711


TPM 2.0 for Enhanced Security in<br />

Software Updates of Industrial Systems<br />

Dr.-Ing. Florian Schreiner<br />

Embedded Security Solutions<br />

Infineon Technologies AG<br />

Munich, Germany<br />

Florian.Schreiner@infineon.com<br />

Abstract— Industry 4.0 enhances the communication and<br />

data exchange between devices in a smart factory. This requires<br />

enhancing the functionalities of the devices and also increasing<br />

the complexity of the software. More software complexity also<br />

implies more possibilities of security issues and bugs. This can be<br />

improved with frequent remote software updates, which address<br />

bugs and consider the latest known threats. These updates also<br />

need a high level of protection in order to prevent misuse and<br />

threats on the deployment of the updates. The Trusted Platform<br />

Module (TPM) is a standardized technology to increase the<br />

security in the deployment and application of software<br />

communication as a trust anchor, because it protects keys and<br />

data with a high security level.<br />

Keywords— Software Update; Security; TPM; Trusted<br />

Computing; Standardization;<br />

I. INTRODUCTION<br />

Industrial Automation and the Industry 4.0 movement are<br />

showing that there are substantially more and more connected<br />

devices in factories and production lines. As a result, the<br />

amount of connected devices increases - offering opportunities<br />

for attacks on such devices, communication channels and<br />

stored data.<br />

The challenges are the enhanced functionalities and the<br />

complexity of the software in the devices, which also extends<br />

the possibilities of security issues and bugs. This can be<br />

improved with frequent remote software updates, which<br />

address bugs and consider latest known threats. These updates<br />

also need a high level of protection in order to prevent misuse<br />

and threats on the deployment of the updates.<br />

The problem for a system with a security bug is the<br />

protection of the cryptographic keys, which are required for a<br />

deployment of an update. These keys need to be stored and<br />

managed in a secured environment, which is separated of the<br />

main software of the devices.<br />

Such a secured environment is the Trusted Platform<br />

Module (TPM), which is a standardized technology to increase<br />

the security in devices and to protect cryptographic keys and<br />

data with a high security level. The TPM 2.0 is the latest<br />

Trusted Computing technology, which provides modern<br />

algorithms, easier integration of cryptographic functions and<br />

the crypto-agility concept. Crypto-agility is important for<br />

industrial devices, as they have a long lifetime and therefore<br />

require a smooth transition to new upcoming algorithms in the<br />

future.<br />

This paper provides a short introduction in the new<br />

functionalities of the TPM 2.0 standards and their application<br />

in industrial devices. The focus is on the protection of a remote<br />

software update process, which uses the TPM as key storage<br />

and the policies for the protection of the key usage.<br />

II.<br />

APPLICATION SCENARIO<br />

A. Secured Software Update<br />

The software update of an industrial device is an<br />

innovative approach to enhance the software of industrial<br />

devices. It enables a faster adaptation in smart factories by<br />

optimizing industrial devices in order to reduce potential risks<br />

or to enhance the performance.<br />

The performance can be increased with optimized<br />

parameters or the execution of new functionalities can be<br />

enabled. There are also negative aspects, because software<br />

updates can also introduce risks like failures, bugs or error in<br />

the software. Such failures can also cause significant financial<br />

damage, because of the high costs for an interruption of the<br />

production line. If such a software error is detected, a software<br />

update can be developed and released in order to remove the<br />

potential risk.<br />

A software update can be executed locally at the device or<br />

as a remote software update via a network connection. The<br />

local update of the device has advantages as the environment of<br />

the update is more secured, because the physical presence at<br />

the device is required to execute the update. However the local<br />

update also creates higher costs and resources, because the<br />

operator needs to enable a direct connection to the device and<br />

the update needs to be planned in the production process.<br />

www.embedded-world.eu<br />

712


A remote software update reduces these costs, because the<br />

update is deployed over a network connection. However this<br />

increases also the attack potential, as the device is reachable by<br />

a higher amount of other devices or intruders, which could<br />

potentially misuse or intercept the normal operation of the<br />

update process. Such a threat could be also more easily<br />

exploited if there are bugs and error in the software of<br />

industrial devices. Therefore a secured software update process<br />

is required in order to validate the transmitted update package<br />

and to verify that the update was installed correctly.<br />

B. Threats for secured software updates<br />

The challenge for a secured software update mechanisms is<br />

the architecture and cryptographic process in order to achieve<br />

an adequate quality to protect against the high variety of known<br />

threats. The architecture of the software update is of special<br />

criticality, because an optimal security concept is required to<br />

achieve a high security level. There are several aspects in a<br />

software update sequence:<br />

<br />

Authorization of the update<br />

Verification of the authenticity, integrity and<br />

confidentiality of the update package<br />

<br />

Verified installation of the update<br />

The authorization of the update is required to protect<br />

against a misuse of the update process. Only authorized parties<br />

are allowed to start a firmware update process, when the device<br />

is in the right state. The authorization is typically done by the<br />

operator or the owner in order to decide, when the update can<br />

be started. The owner or operator is generally not known<br />

during the manufacturing of the industrial device. Therefore a<br />

flexible mechanism for the authorization is required so that the<br />

right party can be identified during the lifetime of the device.<br />

Another threat is the manipulation of the update package.<br />

Cryptographic mechanisms need to be integrated in the secured<br />

update architecture so that the authenticity, integrity and<br />

confidentiality of the update are protected. This can be<br />

addressed with an encryption in combination with a signature<br />

of the update package.<br />

The last step in an update process is the installation of the<br />

new software in the device. This installation can be verified so<br />

that also manipulations during the execution of the update can<br />

be detected. Mechanisms like secured boot also verify the<br />

integrity of the installed software after a reboot of the device.<br />

A general problem for these cryptographic mechanisms is<br />

that the essential keys and authorization credentials need to be<br />

stored securely in the device. The storage of this secret data in<br />

plain on the memory of the host process would not be optimal,<br />

because an attacker can get access to these keys by e.g. reading<br />

the flash memory or exploiting a bug remotely in the software.<br />

In such a case he can misuse update mechanism or to read,<br />

change or clone software of the device.<br />

C. The Trusted Platform Module (TPM)<br />

The Trusted Platform Module (TPM) is a standardized<br />

technology to increase the security in software update as a trust<br />

anchor, because it protects keys and data with a high security<br />

level. The TPM 2.0 as in [1] is the latest Trusted Computing<br />

technology, which provides modern algorithms, easier<br />

integration of cryptographic functions and the crypto-agility<br />

concept. Crypto-agility is important for industrial devices, as<br />

they have a long lifetime and therefore require a smooth<br />

transition to new upcoming algorithms in the future.<br />

The TPM provides standard cryptographic functionalities<br />

and interfaces to protect the data in industrial systems and<br />

enhance communication security. It supports a wide variety of<br />

functionalities including basic device authentication, embedded<br />

system life cycle protection and system integrity in<br />

combination with secured boot. These functionalities offer<br />

flexibility in the integration of the security and enable dynamic<br />

security enhancements over the lifetime of the device. The<br />

generic TPM functionalities are shown in Fig. 1.<br />

Fig. 1. Generic TPM functionalities for industrial systems<br />

The TPM is a hardware security device in which all<br />

functions and the cryptographic operations are executed in a<br />

protected environment. Internally the TPM consists of several<br />

different blocks, which can be accessed via an external<br />

interface, e.g. I2C or SPI. This external interface provides a<br />

mechanism for the authorization of users and the execution of<br />

cryptographic operations with secret keys. Fig. 2 shows the<br />

components of the TPM 2.0 with the supported algorithms,<br />

protocols and functionalities.<br />

Fig. 2. Functional components of a TPM 2.0<br />

The TPM 2.0 has a variety of key management<br />

functionalities, which allows to create keys internally in the<br />

TPM and to store the keys protected by the TPM. Furthermore<br />

the usage of the keys can be limited to authorized parties in<br />

713


order to verify if the requesting party is allowed to use the<br />

corresponding key.<br />

Furthermore the TPM has a set of Platform Configuration<br />

Registers (PCRs), which are registers to store the measurement<br />

data collected during a boot process. These registers are used to<br />

verify the system integrity and they are cryptographically<br />

protected with a hash algorithm.<br />

The high level of resistance to attacks of the TPM chip is<br />

achieved with several countermeasures in hardware. Examples<br />

of these countermeasures are analog sensors that supervise the<br />

input and output pins for the chip in order to detect if the pins<br />

are used for manipulation of the chip operations. Examples for<br />

threats are spikes or glitches that are applied to the pins.<br />

Furthermore the TPM contains a sophisticated internal memory<br />

encryption, so that the data is stored securely even internally<br />

inside the chip. The TPM has additionally more than 50 other<br />

security features, so that it achieves a high level of security<br />

which is also approved using the Common Criteria process.<br />

Furthermore approved TPM products are listed in [2], which<br />

fulfill the requirements of the standardization organization<br />

Trusted Computing Group (TCG).<br />

D. Advantages of the TPM in Software Updates<br />

The enhanced security and trustworthiness offered by a<br />

TPM provides several benefits in the usage for Software<br />

Updates.<br />

Attacks on industrial devices can become widely known<br />

with the publication on conferences, social media, news and<br />

press. This can lead to damages on the brand of products and<br />

even affect the reputation of the whole company regarding<br />

trustworthiness and reliability. Threats like reading flash<br />

memory or the exploitation of software bugs (e.g. heartbleed)<br />

can be developed today even with the knowledge of students. If<br />

a secret key was extracted unauthorized, there are several<br />

methods to use the key in order to obtain confidential data.<br />

The TPM protects these keys as they are only stored and<br />

used inside the chip. Therefore also software bugs in<br />

cryptographic libraries in the host processor (e.g. heartbleed)<br />

aren’t security problems for the keys in the TPM. Therefore the<br />

TPM keys can be considered trustworthy even after the host<br />

software has been attacked or compromised. This enhances the<br />

capabilities of the software update mechanism, because it<br />

allows recovering a system after an attack has occurred.<br />

The high amount of functionalities and the included<br />

security evaluation of the TPM is a significant cost reduction<br />

compared to other security implementations. Such other<br />

implementations can lead to high implementation efforts and<br />

costs. Some technologies like virtualization can require the<br />

usage of proprietary extensions, which lead to a high effort in<br />

the implementation and reduce the interoperability.<br />

Furthermore the threat resistance of these integrated<br />

technologies is often limited. The TPM is internationally<br />

standardized in the Trusted Computing Group (TCG) offering<br />

extensive interoperability with current IT-systems, operating<br />

systems and network protocols like SSL/TLS. Additionally it<br />

provides a high level of security based on smartcard<br />

technology, which provides a strong protection to current<br />

known threats.<br />

III.<br />

SYSTEM ARCHITECTURE WITH A TPM<br />

This section will explain the process on how the TPM is<br />

used to setup a secure software update of a device. The update<br />

is distributed by a cloud server. The following figure 3 shows<br />

an overview of the system and the components. The update<br />

package is signed and encrypted in the cloud server. The<br />

encrypted package will be sent to the device, which uses the<br />

TPM to decrypt the package. This can only be done, if the<br />

operator or owner was authorized beforehand. After that the<br />

signature of the package is verified. If all operations were<br />

successful, the update is installed in the device.<br />

The public/private key pair in the TPM can be for example<br />

generated in the manufacturing of the device. In this operation<br />

also the authorization process for the key is defined. Some<br />

examples are a password or a signature validation, which can<br />

be used to verify the operator later on to authorize the software<br />

update. After the key creation, the public key can be requested<br />

from the TPM and the key can be stored in the Cloud Server.<br />

Host CPU<br />

Storage/<br />

Flash<br />

Client Device<br />

Fig. 3. Overview for secured update with TPM<br />

The TPM also support further enhanced protection<br />

mechanism. The TPM provides version management, which is<br />

controlled by the cloud server and can be also included in the<br />

authorization mechanism. The version management would<br />

check the current firmware update version. Only if the version<br />

matches to the allowed versions from the backend, the TPM<br />

would authorize the key to decrypt the firmware update<br />

package. This enables a protection against unallowed firmware<br />

versions or also rollback attacks to old firmware versions.<br />

IV.<br />

TPM<br />

Software Update<br />

Cloud-Server<br />

SUMMARY AND OUTLOOK<br />

New and increasingly sophisticated security threats are<br />

constantly developing as a result of the widespread adoption of<br />

the Industry 4.0 as well as new application areas and types of<br />

device being connected - all of which are attractive to potential<br />

attackers. With the TPM standard, engineers and developers of<br />

industrial devices have a versatile and highly efficient solution<br />

available that enhances the security of a broad variety of use<br />

cases and industrial systems.<br />

www.embedded-world.eu<br />

714


REFERENCES<br />

[1] Trusted Platform Module Library, Family 2.0, Revision 01.38,<br />

September 2016, Trusted Computing Group,<br />

https://trustedcomputinggroup.org/tpm-library-specification/<br />

[2] TPM Certified Products, January 2018, Trusted Computing Group,<br />

https://trustedcomputinggroup.org/membership/certification/tpmcertified-products/<br />

[3] TSS Feature API Specification, Family 2.0, Revision 00.12, November<br />

2014, Trusted Computing Group, https://trustedcomputinggroup.org/tssfeature-api-specification/<br />

715


Secure Updates of Artificial Intelligence Applications<br />

Used in Autonomous Driving<br />

Antonino Mondello<br />

Principal Design Engineer<br />

Micron Technology Inc.<br />

amondell@micron.com<br />

Alberto Troia<br />

Memory System Architect<br />

Micron Technology Inc.<br />

atroia@micron.com<br />

Abstract— Many applications are developed adopting<br />

intelligent hardware and algorithms using the same methodology<br />

that biological structures use to solve real life problems; this<br />

approach involves strategies and implementations that are based<br />

upon neural networks, genetic algorithms, deep learning, and<br />

other forms of artificial intelligence. One of the main benefits of<br />

artificial intelligence is the inherent ability to arrive at a solution<br />

to a problem in a very short time compared to alternative<br />

implementations, while guaranteeing the robustness and the<br />

integrity of the solution — including management against<br />

unauthorized changes. Ensuring protection against unauthorized<br />

changes requires that the updates of contents (i.e. datasets) must<br />

be accepted only if the trust between the sender and the receiver is<br />

verified. The application of artificial intelligence in conjunction<br />

with system-level security are the cornerstones of enabling<br />

elements to realize autonomous driving.<br />

neurons (see Fig. 1). The output of the internal ANN layer of<br />

neurons, after receiving the signals from the previous layer of<br />

neurons, is provided to the next layer of neurons until it reaches<br />

the output of the ANN; this kind of ANN network is called a<br />

feedforward ANN. The first layer is the input layer, and it<br />

interfaces the network with the external world using R inputs.<br />

Generally, the primary purpose of this layer is to precondition<br />

the magnitude of inputs.<br />

Keywords—Artificial intelligence; Machine Learning; Deep<br />

Learning; Neural Network; Genetic Algorithm; Neuron; Gene;<br />

HASH; HMAC; SHA2; Digest; Weight Matrix; Secure Storage;<br />

Memory, Automotive, autonomous driving<br />

I. ARTIFICIAL INTELLIGENCE OVERVIEW<br />

One of the definitions of artificial intelligence (AI) is the<br />

capability of a machine to imitate intelligent human behavior.<br />

The main purpose of the introduction of AI is to be flexible in<br />

adapting new circumstances that were not forecasted in the<br />

planning and designing of the hardware.<br />

A. Neural Network overview<br />

In recent years, most AI systems were based on neural<br />

networks that emulated the functionality of animal brains. An<br />

Artificial Neural Network (ANN) is a set of cells called neurons<br />

which can be both interconnected or independent using a set of<br />

connections, as depicted in Fig. 1.<br />

The neural signals processed by the generic neuron i is sent<br />

to the neuron j after a multiplication by certain numerical<br />

constants called synaptic weights (w mn). For conceptual<br />

simplicity, the ANN is organized in consecutive layers of<br />

Fig. 1. Generic artificial neural network structure<br />

Some ANNs might have neurons that directly influence the<br />

input by the synaptic weight w mn (see Fig. 1), or, in general, the<br />

output of any specific neuron might be the input of a neuron<br />

present in one of the previous layers. These types of networks<br />

are called Recurrent Artificial Neural Networks (RANNs); the<br />

RANN can affect the stability of the entire structure. It is<br />

customary to call the feedforward ANN a non-recurrent ANN.<br />

To briefly introduce the ANN functionality, we may start<br />

from the description of a single neuron. Consider a generic m-<br />

neuron of the network represented below in Fig. 2; the<br />

relationship between its inputs and the output is given by the<br />

formula:<br />

RR<br />

aa mm = ff mm [bb mm + ∑ kk=1 (ww mmmm ∙ pp kk )] (1)<br />

716


Where: ff mm (nn) is a real function, called activation<br />

function.<br />

Finally, an implementation of a feedforward ANN needs the<br />

management of two different sets of N matrix WW mm , BB mm and N<br />

functions ff mm (nn). Such matrices must be stored in nonvolatile<br />

memory and updated when needed according to the update<br />

policy implemented.<br />

Fig. 2. Neuron I/O relationship<br />

In an ANN, the activation functions can be, in principle,<br />

different for each neuron; in practice, they use only a few<br />

different kinds of functions. The most common activation<br />

functions are described in TABLE I.<br />

TABLE I.<br />

DIFFERENT KINDS OF ACTIVACTION FUNCTIONS<br />

Hard<br />

limiter<br />

Linear<br />

Logsigmoid<br />

bb iiii tt ≥ 0<br />

ff(tt) = <br />

aa iiii tt < 0<br />

ff(tt) = aa ∙ tt<br />

1<br />

ff(tt) =<br />

1 + ee −tt<br />

The constant bb mm , not always present (bb mm = 0), is called the<br />

bias of the neuron. The relationship between inputs and outputs<br />

of the entire network depends on the choice of the activation<br />

function, bias and the synaptic weight ww mmmm . The process that<br />

permits us to define the bias and synaptic weight is called the<br />

learning process. We will describe briefly the ANN learning<br />

strategies in the following paragraphs.<br />

B. Synaptic weight matrix of a neural network<br />

Regardless of the algorithm used to define the numbers ww mmmm ,<br />

bb mm or the choice of the activation functions ff mm (nn), the<br />

mathematical description of an ANN can be represented using<br />

standard matrix formats. This permits us to have a very compact<br />

notation in theory formulation and to have equations organized<br />

in a form that is suitable for the software implementation of the<br />

network.<br />

In a feedforward ANN, the synaptic weights of a generic m-<br />

th layer can be organized in a rectangular matrix WW mm, where the<br />

element ww mmmm ∈ WW indicates the weight ww mmmm and is related to the<br />

connection (synapses) from the neuron n to the neuron m of the<br />

artificial network. The output of each layer can be written as:<br />

AA<br />

<br />

0 = PP<br />

AA mm = FF mm (BB mm + WW mm ⋅ AA mm−1 ) ∀mm ∈ [1, ⋯ , NN]<br />

Fig. 3. Feedforward neural network<br />

To implement a vehicle with autonomous driving<br />

capabilities, the different functions can be implemented using<br />

the feedforward ANN; sometimes it is necessary to use more<br />

powerful ANN architectures, like the Recurrent Artificial Neural<br />

Network (RANN).<br />

The implementations of RANNs pose several additional<br />

problems; in fact, they behave like a dynamic system, which<br />

means that the output depends not only on the current input of<br />

the network, but also on the previous values of the inputs, the<br />

outputs and the internal states. Because of the dynamic nature of<br />

the network, RANNs possess a more complicated mathematical<br />

structure than the feedforward networks and require very large<br />

matrices.<br />

A general model of a RANN is depicted in Fig. 4. To<br />

guarantee the correct functionality at each layer, a Taped Delay<br />

Line (TDL) is used to avoid critical race conditions at the output<br />

of the neurons.<br />

Fig. 4. Recurrent neural network model<br />

For a complete modeling of the network layers, it is<br />

necessary to introduce some matrices as listed in TABLE II. In<br />

fact, it is possible to prove that by using the matrices it is possible<br />

to describe the behavior of the entire RANN network.<br />

717


TABLE II.<br />

PP<br />

MATRICES NEEDED TO DESCRIBE RANNS<br />

Input vector of network; in this network<br />

model, the inputs can be connected to all<br />

layers.<br />

AA mmmm Vector is the output vector of layer m.<br />

BB mm Vector of neurons bias of layer m.<br />

LLLL mm,ll<br />

IIII mm,ll<br />

Matrix with the synaptic weight from the<br />

layer l to the layer m.<br />

Matrix of synaptic weight associated to the<br />

network input l related to the layer m.<br />

Fig. 5 represents a generic RANN layer. TABLE II.<br />

describes the contribution of the neuron inputs and the outputs<br />

which are calculated by using the formula (1) (see [1][2][5][8]<br />

for more details).<br />

Fig. 5. Recurrent neural network m-th layer<br />

From this overview of ANNs, we can draw the conclusion<br />

that the implementation of a neural network requires the<br />

manipulation of large amounts of data stored in matrix form.<br />

Such data must be stored in a nonvolatile memory device and<br />

must be written, read, and updated in a secure manner. This<br />

challenge will be addressed in the next paragraphs where we<br />

describe a possible hardware implementation.<br />

C. Learning of a neural network overview<br />

The learning or training of the ANN [1][2][5] consists of<br />

setting the values of the weight matrices LLLL mm,ll , IIII mm,ll , BB mm . The<br />

learning strategies can be divided into two distinct categories,<br />

which define two different paradigms of learning:<br />

• Supervised learning: Some input vectors previously<br />

collected are presented to the ANN. The output produced<br />

by the network is observed and the deviation from the<br />

expected answer is measured. The weights are corrected<br />

according to the magnitude of the error in the method<br />

defined by the specific error function used. This learning<br />

strategy is also known as learning with teacher.<br />

A special class of supervised algorithm is given by the<br />

Reinforcement learning strategy [5]; reinforcement<br />

algorithms don’t compare the response—at each input of<br />

the network—with the theoretical one; instead, they use,<br />

as measure of the error, a score function that provides a<br />

measure of the global performance of the network. The<br />

synaptic weights are changed according to the score<br />

value. Another supervised class of algorithms is<br />

represented by the learning with error correction. Here,<br />

the magnitude of the error, together with the input vector,<br />

determines the magnitude of the corrections of the<br />

weights.<br />

• Unsupervised learning [5] is used in all situations where,<br />

for a given input, the exact numerical output value is not<br />

known; in other words, the teacher is not available. A<br />

typical example of a problem without a teacher is the<br />

problem of classification. Suppose, for example, there<br />

are some points in a two-dimensional space to be<br />

classified into three clusters. For this task, we can use a<br />

classifier network with three output lines, one for each<br />

class. Each of the three computing units at the output<br />

must specialize by providing a non-zero value in<br />

correspondence of the input elements of each cluster. If<br />

one unit is not zero, the others must be silent. In this case,<br />

we do not know a priori which unit is going to specialize<br />

on which cluster. Generally, we do not even know how<br />

many well-defined clusters are present; the network must<br />

organize itself to be able to associate clusters with units.<br />

Tens of training algorithms are typically employed for each<br />

learning paradigm. A detailed description of these can be found<br />

in [1][2][5][7].<br />

D. Reasons to update the ANN<br />

Once the ANN is designed, trained, implemented and<br />

installed within an autonomous vehicle, there are many reasons<br />

why it will need updating:<br />

• The system can have self-learning capabilities, and after<br />

an extended period of time on the road, there could be a<br />

new set of synaptic weights to be applied to itself to<br />

improve the autonomous driving capabilities.<br />

• The producer of the vehicle (i.e. car maker and/or its<br />

hardware provider) propagates an update to the system<br />

over-the-air because a new version of the AI set of<br />

matrixes is available.<br />

• Malfunctioning equipment requires a local update by the<br />

auto dealership using the OTA/SOTA port.<br />

• The vehicle owner purchases new services.<br />

In next paragraph, we will describe the problem and the<br />

methodology used for an ANN update.<br />

II. WEIGHT MATRIX UPDATE<br />

Fig. 6 shows data gathering until deployment of the ANN.<br />

The first step to train an ANN is to collect data from the field.<br />

This can be accomplished using a data center built in the vehicle<br />

that gathers information from the various sensors in the vehicle,<br />

and which uses that information to identify the detected object,<br />

using for example a supervised learning methodology. The<br />

developer of the ANN usually performs a pre-processing of the<br />

dataset to optimize the functionality of the ANN. The collected<br />

data to generate an ANN is divided into three categories:<br />

718


• Training datasets<br />

• Validation datasets<br />

• Testing datasets<br />

The reason why the data is broken into three datasets falls<br />

outside of the scope of this paper; however, in general, this is to<br />

evaluate if the ANN demonstrates appropriate levels of accuracy<br />

and to ensure it does not overfit the training datasets.<br />

A factory server is typically used by developers to train,<br />

validate and test the ANN functionalities. The objective of the<br />

phase is to generate an ANN that is ready to be deployed in the<br />

field.<br />

Fig. 6.<br />

Factory vehicle updates flow<br />

The ANN model, which is deployed in the field, requires<br />

constant updates. These updates must originate from a certified<br />

authority (i.e. the original developer server). The update of the<br />

ANN model is needed mainly because the autonomous vehicle<br />

can detect events where the local ANN does not have a unique<br />

output. While the local ANN has the capability to self-learn, the<br />

new combination of inputs and outputs must be sent to the<br />

factory server for validation. The factory server will store this<br />

new combination of inputs and outputs as an additional dataset<br />

to run a new learning process that updates all autonomous<br />

vehicles deployed in the field.<br />

The update process is the main reason why the storage device<br />

used to store the ANN cannot be a read only memory device; it<br />

instead requires Flash media. The process of updating the system<br />

must be implemented in a secure environment, regardless of the<br />

communication channel (i.e. on-board diagnostic port, over-theair<br />

update, etc).<br />

If the media used to store the artificial intelligence data is<br />

accessible in an easy manner, it could be changed in a nonauthorized<br />

manner. The manipulation of this setting can change<br />

the behavior of the system, or worse, threaten the safety of the<br />

driver.<br />

As we will describe in the next paragraph, any attempt to<br />

protect data with legacy protection features are a risk, as these<br />

security systems can be easily hacked. Therefore, there is a need<br />

to implement a robust protection scheme for the system and the<br />

media where the data are stored to ensure a high level of<br />

protection against malicious attack.<br />

III. CRYPTOGRAPHY METHODES FOR VEHICLE SAFETY<br />

We are observing a real transformation of the world, where<br />

most electronic devices will be interconnected and capable of<br />

exchanging messages and communicating with each other. This<br />

can be considered a new evolutionary era, where machines are<br />

self-learning and act independently. In this new era, data storage<br />

devices can’t be considered simply containers of data, but rather<br />

integral elements of electronic devices that contain very<br />

sensitive data—data which is mandatory for the correct behavior<br />

of the system. As an example, the large amount of data that an<br />

ADAS system must process must have as few errors as possible,<br />

and at a minimum, the application controller should be able to<br />

recognize the presence of them. This is especially true for data<br />

such as the weight matrix of an artificial intelligence algorithm.<br />

Updating the weight matrix is a very critical operation and it<br />

must be done in a very secure fashion. The weight matrix is<br />

usually stored in a NAND, managed NAND, or NOR storage<br />

device.<br />

For the past decades, memory architects and designers have<br />

proposed various protection schemes against accidental or<br />

unintended modification of data. Some protection schemes were<br />

based on a simple command, protected by a password, which<br />

permitted the protection of selected portions of data in the<br />

memory array. This level of protection ultimately proved to be<br />

ineffective because hackers could readily detect the password<br />

simply by sniffing commands from board buses and reusing<br />

them later. Capturing and reusing transition on the bus is known<br />

as ‘replay attack.’<br />

Another practice used to break a ‘weak protected’ storage<br />

system is accomplished through the use of a non-original<br />

memory component, where the content is exactly the same as the<br />

original component and it is able to emulate the original<br />

component; this kind of technique is called component<br />

replacement attack.<br />

Modern protection schemes must provide protection against<br />

these types of attacks and others, while at the same time enabling<br />

over-the-air firmware (and data) updates.<br />

International organizations like Trusted Computing Group<br />

(TCG) have proposed new paradigms of security [9], based on<br />

consolidated cryptographic concepts [6], [7] which became<br />

standard in the recent years, refer to [10], [11], [12], [13] by the<br />

National Institute of Standard, U.S. Department of Commerce<br />

(NIST).<br />

The new security paradigms and implementations also<br />

involve the end-point components; these paradigms suggest that<br />

each edge component contributes to the security of the entire<br />

electronic system. The strength of this approach is based on the<br />

fact that each device has embedded the capability to prove their<br />

identity (to ensure that a component replacement attack fails)<br />

and to confirm, to the provider of the update, that the data stored<br />

is not changed by a malicious entity. Guaranteeing authenticity<br />

of the data is accomplished via cryptographic measurements.<br />

A. System Secure Zone definition<br />

Most electronic systems are implementing mechanisms to<br />

make sure that data is checked and validated using cryptographic<br />

measurements. Some common definitions include:<br />

719


• Trusted Platform Module (TPM): A specialized chip, or<br />

part of a system, on an endpoint device that stores<br />

cryptographic keys specific to the host system for<br />

hardware authentication.<br />

• Trusted Execution Environment (TEE): A silicon area<br />

inside a controller, isolated from the other circuits and<br />

able to implement cryptographic calculation.<br />

• Secure Element (SE): A secure element (i.e. tamperresistant)<br />

able to store secrets and, eventually, make<br />

cryptographic calculations.<br />

For instance, in a TPM implementation, there is the need to<br />

use components able to support assigned rules and perform the<br />

operations required to ensure that the data used and updated in a<br />

board are secure.<br />

Initially, such concepts were introduced with the intent to<br />

protect PC BIOS integrity and to guarantee secure remote<br />

updates, but such techniques were broadly useful to be applied<br />

to different applications other than PC BIOS [16]. Due to their<br />

inherent strength, those techniques have also been employed in<br />

the automotive field to address safety concerns.<br />

B. Secure communications problem<br />

A TPM inside an electronic system is implemented using<br />

secure components. The components communicate with each<br />

other via a standard communication protocol.<br />

Due to the presence of secure components in the system, the<br />

first feature to be implemented is trusted communication<br />

between the components, refer to Fig. 7.<br />

Fig. 7. TPM implementation<br />

The definition of trusted communication applies when there<br />

is a methodology (i.e. cryptographic signature) to ensure the<br />

integrity of the message and the non-repudiation of the message<br />

(i.e. the clear identification of the sender). This does not imply<br />

that the content of the message must be encrypted (i.e.<br />

maintained secret and/or not visible), but only requires<br />

maintaining a certain knowledge about the origination of the<br />

message. In fact, the main properties of a message sent to the<br />

communication bus must comply with the following:<br />

• Authenticity: Assure that the sender is who we think it is.<br />

• Integrity: The message is not altered intentionally by a<br />

hacker or randomly by noise.<br />

• Non-repudiation: Sender cannot deny that the message<br />

was sent.<br />

Those main properties of networked TPMs can operate<br />

inside the same system, in parallel, exchanging messages.<br />

Fig. 8. TPMs communication over an unsecure channel<br />

One possible implementation of a secure TPM inside a<br />

system can be the use of some cryptographic features and<br />

properties to create the message signature.<br />

Payload<br />

Signature<br />

Fig. 9. Simple message structure<br />

A message is defined as being signed when it contains a<br />

signature. The message content is usually referred to as the<br />

payload, while the piece of information appended to the message<br />

is called a signature. The signature of a message is calculated by<br />

using a one-way function called Message Authentication Code<br />

(MAC) function as in 2.<br />

SSiiiinnaattiiiiii = MMMMMM(kkkkkk, pppppppppppppp) (2)<br />

The key contains secret information, known only to the<br />

entities connected to the TPM, to ensure the security of the<br />

system. The key is never shared in a clear way across the<br />

communication bus. The key may be written into each device in<br />

factory, and/or can be shared by using some protocols based on,<br />

for example, a public key cryptographic scheme [9][10]. To<br />

further improve upon the security of the entire system, there can<br />

be more than one key for a given pair of TPMs connected in the<br />

communication bus.<br />

There are several functions that can be used to generate the<br />

signature of a message; however, these functions must satisfy<br />

some key requirements:<br />

• “Easy” to be calculated: Given a message X the calculus<br />

of MAC (key, X) does not require sophisticated hardware<br />

or software resources<br />

• “Hard” to be inverted: Given a MAC (key, X) it must be<br />

“impossible” to determine the message X or the key; in<br />

other words, the calculation [MAC (key, X)] −1 must be<br />

impossible<br />

• “Negligible” collision probability: Given two messages<br />

X ≠ Y MAC (key, X) ≠ MAC (key, Y)<br />

The cryptographic strength of the MAC together with the<br />

secrecy of the key guarantees the authenticity, the integrity and<br />

the non- repudiation of the messages exchanged inside the TPM<br />

system and between TPM regions. In fact, only the sender that<br />

possesses the (secret) key can generate a signature that the<br />

receiver can understand together with the Message field;<br />

otherwise, the message itself can be discarded by the receiver.<br />

In summary, the receiver refuses the message any time its local<br />

calculus does not match with the signature contained in the<br />

received one.<br />

A powerful MAC function used in many cryptographic<br />

systems is the HMAC-SHA256. Such a function is described in<br />

720


[14][13] and is very robust in the fact that it satisfies all the<br />

requirements defined for MAC function.<br />

HMAC-SHA256 can process a message up to 2 64 bits in<br />

length and produce a 256-bit signature. As of today, there is no<br />

known possibility to invert this MAC and no collision conditions<br />

have been identified.<br />

C. Message replay problem<br />

One methodology to hack the data communication in a bus<br />

is based on recording the messages transmitted through the<br />

channel. The recorded message can be used later either to take<br />

control of the whole system and/or to generate a certain<br />

malfunction in the system under certain circumstances. This<br />

kind of attack, also known as man-in-the-middle attack, can be<br />

executed successfully without the knowledge of the key (known<br />

to the senders and the receivers). A system that can avoid this<br />

type of attack is called a robust against replay-attack.<br />

One of the methodologies to avoid this kind of attack is to<br />

insert a source of randomness into the message; this source of<br />

randomness is also called freshness. The purpose of the<br />

freshness field is to ensure the same signature—including the<br />

message—is not used twice in different times or applications.<br />

The content of the message supporting the anti-replay attack<br />

methodology can look like Fig. 10, where the freshness is an<br />

additional field after the payload.<br />

Payload Freshness Signature<br />

Fig. 10. Message structure with anti replay field<br />

Let’s proceed with the understanding of what might be<br />

considered for the freshness field. The freshness is a special field<br />

with the property that changes at every cycle, i.e. a random<br />

number, the value of a counter incrementing using a clock, etc.<br />

The strategy is to change/generate the freshness content every<br />

time a message is sent. This message must be well known to the<br />

sender and the receiver of a TPM system. It is not required that<br />

any of the elements communicating on the TPM bus need to<br />

encrypt or hide the content; therefore, in a communication<br />

protocol, the sender prepares the signed message using the<br />

freshness to send to the receiver.<br />

Fig. 11. Communication between two components<br />

When the message arrives at the receiver, the first operation<br />

is to check the security of the message:<br />

• The value of the freshness field must be the expected<br />

one.<br />

SSiiiinnaattiiiiii = MMMMMM(kkkkkk, pppppppppppppp | ffffffffhnneeffff)<br />

• The value of the signature calculated, as per the<br />

equation, in the receiver side must match the value<br />

appended to the message itself.<br />

Let’s look at some of the methodologies used to generate a<br />

string that matches the rules to be called a freshness field.<br />

The simplest way to create a freshness field is to use a<br />

timestamp as message; this method needs synchronization<br />

between the clock domains and between different TPMs and the<br />

components inside the TPMs. Another way can be using a<br />

number generated only once at a time, called NONCE (Number<br />

ONCE); this method is not easily implemented because it<br />

requires the existence and the maintenance of a database with all<br />

the generated NONCE to avoid the replication. One of the best<br />

methodologies, also suitable for most applications and, above<br />

all, in the automotive market, is to use a Monotonic Counter<br />

(MC). The Monotonic Counter is a generic digital circuit that<br />

increments its value depending on a certain event; that event can<br />

be each message sent, a clock source, a power event, etc. The<br />

main property of this MC is that its value can’t be decremented<br />

in any way and its operation must not be compromised during<br />

power loss.<br />

Assuming now that the event triggering the increment of the<br />

MC is the message sent, this guarantees that two messages<br />

containing the same payload field will have a different signature,<br />

thanks to the freshness value. So, introducing the MC value, a<br />

message can look like Fig. 12.<br />

Payload MC Signature<br />

Fig. 12. Generic message structure<br />

One possible implementation could be to use a unique<br />

freshness value (i.e. value of the monotonic counter) across the<br />

whole system.<br />

D. Secure communication protocol<br />

Consider the subsystem shown in Fig. 11. To be sure that<br />

only the authorized ECM communicates each other, the follow<br />

protocol can be used:<br />

The sender:<br />

• prepares the message including payload, destination of<br />

the message (optional field), message correction<br />

information (optional) and other relevant strings that<br />

will specify the content of the entire message;<br />

• generates the freshness, i.e. incrementing the MC value;<br />

• calculates the signature by using the formula (2);<br />

• packs the message as per Fig. 10;<br />

• ends the message through the unsecure channel.<br />

The receiver:<br />

• recognizes that the messages are addressed to him;<br />

• reads the content of the message;<br />

• unpacks the message to retrieve the MC value;<br />

• checks the value of the freshness, i.e. monotonic counter<br />

value:<br />

• if the freshness does not correspond to the expected one<br />

the message is discarded;<br />

721


• if the freshness corresponds to the expected one, the<br />

message is accepted and the security check, as per next<br />

step, is executed;<br />

• calculates the signature using the formula (2), the<br />

content of the message as content, the known secret key;<br />

• compares the signature of the message with the<br />

signature coming out from the calculation;<br />

• if the signatures are the same, the message is accepted<br />

and eventually executed; otherwise, the message is<br />

discarded because the content is considered<br />

corrupted/hacked.<br />

For the remainder of this paper, we will refer to this protocol<br />

as secure communication protocol or secure communications<br />

criteria.<br />

The secure communication criteria can be used to build a<br />

secure command set usable over an unsecure channel. If the<br />

devices’ command set contains freshness and signature fields as<br />

shown in Fig. 13, then they are accepted only if the message<br />

matches the security communication criteria.<br />

Command<br />

op-code<br />

Command<br />

parameters<br />

MC<br />

Fig. 13. Secure command structure<br />

Signature<br />

The component that executes the command provides a<br />

command response. Additionally, it must be organized (see Fig.<br />

14) and accepted by the receiver if, and only if, it matches the<br />

security criteria.<br />

Command<br />

response<br />

MC<br />

Fig. 14. Secure command response<br />

Signature<br />

An example of a secure command is one that provides the<br />

unique identification code (UID) of a device (this is normally<br />

written into each device in the factory.) In this case, the<br />

command op-code is the one related to the request UID<br />

command, and the command response contains the requested<br />

UID into the command response field.<br />

E. Device identification memory content measurement<br />

The storage device such as a NOR/NAND memory must<br />

guarantee the genuineness of data stored, and, if necessary,<br />

recover the data with a genuine copy to guarantee system<br />

functionality in case of a hacker attack or system malfunction.<br />

The changing of stored data can be detected through the use of<br />

some cryptography tools including hash functions.<br />

A hash function is a function that is able to map data of<br />

arbitrary size to a fixed size. A function is eligible as a hash<br />

function if it is:<br />

• “Easy” to be calculated: Given a message X the calculus<br />

of HASH (X) has low complexity<br />

• “Hard” to be inverted: [HASH (X)] -1 must be impossible<br />

to be calculated<br />

• “Negligible” collision probability: Given two messages<br />

X≠Y HASH (X) ≠ HASH (Y)<br />

Cryptographic literature proposes many hash functions<br />

[9][10]. A common one is the SHA256 described in [13] and<br />

already described in this paper in conjunction with the HMAC.<br />

Once a HASH is calculated on a genuine data pattern, the<br />

result is called a golden hash. This data is stored in an area that<br />

is not user accessible and is compared on demand with the<br />

current HASH calculated at the moment of the request. The<br />

result of this comparison enables the system to understand if the<br />

array content is genuine or was accidentally or intentionally<br />

modified.<br />

If requested, the memory can provide the hash result as a<br />

command result.<br />

The memory can be configured to automatically hash the<br />

array or part of the array at each power cycle, and in case of data<br />

corruption, to restore a hidden version of the information.<br />

In this case, the restored information cannot be the latest<br />

data, but it might be just a version that permits safety to be<br />

maintained within the whole system. This is not a rule; it is up<br />

to the platform implementer to establish the frequency and the<br />

policies of the hidden information updates; the secure command<br />

set allows the designer to manage them accordingly.<br />

IV. SECURE UPDATE OF ARTIFICIAL INTELLIGENCE<br />

The methodology introduced in the above paragraphs can be<br />

used to update the artificial intelligence algorithms within the<br />

vehicle, and in general, for updating all the features needed to<br />

maintain a high level of security.<br />

ACKNOWLEDGMENT<br />

The authors wish to thank Robert Bielby for the interesting<br />

suggestions during the paper preparation; Barbara Kolbl for the<br />

great availability in coordinating the review sessions;<br />

Shelley Frost for help with editing the text; Lance Dover,<br />

Francesco Tomaiuolo and Tommaso Zerilli for the exchange of<br />

ideas on the implementation of the security features in silicon.<br />

REFERENCES<br />

[1] C. Bishop, Pattern Recognition And Machine Learning, Springer, 2006,<br />

ISBN 9780387310732.<br />

[2] Goodfellow at al., Deep Learning, MIT, 2016 IDBN 9780262035613.<br />

[3] J. J. Hopfield, Neural Networks and Physical Systems with Emergent<br />

Collective Computational Abilities, 1982 PNAS 1982;79;2554-2558<br />

[4] LeCun Y. et al, Deep learning, Nature, 2015 doi:10.1038/nature14539<br />

[5] Marsland S., Machine Learning An Algorithmic Perspective, 2nd ed,<br />

CRC, 2014, ISBN 9781466583283.<br />

[6] A. Mondello, Pianificazione frequenziale automatica nei sistemi<br />

radiomobili cellulari mediante reti neurali. Tesi di laurea, Politecnico di<br />

Torino, 1988<br />

[7] R. Rojas, Neural Networks a systematic introduction, Springer, 1996,<br />

ISBN 9783540605058.<br />

[8] M.T. Hagan at al., Neural Network Design, 2 nd ed, ISBN 978-0971732117<br />

[9] V. V. Yaschenko, Cryptography: An Introduction. AMS, 2002.<br />

[10] N. Ferguson et al. Cryptography Engineering. Wiley, 2010.<br />

722


[11] D. Challener et al. A Practical Guide to Trusted Computing. IBM Press,<br />

2007.<br />

[12] TGC. (2014 March 13). Trusted Platform Module Library Part 1:<br />

Architecture [PDF]. Available at:<br />

http://www.trustedcomputinggroup.org/files/static_pagefiles/C2122862-<br />

1A4B-<br />

B2940289FD15408693D/TPM%20Rev%202.0%20Part%201%20-<br />

%20Architecture%2001.07-2014-03-13.pdf<br />

[13] Fips-180-2. SHA-256: Secure Hash Algorithm [PDF]. Available:<br />

http://csrc.nist.gov/publications/fips/fips180-2/fips180-<br />

2withchangenotice.pdf<br />

[14] Fips-198-1. HMAC-SHA-256: Hash Based Message Authentication<br />

Code [PDF]. Available: http://csrc.nist.gov/publications/fips/fips198-<br />

1/FIPS-198-1_final.pdf<br />

[15] NIST 800-147. BIOS Protection Guidelines [PDF]. Available:<br />

http://csrc.nist.gov/publications/nistpubs/800-147/NIST-SP800-147-<br />

April2011.pdf<br />

[16] NIST 800-155. BIOS Integrity Measurement Guidelines [PDF].<br />

Available: http://csrc.nist.gov/publications/drafts/800-155/draft-SP800-<br />

155_Dec2011.pdf<br />

723


Physical Unclonable Functions to the Rescue<br />

A New Way to Establish Trust in Silicon<br />

Geert-Jan Schrijen<br />

Intrinsic ID<br />

Eindhoven, The Netherlands<br />

geert.jan.schrijen@intrinsic-id.com<br />

Cesare Garlati<br />

prpl Foundation<br />

USA<br />

cesare@prplFoundation.org<br />

Abstract — As billions of devices connect to the Internet,<br />

security and trust become crucial. This paper proposes a new<br />

approach to provisioning a root of trust for every device, based<br />

on Physical Unclonable Functions (PUFs). PUFs rely on the<br />

unique differences of each silicon component introduced by<br />

minute and uncontrollable variations in the manufacturing<br />

process. These variations are virtually impossible to replicate. As<br />

such they provide an effective way to uniquely identify each<br />

device and to extract cryptographic keys used for strong device<br />

authentication. This paper describes cutting-edge real-world<br />

applications of SRAM PUF technology applied to a hardware<br />

security subsystem, as a mechanism to secure software on a<br />

microcontroller and as a basis for authenticating IoT devices to<br />

the cloud.<br />

Keywords — Security; Internet of Things; Physical Unclonable<br />

Function; Authentication<br />

I. INTRODUCTION<br />

The Internet of Things already connects billions of devices<br />

and this number is expected to grow into the tens of millions in<br />

the coming years [5]. To build a trustworthy Internet of Things,<br />

it is essential for these devices to have a secure and reliable<br />

method to connect to services in the cloud and to each other. A<br />

trustworthy authentication mechanism based on device-unique<br />

secret keys is needed such that devices can be uniquely<br />

identified and such that the source and authenticity of<br />

exchanged data can be verified.<br />

In a world of billions of interconnected devices, trust<br />

implies more than sound cryptography and resilient<br />

transmission protocols: it extends to the device itself, including<br />

its hardware and software. The main electronic components<br />

within a device must have a well-protected security boundary<br />

where cryptographic algorithms can be executed in a secure<br />

manner, protected from physical tampering, network attacks or<br />

malicious application code [18]. In addition, the cryptographic<br />

keys at the basis of the security subsystem must be securely<br />

stored and accessible only by the security subsystem itself. The<br />

actual hardware and software of the security subsystem must<br />

be trusted and free of known vulnerabilities. This can be<br />

achieved by reducing the size of the code to minimize the<br />

statistical probability of errors, by properly testing and<br />

verifying its functionality, by making it unmodifiable for<br />

regular users and applications (e.g. part of secure boot or in<br />

ROM) but updateable upon proper authentication (to mitigate<br />

eventual vulnerabilities before they are exploited on a large<br />

scale). Ideally, an attestation mechanism is integrated with the<br />

authentication mechanism to assure code integrity at the<br />

moment of connecting to a cloud service [3].<br />

However, we are not there yet. We also need to be able to<br />

trust the actual generation and provisioning of the<br />

cryptographic keys into the security subsystem. Without trust<br />

in the key generation and injection process we cannot assure<br />

that keys are sufficiently random and that every device in fact<br />

obtains a unique key, which is the basic assumption for secure<br />

device identification. In addition, the provisioning must<br />

guarantee that private keys are not known outside the device,<br />

cannot be extracted or cloned, and that public keys are<br />

unmodifiable without proper authentication.<br />

A trustworthy Internet of Things requires a trust continuum<br />

from chip manufacturing through code development, device<br />

manufacturing, software and key provisioning, all the way to<br />

connecting to the actual cloud service. Central to the capability<br />

of a device to authenticate to the cloud is its digital identity,<br />

which is protected by the security subsystem. Devices that<br />

make up the Internet of Things use a broad variety of silicon<br />

components. It will therefore be a daunting challenge to roll out<br />

a universal security solution that works seamlessly for all<br />

possible microchip technologies in a consistent cost-effective<br />

way.<br />

The further outline of this paper is as follows. Section II<br />

articulates the importance of device root keys as a basis for a<br />

digital device identity and authentication. Section III introduces<br />

SRAM-based PUF as an innovative, flexible and cost-effective<br />

way to bootstrap and secure such root keys in a universal way<br />

on the widest possible variety of microchip technologies.<br />

Finally, section IV highlights some relevant real-world<br />

applications.<br />

II.<br />

DEVICE IDENTITY AND AUTHENTICATION<br />

To securely authenticate a device that is connecting to a<br />

cloud service or for unmanned machine-to-machine<br />

connectivity, every single device must provide a strong<br />

cryptographic identity. Such identity typically consists of an<br />

asymmetric key pair, composed of a public key and a private<br />

key. The private key must be kept secret in the device and<br />

ideally should never leave the device security boundary. The<br />

public key on the other hand can be output and communicated<br />

www.embedded-world.eu<br />

724


to external entities. According to the current PKI model, before<br />

the key pair can be used for device authentication a trusted<br />

entity needs to assert that the public key in fact belongs to a<br />

specific device (e.g. specific brand, model, serial number). This<br />

assertion is created in the form of a digital certificate. The<br />

trusted entity is typically the OEM that manufactures the<br />

device, although many variations in the supply chain setup are<br />

possible.<br />

Devices are authenticated by sending their digital<br />

certificate, which includes the public key, to the verifying<br />

entity, e.g. the cloud service or another device. The verifying<br />

party checks the contents of the certificate and verifies by the<br />

known public key that it is correctly signed by a party it trusts.<br />

The device’s public key that is in the certificate can then be<br />

used to verify the authenticity of the device by means of<br />

established authentication protocols. For example, a challengeresponse<br />

protocol can be used in which the verifying party<br />

generates a random number and sends it to the device. The<br />

device generates a response value using its private key to<br />

compute a digital signature on the received challenge. The<br />

verifying party receives the response and verifies that the<br />

signature is correct using the device’s public key. Alternative<br />

authentication schemes based on asymmetric keys are possible.<br />

For example, when the device sets up a secure HTTP<br />

connection to the cloud service using the TLS protocol, the<br />

client authentication check is done as part of the TLS<br />

handshake. This use case is described in section IV.C.<br />

The asymmetric key pair that forms the device identity<br />

needs to be securely stored inside the security subsystem. This<br />

can be achieved via key wrapping, a process that involves<br />

encrypting the private key within the security boundary before<br />

storing it in non-volatile memory (NVM). The root key, used to<br />

encrypt the other secrets, must be device-unique and securely<br />

stored inside the security boundary: see the use case in section<br />

IV.A. Besides encrypting additional secrets for permanent<br />

storage, the root key can also be used to derive additional<br />

private/public key pairs directly via a cryptographic key<br />

derivation mechanism. Such keys can be used to authenticate<br />

and establish secure channels with multiple devices.<br />

Provisioning root keys into a chip is an essential step in<br />

establishing a root of trust anchored in hardware. Traditional<br />

key storage methods require the root keys to be injected at an<br />

early stage in the production chain. This process implies that<br />

secret keys are handed over from device manufacturer to<br />

silicon manufacturer, and hence are revealed to different parties<br />

in the production chain. This creates undesired liabilities for<br />

both parties as the root keys are known outside the device’s<br />

security boundary. In the IoT this problem is enormously<br />

amplified by the sheer number of devices. In this emerging<br />

scenario, distribution and potential leakage of root keys<br />

becomes the single most important problem [9].<br />

To overcome these limitations, a flexible new key<br />

provisioning method is needed that enables secure<br />

programming of device root keys at any stage in the production<br />

process, allowing a device maker to reduce its dependency on<br />

other trusted parties. Physical Unclonable Functions (PUFs)<br />

based on SRAM memory are an ideal candidate for providing a<br />

universal cost-effective solution to this root key programming<br />

and storage problem.<br />

III.<br />

PHYSICAL UNCLONABLE FUNCTIONS<br />

Physical Unclonable Functions (PUFs) are known as<br />

electronic design components that derive device-unique silicon<br />

properties, or silicon fingerprints, from integrated circuits<br />

(ICs). The tiny and uncontrollable variations in feature<br />

dimensions and doping concentrations lead to a unique<br />

threshold voltage for each transistor on a chip. Since even the<br />

manufacturer cannot control these exact variations for a<br />

specific device, the physical properties are de facto unclonable.<br />

These minute variations do not influence the intended<br />

operation of the integrated circuit. However, they can be<br />

detected with specific on-chip circuitry to form a device-unique<br />

silicon fingerprint. The implementation of such measurement<br />

circuit is what is called a PUF circuit. There are several<br />

alternatives to implementing PUF circuits into an IC. They<br />

vary from comparing path delays and frequencies of free<br />

running oscillators to measuring startup data from memory<br />

components [10]. A particularly promising PUF technology is<br />

based on SRAM memory. The SRAM PUF has excellent<br />

stability over time, temperature and supply voltage variations<br />

and it provides the highest amounts of entropy. Furthermore, it<br />

is available as a standard component in almost every IC. The<br />

latter aspect has important advantages in terms of deployment,<br />

testability and time to market. SRAM PUFs can be used in<br />

standard chips by software access to uninitialized SRAM<br />

memory at an early stage of the boot process. Hence, it is not<br />

required to integrate special PUF circuitry into the hardware of<br />

the chip when using SRAM PUF technology.<br />

A. SRAM PUF<br />

SRAM PUFs are based on the power-up values of SRAM<br />

cells. Every SRAM cell consists of two cross-coupled<br />

inverters. In a typical SRAM cell design, the inverters are<br />

designed to be nominally identical. However, due to the minute<br />

process variations during manufacturing, the electrical<br />

properties of the cross-coupled inverters will be slightly out of<br />

balance. In particular, the threshold voltages of the transistors<br />

in the inverters will show some random variation. This minor<br />

mismatch gives each SRAM cell an inclination to power-up<br />

with either a logical 0 or a logical 1 on its output, which is<br />

determined by the stronger of the two inverters. Since this<br />

variation is random, on average 50% of the SRAM cells have 0<br />

as their preferred startup state and 50% have 1. Note that<br />

SRAM memory is normally used by writing data values into<br />

the memory and reading back the written values at a later point<br />

in time. To use SRAM as a PUF, one simply reads out the<br />

memory contents of the SRAM before any data has been<br />

written into it.<br />

One can evaluate the behavior of this SRAM PUF based on<br />

two main properties for PUFs: reliability and uniqueness. Over<br />

the past years thorough analysis of SRAM PUF data has been<br />

performed. Startup patterns have been measured under various<br />

conditions, from SRAM implemented in several technology<br />

nodes (180nm down to 14nm) by several foundries with<br />

different processes.<br />

725


Fig. 1: A 6-transistor SRAM cell; two cross-coupled inverters are formed by<br />

left inverter consisting of PMOS transistor PL and NMOS transistor NL and<br />

right inverter consisting of PMOS transistor PR and NMOS transistor NR.<br />

Left and right access transistors are indicated as AXL and AXR respectively.<br />

Extensive tests performed by leading PUF vendors and<br />

universities (e.g. in [10],[17]) have yielded the following<br />

results:<br />

• Reliability: Most of the bit cells in an SRAM array have a<br />

strongly preferred startup value which remains static over<br />

time and under varying operational conditions. A minority<br />

of cells consist of inverters that are coincidentally well<br />

balanced and result in bit cells that will sometimes start up<br />

as a 0 and sometimes as a 1. This causes limited “noise”<br />

(or, deviation from the initial reference measurement) in<br />

consecutive SRAM startup measurements. Tests<br />

demonstrate that the noise level of the SRAM PUF under<br />

extensive environmental conditions (e.g. temperatures<br />

ranging from -55˚C to 125˚C) and over years of lifetime<br />

(see also [12]) is sufficiently low to extract cryptographic<br />

keys with overwhelming reliability when using appropriate<br />

post-processing techniques.<br />

• Uniqueness: Extensive testing demonstrates that the startup<br />

pattern of an SRAM array is unique for every IC and even<br />

for a specific memory (region) within every IC. It is highly<br />

unpredictable from chip to chip and hence provides a large<br />

amount of entropy. The amount of entropy is sufficiently<br />

high to efficiently generate secure and unique cryptographic<br />

keys suitable to a broad range of applications.<br />

B. Root Key Storage with PUFs<br />

PUFs can be used to reconstruct a device-unique<br />

cryptographic root key on the fly, without storing secret data<br />

in non-volatile memory. Since PUF responses are noisy, they<br />

cannot be used directly as a cryptographic key. To remove the<br />

noise and to extract sufficient entropy, a so-called Fuzzy<br />

Extractor is needed. A Fuzzy Extractor or Helperdata<br />

Algorithm is a cryptographic primitive that turns PUF<br />

response data into a reliable cryptographic root key.<br />

The Fuzzy Extractor (see Fig. 2) has two modes of<br />

operation: Enrollment and Key Reconstruction.<br />

In Enrollment mode, which is typically executed once over<br />

the lifetime of the chip, the Fuzzy Extractor reads out an<br />

SRAM PUF response and computes that so-called Helperdata<br />

that is then stored in (non-volatile) memory accessible to the<br />

chip [11].<br />

Whenever the cryptographic root key is needed by the chip,<br />

the Fuzzy Extractor is used in the Key Reconstruction mode. In<br />

this mode a new SRAM PUF response is read out and<br />

Helperdata is applied to correct the noise. A hash function is<br />

subsequently applied to reconstruct the cryptographic root key.<br />

In this way the same key can be reconstructed under varying<br />

external conditions such as temperature and supply voltage.<br />

Important: by design the Helperdata does not contain any<br />

information on the cryptographic key itself and it can therefore<br />

be safely stored in any kind of unprotected Non-Volatile<br />

Memory (NVM) on- or off-chip. At rest, when the device is<br />

powered down, no secret is ever present in memory making<br />

traditional expensive anti-tamper requirements obsolete.<br />

Fig. 2: A Fuzzy Extractor operates in two basic modes: i) In Enrollment mode<br />

(steps 1-2) Helperdata is generated based on a measured SRAM PUF<br />

response, ii) In the Key Reconstruction mode (steps 3-5) the Helperdata is<br />

combined with a fresh SRAM PUF response for reconstructing the deviceunique<br />

cryptographic root key<br />

C. Fuzzy Extractor implementations<br />

A Fuzzy Extractor is typically implemented inside a chip in<br />

one of the following basic forms:<br />

• Hardware IP: A hardware IP module that is connected to a<br />

dedicated SRAM memory. The Fuzzy Extractor hardware<br />

IP block directly controls the SRAM memory interface to<br />

read out the PUF values. The cryptographic key can be<br />

output via a dedicated interface to a cryptographic<br />

accelerator. The security advantages of such an<br />

implementation are discussed in the next subsection.<br />

Besides security advantages, a Fuzzy Extractor<br />

implemented in hardware is typically faster and more<br />

power efficient than the equivalent software<br />

implementation.<br />

• Software IP: A software library that can access a dedicated<br />

portion of the overall SRAM memory. It is preferable that<br />

the SRAM portion used by the PUF algorithm is not shared<br />

with other software. Memory management units, silicon<br />

firewalls and trusted execution environments (TEEs) are<br />

www.embedded-world.eu<br />

726


likely used if available. The Fuzzy Extractor does not<br />

contain any secrets, so it does not need to be encrypted.<br />

However, it is important to guarantee the integrity of the<br />

software itself. This can be achieved with a secure boot<br />

setup or by locking down the software on the chip with<br />

alternative mechanisms provided by the chip itself.<br />

Advantages of the software variant include flexible<br />

deployment options, i.e. retrofitting existing devices in the<br />

field and integration with other security components, with<br />

minimal or no hardware changes.<br />

D. Security level provided by PUFs<br />

Using the PUF to reconstruct a cryptographic root key has the<br />

following security advantages:<br />

• Keys are reconstructed on the fly when needed and are<br />

present only temporarily within the security boundary of<br />

the chip. This greatly reduces the attack surface and time<br />

window for exploiting eventual vulnerabilities.<br />

• When the chip is powered down, no physical traces of the<br />

key are present in the chip.<br />

• Guaranteed randomness from the physics of the silicon<br />

results in full entropy keys.<br />

• Root keys are generated within the security boundary of<br />

the chip rather than being injected from the outside,<br />

resulting in a safer and more flexible provisioning process<br />

throughout the supply chain.<br />

It is important to observe that the Fuzzy Extractor must be<br />

implemented and integrated in a secure manner to minimize<br />

the exposure to various attack vectors including software<br />

vulnerabilities, side-channel and invasive attacks. Various<br />

countermeasures are possible and this is an area where<br />

established PUF vendors have developed considerable<br />

proprietary IP.<br />

The actual security level achieved depends largely on the<br />

integration of the Fuzzy Extractor with the security subsystem.<br />

One of the design goals is to make sure that only the Fuzzy<br />

Extractor can access the SRAM PUF. In case of a hardware<br />

integration this is assured by connecting a dedicated SRAM<br />

memory directly to the Fuzzy Extractor and making sure that<br />

there are no software interfaces to it. To this end, it is for<br />

example preferable to use a Built-In Self Test instead of a scan<br />

chain [2]. In case of a software implementation one needs to<br />

make sure that access control settings of the chip are set up<br />

correctly. For example, this is done by using a memory<br />

management unit to reserve access to the SRAM PUF region<br />

of the memory dedicated to the Fuzzy Extractor software, by<br />

locking down the software image using firmware lock bits, by<br />

applying secure boot or by integrating into a TEE.<br />

Additionally, the in-circuit debug facilities need to be<br />

disabled.<br />

Another design goal is to make sure that the cryptographic<br />

key that is output by the Fuzzy Extractor is transported<br />

securely to the cryptographic software that requires it. In case<br />

of a hardware implementation, this can be arranged by<br />

connecting the output bus of the Fuzzy Extractor hardware via<br />

a direct internal connection to a cryptographic coprocessor. In<br />

case of a software implementation, one needs to make sure that<br />

any registers used to store the key are cleared as soon as<br />

possible and cannot be accessed by untrusted processes.<br />

Similar measures as described in the previous paragraph can be<br />

taken to lock down the security boundary of the chip.<br />

E. Known attacks to PUFs<br />

Delay-based PUFs such as Arbiter PUFs and Ring-Oscillator<br />

PUFs promise a large space of independent challengeresponse<br />

pairs that can be used for special authentication<br />

schemes [6][7]. In practice, however, it turns out that<br />

implementations of such PUFs are broken by modelling<br />

attacks, showing that responses are predictable given a limited<br />

subset of challenge-response pairs [15][16].<br />

Memory-based PUFs such as the SRAM PUF are not<br />

susceptible to such attacks. The attacks that have been<br />

demonstrated on SRAM PUFs have been conducted only in<br />

non-realistic laboratory setups and do not form a threat to<br />

practical implementations. For example, with highly<br />

specialized equipment such as laser scanners it seems possible<br />

to read out SRAM memory contents by observing photo<br />

emissions during repeated read cycles [14]. This method is,<br />

however, feasible only in antiquated large technology nodes<br />

(e.g. 300 nm) and does not scale down to smaller modern<br />

technology nodes. In addition, the documented attacks require<br />

a situation where many consecutive SRAM read operations are<br />

executed sequentially on the same SRAM address range; a<br />

situation that does not occur in a good Fuzzy Extractor<br />

implementation. The work presented in [8] uses such a readout<br />

method in combination with a Focused Ion Beam to “clone” a<br />

PUF response from a first to a second SRAM memory. It<br />

should be noted that this is feasible only in obsolete large<br />

technology nodes (demonstrated on 600nm technology) and<br />

that it is only practical to clone a very limited number of bits<br />

with significant effort. In addition, commercial<br />

implementations include various proprietary countermeasures<br />

that make these kinds of attack simply infeasible. As of today<br />

there are no documented successful attacks of commercialgrade<br />

SRAM PUF implementations.<br />

IV.<br />

USE CASES<br />

This section offers some real-world examples of successful<br />

SRAM PUF applications.<br />

A. Secure key vault<br />

The SRAM PUF can be used to provide a cryptographic<br />

root key for a hardware security subsystem. The Fuzzy<br />

Extractor IP block is integrated with the security system IP.<br />

The chip-unique cryptographic root key that is reconstructed<br />

from the SRAM PUF feeds directly into the cryptographic<br />

module, for example an AES core. Fig. 3 shows a typical<br />

security subsystem architecture.<br />

To initialize the system, the PUF must be enrolled: a first<br />

readout of the SRAM startup values is used by the Fuzzy<br />

Extractor to compute the Helperdata (steps 0 and 1 in Fig. 3).<br />

Once the Helperdata is stored in the chip’s non-volatile<br />

memory (NVM), the enrollment step is completed.<br />

727


Fig. 3: Secure key vault based on SRAM PUF depicting Enrollment steps 0<br />

and 1 (dotted lines); Key reconstruction steps 2,3,4, and Encryption of data<br />

generated on processor in steps 5 and 6.<br />

The enrollment step establishes the device-unique<br />

cryptographic root key in the security subsystem. To<br />

reconstruct this key for use, the Helperdata is read from NVM<br />

and combined with a readout of the SRAM startup values in<br />

the Fuzzy Extractor (steps 2 and 3). The reconstructed key is<br />

fed into the AES core (step 4). Data that is being processed by<br />

the CPU can be securely stored by feeding it to the security<br />

subsystem, where it is encrypted using the AES module and<br />

stored in NVM (steps 5 and 6). Note that besides just<br />

encrypting the data, the AES core can also be used to protect<br />

the integrity of the data by computing additional<br />

authentication tags or by using an authenticated encryption<br />

mode such as AES-GCM.<br />

When the processor requires the secure data, steps 2, 3 and<br />

4 are repeated to reconstruct the cryptographic root key and<br />

load it into the AES core. Steps 6 and 5 are then reversed in<br />

direction to feed the encrypted data to the AES block, have the<br />

AES block decrypt it and feed it back to the CPU.<br />

This mechanism makes it possible to keep secrets in<br />

otherwise unprotected non-volatile memory. Note that only<br />

encrypted data and non-sensitive Helperdata is ever stored in<br />

NVM. No secret is ever stored in permanent memory. The<br />

cryptographic root key that is reconstructed from the SRAM<br />

PUF is not known anywhere outside the security boundary.<br />

Therefore, the data that is securely stored in the chip’s NVM<br />

can be decrypted only on the same chip on which it has been<br />

generated. Transferring them to any other target device is not a<br />

concern, even if the Helperdata is copied along with them. The<br />

Helperdata can be used only with the specific SRAM<br />

fingerprint of the chip that generated it in the first place.<br />

B. Software protection in microcontroller<br />

This section describes a use case where the SRAM PUF is<br />

used to protect software IP on a microcontroller. We assume<br />

the microcontroller has an internal flash memory where its<br />

program code can be stored. Before code is executed it is<br />

loaded into an internal SRAM memory. A small part of the<br />

SRAM memory is reserved to be used as PUF. This can be<br />

achieved by instructing the compiler to exclude a certain part<br />

of the SRAM from the memory map, assuring that it will not<br />

be “visible” by other software.<br />

We furthermore assume that the microcontroller has some<br />

access control mechanisms to:<br />

1. Lock down the software in the flash memory to prevent any<br />

modification<br />

2. Disable in-circuit debug facility<br />

Except for a few low-end microcontrollers, these access<br />

control mechanisms are quite common.<br />

1) Setup phase<br />

To securely set up the system, we use a provisioning PC in<br />

a trusted environment to load the code in the flash memory of<br />

the microcontroller. This is depicted in step 1 of Fig. 4. This is<br />

software that will be executed at runtime (see next section).<br />

The software consists of:<br />

• A boot image containing the Fuzzy Extractor algorithm and<br />

the cryptographic cipher algorithms used to decrypt the<br />

software image<br />

• A software image encrypted with key S. Initially the<br />

software has an empty header. At the end of the setup phase<br />

the header will be overwritten with a uniquely encrypted<br />

header per device.<br />

After storing the software code in flash memory, the<br />

provisioning PC loads a temporary enrollment image in the<br />

executable SRAM of the device. This is depicted in step 2.<br />

The enrollment image contains the Fuzzy Extractor algorithm,<br />

as well as a cryptographic cipher that can be used to encrypt a<br />

header for the software image in flash. Furthermore, it<br />

contains the software image encryption key S.<br />

When execution of the enrollment image is triggered (step<br />

3), the SRAM PUF is read out (step 4) and Helperdata is<br />

created by the Fuzzy Extractor algorithm. The Helperdata is<br />

stored in the flash memory (step 5). Based on the Helperdata<br />

and the SRAM PUF readout, the cryptographic root key of the<br />

device K is reconstructed by the Fuzzy Extractor. Using the<br />

cryptographic cipher in the enrollment image, the software<br />

image encryption key S is encrypted with the device-unique<br />

key K. The resulting value, denoted as E[K](S), is written in<br />

the header of the encrypted software image (step 6). The flash<br />

memory now contains an encrypted software image, with a<br />

header that is specifically encrypted for the device it is stored<br />

on.<br />

www.embedded-world.eu<br />

728


At the end of the setup phase the enrollment image is removed<br />

from SRAM. The provisioning PC triggers the necessary<br />

mechanisms in the microcontroller to lock the software images<br />

in flash and to disable the debug port.<br />

Fig. 5: SRAM PUF-based software protection mechanism, runtime operation.<br />

The software protection method described in this section<br />

can be retrofitted to existing devices as it is completely<br />

software based. Still, the root of trust originates from the<br />

SRAM PUF in hardware. The core component that enables this<br />

mechanism is the Fuzzy Extractor that enables key<br />

reconstruction from a standard SRAM memory available in the<br />

microcontroller.<br />

An open source reference implementation of such a Fuzzy<br />

Extractor is available as part of the prpl Security Framework,<br />

see [19].<br />

Fig. 4: SRAM PUF-based software protection mechanism, setup phase.<br />

2) Runtime operation<br />

The runtime flow is depicted in Fig. 5. First the<br />

microcontroller boot loader copies the first boot image into the<br />

SRAM of the microcontroller (step 1) and triggers execution<br />

(step 2). The boot stage code reads the SRAM PUF values<br />

(step 3) as well as the Helperdata (step 4). The Fuzzy Extractor<br />

algorithm in the boot image uses these values to reconstruct the<br />

device-unique root key K. The key K is used to decrypt the<br />

header of the software image (step 5). Decrypting the software<br />

image header results in the software image key S, which is then<br />

used to decrypt the software image in flash (step 6) as it is<br />

being copied to execution SRAM (step 7). When the full<br />

software image is decrypted and available in the SRAM,<br />

execution of the image is triggered (step 8).<br />

The PUF plays an essential role in providing the<br />

microcontroller with a device-unique cryptographic root key<br />

that is used to bind the software image to the specific device.<br />

The root key is only temporarily reconstructed in working<br />

memory to decrypt the header of the software image.<br />

Likewise, the decrypted software image key is only<br />

temporarily present in working memory to decrypt the<br />

software image. When the device is powered off, the plain<br />

software disappears from the execution SRAM memory. Only<br />

encrypted values are left in the flash memory.<br />

C. Device authentication to the cloud<br />

In this use case scenario, we describe how the SRAM PUF<br />

is used as a basis to connect IoT end nodes securely to a cloud<br />

service such as Amazon Web Services or Microsoft Azure<br />

cloud. We assume that the IoT device employs an off-the-shelf<br />

microcontroller as its main processing unit. An OEM (Original<br />

Equipment Manufacturer) owns both the devices and the<br />

service that is running in the cloud. The situation is depicted in<br />

Fig. 6.<br />

1) Installation phase<br />

In the installation phase (step 1) the OEM installs its IoT<br />

Service on the cloud platform of choice. The cloud service has<br />

its own private/public key pair denoted d S/Q S. This key pair is<br />

used to authenticate the service toward its clients. Furthermore,<br />

the cloud service knows the public key Q CA of a trusted<br />

Certificate Authority. This public key is used to verify device<br />

identity certificates of the end nodes that connect to the cloud<br />

service.<br />

The OEM also provides a software image to the Contract<br />

Manufacturer for installation on the IoT devices (step 2).<br />

Embedded in this software image is the URL of the cloud<br />

service, as well as the public key Q S of the cloud service. This<br />

key is used to authenticate the OEM IoT service toward the<br />

device. The software image contains the following<br />

submodules:<br />

• Fuzzy Extractor: The software library that reads out the<br />

uninitialized SRAM values from a reserved part of the<br />

729


SRAM of the IoT device in order to reconstruct a deviceunique<br />

cryptographic key K.<br />

• TLS & crypto library: A software library that contains<br />

cryptographic functionality for securing a network<br />

connection using the Transport Layer Security protocol<br />

[20].<br />

• Connectivity library: A network stack running on the IoT<br />

device, which enables the device to connect to Internet<br />

services. It will typically set up a TCP/IP stack over a<br />

physical network connection such as ethernet or Wi-Fi.<br />

Furthermore, it will support a connectivity protocol such as<br />

MQTT (Message Queuing Telemetry Transport) to run on<br />

top of the TCP/IP stack [21].<br />

• OEM Application: The actual application software that<br />

provides the device with the intended functionality.<br />

2) Setup phase<br />

Every device will go through a setup phase in the<br />

production environment of the Contract Manufacturer, which<br />

operates on behalf of the OEM. As part of this enrollment<br />

and reconstructed on the fly only when needed. The public<br />

device key Q D is sent via the contract manufacturer PC or<br />

Automated Test Equipment to the Certificate Authority<br />

service (step 6). The CA generates a device certificate, which<br />

includes the device public key Q D as well as a signature<br />

created with the CA private key d CA. Optionally the certificate<br />

may include other chip or device IDs. The device certificate,<br />

denoted as $[d CA](Q D), is stored in non-volatile memory on the<br />

device (step 7). After this step the device has an “identity” in<br />

the form of a public-key certificate<br />

Note that this phase implements a one-time-trust event<br />

where the contract manufacturer assures that the device public<br />

key Q D is valid for the specific device and triggers the<br />

generation of a certificate at the CA. The contract manufacturer<br />

is trusted for correctly requesting certificates for public keys of<br />

the devices. It does not have to be trusted to handle any<br />

sensitive private keys.<br />

3) Runtime operation<br />

Once the IoT device is in the field, it can now<br />

Fig. 6: Cloud authentication mechanism based on SRAM PUF.<br />

step, the Fuzzy Extractor reads out the SRAM PUF values<br />

(step 3) and generates Helperdata (step 4), which is stored in<br />

non-volatile memory. The device-unique cryptographic key K<br />

is output by the Fuzzy Extractor and used with a Key<br />

Derivation Function in the TLS crypto library to derive an<br />

asymmetric elliptic curve device key pair d D/Q D. The private<br />

key of this key pair is never stored in any non-volatile memory<br />

autonomously set up secure connections to the OEM IoT<br />

Service. First, the Fuzzy Extractor is used to reconstruct the<br />

device-unique cryptographic key K from a readout of the<br />

SRAM PUF (step 8) and the Helperdata (step 9). The<br />

cryptographic key K is then used by the crypto library to derive<br />

the asymmetric key pair d D/Q D (step 10) and prepare for<br />

cryptographic support of the secure network connection.<br />

www.embedded-world.eu<br />

730


The connectivity library contacts the Internet service via the<br />

URL that is fixed in the OEM software image (step 11). A TLS<br />

connection is then set up where the server is authenticated<br />

toward the device based on the public key Q S that is stored in<br />

the OEM SW image (fetched via step 11). The Device<br />

Certificate (obtained via step 12) is used to authenticate the<br />

client IoT device toward the OEM IoT cloud service. Setting<br />

up the TLS connection (step 13) uses support from the crypto<br />

algorithms in the TLS layer (step 14) and on a high level<br />

proceeds as follows [20], see also Fig. 7:<br />

from the PUF key K. When the IoT device is powered off, no<br />

private keys are present. No sensitive data is ever stored in any<br />

NVM memory.<br />

a. Client and Server exchange initial messages where the<br />

client sends to the server a list of ciphers that it supports.<br />

The server compares this list with the ciphers that it<br />

supports and selects its preferred cipher that both sides<br />

support. In this case we assume that TLS_ECDHE_ECDSA<br />

is supported by the client and selected for setting up the<br />

secure connection. This cipher combination uses elliptic<br />

curve Diffie-Hellman key exchange to set up a shared<br />

session key, and the elliptic curve digital signature<br />

algorithm for authentication (i.e. message signing).<br />

b. The server determines the elliptic curve parameters,<br />

including the elliptic curve base point P. The server<br />

randomly generates an ephemeral elliptic curve key pair<br />

d SR/Q SR, where Q SR = d SR∙P and signs the ephemeral public<br />

key Q SR with its private key d S using the ECDSA signature<br />

algorithm. Note that the operator “∙” denotes point<br />

multiplication over the elliptic curve. The signature value is<br />

denoted as $[d S](Q SR).<br />

c. Then the server sends the signed ephemeral public key<br />

$[d S](Q SR) to the client, together with the elliptic curve<br />

parameters.<br />

d. The client uses the server’s public key Q S to verify that Q SR<br />

was signed correctly.<br />

e. The client sends its public key certificate to the server. The<br />

server uses the CA public key Q CA to verify the certificate<br />

and to be assured of the correct device’s public key Q D.<br />

f. The client also randomly generates an ephemeral elliptic<br />

curve key pair d DR/Q DR, where Q DR=d DR∙P. The public<br />

ephemeral key Q DR is sent back to the server.<br />

g. The client uses its private key d D to sign the TLS transcript<br />

(messages exchanged in steps a-f) and sends the signature<br />

to the server.<br />

h. The server verifies the signature using the previously<br />

verified device public key Q D.<br />

i. The client computes a shared secret as S = d DR∙Q SR =<br />

d DR∙d SR∙P over the elliptic curve group.<br />

j. The server computes the same shared secret as S = d SR∙Q DR<br />

= d SR∙d DR∙P<br />

Now that both client and server side have the same shared<br />

key S, symmetric session keys are derived from it to encrypt<br />

and authenticate further messages that are exchanged between<br />

both sides. Note that authentication of the client IoT device<br />

toward the server is done through steps e, g and h. The private<br />

device key d D that is used for this authentication step is derived<br />

Fig. 7: Simplified overview of TLS key agreement steps based on ECDH<br />

protocol.<br />

The SRAM PUF provides the flexibility to instantiate a<br />

device-unique key in the device and form the basis of a device<br />

identity (through the device certificate). No IDs or keys have to<br />

be injected by the silicon manufacturer. The OEM can decide<br />

to run the enrollment step at any semi-trusted time and place in<br />

the production chain. This has the advantage that the OEM can<br />

take device security in its own hands, without having to rely on<br />

key injection by the silicon manufacturer and secure handover<br />

of installed keys. This reduces key provisioning costs in the<br />

production chain considerably.<br />

V. CONCLUSIONS<br />

SRAM-based Physical Unclonable Functions form a<br />

universal method to securely store cryptographic keys in the<br />

chips of IoT devices. SRAM PUF provides hardware-rooted<br />

security that is enabled via software. When the device is<br />

powered down, no secrets are stored in memory, making<br />

cryptographic keys impossible to extract. In addition, SRAM<br />

PUF provides a high grade of flexibility all through the device<br />

supply chain. Every device can generate its own keys at any<br />

wanted point in the production chain. The entropy of these<br />

keys is determined by randomness in the physics originating<br />

from minute and uncontrollable process variations in the<br />

silicon production process. This makes PUF-based<br />

implementations much more resilient than traditional key<br />

injection options. The flexibility of the SRAM PUF process<br />

results in cost reductions as external key management<br />

infrastructure is kept to a minimum. SRAM PUF technology<br />

works reliably on any device that has silicon SRAM onboard: it<br />

will become the option of choice to establish trust in silicon for<br />

billions of devices that make the future Internet of Things.<br />

731


REFERENCES<br />

[1] M. Bhargava, C. Cakir, and K. Mai, “Comparison of bi-stable and delaybased<br />

Physical Unclonable Functions from measurements in 65nm bulk<br />

CMOS,” in Custom Integrated Circuits Conference (CICC), 2012 IEEE,<br />

2012, pp. 1–4.<br />

[2] M. Cortez, G. Roelofs, S. Hamdioui, G. Di Natale, “Testing PUF-Based<br />

Secure Key Storage Circuits”, DATE conference 2014,<br />

https://www.dateconference.com/files/proceedings/2014/pdffiles/07.7_2.pdf<br />

.<br />

[3] Trusted Computing Group, Device Identity Composition Engine<br />

workgroup, https://trustedcomputinggroup.org/work-groups/dicearchitectures/<br />

.<br />

[4] Y. Dodis, L. Reyzin, and A. Smith, “Fuzzy extractors: How to generate<br />

strong keys from biometrics and other noisy data,” in Advances in<br />

Cryptology - EUROCRYPT 2004, ser. Lecture Notes in Computer<br />

Science, Springer Berlin Heidelberg, 2004, vol. 3027, pp. 523–540.<br />

[5] Gartner newsroom, “Gartner Says 6.4 Billion Connected Things Will Be<br />

in Use in 2016, Up 30 Percent From 2015”,<br />

https://www.gartner.com/newsroom/id/3165317 .<br />

[6] B. Gassend, D. Clarke, M. van Dijk, S. Devadas, “Silicon physical<br />

random functions” In: ACM Conference on Computer and<br />

Communications Security (ACM CCS). pp. 148–160. ACM, New York,<br />

NY, USA (2002).<br />

[7] B. Gassend, D. Clarke, M. van Dijk, S. Devadas, “Silicon physical<br />

random functions” In: ACM Conference on Computer and<br />

Communications Security (ACM CCS). pp. 148–160. ACM, New York,<br />

NY, USA (2002).<br />

[8] C. Helfmeier, C. Boit, D. Nedospasov, and J.-P. Seifert, “Cloning<br />

physically unclonable functions,” in Hardware-Oriented Security and<br />

Trust (HOST), 2013 IEEE International Symposium on, 2013, pp. 1–6.<br />

[9] Intrinsic ID whitepaper, “Flexible Key Provisioning with SRAM PUF”,<br />

https://www.intrinsic-id.com/resources/white-papers/white-paperflexible-key-provisioning-sram-puf/<br />

.<br />

[10] S. Katzenbeisser, U. Kocabas¸, V. Rozic, A.-R. Sadeghi, I.<br />

Verbauwhede, and C. Wachsmann, “PUFs: Myth, Fact or Busted? A<br />

Security Evaluation of Physically Unclonable Functions (PUFs) Cast in<br />

Silicon,” in Cryptographic Hardware and Embedded Systems (CHES)<br />

2012, ser. Lecture Notes in Computer Science, Springer Berlin<br />

Heidelberg, 2012, vol. 7428, pp. 283–301.<br />

[11] J.-P. Linnartz and P. Tuyls, “New shielding functions to enhance privacy<br />

and prevent misuse of biometric templates,” in Audio- and Video- Based<br />

Biometric Person Authentication, ser. Lecture Notes in Computer<br />

Science, Springer Berlin Heidelberg, 2003, vol. 2688, pp. 393–402.<br />

[12] R. Maes, V. van der Leest, “Countering the effects of silicon ageing on<br />

SRAM PUFs”, HOST 2014.<br />

[13] D. Merli, F. Stumpf, G. Sigl, “Protecting PUF Error Correction by<br />

Codeword Masking”, Cryptology ePrint Archive,<br />

https://eprint.iacr.org/2013/334.pdf .<br />

[14] D. Nedospasov, J.-P. Seifert, C. Helfmeier, and C. Boit, “Invasive PUF<br />

analysis,” in Fault Diagnosis and Tolerance in Cryptography (FDTC),<br />

2013 Workshop on, 2013, pp. 30–38.<br />

[15] U. Rührmaier, J. Sölter, F. Sehnke, X. Xu, A. Mahmoud, V. Stoyanova,<br />

G. Dror, J. Schmidhuber, Wayne Burleson, S. Devadas “PUF Modeling<br />

Attacks on Simulated and Silicon Data”, IACR Eprint archive 2013,<br />

http://sharps.org/wp-content/uploads/RUHRMAIR-IACR.pdf .<br />

[16] U. Rührmaier, J. Sölter, “PUF Modeling Attacks: An Introduction and<br />

Overview”, DATE 2014,<br />

https://pdfs.semanticscholar.org/a023/dd6069b664b0e53dfa5366d3c881<br />

a6876583.pdf .<br />

[17] G.-J. Schrijen and V. van der Leest, “Comparative analysis of SRAM<br />

memories used as PUF primitives,” in Design, Automation Test in<br />

Europe Conference Exhibition (DATE) 2012, March 2012, pp. 1319 –<br />

1324.<br />

[18] Synopsys whitepaper, “Securing the Internet of Things – An Architect’s<br />

Guide to Securing IoT Devices Using Hardware Rooted Processor<br />

Security”, https://hosteddocs.emediausa.com/arc_security_iot_wp.pdf .<br />

[19] PRPL Foundation, security working group:<br />

https://prpl.works/category/prpl-security/ , PRPL PUF-API:<br />

https://github.com/prplfoundation/prpl-puf-api/tree/December-2017 ,<br />

Security Framework application note: https://prpl.works/applicationnote-july-2016/<br />

.<br />

[20] Wikipedia, “Transport Layer Security”,<br />

https://en.wikipedia.org/wiki/Transport_Layer_Security#Clientauthenticated_TLS_handshake<br />

[21] Wikipedia, “MQTT”, https://en.wikipedia.org/wiki/MQTT .<br />

www.embedded-world.eu<br />

732


How to Incorporate Low-Resource Cryptography<br />

Into a Highly Constrained Real-World Product<br />

Derek Atkins<br />

SecureRF Corporation<br />

Shelton, CT, USA<br />

datkins@SecureRF.com<br />

Drake Smith<br />

SecureRF Corporation<br />

Shelton, CT, USA<br />

dsmith@SecureRF.com<br />

Abstract—The Internet of Things (IoT) has a problem: the<br />

small devices that power the IoT are insecure because these<br />

devices have few, if any, options for providing authentication and<br />

data integrity. These embedded devices lack the computing,<br />

memory, and/or energy resources needed to implement today’s<br />

standard security methods. This leaves most IoT systems<br />

vulnerable to attack.<br />

Before revealing an alternative that enables security on<br />

devices as small as the ubiquitous 8051 8-bit microcontroller, we<br />

will first show you how to identify security threats and how to<br />

determine security requirements. We will provide some<br />

techniques for evaluating your products and deployment<br />

scenarios for susceptibility to spoofing and impersonation,<br />

message tampering, and eavesdropping. We will introduce some<br />

effective countermeasures to protect against these threats<br />

together with their suggested security strengths. As an example<br />

of good security protocol design, we will consider a typical IoT<br />

use case where a base station must communicate with a remote<br />

sensor in a secure manner. We will discuss some potential<br />

exploits and attacks and then outline a security protocol that<br />

mitigates those threats.<br />

Next, we will introduce Group Theoretic Cryptography<br />

(GTC) as an alternative to resource-intensive RSA and ECC. We<br />

will explain why RSA, ECC, and Diffie-Hellman are a poor fit for<br />

highly-constrained devices such as battery-less sensors with<br />

microcontrollers (MCUs) having low clock rates and low bitwidth<br />

architectures. We will present a GTC-based suite of<br />

quantum-resistant cryptographic methods that have been<br />

designed specifically for constrained environments.<br />

We will conclude with a discussion on how to incorporate<br />

GTC-based security into real-world products. You will learn<br />

about the availability of cryptographic libraries that you can<br />

incorporate into your own code that implement the low-resource<br />

methods discussed in this presentation. Using these libraries, you<br />

will see typical run-times plus ROM and RAM utilization for a<br />

range of microcontrollers and processor cores.<br />

Keywords—Internet of Things; IoT; Public Key Cryptography;<br />

Group Theoretic Cryptography; Ironwood Key Agreement<br />

Protocol; Walnut Digital Signature Algorithm; WalnutDSA<br />

I. INTRODUCTION<br />

As the Internet of Things (IoT) grows larger, the devices<br />

attaching to networks continually get smaller. Finding devices<br />

with low clock speeds, limited RAM and ROM, and<br />

microcontrollers with 16 or even 8 bits is not uncommon.<br />

While this does not reduce or eliminate the requirement for<br />

cryptographic authentication, it does reduce the usability or<br />

practicality of currently established methods. On some of the<br />

smaller devices where you can make it fit, an Elliptic Curve<br />

Cryptography (ECC) authentication still takes 10-60 seconds to<br />

complete.<br />

Unfortunately, security solutions are still required for<br />

authentication and data protection in networked devices.<br />

Without security, communications can be compromised,<br />

risking data or, worse, safety. Networked vehicles have already<br />

been hacked, enabling the attacker to control a vehicle, shut<br />

down the engine, enable the brakes, or even drive it remotely.<br />

All this is possible because there is no security on these<br />

devices.<br />

This paper will describe good security design and practice<br />

by leveraging the Intel DE10-Nano development kit, which<br />

utilizes a next-generation Group Theoretic cryptosystem (GTC)<br />

for quantum-resistant key agreement and digital signature<br />

evaluation. The board is co-developed using Intel FPGA<br />

technology and is delivered with GTC technology<br />

demonstrations for everyday use. Specifically, code and<br />

documentation is provided to enable the DE10-Nano to<br />

authenticate to small sensor nodes, and to run a speed test<br />

showing the performance of the technology.<br />

II. THE DE10-NANO<br />

The DE10-Nano is a development kit built around an Intel<br />

Cyclone V System-on-Chip FPGA, which combines a dualcore<br />

ARM Cortex-A9 with space for programmable logic,<br />

which enables design flexibility between hardware/software<br />

interfaces. Users can reconfigure the hardware and link with<br />

software to create high-performance, low-power systems.<br />

Leveraging the Cyclone V, the DE10-Nano enables developers<br />

to rapidly develop embedded applications and test different<br />

733


configurations of hardware and software in an easy-to-access<br />

platform.<br />

The DE10-Nano is meant to be the medium-area device,<br />

meaning it talks to larger devices but also talks to smaller<br />

devices, like an Arduino or even smaller sensors or nodes. So,<br />

while the DE10-Nano does contain dual Cortex-A9<br />

processors—virtual super-computers in the realm of IoT—it is<br />

expected to communicate with devices with much lowercaliber<br />

capabilities, perhaps even as low as an 8-bit 8051<br />

microcontroller. This implies that any security solution must<br />

be capable of running on those tiny devices.<br />

The DE10-Nano comes with one security solution [1] that<br />

leverages SecureRF’s Ironwood Key Agreement Protocol TM<br />

(Ironwood KAP TM ) and Walnut Digital Signature Algorithm TM<br />

(WalnutDSA TM ). These methods not only work effectively on<br />

the DE10-Nano, but can be used on those tiny devices as well.<br />

III. SECURITY RISK AND THREAT ANALYSIS<br />

Before making any security choices it is best to understand<br />

the risks, threats, and possible mitigation techniques available.<br />

Specifically, it is important to look at the potential<br />

vulnerabilities, the attack surfaces, the likelihood of attack, the<br />

cost of an attack, and the cost of protecting against the attack.<br />

A threat analysis is invaluable to determine what needs to<br />

be protected and how. The analysis provides a list of concerns,<br />

ranks them, and considers different ways a system can be<br />

attacked. Next it evaluates the risk of those attacks, how likely<br />

they are, and how much damage it would cause, and the costs<br />

to correct it.<br />

When analyzing risk, one approach is to look at the asset<br />

value versus the attack cost. For example, a bank protecting<br />

millions of dollars in assets is not going to protect it with a<br />

system that can be broken with only a few dollars of effort<br />

(under-protected risk). On the other side, protecting a $1 asset<br />

using a system that would cost a million dollars to break is,<br />

most likely, completely overkill.<br />

Threats can stem from direct physical access, where an<br />

attacker can touch, push, probe, twist, or otherwise manipulate<br />

the target. In addition to direct attacks, this can include side<br />

channels like differential power analysis (DPA), glitching<br />

attacks, timing attacks, or even listening to sounds emanating<br />

from the device.<br />

Another source of threats is the network. Open services,<br />

built-in accounts with hard-coded passwords, non-patched<br />

systems containing software bugs; the sources of networkbased<br />

attacks are endless.<br />

Threat analysis is best done by a professional (or at least<br />

someone who considers themselves extremely paranoid).<br />

However, the thought process of analyzing, enumerating, and<br />

ordering the threats is an important step before proceeding to<br />

countermeasures.<br />

IV. EARLY COUNTERMEASURES<br />

Once the threats are enumerated and the risks are<br />

understood, the next step to protect a system is applying<br />

countermeasures. The goal is to mitigate the risks and defend<br />

against the threats. These countermeasures can take many<br />

forms.<br />

The first form of countermeasure is physical protection. For<br />

example, encasing a circuit board in a block of epoxy will<br />

prevent any access to individual items on the board. Only<br />

wires/cables that explicitly protrude from the epoxy are<br />

accessible, limiting what an attacker can do. This would<br />

prevent targeted attacks between chips on the board,<br />

monitoring the transmission bus, watching memory, changing<br />

out the CPU, etc.<br />

However, even encapsulation cannot protect against certain<br />

types of side channel attacks. Most likely, a box still has a<br />

power adapter (unless there’s a battery inside the epoxy, which<br />

would limit the lifetime of the device). An attacker could<br />

attempt DPA using that power input. DPA-specific protections<br />

require specific hardware and software mitigations, but those<br />

are specific to DPA.<br />

Other countermeasures include processes in place to control<br />

people’s actions and behaviors, software mitigations such as<br />

better security on the system, keeping systems patched with<br />

fixes, and cryptographic protections.<br />

Another countermeasure is a self-contained secure boot<br />

solution, with an integrated secure update infrastructure. A<br />

secure boot solution enables a device to cryptographically<br />

verify the authenticity and integrity of firmware before it is<br />

loaded and run on the system. This prevents an attacker from<br />

making changes to the underlying code. At startup the system<br />

will verify the firmware, usually by checking a digital<br />

signature, and only if the signature is valid will it continue. All<br />

that is required is that the public key of the signer be available<br />

and non-writable (meaning an attacker cannot replace it).<br />

A secure update solution enables cryptographic validation<br />

of firmware updates before they are stored in place. It protects<br />

the device from unwanted or invalid updates. Together with the<br />

secure boot solution the device is assured of correct code.<br />

V. WHY EXISTING SOLUTIONS FAIL<br />

Security is often considered only as an afterthought. It is<br />

not usually a feature. It is rarely customer-visible (except when<br />

it is not working), it adds cost, and it reduces performance<br />

(compared to a completely insecure system). And until there is<br />

a major break, customers rarely ask for it. What this means is<br />

that in the rapid pace of development, an actual customerfacing<br />

feature is more likely to get implemented than a<br />

strategic security solution.<br />

Moreover, because security is only added later, the<br />

manufacturer attempts to bolt it onto the side of the working<br />

product. Of course, this trick never works.<br />

Security requires a holistic approach. A good security<br />

architecture is necessarily going to touch every part of the<br />

system, from the hardware up to the user interface, and<br />

everywhere in between. If security is not considered at the<br />

onset, then adding security can become a daunting task.<br />

Adding it piecemeal often does not work, or if it can work, it<br />

works insufficiently.<br />

734


Next, cryptography is a requirement for good security. A<br />

security solution without cryptography is just an attack waiting<br />

to happen. Yet not just any cryptography can work, one must<br />

apply the correct methods.<br />

Symmetric methods like AES are perfect for data<br />

encryption. AES is efficient and generally available on most<br />

systems. However, keys must be managed for AES to work<br />

properly. To manage those keys properly in a large scale,<br />

distributed system, the best approach is to use an Asymmetric<br />

system.<br />

However, on small, IoT devices the constraints of the<br />

system might restrict the ability to add legacy asymmetric<br />

cryptographic systems. Specifically, many of the systems in<br />

use today either will not fit or, if they can be made to fit, will<br />

not perform adequately in the low-resource environments. For<br />

example, fitting ECC in an 8-bit processor like an 8051 is<br />

nearly impossible, or if it can be fit (possibly with extremely<br />

low security), it will still take minutes to calculate its answer.<br />

Using RSA could take even longer.<br />

Imaging having to wait several minutes for a device to<br />

power up, because its secure boot solution takes that long to<br />

validate the firmware?<br />

VI. GROUP THEORETIC CRYPTOGRAPHY<br />

In 2005, [2] introduced the world to E-Multiplication, a<br />

lightweight, quantum-resistant, one-way function rooted in<br />

infinite group theory, matrices, permutations, and arithmetic<br />

over small finite fields. Implementations of E-Multiplication<br />

are small and extremely efficient, even on 16- or 8-bit<br />

microcontrollers.<br />

E-Multiplication is the basis for several cryptographic<br />

methods, Group Theoretic methods, but the most interesting<br />

are Ironwood KAP [3] and WalnutDSA [4].<br />

Ironwood KAP is a Diffie-Hellman-like key agreement<br />

scheme that enables two devices, who may never have met<br />

before, to exchange public keys and, using those and their own<br />

private keys, generate a shared secret. That shared secret can<br />

then be used to authenticate the devices or by a method like<br />

AES to encrypt data between the devices.<br />

Due to E-Multiplication being so efficient, Ironwood can<br />

compute a shared secret on even an 8-bit 8051 in about 200 ms.<br />

This would enable even the smallest of devices to compute a<br />

shared secret and authenticate itself.<br />

Ironwood is also interesting because the two sides need to<br />

perform different amounts of work. In other words, the method<br />

itself has asymmetric implementation requirements. The lighter<br />

side of the method is often 50 times faster, meaning an 8051<br />

could execute the other side of the method in about 4 ms.<br />

WalnutDSA, on the other hand, is a quantum-resistant<br />

digital signature scheme that enables one party to create a<br />

signature on a message that can be verified by a second party,<br />

that ensures that the message has not been modified, and<br />

proves the message came from the first party. Digital<br />

signatures are used for certificates, to prove identity, and in<br />

some challenge-response authentication systems.<br />

WalnutDSA signature verification is extremely fast, even<br />

on IoT edge devices. For example, on an 8051 a WalnutDSA<br />

signature can be verified in 35ms, and on an ARM Cortex M3,<br />

WalnutDSA verifies the signature in 5.7 ms in software! This<br />

works out to about 40 times faster than an ECDSA signature<br />

validation, using half the code size, half the RAM, and also<br />

providing the future-proof characteristics of quantum<br />

resistance.<br />

Because WalnutDSA is so fast and lightweight it means<br />

that even the lowest-end IoT device can benefit from its use.<br />

For example, there is no way that you could get a near-realtime<br />

PKI working on an 8-bit 8051 using legacy methods, but<br />

leveraging WalnutDSA enables that. An 8051 could validate a<br />

certificate quickly, in tens of milliseconds, enabling a whole<br />

class of new applications. Combined with Ironwood, these lowend<br />

devices can perform full end-to-end authentication,<br />

validation, and connection security.<br />

WalnutDSA is currently under review as part of The<br />

National Institute of Standards and Technology (NIST) Post-<br />

Quantum Standardization Process, and has the fastest<br />

verification times of all accepted methods as reported by NIST.<br />

VII. INCORPORATING GTC IN YOUR PRODUCTS<br />

Ironwood KAP and WalnutDSA are available in SDKs and<br />

IP Cores for integration into various levels of devices, from<br />

Linux and Windows systems, down to ARM Cortex M0, Texas<br />

Instruments MSP430, and even down to the 8-bit 8051 or<br />

Atmel AVR platforms or even into FPGAs and custom ASIC<br />

designs.<br />

Integrating Ironwood KAP is as simple as making one<br />

function call to generate the shared secret from the private data<br />

that gets provisioned onto the device and the public key sent<br />

from the other side. All the hard work is abstracted away,<br />

making it simple to use. Additional APIs are available to<br />

leverage that shared secret into an authentication protocol or<br />

data encryption module.<br />

WalnutDSA is just as simple to integrate. A single API<br />

takes a hashed message, signature, and public key, and returns<br />

a response of valid or not-valid. API calls for hashing are<br />

available for convenience, or the developer could use their<br />

own. Some hardware has embedded hash functions that can be<br />

leveraged for improved performance.<br />

Together, WalnutDSA and Ironwood KAP can provide a<br />

full suite of authentication, integrity, and secure<br />

communication technologies, which can easily integrate into<br />

existing protocols or, better yet, become the basis for a secure<br />

platform, including a secure boot and secure update solution.<br />

Moreover, adding these features requires as little as 3000-7000<br />

bytes of code.<br />

VIII. CONCLUSIONS<br />

Group Theoretic Cryptography has provided quantumresistant<br />

public-key methods that are small, efficient, and<br />

practical even on the smallest of today’s IoT devices.<br />

Leveraging the Ironwood KAP and WalnutDSA, developers<br />

can integrate modern PKI concepts on tiny devices while<br />

735


adding very little code and reducing the performance impact<br />

compared to legacy cryptographic methods.<br />

After creating a threat and risk analysis to discover the<br />

biggest threats and best places to add protections, adding GTC<br />

technologies can easily mitigate many problems with better<br />

efficiency and performance than legacy cryptographic<br />

methods.<br />

With WalnutDSA under consideration by NIST, the future<br />

is quantum-resistant.<br />

REFERENCES<br />

[1] Intel Corporation and SecureRF Corporation, “How to Authenticate<br />

Remote Devices with the DE10-Nano Kit,” August 2017,<br />

https://software.intel.com/en-us/articles/how-to-authenticate-remotedevices-with-the-de10-nano-kit.<br />

[2] I. Anshel, M. Anshel, D. Goldfeld, S. Lemieux, Key Agreement, the<br />

Algebraic Eraser TM , and Lightweight Cryptography, Algebraic methods<br />

in cryptography, Contemp. Math., vol. 418, Amer. Math. Soc.,<br />

Providence, RI, 2006, pp. 1–34.<br />

[3] I. Anshel; D. Atkins; D. Goldfeld; P. E. Gunnells, Ironwood Meta Key<br />

Agreement and Authentication Protocol, to appear.<br />

[4] I. Anshel; D. Atkins; D. Goldfeld; P. E. Gunnells, WalnutDSA TM : A<br />

Quantum-Resistant Digital Signature Algorithm, to appear.<br />

736


Practical Use of MISRA C and C++<br />

By Greg Davis<br />

Director of Engineering, Compiler Development<br />

Session 27: 1 Mar 2018 09:30-1100<br />

Copyright 2013-2018 by Greg Davis<br />

Introduction<br />

No software engineering process can guarantee secure code, but following the right<br />

coding guidelines can dramatically increase the security and reliability of your code.<br />

Many embedded systems live in a world where a security breach can be catastrophic.<br />

Embedded systems control much of the world’s critical infrastructure, such as dams,<br />

traffic signals, and air traffic control. These systems are increasingly communicating<br />

together using COTS networking and in many cases using the internet itself. Keeping<br />

yourself out of the courtroom, if not common decency, demands that all such systems<br />

should be developed to be secure.<br />

There are many factors that determine the security of an embedded system. A wellconceived<br />

design is crucial to the success of a project. Also, a team needs to pay<br />

attention to its development process. There are many different models of how software<br />

development ought to be done, and it is prudent to choose one that makes sense. Finally,<br />

the choice of operating system can mean the difference between a project that works well<br />

in the lab and one that works reliably for years in the real world.<br />

Even the most well thought-out design is vulnerable to flaws when the implementation<br />

falls short of the design. This paper focuses on how one can use a set of coding<br />

guidelines, called MISRA C and MISRA C++, to help root out bugs introduced during<br />

the coding stage.<br />

MISRA C and C++<br />

MISRA stands for Motor Industry Software Reliability Association. It originally<br />

published Guidelines For the Use of the C Language In Critical Systems, known<br />

informally as MISRA C, in 1998. A second edition of MISRA C was introduced in 2004,<br />

and next MISRA C++ was released in 2008. The most recent edition of MISRA, a third<br />

edition of MISRA C also known as MISRA C3 was released in 2012/2013. More<br />

information on MISRA and the standards themselves can be obtained from the MISRA<br />

web site at http://www.misra.org.uk.<br />

The purpose of MISRA C and MISRA C++ guidelines are not to promote the use of C or<br />

C++ in critical systems. Rather, the guidelines accept that these languages are being used<br />

for an increasing number of projects. The guidelines discuss general problems in<br />

737


software engineering and note that C and C++ do not have as much error checking as<br />

other languages do. Thus the guidelines hope to make C and C++ safer to use, although<br />

they do not endorse MISRA C or MISRA C++ over other languages.<br />

MISRA C is a subset of the C language. In particular, it is based on the ISO/IEC<br />

9899:1990 C standard, which is identical to the ANSI X3.159-1989 standard, often called<br />

C ’89. Thus every MISRA C program is a valid C program. The MISRA C subset is<br />

defined by 143 rules and 16 directives that constrain the C language and the software<br />

development process. Correspondingly, MISRA C++ is a subset of the ISO/IEC<br />

14882:2003 C++ standard. MISRA C++ is based on 228 rules, many of which are<br />

refinements of the MISRA C rules to deal with the additional realities of C++.<br />

For notational convenience, we will use the terms “MISRA”, “MISRA C” or “MISRA<br />

C++” loosely in the remainder of the document to refer to either the defining documents<br />

or the language subsets.<br />

What is MISRA?<br />

MISRA is written for safety critical systems, and it is intended to be used within a<br />

rigorous software development process. The standard briefly discusses issues of software<br />

engineering, such as proper training, coding styles, tool selection, testing methodology,<br />

and verification procedures.<br />

MISRA also talks about the ways to ensure compliance with all of the rules. Some of the<br />

rules can be verified by a static checking tool or a compiler. Many of the rules are<br />

straightforward, but others may not be or may require whole-program analysis to verify.<br />

Management needs to determine whether any of the available tools can automatically<br />

verify that a given rule is being followed. If not, this rule must be checked my some kind<br />

of manual code review process. Where it is necessary to deviate from the rules, project<br />

management must give some form of consent by following a documented deviation<br />

procedure. Other non-mandatory “advisory” rules do not need to be followed so strictly,<br />

but cannot just be ignored altogether.<br />

The MISRA rules are not meant to define a precise language. In fact, most of the rules<br />

are stated informally. Furthermore, it is not always clear if a static checking tool should<br />

warn too much or too little when enforcing some of the rules. The project management<br />

must decide how far to go in cases like this. Perhaps a less strict form of checking that<br />

warns too little will be used throughout most of the development, until later when a<br />

stricter checking tool will be applied. At that point, somebody could manually determine<br />

which instances of the diagnostic are potential problems.<br />

Most of the rules have some amount of supporting text that justifies the rules or perhaps<br />

gives an example of how the rule could be violated. Many of the rules reference a<br />

source, such as parts of the C or C++ standards that state that such behavior is undefined<br />

or unspecified.<br />

738


Before exploring how one could use MISRA, let’s familiarize ourselves with the<br />

concepts and some examples of the rules of MISRA.<br />

Taxonomy of the Rules<br />

The MISRA rules are classified according to the C or C++ constructs that they restrict.<br />

For example, some of the categories are Environment, Control Flow, Expressions,<br />

Declarations, etc. However, I find that most of the rules also fall into a couple of groups<br />

according to the errors that they prevent.<br />

The first group of rules consists of those that intend to make the language more portable.<br />

For example, the language does not specify the exact size of the built in data types or how<br />

conversions between pointer and integer are handled. So, an example of a rule is one that<br />

says:<br />

C Directive 4.6/C++ Rule 3-9-2 (advisory):<br />

Typedefs that indicate size and signedness should be used in place of the<br />

basic numerical types.<br />

This rule effectively tries to avoid portability problems caused by the implementationdefined<br />

sizes of the basic types. We will return to this rule in the next section.<br />

Another source of portability problems are undefined behaviors. A program with an<br />

undefined behavior might behave logically, or it could abort unexpectedly. For example,<br />

using one compiler, a divide by 0 might always return 0. However, another compiler<br />

may generate code that will cause hardware to throw an exception in this case. Many of<br />

the MISRA C rules are there to forbid behaviors that produce undefined results because a<br />

program that depends on undefined behaviors behaving predictably may not run at all if<br />

recompiled with another compiler.<br />

Unlike this first group of rules that guard against portability problems, the second group<br />

of rules intends to avoid errors due to programmer confusion. While such rules don’t<br />

make the code any more portable, they can make the code a lot easier to understand and<br />

much less error prone. Here’s an example:<br />

C Rule 7.1/C++ Rule 2-13-2 (required):<br />

Octal constants (other than zero) and octal escape sequences (other than<br />

“\0”) shall not be used.<br />

By definition, every compiler should do octal constants the same way, but as I will<br />

explain later, octal constants almost always cause confusion and are rarely useful.<br />

739


A few other rules are geared toward making code safe for the embedded world. These<br />

rules are more controversial, but adherence to them can avoid problems that many<br />

programmers would rather sweep under the carpet.<br />

Examples of the Rules<br />

We will start by reviewing the rules mentioned above.<br />

Octal constants (other than zero) and octal escape sequences (other than “\0”)<br />

shall not be used. (C Rule 7.1/C++ Rule 2-13-2/Required)<br />

To see why this rule is helpful, consider:<br />

line_a |= 256;<br />

line_b |= 128;<br />

line_c |= 064;<br />

The first statement sets bit 8 of the variable line_a. The second statement sets bit 7<br />

of line_b. You might think that the third statement sets bit 6 of line_c. It<br />

doesn’t. It sets bits 2, 4, and 5. The reason is that in C any numeric constant that<br />

begins with 0 is interpreted as an octal constant. Octal 64 is the same as decimal<br />

52, or 0x34.<br />

Unlike hexadecimal constants that begin with 0x, octal constants look like<br />

decimal numbers. Also, since octal only has 8 digits, it never has extra digits that<br />

would give it away as non-decimal, the way that hexadecimal has a, b, c, d, e, and<br />

f.<br />

Once upon a time, octal constants were useful for machines with odd-word sizes.<br />

These days, they create more problems than they’re worth. MISRA C prevents<br />

programmer error by forcing people to write constants in either decimal or<br />

hexadecimal.<br />

<br />

Typedefs that indicate size and signedness should be used in place of the basic<br />

types. (C Directive 4.6/C++ Rule 3-9-2/Advisory)<br />

This is a portability requirement. Code that works correctly with one compiler or<br />

target might do something completely different on another. For example:<br />

int j;<br />

for (j = 0; j < 64; j++) {<br />

if (arr[j] > j*1024) {<br />

arr[j] = 0;<br />

}<br />

}<br />

740


On a target where an int is a 16-bit quantity, j*1024 will overflow and become a<br />

negative number when j >= 32. MISRA C suggests defining a type in a header<br />

file that is always 32-bits. For example one could define a header file called<br />

misra.h that does this. It could define an 32 bit type as follows:<br />

#include <br />

#if (INT_MAX == 0x7fffffff)<br />

typedef int SI_32;<br />

typedef unsigned int UI_32;<br />

#elif (LONG_MAX == 0x7fffffff)<br />

typedef long SI_32;<br />

typedef unsigned long UI_32;<br />

#else<br />

#error No 32-bit type<br />

#endif<br />

Then the original code could be written as:<br />

SI_32 j;<br />

for (j = 0; j < 64; j++) {<br />

if (arr[j] > j*1024) {<br />

arr[j] = 0;<br />

}<br />

}<br />

Strict adherence to this rule will not eliminate all portability problems based on<br />

the sizes of various types 1 , but it will eliminate most of them. Other MISRA rules<br />

(notably 10.1 and 10.3) are meant to fill in these gaps.<br />

The potential drawback to such a rule is that programmers understand the concept<br />

of an “int”, but badly-named types may disguise what the type represents.<br />

Consider a “generic_pointer” type. Is this a void * or some integral type that is<br />

large enough to hold the value of a pointer without losing data? Problems like<br />

this can be avoided by sticking to a common naming convention. Although there<br />

1<br />

The “integral promotion” rule states that before chars and shorts are operated on, they are cast up to an<br />

integer if an integer can represent all the values of the original type. Otherwise, they are cast up to an<br />

unsigned integer. The following code will behave differently on a target with a 16-bit integer (where it will<br />

return 0) than it will on a target with a 32-bit integer (where it will return 65536).<br />

UI_32 a()<br />

{<br />

UI_16 x = 65535;<br />

UI_16 y = 1;<br />

return x+y;<br />

}<br />

741


will be a slight learning curve for these names, it will pay off over time.<br />

Another problem is that using a type like UI_16 may be less efficient than using<br />

an “int” on a 32-bit machine. While it would be unsafe to use an int in place of a<br />

UI_16 if the code depends on the value of the variable being truncated after each<br />

assignment, in many cases the code does not depend on this. In some cases, an<br />

optimizing compiler can remove the extra truncations; in the rest, the extra cycles<br />

can be considered the price of safety.<br />

This next rule is specific to MISRA C.<br />

<br />

Function types shall be in prototype form with named parameters.shall have<br />

prototype declarations and the prototype shall be visible at both the function<br />

definition and call. (C Rule 8.2/required)<br />

Consider the following code:<br />

File1.c<br />

File2.c<br />

static F_64 maxtemp;<br />

void IncrementMaxTemp(void)<br />

F_64 GetMaxTemp(void)<br />

{<br />

{<br />

SetMaxTemp(GetMaxTemp() + 1);<br />

return maxtemp;<br />

}<br />

}<br />

void SetMaxTemp(F_64 x)<br />

{<br />

maxtemp = x;<br />

}<br />

This code may look OK, but it will not work as expected with most compilers. C<br />

has some rather dangerous rules that assume that type of a function when the<br />

function has not been declared. In File2.c, GetMaxTemp is called, but never<br />

declared A conforming ANSI/ISO C compiler will assume that GetMaxTemp()<br />

returns an int. In reality, GetMaxTemp will return a double. Depending on the<br />

architecture and compiler different things will happen, but this code will rarely<br />

work the right way.<br />

MISRA C avoids this problem by forcing the user to declare functions before they<br />

are used. This rule is absent from MISRA C++ since the C++ language has long<br />

required this.<br />

At the top of a file before it is ever used. Of course, the requirement that a global<br />

function be declared before it is used helps ensure that the declaration of a<br />

function matches the definition.<br />

In fact, another rule states:<br />

742


An external object or function shall be declared once in one and only one file. (C<br />

Rule 8.5/C++ Rule 3-2-2/Required)<br />

This rule works along with the previous rule to ensure that object and functions<br />

are will be compiled consistently.<br />

<br />

The value of an object with automatic storage duration shall not be read before it<br />

has been set. (C Rule 9.1/Required)<br />

In C and C++, automatic variables have an undefined value before they are<br />

written to. Unlike in Java, they are not implicitly given a value like 0. This<br />

sounds like good programming practice, so few people would disagree with this<br />

rule in most cases. But, how about the following case:<br />

extern void error(void);<br />

UI_32 foo(UI_8 arr[4])<br />

{<br />

UI_32 acc, j;<br />

UI_32 err = 0;<br />

for (j = 0; j < 4; j++)<br />

acc = (acc


they are used.” The description of the rule goes on to discuss dubious embedded<br />

environments that do not initialize static variables to zero before further requiring:<br />

“Each class constructor shall initialize all non-static members of its class.”<br />

<br />

The right hand operand of a logical && or || operator shall not contain side<br />

effects. (C Rule 13.5/C++ Rule 5-14-1/Required)<br />

A side-effect is defined as an expression that accesses a volatile object, modifies<br />

any object, writes to a file, or calls off to a function that does any of these things,<br />

possibly through its own function calls.<br />

The nomenclature “side-effect” may sound ominous and undesirable, but after<br />

some reflection, it becomes clear that a program cannot do much of anything<br />

useful without side-effects.<br />

As an example of where this rule is helpful is as follows:<br />

file_handle *ptr;<br />

success = packet_waiting(ptr) &&<br />

process_packet(ptr);<br />

This may work fine in a lot of cases. But, even if it is safe, it can easily become a<br />

hazard later. For example, a programmer might think that process_packet() is<br />

always called. Therefore, he reasons, it should be safe to close a file or free some<br />

memory in process_packet().<br />

A safer way to write this would be:<br />

or:<br />

file_handle *ptr;<br />

success_1 = packet_waiting(ptr);<br />

success_2 = process_packet(ptr);<br />

success = success_1 && success+2;<br />

file_handle *ptr;<br />

success = 0;<br />

if (packet_waiting(ptr)) {<br />

if (process_packet(ptr)) {<br />

success = 1;<br />

}<br />

}<br />

depending on the true intent of the code.<br />

This rule is not a portability or safety issue, per se, because the behavior of the ||<br />

744


and && operators are well defined. But, the rule is intended to eliminate a<br />

common source of programming errors.<br />

The final two rules that I will survey are perhaps the most controversial.<br />

<br />

<br />

The memory allocation and deallocation functions of stdlib.h shall not be used.<br />

(C Rule 21.3/C++ Rule 18-4-1/Required)<br />

Functions shall not call themselves, either directly or indirectly. (C Rule<br />

17.2/C++ Rule 7-5-4/Required under MISRA C, Advisory under MISRA C++)<br />

One problem with dynamic memory is that it needs to be used carefully in order<br />

to avoid memory leaks that could cause a system to run out of memory. Also,<br />

since implementations of malloc() may vary, heap fragmentation may not be the<br />

same between different toolchains.<br />

Likewise, recursion needs to be used carefully or otherwise a system could easily<br />

exceed the amount of available stack space.<br />

Applying MISRA<br />

MISRA C and MISRA C++, in their entirety, are obviously not for everyone. MISRA<br />

was designed for the automotive market where reliability is of the utmost importance, but<br />

manufacturers in other markets, such as game machines, may be able to tolerate less<br />

reliability in order to cram more features into the product. But, in terms of security, a<br />

simple and well-conceived design usually wins. It’s hard to imagine an extremely secure<br />

design that doesn’t lend itself to quite a number of the MISRA rules.<br />

As discussed earlier, some of the rules in the standard are advisory. One need not always<br />

follow them, although they are not supposed to just be ignored. Even the mandatory<br />

rules do not need to be observed everywhere. But, a manufacturer wishing to claim that<br />

his product is MISRA compliant must have a list of where it was necessary to deviate<br />

from the rules, along with other documentation mentioned in the standard.<br />

A looser approach might suffice in many cases where total compliance is not necessary.<br />

For example, let’s consider dynamic memory allocation. Some projects might only use<br />

dynamic memory in rare circumstances. It might be wise for an embedded development<br />

team to look through their uses of dynamic memory to verify that their use of dynamic<br />

memory is truly safe.<br />

Consider the following example:<br />

#include <br />

typedef unsigned int UI_32;<br />

/* Allocate memory for the next<br />

* packet */<br />

#ifdef __cplusplus<br />

745


extern UI_32 receive_sample(void);<br />

extern UI_32 checksum_data(UI_32<br />

length, UI_32 *data);<br />

void send_reply(UI_32 reply);<br />

extern void panic(void);<br />

/* This thread loops endlessly, receiving a<br />

* packet of variable length and replying<br />

* with the checksum of the packet.<br />

*/<br />

void checksum_thread(void)<br />

{<br />

while (1) {<br />

#else<br />

#endif<br />

UI_32 *data = new int[length];<br />

UI_32 *data = (UI_32 *)<br />

malloc(sizeof(UI_32) * length);<br />

for (count = 0; count < length;<br />

count++) {<br />

data[count] =<br />

receive_sample();<br />

}<br />

reply = checksum_data(length,<br />

data);<br />

/* Get the length of the next packet<br />

*/<br />

UI_32 length = receive_sample();<br />

if (length != 0) {<br />

UI_32 count, reply;<br />

}<br />

}<br />

}<br />

send_reply(reply);<br />

746


There are a couple of programming errors in the example:<br />

1. The C code does not check that malloc returns a non-NULL pointer. In C++, a<br />

call to new that cannot be fulfilled will result in a throw, but the surrounding code<br />

would need to be analyzed to see whether it could correctly handle the exception.<br />

A secure embedded system will probably need to restart the thread in a way that is<br />

consistent with its design.<br />

2. The memory allocated is never freed.<br />

This kind of analysis might lead to other insights. For example, there is often an upper<br />

bound on the size of most inputs. If this is true, in this case, then the programmer could<br />

have just as well used malloc in a case where a static or automatic array of fixed size<br />

would have been better. Even if these sorts of transformations are not possible, it can still<br />

be instructive to look at the places where memory allocation is used. This requirement<br />

will tend to discourage unneessary uses of malloc.<br />

Of course, a development team could use most of MISRA, but totally disregard rules that<br />

do not seem practical for their application given the amount of development time that<br />

they have. For example, a team could follow all of the required MISRA rules, except for<br />

the rule that prohibits dynamic memory allocation. They could also decide to follow<br />

many of the useful advisory rules, such as rule 6.3 (which says to use length-specific<br />

types instead of the built-in types). Later on, perhaps after completing the next<br />

milestone, the team could reconsider any rules that they chose to disregard in the last<br />

pass.<br />

It might also be necessary to add additional rules beyond what MISRA calls for. For<br />

example, MISRA C++ allows exception handling, but any given system may not be able<br />

to accommodate the ROM size of the tables if they are not often used. If the compiler<br />

offers an option that excludes exception handling in order to generate better code, this<br />

might be the right thing to do. Others claim that exception handling makes a program<br />

difficult to analyze.<br />

One thing that makes MISRA particularly attractive is that a number of embedded tools<br />

vendors are already checking the rules this in their compilers and code checkers. This off<br />

the shelf support makes MISRA easier than other alternatives that specify rules but have<br />

little infrastructure to back them up.<br />

Conclusion<br />

MISRA is a valuable tool for programming teams trying to write highly secure and<br />

reliable code. The rules are well thought out and provide many insights into likely errors<br />

and constructs that may cause security problems. Almost anyone who writes C or C++<br />

code will find MISRA’s coding guidelines useful. Consistent use of MISRA will increase<br />

the security of your software.<br />

747


Write Safe AND Secure Application Code with<br />

MISRA C:2012<br />

Mark. W. Richardson<br />

Lead Field Application Engineer<br />

LDRA<br />

Wirral, UK<br />

mark.richardson@ldra.com<br />

I. INTRODUCTION<br />

When examined with a critical eye, the commonly held belief<br />

that security and safety critical code are hugely different to<br />

each other is a conundrum. Why would that be?<br />

Within the safety domain, the aim for software developers is<br />

to produce code that performs as required, whilst ensuring that<br />

erroneous behaviour does not result in an accident.<br />

Within the security domain, their aim is to produce software<br />

that performs as required whilst ensuring that manipulation of<br />

input data does not result in denial of service or the leaking of<br />

sensitive data.<br />

Best practise for the development of either safety or security<br />

critical code is to apply a formalised software development<br />

process, starting with a set of requirements and tracing those<br />

requirements through to executable code. Undefined,<br />

unspecified and implementation-defined behaviours within the<br />

C language can lead to safety or security failures. And data<br />

handling errors such as invalid values, domain violations,<br />

tainted data, and leaking of confidential information can<br />

prevent both safety and security objectives from being<br />

realised.<br />

With so much commonality between perceived optimal<br />

practices for safety and security critical code, it is a puzzle as<br />

to why there is a common misconception that MISRA i is just<br />

for safety-related not for security-related projects. In response<br />

to that misconception, in April 2016, MISRA released<br />

“MISRA C:2012 – Addendum 2 ii ” which highlights which of<br />

the 46 C Secure iii rules are covered by the MISRA C:2012 iv<br />

guidelines.<br />

Even though MISRA C:2012 Amendment 1 v was written to<br />

further ensure complete coverage of the C Secure rules in the<br />

MISRA C:2012 standard, to a large extent it does so by<br />

enhancing the language of existing checks. For the most part,<br />

these enhancements explain why those checks are important<br />

from a security perspective with reference to the ISO C Secure<br />

Guidelines, particularly with regards to the use of<br />

"untrustworthy data.“<br />

In other words, the original MISRA C:2012 document has<br />

always targeted concerns such as buffer overruns and memory<br />

errors, and they have always been important for both safety<br />

and security. It has always promoted the detection of<br />

inconsistent data use, pertinent for all critical code. More<br />

generally, it has always aimed to ensure that defects are not<br />

introduced into the code, rather than adopting a set of checks<br />

to try and identify them after the fact.<br />

II. THE IMPORTANCE OF PROCESS STANDARDS AND<br />

GUIDELINES<br />

Safety-critical industries such as aerospace, automotive, rail,<br />

and medical, use process standards that address the rigour in<br />

which activities need to be performed during the development<br />

life cycle stages with respect to the functional safety of the<br />

system being developed. Coding standards and guidelines,<br />

such as MISRA C, are a critical part of this process. MISRA C<br />

defines a subset of the C language suitable for developing any<br />

application with high-integrity or high-reliability<br />

requirements. Although MISRA guidelines were originally<br />

designed to promote the use of the C language in safetycritical<br />

embedded applications within the motor industry, they<br />

have gained widespread acceptance in many other industries<br />

as well.<br />

The illustration in Figure 1 is a typical example of a table<br />

from ISO 26262-6:2011 vi , which mirrors similar tables both<br />

in IEC 61508 vii , and in other derivatives such as IEC 62304 viii<br />

(used in the development of medical devices). It shows the<br />

coding and modelling guidelines to be enforced during<br />

implementation, superimposed with an indication of where<br />

compliance can be confirmed using automated tools.<br />

These guidelines combine to make the resulting code more<br />

reliable, less prone to error, easier to test, and/or easier to<br />

maintain.<br />

www.embedded-world.eu<br />

748


IV.<br />

MISRA C SECURITY<br />

AMENDMENTS<br />

After the publication of MISRA C:2012, the WG14 ix<br />

committee responsible for maintaining the C standard<br />

published the ISO/IEC 17961:2013 C Language Security<br />

Guidelines x , designed to limit the use of the C language to a<br />

subset excluding the more vulnerable features of the language.<br />

The intention was for all rules to be enforceable using static<br />

analysis such that their detection could be automated without<br />

generating excessive false positives.<br />

Figure 1 - Mapping the capabilities of the LDRA tool suite to<br />

“Table 6: Methods for the verification of the software<br />

architectural design” specified by ISO 26262-6:2011<br />

III.<br />

THE SAFE AND SECURE SYSTEM<br />

The enterprise computing community has traditionally taken a<br />

“fail-first and patch-later” approach to secure system<br />

development. This development life-cycle consists of a largely<br />

laissez-faire attitude to development, and the subsequent<br />

application of penetration tests, fuzz tests and fault injection to<br />

expose and correct any unwanted behaviour. Such a reactive<br />

approach is not adequate when safety critical applications are<br />

involved, where functional safety standards already demand a<br />

much more proactive development approach (Figure 2) – and<br />

that proactive attitude is equally essential where a connected<br />

system must be dependable, trustworthy and resilient in order<br />

to protect critical data.<br />

Developers of functionally safe systems in accordance with<br />

such as DO-178, ISO 26262 and IEC 61508 are required to<br />

perform a functional safety risk assessment as part of the<br />

development lifecycle. Not only does it make sense to mirror<br />

that approach to perform a functional security risk assessment,<br />

but it is obligatory if those security risks represent a potential<br />

safety risk too. The identification of security risks involved in<br />

developing and deploying the product should be assessed and<br />

mitigation activities reflected in the security requirements. The<br />

design and coding stages can then also reflect the aspects of<br />

security requirements along with functional and non-functional<br />

requirements.<br />

It was in response to ISO/IEC 17961 that the MISRA<br />

committee developed “MISRA C:2012 – Addendum 2”,<br />

highlighting which of the 46 C Secure rules are covered by the<br />

original MISRA C: 2012 guidelines. MISRA C:2012<br />

Amendment 1 was written to further ensure complete<br />

coverage of the C Secure rules. The amendment is an<br />

extension MISRA C:2012.<br />

It establishes 14 new guidelines for secure C coding to<br />

improve the coverage of the concerns highlighted by the ISO<br />

C Secure Guidelines including, for example, issues pertaining<br />

to the use of “untrustworthy” data—a well-known security<br />

vulnerability. By following the additional guidelines,<br />

developers can more thoroughly analyse their code and can<br />

assure regulatory authorities that they have adopted best<br />

practice. This is becoming critical in many fields of endeavour<br />

including the automotive industry, the Industrial Internet of<br />

Things (IIoT), and the medical device sector – in short,<br />

wherever security threats have led to OEM demands for<br />

developers to prove that their software meets the highest<br />

standards for security as well as safety.<br />

V. INSECURE CODING EXAMPLES AND RELATED<br />

RULES<br />

To put the amendment into context, it is useful to review<br />

examples of where the additional rules apply.<br />

Example 1: Rule 12.5<br />

Rule 12.5 states, “The sizeof operator shall not have an<br />

operand which is a function parameter declared as “array of<br />

type”<br />

Many developers use the “sizeof” operator to calculate the<br />

size of an array. In a normal scenario that works fine. But<br />

when that approach is used on an array passed as a function<br />

parameter, that parameter is passed as a “pointer to type.”<br />

Consequently the attempt to calculate the number of elements<br />

usually returns an incorrect value, as illustrated in Figure 2 –<br />

and in this case, results in an array bound being exceeded.<br />

Figure 2 - The traditional V software development life cycle<br />

model incorporates security activities from the early stages<br />

749


void f1 (void)<br />

{<br />

char ch;<br />

ch = (char)getchar();<br />

if (EOF != (int) ch)<br />

/* Non-compliant - getchar returns an int which is cast to a<br />

narrower type */<br />

{<br />

}<br />

}<br />

Automatic Detection of Rule Violation at an Early Stage<br />

Figure 2: Source code example<br />

A static analysis tool can be used to check for the use of such<br />

syntax (Figure 3).<br />

Peer reviews represent a traditional approach to enforcing<br />

adherence to such guidelines, and whilst they still have an<br />

important part to play, automating the more tedious checks<br />

using tools is far more efficient, less prone to error,<br />

repeatable, and demonstrable (Figure 4).<br />

Figure 3 - The LDRA TBvision tool detects the MISRA C:2012<br />

rule violation for the “sizeof” operator example<br />

Example 2: Rule 22.7<br />

Rule 22.7 states “The macro EOF shall only be compared<br />

with the unmodified return value from any Standard Library<br />

function capable of returning EOF.”<br />

An “EOF” (End Of File) return value from standard library<br />

functions is used to indicate that the relevant stream has either<br />

reached the end of the file, or an error has occurred in reading<br />

from or writing to that file. The macro “EOF” is defined as an<br />

“int” with a negative value.<br />

If the “EOF” value is captured in a variable of incorrect type,<br />

then it may become indistinguishable from a valid character<br />

code. It is therefore important to use an “int” to store the<br />

return code from such functions such as “getchar()” or<br />

“fgetc()”, and to avoid because the common practice of<br />

storing the result in a char.<br />

Figure 4 - LDRA TBvision reports the Rule 22.7 (EOF<br />

comparison with char) violation in the source code.<br />

VI.<br />

CHOOSING A LANGUAGE SUBSET<br />

Although there are several language subsets (or less formally,<br />

“coding standards”) to choose from, these have traditionally<br />

been focused primarily on safety, rather than security. More<br />

lately with the advent of such as the Industrial Internet of<br />

Things, connected cars, and connected heart pacemakers, that<br />

focus has shifted towards security to reflect the fact that<br />

systems such as these, once naturally secure through isolation,<br />

are now increasingly accessible to aggressors.<br />

There are, however, subtle differences between the differing<br />

subsets which is perhaps a reflection of the development<br />

dichotomy between designing for security, and appending<br />

some measure of security to a developed system. To illustrate<br />

this, it is useful to compare and contrast the approaches taken<br />

by the authors of MISRA C and CERT C with respect to<br />

security.<br />

www.embedded-world.eu<br />

750


A. Retrospective adoption<br />

MISRA C:2012 categorically states that “MISRA C should be<br />

adopted from the outset of a project. If a project is building on<br />

existing code that has a proven track record then the benefits<br />

of compliance with MISRA C may be outweighed by the risks<br />

of introducing a defect when making the code compliant.”<br />

This contrasts in emphasis with the assertion of the CERT C xi<br />

authors that although “the priority of this standard is to<br />

support new code development…. A close-second priority is<br />

supporting remediation of old code”<br />

Of course, as with the system as a whole, the level of risk<br />

involved with the compromise of the system will reflect on the<br />

approaches to be adopted. Certainly, the retrospective<br />

application of any language subset is better than nothing, but<br />

late adoption does not represent best practice.<br />

more draconian, and yet by avoiding the side effects altogether<br />

the resulting code is certain to be more portable, and it can be<br />

automatically checked by a static analysis tool. It is simply not<br />

possible for a tool to check whether a developer is “aware” of<br />

side effects – and less possible still to ascertain whether<br />

“awareness” equates to “understanding”.<br />

The net effect is that a static analysis tool can make the same<br />

checks but the detection of an issue has different implications.<br />

For MISRA – “You have a violation that either needs to be<br />

removed or a deviation introduced”. For CERT – “Did you<br />

mean to do this this”? The former is clearly easier to police.<br />

B. Relevance to safety, high integrity and high reliability<br />

systems<br />

MISRA C:2012 “define[s] a subset of the C language in which<br />

the opportunity to make mistakes is either removed or<br />

reduced. Many standards for the development of safety-related<br />

software require, or recommend, the use of a language subset,<br />

and this can also be used to develop any application with high<br />

integrity or high reliability requirements”. The accurate<br />

implication of that statement is that MISRA C was always<br />

appropriate for security critical applications even before the<br />

security enhancements introduced by MISRA C:2012<br />

Amendment 1.<br />

CERT C attempts to be more all-encompassing, as reflected in<br />

its introductory suggestion that “safety-critical systems<br />

typically have stricter requirements than are imposed by this<br />

standard … However, the application of this coding standard<br />

will result in high-quality systems that are reliable, robust, and<br />

resistant to attack”.<br />

C. Decidability<br />

The primary purpose of a requirements-driven software<br />

development process as exemplified by ISO 26262 is to<br />

control the development process as tightly as possible to<br />

minimize the possibility of error or inconsistency of any kind.<br />

Although that is theoretically possible by manual means, it<br />

will generally be far more effective if software tools are used<br />

to automate the process as appropriate.<br />

In the case of static analysis tools, that requires that the rules<br />

can be checked algorithmically. Compare, for example, the<br />

excerpts shown in Figure 5, both of which address the same<br />

issue. The approach taken by MISRA is to prevent the issue<br />

by disallowing the inclusion of the pertinent construct. CERT<br />

C instead asserts that the developer should “be aware” of it.<br />

Of course, there are advantages in each case. The CERT C<br />

approach is clearly more flexible; something of particular<br />

value if rules are applied retrospectively. MISRA C:2012 is<br />

Figure 5: Contrasting approaches to the definition of coding<br />

rules<br />

D. Precision of rule definitions<br />

The stricter, more precisely defined approach of MISRA does<br />

not only lend itself to a standard more suitable for automated<br />

checking. It also addresses the issue of language<br />

misunderstanding more convincingly than CERT C.<br />

Evidence suggests that there are particular characteristics of<br />

the C language which are responsible for most of the defects<br />

found in C source code xii , such that around 80% of software<br />

defects are caused by the incorrect usage of about 20% of the<br />

available C or C++ language constructs. By restricting the use<br />

of the language to avoid the parts that are known to be<br />

problematic, it becomes possible to avoid writing associated<br />

defects into the code and as a result, the software quality<br />

greatly increases.<br />

This approach also addresses a more subtle issue surrounding<br />

the personalities and capabilities of individual developers.<br />

Simple statistics tell us that of all the C developers in the<br />

world, 50% of them have below average capabilities – and yet<br />

it is very rare indeed to find a development team manager who<br />

would acknowledge that they recruit any such individuals.<br />

More than that, in any software development team, there will<br />

some who are more able than others and it is human nature for<br />

people not to highlight the fact if there are things they don’t<br />

understand. More than that, it is common for less experienced<br />

programmers to be writing code, especially in large teams;<br />

751


typically the most experienced members will be involved in<br />

management and requirements definition with the new intake<br />

being used to code from the decomposed requirements.<br />

Figure 6 uses the handling of variadic functions to illustrate<br />

how this approach differs from that of CERT C. CERT C calls<br />

for developers to “understand” the associated type issues, but<br />

doesn’t suggest how a situation might be handled where a<br />

developer is, despite the best of intentions, harbouring a<br />

misunderstanding.<br />

A counter argument might be that there will be developers<br />

who are very aware of the type issues associated with variadic<br />

functions, who make very good use of them, and who may feel<br />

affronted by the tighter restrictions on their use. However, for<br />

highly safety or security critical systems, MISRA would assert<br />

that because the “opportunity to make mistakes is either<br />

removed or reduced”, that is a price well worth paying.<br />

applied. However, for safety or security critical applications,<br />

MISRA C is considerably less error prone both because it is<br />

specifically designed for such systems and as a result of its<br />

stricter, more decidable rules. Conversely, there is an<br />

argument for using the CERT C standard if the application is<br />

not critical but is to be connected to the internet for the first<br />

time. The retrospective application of CERT C might then be a<br />

pragmatic choice to make, though it would likely be<br />

accompanied by a list of issues where confirmation of intent is<br />

required .<br />

COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral<br />

CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

CONTACT DETAILS<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Richardson<br />

Lead Field Applications Engineer<br />

E:mark.richardson@ldra.com<br />

Figure 6: Comparing differing precision of rule definition<br />

VII.<br />

CONCLUSIONS<br />

Best practise for the development of either safety or security<br />

critical code is to apply a formalised software development<br />

process, starting with a set of requirements and tracing those<br />

requirements through to executable code. Even so, undefined,<br />

unspecified and implementation-defined behaviours within the<br />

C language can lead to safety or security failures in the<br />

resulting code base. And data handling errors such as invalid<br />

values, domain violations, tainted data, and leaking of<br />

confidential information can prevent both safety and security<br />

objectives from being realised.<br />

MISRA C:2012 is not the only coding standard option for<br />

those with a need to develop secure code. For example, The<br />

correct application of either CERT C or MISRA C:2012 will<br />

certainly result in more secure code than if neither were to be<br />

www.embedded-world.eu<br />

752


i<br />

MISRA – The Motor Industry Software Reliability<br />

Association<br />

https://www.misra.org.uk/Publications/tabid/57/Default.aspx<br />

ii<br />

MISRA C:2012 - Addendum 2: Coverage of<br />

MISRA C:2012 against ISO/IEC TS 17961:2013 "C Secure",<br />

ISBN 978-906400-15-6 (PDF), April 2016.<br />

iii<br />

ISO/IEC TS 17961:2013 Information technology -- Programming<br />

languages, their environments and system software interfaces - C secure<br />

coding rules<br />

iv<br />

MISRA C:2012 - Guidelines for the Use of the C Language<br />

in Critical Systems, ISBN 978-1-906400-10-1 (paperback),<br />

ISBN 978-1-906400-11-8 (PDF), March 2013<br />

v<br />

MISRA C:2012 - Amendment 1: Additional security<br />

guidelines for MISRA C:2012, ISBN 978-906400-16-3 (PDF),<br />

April 2016.<br />

vi<br />

ISO 26262-6:2011 Road vehicles -- Functional safety -- Part 6: Product<br />

development at the software level<br />

vii<br />

IEC 61508-1:2010 Functional safety of<br />

electrical/electronic/programmable electronic safety-related<br />

systems - Part 1: General requirements<br />

viii<br />

IEC 62304 International Standard Medical device software<br />

– Software life cycle processes Consolidated Version Edition<br />

1.1 2015-06<br />

ix<br />

International standardization working group for the<br />

programming language C JTC1/SC22/WG14<br />

http://www.open-std.org/jtc1/sc22/wg14/<br />

x<br />

ISO/IEC TS 17961:2013 Information technology -- Programming<br />

languages, their environments and system software interfaces -- C secure<br />

coding rules<br />

xi<br />

SEI CERT C Coding Standard<br />

https://wiki.sei.cmu.edu/confluence/display/c/SEI+CERT+C+<br />

Coding+Standard<br />

xii<br />

Applying the 80:20 Rule in Software Development. Jim<br />

Bird. Nov 15,2013. https://dzone.com/articles/applying-8020-<br />

rule-software<br />

753


Hypervisors in Embedded Systems<br />

Applications and Architectures<br />

Jack Greenbaum<br />

Green Hills Software, Inc<br />

Santa Barbara, California, USA<br />

jackg@ghs.com<br />

Cesare Garlati<br />

prpl Foundation<br />

Santa Clara, California, USA<br />

cesare@prplFoundation.org<br />

Abstract — As microprocessor architectures have evolved<br />

with direct hardware support for virtualization, hypervisor<br />

software has become not just practical in embedded systems, but<br />

present in many commercials applications. This paper discusses<br />

embedded systems use cases for hypervisors, including their use<br />

in workload consolidation and security applications.<br />

Keywords — hypervisor; virtualization; virtual machine; guest<br />

OS; embedded systems; security; IoT; Internet of Things.<br />

I. INTRODUCTION<br />

Hypervisors are a type of operating system software that<br />

allows multiple traditional operating systems to run on the<br />

same microprocessor [1]. They were originally introduced in<br />

traditional IT data centers to solve workload balancing and<br />

system utilization challenges. Initial hypervisors required<br />

changes to the guest OS to compensate for a lack of hardware<br />

support for the isolation required between guest operating<br />

systems. As microprocessor architectures have evolved with<br />

direct hardware support for virtualization, hypervisors have<br />

become not just practical in embedded systems, but are present<br />

in deployed applications [2]. Hypervisors are here to stay in<br />

embedded systems. This paper discusses embedded systems<br />

use cases for hypervisors, including their use in workload<br />

consolidation and security applications.<br />

Hardware support for virtualization in modern<br />

microprocessors has been the necessary enabler for<br />

virtualization to move from the data center to embedded<br />

systems. All of the major processor architectures have evolved<br />

with virtualization extensions. Notable examples include Intel<br />

VT-x, ARM Virtualization Extensions, and MIPS VZ<br />

extensions. This support includes a distinct hypervisor<br />

execution mode at a higher privilege level than the traditional<br />

supervisor mode, and IO MMUs to isolate peripheral devices<br />

used by different guest operating systems from each other.<br />

Without an IOMMU the unique IO requirements of embedded<br />

systems cannot be properly separated. The Intel version is<br />

called VT-d, and most ARM processors have a “System<br />

MMU”. In the data center the IOMMU is sometimes called<br />

Single Root Virtualization, or SRV.<br />

The rest of this paper focuses on the use cases for<br />

hypervisors in embedded systems, and introduces the<br />

capabilities that hypervisors provide to implement these use<br />

cases.<br />

II.<br />

USE CASES<br />

A. Consolidation<br />

The most common use of hypervisors is to consolidate<br />

multiple workloads onto a single platform in order to reduce<br />

size, weight, power, or cost. This is the same use case that has<br />

driven broad adoption of hypervisors in the IT server space. As<br />

servers have grown to have more capacity than any single<br />

application, virtualization lets one server combine multiple<br />

applications. But integrating multiple applications from<br />

different customers onto one operating system puts too many<br />

constraints on what functions can be combined. Virtualization<br />

instead runs multiple operating systems on the same hardware,<br />

allowing complete applications to run on the same hardware<br />

with very little interaction.<br />

Consolidation use cases are becoming common in<br />

automotive systems. One example is combining the instrument<br />

cluster and in-vehicle infotainment (IVI) systems into a single<br />

electronic control unit (ECU). The instrument cluster is<br />

typically built on a real-time operating system (RTOS), while<br />

IVI is often built on Linux or another general purpose OS<br />

(GPOS). The real-time and safety requirements of the<br />

instrument cluster cannot be met by GPOS, and the media<br />

libraries required for IVI are expensive to port to an RTOS.<br />

Therefore, integrating these two functions into one OS is not<br />

feasible. But a hypervisor with real-time and safety guarantees<br />

can run both the RTOS and GPOS on the same processor<br />

within a single ECU. This saves not only cost (by having only<br />

one processor and circuit board), but also space in the vehicle<br />

that is increasingly full with more and more ECUs for modern<br />

safety features.<br />

B. Legacy Operating Systems<br />

As systems evolve over time, it often becomes necessary to<br />

make a shift to a new operating environment to enable new<br />

features. Preserving the existing features of the system would<br />

then require porting already tested and field proven software to<br />

the new platform. Virtualization, on the other hand, allows<br />

running the existing operating environment alongside of the<br />

new software on the same processor. One example is a<br />

www.embedded-world.eu<br />

754


software defined radio. Over time the product requirements<br />

may evolve to require a transition from a simple LCD user<br />

interface to a graphical user interface (GUI). The radio<br />

software may have a high cost to recertify. By using<br />

virtualization, the radio protocol software can be maintained<br />

while a second OS with a modern GUI library can be run<br />

alongside. By running the radio protocol software unchanged<br />

(or with minimal change), a GUI can be added while<br />

minimizing or eliminating recertification costs for the radio<br />

protocol software. This use case is most common in deeply<br />

embedded and very cost sensitive applications.<br />

C. Multiple Security Levels (MILS)<br />

A third example is a combination of the consolidation and<br />

legacy cases. The application in this case is to provide security<br />

isolation between two different workloads that have different<br />

security postures. Two examples include running Trusted<br />

Execution Environments and dual-persona smart phones.<br />

Trusted Execution Environments (TEEs) provide security<br />

critical processing in an environment isolated from the rest of<br />

the system. Use cases include secure boot, cryptographic<br />

services, and security critical device feature management<br />

including the IOMMU. This is similar to consolidation in that<br />

cryptographic services have traditionally been offloaded to a<br />

separate, smaller, core. The advantage to running cryptographic<br />

services in a TEE on the main processor cores is typically<br />

higher performance than traditional approach.<br />

A dual persona phone acts like two separate smart phones.<br />

The common application is one partition is an operationally<br />

secure partition, while the other partition can be updated or<br />

reconfigured by the user. Typically, the secure partition is<br />

controlled by a business IT department or a government entity<br />

that manages compartmentalized information. Such a partition<br />

may have access to restricted networks and therefore contains<br />

high value encryption keys and information. The software load<br />

on the secure partition is often locked down and verified at<br />

boot time. The second partition is often called a user partition,<br />

and has access to the public internet and can install apps from<br />

an app store and access other unsecured content. The<br />

underlying assumption is that the hypervisor provides a higher<br />

level of isolation than the individual operating systems being<br />

virtualized<br />

A. Memory Sharing<br />

The most basic hypervisor runs on a multi-core SoC and<br />

provides only for sharing of memory. Each CPU core runs a<br />

separate hardware load. The hypervisor configures the SoC’s<br />

virtualized memory management - see section below - to<br />

restrict each CPU to a portion of the memory address space of<br />

the SoC, including both RAM and peripheral registers. This<br />

allows multiple Operating systems to run on a single SoC with<br />

disjoint peripherals and secure shared access to RAM.<br />

B. CPU Sharing<br />

A more capable hypervisor also allows sharing of<br />

individual CPU cores via time slicing. This allows different<br />

workloads to have access to all CPU resources during times of<br />

heavy demand, and to partition that access based on priority.<br />

For example, when consolidating RTOS and GPOS workloads,<br />

the RTOS is typically given priority on the CPUs, while the<br />

GPOS gets a guaranteed minimum amount of execution time.<br />

The GPOS has full access to the CPUs when the RTOS is idle.<br />

Note that a hypervisor that supports CPU sharing in this way<br />

typically must be written with real-time behavior in mind, and<br />

is often based on an RTOS.<br />

C. Peripherals Sharing<br />

Another set of hypervisor features revolves around sharing<br />

of peripherals, such as mass storage, communication links, and<br />

GPUs. Embedded systems are often cost sensitive, so the<br />

ability to share devices such as eMMC mass storage is<br />

required. There are several different techniques for<br />

implementing peripheral sharing, including mediated pass<br />

through, device emulation, and paravirtualization. Each<br />

approach has its strengths and weaknesses. A full discussion of<br />

these concepts is beyond the scope of this paper, but when<br />

considering the use of a hypervisor the sharing of devices is as<br />

important to consider as is the sharing of the CPU.<br />

IV. CONCLUSION<br />

Hypervisors have moved from the data center to embedded<br />

systems, enabled by hardware support in modern<br />

microprocessors. We have outlined the common use cases for<br />

virtualization, and considerations for device sharing.<br />

REFERENCES<br />

III. HYPERVISOR CAPABILITIES<br />

All hypervisors provide isolated sharing of System On a<br />

Chip (SoC) resources, but differ in the scope and depth of<br />

support for sharing the different hardware elements.<br />

[1] Security Guidance for Critical Areas of Embedded Computing, prpl<br />

Foundation, January 2016 - https://prpl.works/security-guidance/<br />

[2] prplSecurity Framework Application Note, prpl Foundation, July 2017 -<br />

https://prpl.works/application-note-july-2016/<br />

755


Digging Into Embedded Virtualized Systems<br />

Overcoming the Barriers to Debugging Hardware and Software<br />

Khaled Jmal<br />

Lauterbach GmbH<br />

Höhenkirchen-Siegertsbrunn, Germany<br />

khaled.jmal@lauterbach.com<br />

Rudolf Dienstbeck<br />

Lauterbach GmbH<br />

Höhenkirchen-Siegertsbrunn, Germany<br />

rudolf.dienstbeck@lauterbach.com<br />

Abstract—In order to save money, the functions from several<br />

electronic devices are consolidated on a common hardware unit.<br />

A hypervisor separates the functions on the software side. This<br />

results in debugging becoming more challenging but by no means<br />

impossible.<br />

Keywords—hypervisor; debugging; awareness;<br />

I. INTRODUCTION<br />

Hypervisor – embedded software developers are currently<br />

faced with this term all the time. There is almost a hype around<br />

this technology (pun intended). For instance, it seems to be a<br />

focal point of discussion at the moment in the automotive,<br />

aviation and aerospace segments as well as in the field of<br />

medical technology. However, what impact does this have on<br />

the development cycle and, in particular, in terms of<br />

debugging? Debugging tools, particularly those that access the<br />

hardware (e.g. JTAG debuggers) need to take so much into<br />

consideration when a hypervisor is utilized on the target<br />

system. Naturally, the developer wants to have a tool at their<br />

disposal that shows them the complete status of the embedded<br />

system including all components such as the hypervisor, guest<br />

operating systems and guest processes.<br />

II.<br />

SEVERAL MACHINES ON A SINGLE PIECE OF<br />

HARDWARE<br />

A hypervisor allows to run different virtual machines (VM),<br />

also called guest machines, on a single piece of hardware. This<br />

permits e.g. to run several operating systems on the same host.<br />

The hypervisor is responsible for allowing these operating<br />

systems to run on a single computer, either by dividing the<br />

CPU across the operating systems in a time slice technique, or<br />

by dynamically assigning the individual cores to different<br />

guests in a multi-core environment. Everybody is aware of<br />

hypervisors on desktop computers such as VMWare or<br />

VirtualBox which can be used e.g. to run one (or several)<br />

complete Linux distribution(s) on Windows. Other examples<br />

that are also utilized in embedded systems include Xen, KVM,<br />

Jailhouse and QEMU.<br />

A concrete application from the embedded systems segment<br />

may be structured as follows: The objective is for a car<br />

dashboard to work with an industrial Linux distribution, for the<br />

infotainment system to operate using Android, for the air<br />

conditioning to utilize FreeRTOS and the engine control to<br />

work with an AUTOSAR Stack. In the past, four (and even<br />

more) different hardware platforms were actually required for<br />

this purpose. However, all of these functions are now<br />

integrated into a single system and, where possible, even on a<br />

single CPU.<br />

Why? The first reason can be attributed to costs.<br />

Nowadays, embedded systems are so powerful that a single<br />

system is able to complete all of these tasks. Furthermore, it is<br />

also cheaper to produce and install an integrated hardware<br />

module rather than four different systems. This is the primary<br />

motivation as every penny counts, especially in the automotive<br />

industry. As an "add-on", a hypervisor provides an extra layer<br />

of security and protection. The hypervisor is able to monitor all<br />

guests and act accordingly in the event of issues, e.g. by<br />

restarting a guest. It is also essential to protect the guests from<br />

unwanted interaction. A technical prerequisite for this is to<br />

ensure that all guests are kept separate from each other in terms<br />

Fig. 1. A hypervisor coordinates the operation of several virtual<br />

machines on a real machine.<br />

756


of hardware via an independent Memory Management Unit<br />

(MMU) (Figure 1).<br />

In terms of hardware, the individual guests can be separated<br />

from each other if the CPU provides a complete hardware<br />

abstraction. In order to do so, three things must be virtualized<br />

in principle: The memory, the peripheral equipment and the<br />

CPU itself. The guest's operating system should not even know<br />

that it is running in a virtualized machine. This requires that the<br />

MMU supports two stages of address translation. The first<br />

stage translates the guest virtual address to a guest physical<br />

address also called intermediate address. The intermediate<br />

address is then translated in a second MMU stage of the<br />

hypervisor to the real physical address. The peripheral<br />

equipment is also virtualized ("virtual I/O") in order to ensure<br />

that each guest is able to interact with the environment. In<br />

doing so, the hypervisor decides which guest may access which<br />

piece of peripheral equipment and which responds to guest<br />

interruptions. Finally, each guest receives one or several virtual<br />

CPUs that are mapped on the actual cores via a scheduler. In<br />

doing so, the number of virtual CPUs of a specific guest can be<br />

lower or greater than the number of real cores.<br />

III.<br />

HYPERVISOR IMPACT ON DEBUGGERS<br />

There are in principle two debugging methods: The<br />

software-controlled run mode debugging and the hardwarecontrolled<br />

stop mode debugging.<br />

A. Run Mode Debugging<br />

The run mode debugging method involves the loading in<br />

the target platform of an additional debug software (also called<br />

“debug agent”) that accomplishes the actual debugging. Singlestep<br />

mode, breakpoints, etc. are all managed by this piece of<br />

software. A typical example is the use of a gdbserver to<br />

remotely debug a Linux process. The debugger user interface<br />

on the development computer communicates then with the<br />

debug agent e.g. via a serial interface or Ethernet. On a<br />

breakpoint hit, only the component to be debugged, e.g. the<br />

Linux process, is stopped. The rest of the system will continue<br />

to run. This is the reason why this method is called “run<br />

mode”. Such a debug session only requires an appropriate<br />

communication channel. If an underlying hypervisor is present,<br />

the channel is simply routed through it (Figure 2). Once this<br />

route has been established, neither the debugger nor the agent<br />

is aware that a hypervisor is present in-between them, i.e. the<br />

debugging is "hypervisor agnostic". This method is perfect if<br />

the system needs to continue during the debugging, e.g.<br />

because protocols need to be served. Run mode debugging is<br />

completely sufficient to debug a single component, for instance<br />

a process within a single machine. However, this method<br />

reaches its limits as soon as the guest operating system or the<br />

Fig. 2. Run mode debugging with a gdbserver<br />

hypervisor is involved. In this case a different debugging<br />

approach that allows a system wide view is required.<br />

B. Stop Mode Debugging<br />

When debugging, developers generally want to see<br />

everything: The hypervisor, all guests as well as all guest<br />

processes all at once and all at the same time! This is, in<br />

principle, not possible in run mode for the aforementioned<br />

reasons. But it is possible in stop mode, which is the main<br />

strength of this option. In hardware-controlled stop mode<br />

debugging, the debugger is connected directly to the processor<br />

via a dedicated interface which is typically JTAG. The<br />

debugger uses this interface to control the CPU itself, e.g. stop<br />

it, trigger individual program steps, read the registry or<br />

memory. This also means that the entire system, including all<br />

processes, guests and – of course – the hypervisor, is stopped in<br />

the event of a breakpoint. In such a case, no more interrupts are<br />

operated, no communication protocols run and no VM, process<br />

or task changes take place. The CPU is effectively "frozen",<br />

which is why it is called "stop mode". Since a hardware<br />

debugger accesses the system via the CPU, it can initially only<br />

“see” the component that are released by the MMU in this<br />

state, i.e. only the guest currently running on the CPU and only<br />

the currently active process. The debugger is however able to<br />

do slightly more than that: Thanks to a temporary minimal<br />

manipulation to the MMU registry, it can also directly read the<br />

physical address space and the current "intermediate" (=<br />

"physical guest") address space. However, all debug symbols<br />

belonging to the processes and guests are stored on virtual<br />

addresses, meaning that this additional view is not particularly<br />

useful to begin with. Therefore, the debugger needs to translate<br />

the virtual address to the corresponding physical address, i.e.<br />

perform the MMU table walk, before accessing the physical<br />

address space. This can be done for the current context by<br />

reading the page table pointers from the MMU registers.<br />

However, for the debugger to be able to see everything beyond<br />

the current status, the information about the MMU tables of the<br />

single tasks, virtual machines and the hypervisor needs to be<br />

extracted from the guest operating systems and from the<br />

hypervisor. The debugger needs also to be “aware” of the<br />

hypervisor as well as of the single guest operating systems.<br />

This requires a "hypervisor awareness", an "OS awareness" for<br />

each guest and an "MMU awareness" for both the hypervisor<br />

as well as for each guest.<br />

IV.<br />

DEBUGGER NEEDS TO HAVE "AWARENESS"<br />

A hypervisor awareness is used to determine the list of the<br />

virtual machines, their IDs, virtual CPUs and the MMU<br />

settings. The awareness uses the hypervisor debug symbol<br />

information (ELF/DWARF) in order to read the necessary<br />

information from the system. The hypervisor awareness is also<br />

responsible for managing the layout of the stage 2 MMU<br />

translation so that the debugger has access to all VMs. An "OS<br />

awareness" is additionally required for each guest in order to<br />

analyze the content of a guest operating system. The awareness<br />

is also developed specifically for each OS in use. This<br />

awareness then determines the processes of the operating<br />

system and the MMU settings within the VM as well as the<br />

MMU table layout (stage 1 MMU translation). For this<br />

purpose, the awareness then uses the debug symbol<br />

757


Fig. 3. A tree structure illustrates the target system layout<br />

information belonging to the respective operating system. As a<br />

result, the debugger is able to illustrate a hierarchical tree of the<br />

entire system. Processes, threads and other resources can be<br />

illustrated (Figure 3).<br />

With this awareness of the system layout, the debugger can<br />

read the list of guests and processes as well as their MMU<br />

tables from the system. Equipped with this knowledge, the<br />

debugger can now perform the MMU table walk for each<br />

virtual address of a guest or process itself, i.e. past the<br />

hardware MMU, and reads the respective data from the<br />

physical memory directly. By implementing this method, the<br />

debugger accesses all addresses belonging to all guests and all<br />

processes, irrespective of whether they are virtual, intermediate<br />

or physical. And all of this is done at the same time!<br />

Various commands and windows can be specially used on a<br />

certain machine or certain process. For instance, the process<br />

taking place on a Linux machine and the task being performed<br />

by a FreeRTOS device can be shown at the same time. The<br />

loaded debug symbols can be assigned to a certain machine or<br />

certain process. Using a machine ID and a process ID, each<br />

virtual address is unambiguous.<br />

If the software runs on a breakpoint, the entire system will<br />

be stopped as described above. The debugger then<br />

automatically switches to the (real) core that stopped at the<br />

breakpoint and displays the current machine and process on<br />

this core. This allows the user to immediately see the<br />

conditions that led to this break. Naturally, it is possible to<br />

manually switch to other cores and their "current machines".<br />

Moreover, the user is not only able to switch the view to other<br />

hardware cores; he can also switch to other, currently inactive<br />

guest systems. As a result, a symbolic access to all of the<br />

functions and variable of other machines is possible at all<br />

times. If the registries are not loaded in a real core at this<br />

moment in time, the debugger reads the values for this from the<br />

hypervisor or guest system memory. Using these values, the<br />

debugger determines the current stack frame in order to; for<br />

instance, display the current call hierarchy of a task's functions.<br />

Straightaway, the developer sees the current progress of the<br />

task and why it may be potentially waiting.<br />

Lauterbach has created a reference implementation with the<br />

Xen hypervisor and the Linux and FreeRTOS guests on a<br />

Hikey board that demonstrates the functionality. The MMU<br />

support implemented in the TRACE32 debugger and an<br />

expansion of the address management to virtualized systems<br />

permit access to all components at all times. This enables a<br />

debugging of the hypervisor, all guest operating systems and<br />

all guest processes. Consequently, it is even possible for a<br />

retrospective analysis of a memory map without any problems<br />

whatsoever.<br />

758


Autonomous Driving needs Safety and Security<br />

Dr. Ciwan Gouma<br />

SYSGO AG<br />

Manager Business Development Automotive<br />

Klein- Winternheim, Germany<br />

ciwan.gouma@sysgo.com<br />

Abstract— Internet in cars, vehicles communicating with<br />

each other and with the infrastructure - many new and very<br />

important functions for driver assistance and autonomous<br />

driving will be reality in the future.<br />

What ideas and concepts from the IT security and the<br />

avionics industry can we use? How can synergies be derived<br />

from the joint implementation of safety and security<br />

requirements, which also increase efficiency for developers,<br />

SW architects and testers? What requirements should a<br />

MILS Operating System (Multiple Independent Levels of<br />

Safety/Security) meet to minimize risks, reduce<br />

development times and reduce development costs?<br />

Keywords — Automotive CyberSecurity Overview & recommendations,<br />

from Safety to Security, mixed Criticalities,<br />

Adaptive AUTOSAR & Security<br />

I. GAMECHANGER – CONNECTED CAR<br />

Self-driving vehicles will soon hit the road – the automotive<br />

industry is facing rapid changes with countless challenges.<br />

Handling vast data collections at the same time managing<br />

uncompromised security, real-time decision making combined<br />

with new mobility services. OEM´s and Tier 1 are facing shorter<br />

design cycles and have to handle requests for more personalised<br />

experience.<br />

In 2021 we expect about 200 million connected cars, about<br />

90% of all cars will have an internet access [1]. Already in 2025<br />

research institutes expect about 470 million connected and about<br />

7 million autonomous driving cars (autonomous driving level<br />

4/5, see Figure 1) [2], [3].<br />

Automated driving systems will monitor the driving<br />

environment. These autonomous cars with more than<br />

autonomous driving level 3 (see Figure 1) requires high safety<br />

certifications.<br />

The current situation with more than 100 Electronic Control<br />

Units (ECU´s) in a car increased the cybersecure attack surface<br />

tremendously because of their connectivity.<br />

It is obvious that we have to rethink cybersecurity and<br />

vehicle safety. By 2020, almost every new car will be connected,<br />

putting OEM’s current structures at risk because the current<br />

communication and energy on-board network topology as well<br />

Figure 1:Level of Autonomous Driving [13] [14]<br />

Software architecture is not able to handle future requirements –<br />

complexity, ensure security, costs [4].<br />

Last but not least, the incredible fast-evolving Artificial<br />

Intelligence (AI) is also a compelling reason to think about new<br />

approaches, for which some ideas will be presented below.<br />

II. OTHER PERSPECTIVES<br />

- LEARNING FROM AVIONIC AND IT SECURITY –<br />

The many years of experience in IT security should also be<br />

taken into account in the automotive industry. Thus, the<br />

established and proven technology of firewalls and<br />

cryptography can be used.<br />

But “Crypto won’t save you either” by Peter Gutmann [5]:<br />

• List a lot of prominent hacks, for all of them cracking<br />

crypto was not necessary<br />

• All of them targeted integration<br />

Thus, what may we learn from IT Security:<br />

• Security is the integral system property<br />

Establish End2End security<br />

www.embedded-world.eu<br />

759


• Security is a process<br />

Establish easy to use and verifiable and secure update<br />

procedures<br />

We may observe a lot of similarities between Avionic<br />

Industry challenges within the recent years and the current trends<br />

in the automotive industry like:<br />

• Tremendous changes for the network based infrastructure<br />

Aircraft today is network based (AFDX & IP)<br />

• Increasing usage of common computing resources<br />

o<br />

Integrated Modular Architecture (IMA), Open<br />

World<br />

• Open World domain with COTS software<br />

o<br />

• New IT services<br />

o<br />

Wi-fi products, Linux<br />

Pilots (tablets), passengers, crew, maintenance<br />

• Increasing integration and information flow between<br />

systems<br />

• Aircraft is heavily connected to other IT services, Integration<br />

of several domains<br />

o<br />

Airlines, ATC<br />

• Aircraft is connected to INTERNET<br />

Already in use concepts and accepted solutions for aircrafts are:<br />

1. Security by design<br />

a. Proper separation and control of<br />

functionalities (freedom of interference, no<br />

error propagation, minimizing the attack<br />

surface)<br />

b. Proper separation and control of information<br />

flows.<br />

c. Proper compositional certification approach.<br />

2. Introducing of “Multiple Independent Levels of<br />

Safety/Security” (MILS) systems [6].<br />

Figure 2: MILS Architectural Approach<br />

MILS is a high-assurance security architecture that supports<br />

the coexistence of untrusted and trusted components, based on<br />

verifiable separation mechanisms and controlled information<br />

flow [6].<br />

More findings, learnings from avionic industry regarding<br />

safety and security certification are discussed in [14].<br />

III. OTHER BENEFITS<br />

- SAFETY & SECURITY STANDARDS -<br />

ISO 26262<br />

2nd Edition<br />

a) Potential interaction<br />

between<br />

safety and security<br />

b) Cybersecurity<br />

threats to be<br />

analyzed as hazards<br />

c) Monitoring<br />

activities for cybersecurity,<br />

including<br />

incident response<br />

tracking<br />

d) Refer also to SAE<br />

J3061, ISO/IEC<br />

27001 and<br />

ISO/IEC 15480<br />

Common Safety and Security Base<br />

SAE J3101 –<br />

SAE J3061 –<br />

Hardware-Protected Cybersecurity<br />

Security for Ground Guidebook for Cyber-<br />

Vehicles Applications Physical Vehicle<br />

a) Secure boot<br />

b) Secure storage<br />

c) Secure execution<br />

environment<br />

d) Other hardware<br />

capabilities ...<br />

e) OTA,<br />

authentication,<br />

detection, recovery<br />

mechanisms ...<br />

Systems<br />

a) Enumerate all attack<br />

surfaces, conduct<br />

threat analysis<br />

b) Reduce attack<br />

surface<br />

c) Harden hardware<br />

and software<br />

d) Perform security<br />

testing (penetration,<br />

fuzzing, etc.)<br />

The ISO 26262 is already well established as the safety<br />

standard for certification and confirmation purposes within the<br />

automotive industry. This safety standard already refers to<br />

several security standards as shown in the table above.<br />

The SAE J3016 as standardization document provides<br />

recommendations for security relevant functions and<br />

procedures. The table listed the most important.<br />

The SAE J30161 document [7] is a guidebook with guidance<br />

on how to secure the software part of an automotive system.<br />

There are organizational and procedural similarities between<br />

Safety – Software Life-Cycle and Security - Software Life-<br />

Cycle. Taking these common efforts into account and taking<br />

Automotive Cybersecurity as part of the vehicle development<br />

life cycle from the very outset, may reduce the effort<br />

enormously.<br />

MILS Operating System (MILS OS) or with other words<br />

MILS approach is the architectural principle addressing<br />

requirements from MIL standards like development processes,<br />

risk modelling, verification & validation, and automotive<br />

domain specifics.<br />

For the usage as MILS OS it’s recommended to use a multi-core<br />

hypervisor to materialize benefits from modern multi-core<br />

hardware systems. A further example of successful multi-core<br />

hypervisor implementation can be found in [8].<br />

760


IV. SUMMARY - BENEFITS FROM MILS APPROACH<br />

AUTOMOTIVE EXAMPLES<br />

Driver assistance systems and autonomous driving are<br />

currently in addition to electromobility the most important topics<br />

in automotive development. Many "autonomous" or "partially<br />

autonomous" systems are already on the road or in final test<br />

phases. For complete autonomy, as described in Level 5, it is<br />

still a big step [9]. Besides the vehicle-to-vehicle infra­structure,<br />

the one hand there are the technical systems in the car, the focus<br />

of this article, on the other legal challenges of responsibility in<br />

accidents, as well as privacy issues have to be clarified for the<br />

series introduction of autonomous systems.<br />

This article presents technical approaches that can<br />

successfully address both safety and automotive cybersecurity<br />

requirements. For questions on data protection as well as legal<br />

and other important ethical questions please refer to [10], [11].<br />

MILS OS are a cost-efficient and practical base for the<br />

OEM´s and/or Tier1 supplier to provide powerful and modern<br />

Multi-Domain Safe&Secure Automotive Platform, which may<br />

integrate big data handling, sensor fusion, artificial intelligence<br />

algorithm and at the same time minimize security risks and<br />

reduce development efforts.<br />

Figure 3: Control the network traffic by using a security monitor<br />

application and firewall for communication between 3 VMs<br />

Figure 3 presents a real example with separated domains<br />

(VM´s) as a secure by design system. This example shows how<br />

the complexity will be managed as well the safe & secure<br />

integration of other operating systems or 3 rd party components.<br />

Beside of that, important security functions like secure boot, -<br />

update and Over The Air update of features and firmware may<br />

easily integrate.<br />

Furthermore, we may see AUTOSAR Adaptive as next<br />

automotive platform evolution as an ideal candidate for a MILS<br />

operating system. Figure 4 shows an example architecture,<br />

which combines AUTOSAR Adaptive and other operating<br />

systems. A MILS OS may provide the base for an ASIL D<br />

AUTOSAR Adaptive system, simply providing a ‘SafePOSIX’<br />

API.<br />

Figure 4: Hypervisor combines Safety AUTOSAR ADAPTIVE and<br />

Linux (Source: Vector)<br />

Take away:<br />

• One multi-domain platform, integrated AI, sensor fusion<br />

and big data handling to create symbiosis between<br />

humans, cars and surroundings.<br />

• Enabling new mobility services<br />

• Secure the car with strict separated and secure domains,<br />

providing safe & secure inter-domain communication<br />

• Maximize data privacy and effective usage and minimize<br />

cyber risks<br />

• Reduce development costs and time to market with<br />

configurable platforms and easy and safe integration of 3 rd<br />

party components.<br />

How to create a MILS platform<br />

• Understand and follow the standards and<br />

recommendations<br />

• First, secure the Hardware<br />

Securing of the HW – not part of this paper. For more<br />

information how to provide higher safety-level for a nonsafe<br />

HW please refer to [12]<br />

• Then secure the Software<br />

o System integration concept,<br />

i.e. Architecture is the most important<br />

Security MEASUREMENT<br />

o Check the following feature of your platform:<br />

Secure Boot, Secure Update - Over The Air,<br />

Monitoring, Assessment, Notifications,<br />

Remediation, Safe & Secure SW Life-Cycle,<br />

establish End to End Security<br />

V. REFERENCES<br />

[1] VDC Research Group, Inc., „Hypervisor & Secure Operating Systemes:<br />

Safety, Security, and Virtualization Converge in the IoT,“ 2015.<br />

[2] PwC, „The 2017 PwC Strategy & Digital Auto Report,“<br />

https://www.strategyand.pwc.com/media/file/2017-Strategyand-Digital-<br />

Auto-Report.pdf, 2017, September.<br />

[3] VDC Research Group, Inc., „The Gloabal Market for IoT & Embedded<br />

Operating Systems; Automotive Drives Revenue, ECU´s Drive<br />

Developer Mindshare,“ 2017.<br />

[4] Ernst & Young, „Automotive Cybersecurity,“<br />

http://www.ey.com/gl/en/industries/automotive, 2016.<br />

www.embedded-world.eu<br />

761


[5] P. Gutmann, „Youtube: Linux.conf.au 2015 -- Auckland, New<br />

Zealand,“ Linux.conf.au 2015 -- Auckland, New Zealand, 2015.<br />

[Online]. Available:<br />

https://www.youtube.com/watch?v=_ahcUuNO4so.<br />

[6] H. Blasum, S. Tverdyshev, B. Langenstein, J. Maebe, B. De Sutter, B.<br />

Leconte, B. Triquet, K. Müller, A. Söding, A. Söding - Freiherr von<br />

Blomberg und A. Tillequin, „MILS Architektur, Whitepaper,“ in<br />

EURO-MILS: Secure European Virtualisation for Trustworthy<br />

Applications in Critical Domains, www.euromils.eu, 2014.<br />

[7] SAE International, SAE J3061 - Cybersecurity Guidebook for Cyber-<br />

Physical Vehicle Systems, 2016 January.<br />

[8] S. Nordhoff, „How hypervisor operating Systems can cope with multicore<br />

certification challenges,“ in Aviation Electronics Europe, Munich,<br />

2016.<br />

[9] F. Walkembach und C. Berg, „Eine für alle; Einheitliche Plattform für<br />

alle Autofunktionen,“ Automobile Elektronik, pp. 22-24, 11-12 2016.<br />

[10] Bundesregierung, „Strassenverkehrsgesetz, Automatisiertes Fahren auf<br />

dem Weg,“ 2017. [Online]. Available:<br />

https://www.bundesregierung.de/Content/DE/Artikel/2017/01/2017-01-<br />

25-automatisiertes-fahren.html.<br />

[11] THE NATIONAL ACADEMIES PRESS, „A Look at the Legal<br />

Environment for Driverless Vehicles,“<br />

https://www.nap.edu/download/23453, 2016.<br />

[12] M. Özer, „Safety-Architektur für Plattformen mit komplexer Hardware;<br />

SIL-4 trotz unsichere Hardware,“ in Tagungsband Embedded Software<br />

Engineering Kongress 2017, Sindelfingen, www.ese-kongress.de, 2017.<br />

[13] SAE International, „AUTOMATED DRIVING, LEVELS OF<br />

DRIVING AUTOMATION ARE DEFINED IN NEW SAE<br />

INTERNATIONAL STANDARD J3016,“<br />

www.sae.org/misc/pdfs/automated_driving.pdf.<br />

[14] Le Merdy, Stéphane; SYSGO AG, „Avionics Application: Security for<br />

Safety in PikeOS,“ https://www.sysgo.com/services/knowledgecenter/whitepapers,<br />

2017.<br />

762


Building Modern Industrial Applications with Open<br />

Standards and Open-source Software<br />

Frank Meerkötter (Author)<br />

Development Lead<br />

basysKom GmbH<br />

Darmstadt, Germany<br />

This paper offers arguments for building industrial HMIs with<br />

open standards and open-source software by showcasing a<br />

solution built on Qt, Linux and OPC-UA.<br />

OPC UA, Qt OpcUa, HMI, Qt, Qt Quick, Embedded Linux,<br />

Yocto, Open Source, FOSS<br />

I. INTRODUCTION<br />

Traditionally, HMIs for industrial automation are built<br />

using proprietary tools, components and interfaces. In the worst<br />

case, a solution of this kind is created with a proprietary tool,<br />

requiring a proprietary runtime and a proprietary<br />

communication interface, both often only available on<br />

Windows.<br />

This paper offers arguments for building industrial HMIs<br />

with open standards and open-source software by showcasing a<br />

solution built on Qt, Linux and OPC-UA. It will compare such<br />

a solution with a traditional approach. It will also discuss the<br />

advantages and disadvantages of both, taking into account<br />

different kinds of scenarios and applications, as well as our<br />

experience in the field. The showcase reflects what we found in<br />

our customer projects.<br />

A. Target Scope<br />

There are two kinds of cases that one needs to differentiate<br />

when talking about industrial applications or HMIs.<br />

Case one are plant manufacturers or industrial<br />

integrators that need to provide an HMI for the<br />

machinery inside a specific production line or even a<br />

complete plant. The given combination of machines and<br />

their setup is individual for most installations. The use<br />

cases such an HMI needs to fulfill are typically well<br />

defined and properly addressed by traditional industrial<br />

HMI software. The amount of budget that can be spent<br />

on HMI customization or application development is<br />

typically limited, as the resulting software is a one-off<br />

solution. This type of HMIs/industrial applications is<br />

well served by the "configuration, not programming"<br />

approach of traditional HMI software.<br />

Case two are machine manufactures with machines<br />

produced in medium to large series. HMIs for these<br />

kinds of machines are also often done with traditional<br />

HMI software (at least as long as the application falls<br />

into the "comfort zone" of such tools). HMI software of<br />

this kind is not a one-off development and also an<br />

important point of differentiation for the manufacturer.<br />

This means more effort can be spent and it can make<br />

sense to look outside the world of traditional HMI<br />

software.<br />

This paper will focus on the second case.<br />

II.<br />

TRADITIONAL INDUSTRIAL HMI SOFTWARE<br />

What is "traditional HMI Software"? There is a large<br />

number of products, so we can answer this question only for<br />

the typical case which looks like this:<br />

Industrial HMI software consists of a graphical editor and a<br />

run time. The editor is used on a development machine to<br />

create the screens of the HMI inside a graphical composer and<br />

to implement the UI logic. It provides a library equipped with<br />

often needed graphical widgets. In addition, it often provides<br />

wizards that guide the creation of frequently needed<br />

components. It also contains pre-build blocks of typical<br />

functionality such as alarm management, recipe management,<br />

access to historical data and reporting. Most of the times it has<br />

a way to discover and import machine interfaces (symbols,<br />

variables, addresses). While it is possible to customize the UIlogic<br />

with simple scripts, the focus is on configuration, not on<br />

software development. The runtime is used to execute the HMI<br />

that has been created with the editor. A product might be able<br />

to produce output for several different runtimes.<br />

A. Advantages of Industrial HMI Software<br />

<br />

No deep software development skills are needed.<br />

Many prepackaged components and existing<br />

application specific functionality.<br />

www.embedded-world.eu<br />

763


Ability to import machine interfaces to work with,<br />

either from a live machine or through several file based<br />

exchange mechanisms.<br />

Support through the tool vendor.<br />

Ability to get results quickly.<br />

B. Disadvantages of Industrial HMI Software<br />

<br />

<br />

<br />

<br />

<br />

<br />

It can be hard to create high quality HMIs.<br />

It can be hard to extend an HMI as soon as one leaves<br />

the "comfort zone" of a given tool or the application<br />

surpasses a certain size.<br />

Availability of runtimes. Older solutions often only<br />

provide a Windows runtime, while more modern<br />

solutions have become more flexible, also providing<br />

runtimes for Android/iOS or the web browser. Still, the<br />

cooperation of the given vendor is needed to get a<br />

runtime for a specific Hardware/OS combination.<br />

Vendor lock-in. The given HMI is developed with a<br />

specific product of a given vendor. The resulting<br />

implementation cannot be ported easily to another<br />

vendor.<br />

License fees (Windows, communication driver, HMI-<br />

Software and runtimes).<br />

Version control is often lacking. Examples include<br />

binary project files or XML based formats which are<br />

often also hard to handle reasonably in version control.<br />

III.<br />

MODERN HMI SOFTWARE DEVELOPMENT<br />

The following section describes an approach based on open<br />

standards and open source software to build a machine HMI. It<br />

is most suitable for scenarios where the HMI is not a one-off<br />

solution, there are high demands for HMI quality and/or the<br />

application will become complex/large. One of the strengths of<br />

this approach is its flexibility and openness - it becomes<br />

possible to switch out the hardware, the OS and other<br />

components.<br />

The HMI this essay refers to is built with Qt & Qt Quick,<br />

running on an ARM-SBC with an OpenEmbedded/Yoctobased<br />

Linux as operating system. It is using OPC-UA via Qt<br />

OpcUa and open62541 to communicate with its PLC. A<br />

slightly modified version of this stack could also be used with<br />

an X86 industrial PC running Windows or an Android tablet.<br />

A. Qt and Qt Quick<br />

<br />

Qt is an open source C++ framework delivering the<br />

building blocks for cross-platform HMI and<br />

application development. Within Qt there is Qt Quick,<br />

a technology geared towards rapidly building modern,<br />

animated, smartphone-like HMIs. A Qt Quick<br />

application is typically structured in two parts: an<br />

application backend written in C++, which contains the<br />

business logic and a frontend which is a pure UI<br />

written in QML. QML, is a JSON-like language, used<br />

to declaratively describe the HMI (as opposed to<br />

programming it imperatively). Qt ships numerous cross<br />

platform modules for tasks such as network<br />

communication, database access, printing or XML and<br />

JSON processing. Specifically, for industrial<br />

application it has support for e.g. CAN adapters,<br />

ModBus, serial ports and OPC-UA. Qt is available<br />

under a dual licensing scheme, either as a commercial<br />

product from TheQtCompany or as (L)GPL. Qt takes<br />

API and ABI stability very seriously. Strict<br />

compatibility is kept within a major release series.<br />

Historically this means ~7-8 years.<br />

Qt is accompanied by its own integrated development<br />

environment, the QtCreator. HMIs based on QML are either<br />

programmed or created via the Qt Quick designer which<br />

provides a graphical editor.<br />

B. OPC-UA<br />

OPC-UA is a communication standard for industrial<br />

applications. It is standardized by the opcFoundation and also<br />

published as an international standard by the IEC as IEC62541.<br />

OPC-UA is the successor to the old OPC standard (now<br />

dubbed OPC-classic). OPC-UA is, unlike OPC-classic,<br />

platform independent.<br />

Qt OpcUa is developed by basysKom. It will be a standard<br />

module of Qt, starting with Qt 5.11 which will be available<br />

mid-2018. It provides an easy to use, Qt-ish API for OPC-UA<br />

clients. It does not implement its own stack, but rather wraps<br />

existing stacks - one of these is open62541.<br />

open62541 is an open source project which implements a<br />

portable OPC-UA stack in C. The source is licensed under<br />

MPL2. It provides functionality for server and client-side<br />

development.<br />

C. Embedded-Linux and Yocto<br />

OpenEmbedded and Yocto have emerged as the standard<br />

tooling to create custom Linux firmwares. They allow the<br />

creation of a range of systems, from Desktop like, to singlepurpose<br />

systems. Its modular approach separates BSP specific<br />

parts from application specific parts, making it easy to switch<br />

out the underlying hardware.<br />

D. Advantages of this approach<br />

<br />

<br />

<br />

<br />

<br />

<br />

Allows to build high quality HMIs (animated, fluid,<br />

smartphone like).<br />

Scalable across machine variants as well as application<br />

size/complexity.<br />

Flexibility and freedom to implement individual<br />

requirements.<br />

Cross platform.<br />

No vendor lock-in as it allows to replace components<br />

throughout the stack. Examples include choosing a<br />

different PLC vendor, replacing the open62541-stack<br />

with a commercial offering or replacing the QML-UI<br />

with a web-based solution by placing a<br />

REST/WebSocket server on top of the existing<br />

application backend.<br />

Opportunity to significantly reduce license fees.<br />

764


Enables the usage of cheap ARM-SBC (as opposed to<br />

full industrial PCs).<br />

E. Disadvantages of this approach<br />

<br />

<br />

<br />

<br />

Requires the skill (and will) to perform actual software<br />

development.<br />

Does not scale for one-off scenarios.<br />

Less guidance by a tool<br />

Less pre-packaged and pre-arranged industry specific<br />

functionality.<br />

applications for machines. This approach really shines when<br />

the application is individual or complex and has high<br />

requirements on HMI quality. It becomes possible to add new<br />

features and functionality without being restricted by a given<br />

HMI tool. It also becomes easier to scale an application either<br />

across machine variants or in delivered features.<br />

The open nature of this approach allows an evolution of an<br />

application stack or the integration of new machine interfaces<br />

without being strictly tied to the product lifecycle of a specific<br />

tool vendor. The cross-platform nature of the presented<br />

application stack gives an example how to future proof an<br />

investment against changes in hardware or OS availability and<br />

opens opportunities to reduce software license costs.<br />

IV.<br />

CONCLUSION<br />

Our experience shows that it is beneficial to work with<br />

open standards and open source software to build HMIs and<br />

www.embedded-world.eu<br />

765


User Experience as an<br />

Industry 4.0 Innovation Driver<br />

David C. Thömmes, B.Sc, CEO Shapefield<br />

Senior Software & UX Engineer<br />

Microsoft MVP „Windows Development“<br />

Shapefield UG<br />

D66115 Saarbrücken, Germany<br />

www.shapefield.de<br />

Abstract—Apparently everyone has heard of user experience<br />

design and usability but only the fewest manufacturers seem to<br />

develop software that is focused on the user. If you look at the<br />

user interfaces of current products, its obvious that theres a<br />

massive catching up to do. Today, a positive user experience (UX)<br />

is becoming more and more a success factor and a serious buying<br />

criterion for many companies. In the smartphone era, users are<br />

accustomed to user-friendly and well-usable interfaces. New<br />

technical achievements, such as HoloLens, are driving<br />

expectation. In his presentation, David C. Thömmes gives you an<br />

exciting insight into a user-centered design process, the current<br />

state of the market and current technology trends. The most<br />

important phases, terms and UX methods are presented with<br />

illustrative examples. Be inspired and get new impulses!<br />

Keywords— UX design, UI design, GUI design, user experience<br />

design, user interface design, graphical user interface, interaction<br />

design, UI development, UI engineering<br />

I. INTRODUCTION<br />

Fig. 1 shows a typical 2D user interface, perhaps reminding<br />

one or the other manufacturer in an industrial context of his<br />

own creations. The shown user interface undoubtedly offers a<br />

button for every function and convinces with pleasant<br />

aesthetics. The user is immediately aware of the different<br />

functions and the learnability of the interface can only be<br />

referred to as good. Obviously, the previous statements were<br />

sarcasm. In this situation, brave manufacturers eventually tend<br />

to reprogram the existing user interface. Modern UI<br />

frameworks such as WPF with XAML, Qt with QML or<br />

HTML5 with AngularJS are often used for this. Taking a closer<br />

look at the reprogrammed interface, often no improvement can<br />

be recognized. Poor operating concepts are adopted without<br />

reflections and so the opportunity for a proper redesign of the<br />

user interface is lost. New UI frameworks do not automatically<br />

lead to an attractive and well-usable interface and a positive<br />

user experience. A good user interface is the result of an<br />

interdisciplinary design process that consciously puts the user<br />

at the center of the design. But what exactly does user<br />

experience mean?<br />

II.<br />

USER EXPERIENCE<br />

User experience describes the sum of all the experiences a<br />

user collects with a digital product [1]. This includes the<br />

entirety of all possible points of contact such as advertising,<br />

websites, ordering processes, product design or installation.<br />

The user experience is not limited to the actual period of time<br />

needed to use of the product, but also the time before and after<br />

the usage gains in importance. Since user experience should be<br />

understood as a holistic approach, the term UX design reflects<br />

an interdisciplinary conglomeration of different disciplines.<br />

The pure UI design is only a partial discipline in addition to<br />

important core disciplines such as interaction design, product<br />

design and usability engineering. Optimally, the user<br />

experience should be stimulated on all different levels. Every<br />

point of contact with the product should be designed with the<br />

same quality and dedication for a positive user experience.<br />

Reduced to the aspects that are relevant during the usage of a<br />

product, the term user experience emerges new facets, such as<br />

usability.<br />

Fig. 1: Typical user interface from industrial sector<br />

www.embedded-world.eu<br />

766


III.<br />

USABILITY<br />

First of all, usability is a part of the standard EN ISO 9241,<br />

which describes guidelines for human-computer interaction. In<br />

the section 9241-11, usability is defined as: "the extent to<br />

which a product can be used by certain users in a particular<br />

context of use to achieve specific goals effectively, efficiently<br />

and satisfactorily“[2][3]. This means, usability is depending on<br />

which users use the product in which work environment and<br />

which tasks are solved. In this, the factors effectiveness,<br />

efficiency and satisfaction can be considered.<br />

Effectiveness<br />

Effectiveness describes how effective a task can be<br />

handled. For example, is the user able to configure the machine<br />

to his needs?<br />

Efficiency<br />

The factor of efficiency expresses the temporal, economic<br />

and cognitive costs involved in achieving the goal. How long<br />

does it take the user to find an alarm in the system? How many<br />

clicks are needed for this process? How exhausting was the<br />

search for the user?<br />

Satisfaction<br />

Satisfaction is subjective and arises when the expectations<br />

of a product or system are exceeded. Positive emotions,<br />

feelings of happiness and aesthetics play a decisive role here.<br />

IV.<br />

DESIGN PROCESS<br />

One possible process for ensuring good usability is known<br />

as user-centered design (UCD). It describes a highly iterative<br />

design process that focuses on the user's needs as the<br />

foundation for design. The fundamental idea is that, at the<br />

beginning, as much information as possible is collected about<br />

the different user groups. Based on the information gathered, a<br />

design phase follows in which hypotheses are prepared as<br />

drafts, concepts, screens, etc. Subsequently, the products of the<br />

design phase are evaluated by various empirical or analytical<br />

methods. It is reviewed whether the designed hypotheses<br />

actually work for the user and what degree of usability has<br />

been achieved. Potential problems are detected by this<br />

procedure and, if necessary, corrected by returning to a<br />

previous phase. Through the iterative alternation of the<br />

different phases, the development of the products is an integral<br />

part of the process. Step by step, an approximation to the<br />

optimal result is achieved. Depending on the company and<br />

project different interpretations, phases and methods are<br />

applied. Fig.2 shows a possible UCD variant.<br />

V. USER ANALYSIS<br />

The ultimate goal of user analysis is to get to know the user<br />

and his needs and to prepare the results with corresponding<br />

documentation methods. The main focus is on work processes,<br />

working environment and contextual general conditions. Figure<br />

3 shows an engineer working with a complex CAD program.<br />

By carrying out a context analysis, for example, the working<br />

environment of the user becomes comprehensible almost<br />

unadulterated. A context analysis is a combination of<br />

observation and subsequent questioning. For one day, the UX<br />

designer becomes a shadow for the user and accompanies him<br />

at his everyday work.<br />

VI.<br />

DESIGN<br />

During the design phase, the information and results from<br />

the analysis phase are transformed into creative solutions. This<br />

phase is subdivided into the development of conceptual and<br />

visual design. The conceptual design of a user interface<br />

documents corresponding design decisions regarding the<br />

navigation structure, information architecture, interaction<br />

paradigms, controls and layouts. For this purpose, individual<br />

screens are often visualized as wireframes. Important areas of<br />

the user interface such as the alarm system, help or the<br />

displayed status are designed and arranged. In this step, the<br />

concrete visual design is less relevant, since with wireframes it<br />

is possible to collect reliable user feedback in an early state of<br />

the project. Completely formed screens with colors or effects<br />

could distract from the actual concept and distort the<br />

impression. Fig.4 shows a conceptual design of an engine<br />

control. The conceptual design is followed by the visual<br />

design. Shapes, colors, fonts, icons, effects, proportions and<br />

arrangements can have a significant influence on the perception<br />

and value of a user interface. As part of the visual design, these<br />

attributes are arranged in a well-defined composition and by<br />

this, the user interface gets it´s appearance. This is where the<br />

important first impression comes from, long before usability or<br />

functionality play a role.<br />

Fig. 2: User centered design<br />

Fig. 3: Engineer working with a CAD software<br />

767


Fig. 4: Conceptual design for motor control<br />

VII.<br />

EVALUATION<br />

Without appropriate evaluation, the results of the design<br />

phase are always just hypothesis. Interactive prototypes allow<br />

an evaluation of these results, for example as a part of a<br />

usability test. For this purpose, the existing static screens are<br />

implemented as interactive software fragments and real users<br />

are confronted with the prototype. Recruited users receive<br />

concrete tasks and are observed during the use of the product.<br />

Classically, a usability test is performed in a usability lab.<br />

While a user is in the so-called user´s room, the UX designer<br />

watches the events from a second room. For support and<br />

documentation, video, screen and audio signals is transmitted<br />

from the user´s room. As a cost-effective alternative to the<br />

classic usability test, the method of a focus group becomes<br />

more and more popular these days. A focus group is a<br />

moderated group discussion with relevant users. Usually in the<br />

course of one day several design hypotheses are discussed<br />

openly in the group and presented with the help of wireframes,<br />

screens and interactive prototypes. By this, users get the<br />

opportunity to try out new operating concepts live and to share<br />

their experiences directly with other users. The momentum of<br />

the group quickly creates user feedback and corresponding<br />

problems, concerns and comments can be discussed<br />

transparently.<br />

VIII. SPECIFICATION AND IMPLEMENTATION<br />

After the design has been evaluated by appropriate<br />

methods, the project has to be processed and documented for<br />

the development. Typically, a style guide is written for this<br />

purpose. It contains basic design resources such as colors,<br />

fonts, and control specifications as well as guidelines for using<br />

these controls and general usability information. Style guide<br />

documents easily became very extensive. Correspondingly, the<br />

production costs are high and, at the same time, the document<br />

is less consumable. In order to make a faster leap into<br />

development, lightweight specifications are becoming more<br />

and more prevalent. They are called design manuals. Usually,<br />

they only contain the essential interfaces between design and<br />

development and are deliberately reduced to the essentials. Fig.<br />

5 shows a button with dimensions.<br />

The actual design process ends with the completion of the<br />

specification. But this only covers half of the project. After<br />

this, the technical implementation of the user interface is<br />

usually carried out simultaneous to the overarching<br />

development project. Every pixel and every distance is<br />

essential. Depending on the scope and complexity of the<br />

design, there are interesting challenges for the role of the UI<br />

Engineer. Especially with modern UI frameworks attractive<br />

and rich user interfaces can be realized with a reasonable effort<br />

these days. For example, WPF offers incredible possibilities<br />

www.embedded-world.eu<br />

768


Fig. 5: Button dimensions<br />

with styles, data templates and control templates [4]! Qt has<br />

also improved through QtQuick and QML [5], which opens up<br />

new perspectives for the technical realization.<br />

IX.<br />

CONCLUSION<br />

From user´s point of view, the user interface is the face of<br />

the application. It does not matter if the application is a<br />

machine control or a complex ERP system. The user likes it if<br />

it´s easy to use and nice to look at. But a positive user<br />

experience is no coincidence, but the result of a solid design<br />

process and a skillful technical implementation. In addition, the<br />

world of users is now undergoing massive change.<br />

Digitalization and Industry 4.0 are the new riggers. Almost<br />

monthly, new devices are released with innovative interaction<br />

paradigms, such as the Leap Motion or the Apple Watch.<br />

Additionally, there is the trend of artificial intelligence paired<br />

with language interfaces such as Google Home or Amazon<br />

Alexa. Innovations are already existing, but for many<br />

manufacturers, the development of a contemporary 2D user<br />

interface would be an advance. It's time for a change.<br />

AUTHOR<br />

David C. Thömmes (B.Sc.) studied media informatics at the<br />

University of Applied Sciences Kaiserslautern and discovered<br />

his passion for human-computer interaction and software<br />

engineering. David has developed his first user interfaces with<br />

VBA and Delphi 2004. As a Senior Software & UX Engineer<br />

as well as Managing Director of Shapefield, his passion is<br />

today the user-centered design and the technical development<br />

of impressive user interfaces. Prior to that, he was responsible<br />

for the development department of the renowned UX service<br />

provider Ergosign for almost 5 years in the role of Senior<br />

Software Engineer & Field Lead "Software Engineering<br />

Standards". At the beginning of 2015, David left Ergosign at<br />

his own request and laid the foundation for Shapefield a few<br />

months later. By working on various projects with different<br />

technologies, he has a profound knowledge in the development<br />

of desktop, web, embedded and mobile applications.<br />

Technically his heart beats for XAML, QML, C#, C ++ and<br />

PHP. For his achievements, David was honored with the<br />

Microsoft MVP Award 2016 and 2017.<br />

REFERENCES<br />

1. https://www.nngroup.com/articles/definition-user-experience<br />

2. https://de.wikipedia.org/wiki/Gebrauchstauglichkeit_(Produkt)<br />

3. https://de.wikipedia.org/wiki/EN_ISO_9241<br />

4. https://docs.microsoft.com/en-us/dotnet/framework/wpf/controls/stylingand-templating<br />

5. https://www.qt.io<br />

769


Real-Time Holographic Solution for True 3D-Display<br />

A. Kaczorowski, S.J. Senanayake, R. Pechhacker, T. Durrant, M. Kaminski and D. F. Milne<br />

VividQ Research and Development Division<br />

Cambridge, UK, CB3 0AX<br />

darran.milne@vivid-q.com<br />

Abstract— Holographic display technology has a been a topic<br />

of intense academic research for some time but has only<br />

recently seen significant commercial interest. The uptake has<br />

been hindered by the complexity of computation and sheer<br />

volume of the resulting holographic data meaning it takes up<br />

to several seconds minutes to compute even a single frame of<br />

holographic video, rendering it largely useless for anything but<br />

static displays. These issues have slowed the development of<br />

true holographic displays. In response, several easier to<br />

achieve, yet incomplete 3D-like technologies have arisen to<br />

fill the gap in the market. These alternatives, such as 3D<br />

glasses, head-mounted-displays or concert projections are<br />

partial solutions to the 3D problem, but are intrinsically<br />

limited in the content they can display and the level of realism<br />

they can achieve.<br />

Here we present VividQ's Holographic Solutions, a<br />

software package containing a set of proprietary state-of-theart<br />

algorithms that compute holograms in milliseconds on<br />

standard computing hardware. This allows three-dimensional<br />

holographic images to be generated in real-time. Now users<br />

can view and interact with moving holograms, have<br />

holographic video calls and play fully immersive holographicmixed<br />

reality games. VividQ's Solutions are a vital component<br />

for Industry 4.0, enabling IOT with 3D holographic imaging.<br />

The software architecture is built around interoperability with<br />

leading head-mounted-display and head-up-display<br />

manufacturers as well as universal APIs for CAD, 3D gaming<br />

engines and Windows based engineering software. In this<br />

way, VividQ software will become the new benchmark in 3D<br />

enhanced worker/system interaction with unrivalled 3D<br />

imaging, representation and interactivity.<br />

Keywords— Digital Holography, GPU, Augmented Reality,<br />

Mixed Reality, Optical Systems, Display Technology<br />

I. INTRODUCTION<br />

Owing to the recent increased interest in 3D display,<br />

multiple technologies have emerged to deliver a convincing<br />

3D experience [1-4]. These largely rely on multi-view or<br />

stereoscopic representations designed to "trick" the eye into<br />

providing correct depth cues to make the projections appear<br />

three-dimensional. However, these depth cues are often<br />

limited and in some cases can cause accommodation-vergence<br />

conflicts leading to nausea and headaches for the user.<br />

Holographic display, on the other hand, aims to precisely<br />

recreate the wave-front created from a 3D object or scene,<br />

creating a true 3D image of the input scene with all the correct<br />

depth cues intact. This makes holographic display an ideal<br />

candidate for augmented/mixed reality applications, as it<br />

provides 3D virtual objects that appear in focus with their<br />

surroundings.<br />

With advances in 3D sensors together with dramatic<br />

increases in computational power, Digital Holography (DH)<br />

has become a topic of particular interest. In DH, holograms<br />

are calculated from point cloud objects, extracted from 3D<br />

data sources such as 3D cameras, game engines or 3D design<br />

tools, by simulating optical wave propagation [5][6][7]. This<br />

simulation can take multiple forms depending on the desired<br />

quality of the recovered image and whether the holograms are<br />

chosen to be in the far or near field. The resulting hologram<br />

may then be loaded onto a suitable digital display device with<br />

associated optical set-up for viewing.<br />

A conceptually simple approach for hologram generation is<br />

the ray-tracing method [8][9][10], in which the paths from<br />

each point on the object to each hologram pixel are computed<br />

and aggregated to produce the hologram representation. While<br />

the ray-tracing method is physically intuitive, it is highly<br />

computationally expensive. To address this issue, many<br />

modifications [11-14], and alternatives solutions such as the<br />

polygon [14-18] and image-based methods [19][20] have been<br />

proposed. In the paper, we describe a real-time holographic<br />

display system, that uses a different algorithmic approach<br />

based on a Layered Fourier Transform (LFT) scheme [21][22].<br />

We demonstrate how data may be extracted from a 3D data<br />

source i.e. the Unity engine, streamed to a holographic<br />

generation engine, containing the LFT algorithms. The LFT<br />

algorithms are highly parallelized and optimized to run via<br />

CUDA kernels on NVidia Graphics Processing Units (GPUs).<br />

The resulting holograms are then output to a suitable display<br />

device, in this case a Ferroelectic LCoS Spatial Light<br />

Modulator (FLCOS SLM).<br />

www.embedded-world.eu<br />

770


In the following we describe the various components of the<br />

real-time holographic display architecture. In section II, we<br />

discuss the streaming of game data from the Unity engine and<br />

the standardization of Unity data to a generic 3D format. In<br />

Section III, we present the real-time Hologram Generation<br />

Engine (HGE) before going into the display setup and driver<br />

in Section IV. We discuss results and future work in Section<br />

V.<br />

II.<br />

DATA CAPURE AND STANARDIZATION<br />

To stream date to the Hologram Generation Engine, we<br />

must first extract 3D data from a suitable source. In this case,<br />

we choose the Unity engine. Unity is a well-known gaming<br />

platform that may be used to create entire 3D scenes using<br />

pre-rendered assets.<br />

A. 3D Data Streaming from Unity<br />

Key to the performance of the real-time process is that 3D<br />

content rendered within Unity is passed to the holographic<br />

generation engine without having to copy memory from the<br />

CPU to GPU and back again. The process is summarized in<br />

Fig.1. Unity uses a concept of Shaders to create objects<br />

known as Textures that describe virtual scenes. The standard<br />

Unity Shaders create Textures as colour maps that specify<br />

colours RGB in a 2D grid. While this is suitable for rendering<br />

to a standard 2D display, such as a monitor or stereoscopic<br />

device, this is insufficient to capture depth information about a<br />

scene as required for 3D Holography. Instead, a custom<br />

Shader was implemented that renders a colour map with depth<br />

(Z) to create a four-channel Colour-Depth-Map (CDM) with<br />

channels RGBZ . Each CDM is rendered at the resolution of<br />

the Spatial Light Modulator (in this case the rendering is<br />

actually performed at half the SLM resolution due the binary<br />

nature of the device giving rise to twin images). Unity renders<br />

into the CDM texture within an OpenGL context. This allows<br />

it to pass the CDM texture object direct to the HGE. Within<br />

the HGE, the CUDA-OpenGL-Interop library is utilized to<br />

make the data available to the HGE’s custom kernel functions,<br />

contained in a set of C++ DLLs. This way, Unity is able to<br />

render the 3D scene and the information is passed straight to<br />

the hologram algorithms without multiple memory copies<br />

between the CPU and GPU. In this sense, the OpenGL context<br />

acts as a translator between the two steps, allowing us to pass<br />

a pointer to the texture directly to the DLLs holding the HGE<br />

algorithms. While this implementation is based on OpenGL,<br />

one could consider alternative approaches using Direct3D or<br />

Vulkan. Direct3D is widely use in the game industry and<br />

represents a natural next step in the evolution of the streaming<br />

solution as it contains libraries similar to the CUDA-OpenGL-<br />

Interop. For Vulkan there is currently no such support, but it is<br />

likely that there will be in the near future.<br />

Fig. 1. Unity Streaming Process: Unity renders a 3D<br />

scene, the shader creates a custom texture which is passed to<br />

an array on the GPU (cudaArray) where the hologram will be<br />

calculated.<br />

B. Data Standardization<br />

While Unity is a useful tool for gaming applications and<br />

technology demonstrations, for Holography to be available for<br />

more general applications, one should define a process to<br />

stream and work on data from arbitrary 3D data sources.<br />

Three-dimensional data is present in many forms across<br />

multiple software and hardware platforms. To compute<br />

holograms, fundamentally we require data in a point-cloud<br />

format. A point cloud can be thought of simply as a list of 3D<br />

coordinates, specifying the geometry of the object, along with<br />

a set of attributes of the cloud e.g. colours (the CDM texture<br />

from Unity can be thought of a flattened point cloud with each<br />

point of the grid specifying (x,y)-coordinates and the depth<br />

channel providing the z). While point clouds are a common<br />

and intuitive data type, so far no standard point cloud format<br />

has emerged that is compatible with the majority of 3D source<br />

data systems. To overcome this issue in holographic<br />

applications, and allow arbitrary sources to be streamed to the<br />

HGE, we present a new Point Cloud class structure that<br />

incorporates the essential features of 3D data required for<br />

holographic computation.<br />

C. Point Cloud Class<br />

Unity 3D rendering<br />

Render custom texture<br />

OpenGL context<br />

CUDA_OpenGL_Interop<br />

library<br />

cudaArray<br />

Hologram Generation<br />

Engine<br />

The point cloud class, PointCloud, provides a common<br />

framework for data passing through the real-time holographic<br />

display system. This allows 3D data to be passed around in<br />

memory rather than in file format for fast processing.<br />

PointCloud is an abstract base class that allows derivative<br />

classes to specify specific point cloud representations. In the<br />

holographic generation case, we are interested in two<br />

771


particular types of point cloud representation: 3D and 2.5D<br />

point clouds. The 3D case refers to a PC that contains<br />

complete geometric information of a given object while the<br />

2.5D case occurs when using a PC viewed from a particular<br />

perspective. In this case (assuming the object is not<br />

transparent), one may neglect points that are occluded.<br />

The base class and inheritance structure of PointCloud is<br />

designed to be generic and easily extensible so one may define<br />

further derivative classes for higher dimensional PCs or PCs<br />

with attributes specific to the chosen application or data<br />

source. The base class contains generic file reading and<br />

writing methods but there is no embedded algorithmic<br />

functionality. Instead, all parts of the holographic system<br />

architecture may accept an instance of these types and run<br />

algorithms using the data contained in them.<br />

With data streamed via the Unity process or through the<br />

generic point cloud class we may now compute the<br />

holographic representation of the data to be displayed on the<br />

SLM for viewing. In the next section we discuss the theory<br />

behind the hologram generation process, outline the algorithm<br />

in the HGE and describe the expected outputs.<br />

III.<br />

REAL-TIME HOLOGRAM GENERATION<br />

A physically intuitive generation method for the calculation<br />

of digital holograms is a direct simulation of the physical<br />

holographic recording process. In this model, objects are<br />

represented by a point cloud where points in the cloud are<br />

assumed to emit identical spherical light waves that propagate<br />

towards a fixed 2D "holo-plane" offset from the cloud. The<br />

resulting interference pattern is calculated on the surface of the<br />

holo-plane to yield the digital hologram. While this method is<br />

conceptually simple and can produce high quality holograms,<br />

it is computationally intensive and time consuming to<br />

implement. To reduce the computational load of hologram<br />

generation we make use of a layer-based Fourier algorithm.<br />

This method partitions the point cloud into parallel, twodimensional<br />

layers by choosing a discretization along one axis<br />

of the object. Points that do not intersect one of the discrete<br />

layers are simply shifted along the axis of discretization to the<br />

closest layer. To construct the hologram a Discrete Fourier<br />

Transform (DFT) is applied to each of the layers. The DFT is<br />

implemented by the Fast Fourier Transform (FFT) algorithm.<br />

To account for the varying depths, a simulated effective lens<br />

correction is calculated and applied to each layer. The<br />

transformed and depth corrected layers are summed to yield<br />

the final hologram. So for a hologram, H, with holo-plane<br />

coordinates (α, β), the construction is described by:<br />

H(α, β) = ∑ e iz i(α 2 +β 2) FT[A i (x, y)].<br />

i<br />

Where A i (x, y) is the i th object layer, z i is the depth<br />

parameter for the i th layer. The sum is defined over all the<br />

layers in the discretization.<br />

The implementation of the LFT method in the HGE is<br />

complicated by two issues. First, three coloured holograms<br />

(RGB) must be created to achieve full colour holographic<br />

images. This is achieved in this case by including a loop over<br />

the colours and essentially running the algorithm three times.<br />

The resulting holographic images can then be overlaid in the<br />

hologram replay field to give the final full colour holographic<br />

image. Note that the three coloured holograms will not yield<br />

the same size of image in the replay field due to the different<br />

wavelengths, diffracting at different rates on the display. To<br />

account for this the input point cloud or CDM for each colour<br />

channel must be scaled to ensure the images overlap exactly.<br />

The second issue is that the output display element – in<br />

this case a FLCoS SLM – is a binary phase device. Hence, the<br />

hologram H(α, β), with which in general takes complex<br />

values, representing both amplitude and phase, must be<br />

quantized to just two phase values i.e. 1-bit per pixel. This<br />

causes a severe reduction in quality in the resulting hologram<br />

and noise reduction methods must be applied to discern a<br />

viewable image. In general, even non-FLCoS devices, such as<br />

Nematic SLMs, cannot represent full phase and must quantize<br />

to some finite number of phase levels (usually 256-levels i.e.<br />

8-bits per pixel). While an individual binary hologram may<br />

give a very poor reconstruction quality, it is possible to use<br />

time-averaging to produce high quality images with low noise<br />

variance. Such a scheme is implemented in the HGE to<br />

account for the limitations of the output display device.<br />

A. Algorithm Structure<br />

Given the that we require three-colour holograms<br />

composed of multiple object layers and noise reduction must<br />

be applied as part of the process, the HGE algorithm proceeds<br />

as follows:<br />

1. The CDM Texture is passed from Unity to the C++<br />

DLL that wraps the underlying CUDA kernels.<br />

2. For each colour the CDM is scaled in size to account<br />

for the variable rates of diffraction of the three laser<br />

fields.<br />

3. FFTs are applied to the CDM data using the cuFFT<br />

CUDA Library to give full-phase holograms.<br />

4. The full-phase holograms are quantized to give<br />

binary phase holograms.<br />

5. The time-averaging algorithm is applied to eliminate<br />

noise in the replay field image.<br />

6. The holograms are stored in memory as a 24-bit<br />

Bitmap.<br />

The output of this procedure is a 24-bit Bitmap that can be<br />

streamed directly to the FLCoS SLM.<br />

The majority of the algorithmic work is handled by several<br />

custom CUDA kernels, which are responsible for handling the<br />

CDM object, creating layers, preparing them for FFT and<br />

www.embedded-world.eu<br />

772


(b)<br />

S<br />

(c)<br />

Image<br />

(e)<br />

I<br />

(f)<br />

(a)<br />

Mi<br />

(d)<br />

Fig. 2. The holographic display apparatus: The FLCoS<br />

SLM is synchronized to three laser diodes (RGB) via a custom<br />

micro-controller. The holographic image created by the<br />

reflected light, is enlarged by an optical set up and viewed via<br />

a beam-splitter eye-piece. (a) Micro-controller, (b) SLM<br />

Driver, (c) SLM, (d) Laser-diode array (RGB). (e) Image<br />

enlarging optics, (f) Eye-piece<br />

merging to create the holograms. These kernels are called via<br />

C++ wrapper functions that expose the functionality without<br />

the need to interact directly with the low level CUDA C code.<br />

IV.<br />

DEVICE DRIVERS AND HOLOGRAM DISPLAY<br />

With the holograms computed, all that remains is to display<br />

on a suitable output device. Holographic images of this type,<br />

cannot be viewed on a typical display such as LED or OLED<br />

as these do not allow for phase modulation of light. Instead we<br />

use a reflective phase-only Spatial Light Modulator. These<br />

devices allow us to modulate the phase of incoming coherent<br />

light to generate desired interference patterns in the reflected<br />

wave-front to create the holographic image.<br />

The device used here is a ForthDD Ferroelectric Liquid<br />

Crystal on Silicon (FLCoS) SLM with 2048 x 1536 pixels of<br />

pitch 8.2μm. The device comes equipped with a control unit<br />

and drivers to allow developers to alter setting such as refresh<br />

rate and colour sequencing. To create the holographic images,<br />

RBG lasers are collimated and directed to the SLM surface<br />

that is displaying the holograms. The prototype display uses<br />

off-the-shelf optical components from ThorLabs and is<br />

designed to replicate an augmented reality style display. In<br />

this scheme, the holograms are reflected back to a beamsplitter<br />

which acts as an eye-piece to achieve the augmented<br />

reality effect (Fig. 2).<br />

For this implementation, we create three colour holograms<br />

(RG and B) which must be displayed sequentially in a timemultiplexed<br />

fashion. To achieve this, a custom Arduino<br />

microcontroller was developed that synchronizes the RGB<br />

frames with three laser diodes. These frames are shown at high<br />

frequency to ensure that the images are time-averaged with<br />

respect to the viewer’s eye to give a single full-colour image<br />

(Fig. 3).<br />

Fig. 3. Augmented Reality Holographic Elephant Image.<br />

Photographed directly through the eye-piece with DSLR<br />

camera. Image is constructed by overlaying RBG holographic<br />

elephants in the replay field to create single full-colour<br />

elephant.<br />

V. RESULTS AND DISCUSSION<br />

The real-time holographic generation and display process has<br />

been tested on a NVidia GTX 1070 – a mid-range gaming<br />

GPU. Running a 3D Unity game, with a depth resolution of<br />

between 32 and 64 depth layers (number of layers computed<br />

depends on the content in the scene and is determined at runtime)<br />

the GTX 1070 yields a framerate of 52-55 Hz. This<br />

creates a smooth gaming experience assuming a static display<br />

as tested here, but a framerate of 90-120 Hz would be required<br />

to achieve a seamless mixed reality display system. Increasing<br />

the memory available to the GPU would be a first step to<br />

allow more holograms to be computed in parallel. Indeed the<br />

new generation Volta architecture from NVidia make use of<br />

half-float calculations would give a significant speed up to<br />

HGE and is projected to allow for >90Hz in the system.<br />

Moving to a dedicated ASIC would improve this further by<br />

running at lower power in a more compact package, suitable<br />

for portable, untethered devices. As 80% of the compute time<br />

in the HGE is spent performing FFT, a dedicated embedded<br />

solution would provide significant speed up over the current<br />

generic approach.<br />

In this optical setup, the image size and quality is<br />

constrained by several physical factors. The eye-box and field<br />

of view are very small due to the dimensions of the SLM and<br />

the optics used to expand the image size. The holographic<br />

images also pick up noise due to speckle effects from the laser<br />

diodes and there is also some residual noise in the replay field<br />

due to the quantization errors created by the binary SLM.<br />

These issues can be addressed primarily through higher<br />

quality SLMs with smaller pixel pitch and higher resolution.<br />

Nematic type, 8-bit SLMs with 4k x2K resolutions and pixel<br />

pitch of 3.74 μm are currently available with 8K devices likely<br />

to emerge within the near future. The higher resolution and<br />

smaller pitch of these devices allow for wider fields of view<br />

and finer detail in the holographic images. Additionally, one<br />

can consider waveguide solutions combined with eye-tracking<br />

for accurate eye-box mapping to ensure the viewer never loses<br />

773


sight of the holographic image. Such schemes are the subject<br />

of current research and development.<br />

VI.<br />

CONCLUSION<br />

Here we have presented an end-to-end holographic generation<br />

and display system that allows 3D data to be extracted directly<br />

from a Unity game, full 3D holograms to be computed and<br />

then streamed to an augmented reality holographic display.<br />

The hologram generation algorithms achieve a depth<br />

resolution between 32 and 64 while maintaining a framerate<br />

>50 Hz on a 2k x 1.5k SLM. While the hardware required to<br />

view the holographic images is in a nascent state, such an<br />

advance in the algorithmic side will enable the development of<br />

high-quality, fully interactive holographic display systems that<br />

are suitable for mass adoption.<br />

ACKNOWLEDGMENT<br />

The authors would like to thank E. Bundyra, M. Robinson<br />

and M. Ippolito for providing funding and business development<br />

support in this project. We would also like to thank Prof T.<br />

Wilkinson and CAPE at the University of Cambridge for<br />

supporting the project in its early stages.<br />

REFERENCES<br />

[1] J. Park, D. Nam, S. Y. Choi, J. H. Lee, D. S. Park, and C. Y. Kim, "Light<br />

field rendering of multi-view contents for high density light field 3D<br />

display," in SID International Symposium, pp. 667–670 (2013).<br />

[2] G. Wetzstein, D. Lanman, M. Hirsch, and R. Raskar, "Tensor displays:<br />

compressive light field synthesis using multilayer displays with<br />

directional backlighting," ACM Trans. Graph. 31, 1–11 (2012).<br />

[3] T. Balogh, P. T. Kovács, and Z. Megyesi, "Holovizio 3D display system,"<br />

in Proceedings of the First International Conference on Immersive<br />

Telecommunications (ICST) (2007).<br />

[4] X. Xia, X. Liu, H. Li, Z. Zheng, H. Wang, Y. Peng, and W. Shen, "A 360-<br />

degree floating 3D display based on light field regeneration," Opt.<br />

Express 21, 11237–11247 (2013).<br />

[5] R. H.-Y. Chen and T. D. Wilkinson, "Computer generated hologram from<br />

point cloud using graphics processor," Appl. Opt. 48(36), 6841 (2009).<br />

[6] S.-C. Kim and E.-S. Kim, "Effective generation of digital holograms of<br />

three-dimensional objects using a novel look-up table method," Appl.<br />

Opt. 47(19), D55–D62 (2008).<br />

[7] T. Shimobaba, N. Masuda, and T. Ito, "Simple and fast calculation<br />

algorithm for computer-generated hologram with wavefront recording<br />

plane," Opt. Lett. 34(20), 3133–3135 (2009).<br />

[8] D. Leseberg, "Computer-generated three-dimensional image holograms,"<br />

Appl. Opt. 31, 223–229 (1992).<br />

[9] K. Matsushima, "Computer-generated holograms for three dimensional<br />

surface objects with shade and texture," Appl. Opt. 44, 4607–4614 (2005).<br />

[10] R. Ziegler, S. Croci, and M. Gross, "Lighting and occlusion in a wavebased<br />

framework," Comput. Graph. Forum 27, 211–220 (2008).<br />

[11] M. E. Lucente, "Interactive computation of holograms using a look-up<br />

table," J. Electron. Imaging 2, 28–34 (1993).<br />

[12] Y. Pan, X. Xu, S. Solanki, X. Liang, R. B. Tanjung, C. Tan, and T. C.<br />

Chong, "Fast CGH computation using S-LUT on GPU," Opt. Express 17,<br />

18543–18555 (2009).<br />

[13] P. Tsang, W.-K. Cheung, T.-C. Poon, and C. Zhou, "Holographic video<br />

at 40 frames per second for 4-million object points," Opt. Express 19,<br />

15205–15211 (2011).<br />

[14] J. Weng, T. Shimobaba, N. Okada, H. Nakayama, M. Oikawa, N. Masuda,<br />

and T. Ito, "Generation of real-time large computer generated hologram<br />

using wavefront recording method," Opt. Express 20, 4018–4023 (2012).<br />

[15] K. Matsushima, "Wave-field rendering in computational holography: the<br />

polygon-based method for full-parallax high-definition CGHs," in<br />

IEEE/ACIS 9th International Conference on Computer and Information<br />

Science (ICIS) (2010).<br />

[16] D. Im, E. Moon, Y. Park, D. Lee, J. Hahn, and H. Kim, "Phase regularized<br />

polygon computer-generated holograms," Opt. Lett. 39, 3642–3645<br />

(2014).<br />

[17] K. Matsushima and A. Kondoh, "A wave-optical algorithm for hidden<br />

surface removal in digitally synthetic full parallax holograms for<br />

threedimensional objects," Proc. SPIE 5290, 90–97 (2004).<br />

[18] H. Nishi, K. Matsushima, and S. Nakahara, "Advanced rendering<br />

techniques for producing specular smooth surfaces in polygon-based<br />

high-definition computer holography," Proc. SPIE 8281, 828110 (2012).<br />

[19] H. Shum and S. B. Kang, "Review of image-based rendering techniques,"<br />

Proc. SPIE 4067, 2–13 (2000).<br />

[20] F. Remondino and S. El-Hakim, "Image-based 3D modelling: a review,"<br />

Photog. Rec. 21, 269–291 (2006).<br />

[21] J.-S. Chen, D. Chu, and Q. Smithwick, "Rapid hologram generation<br />

utilizing layer-based approach and graphic rendering for realistic threedimensional<br />

image reconstruction by angular tiling," J. Electron. Imaging<br />

23, 023016 (2014).<br />

[22] J. S. Chen and D. P. Chu, "Improved layer-based method for rapid<br />

hologram generation and real-time interactive holographic display<br />

applications," Opt. Express 23, 18143–18155 (2015).<br />

www.embedded-world.eu<br />

774


Integrating Capacitive Touch Technology into<br />

Electronic Access Control Products<br />

Walter Schnoor<br />

System Applications, MSP Microcontrollers<br />

Texas Instruments, Inc.<br />

Dallas, TX U.S.A.<br />

Abstract—Capacitive touch brings appealing aesthetics,<br />

enhanced security possibilities, and improved reliability to<br />

electronic access control systems. These key benefits are often<br />

counteracted by higher average power consumption, touch<br />

detection reliability issues in exterior installations when the touch<br />

panel is exposed to moisture, and additional system cost and<br />

integration complexity. However, microcontrollers equipped<br />

with capacitive touch sensing technology can be designed with<br />

key features to address these system challenges. This paper<br />

discusses system design techniques for addressing the common<br />

challenges with capacitive touch in electronic access control<br />

products, including how to reduce the average current draw of a<br />

12-button capacitive touch keypad into the single digit<br />

microamperes and system integration techniques to reduce<br />

system cost and improve moisture tolerance.<br />

Keywords—capacitive touch; capacitive sensing;<br />

microcontroller; electronic lock; electronic acess control; human<br />

machine interface<br />

I. INTRODUCTION<br />

Smart home products have seen their popularity amongst<br />

consumers rise significantly in recent years. Take 2016 for<br />

example - it’s estimated that 80 million smart home devices<br />

were delivered to customers, marking a 64% increase from<br />

2015 [1]. Smart home product manufacturers are now even<br />

more optimistic about the future of the industry. The smart<br />

home industry is expected to reach $120 billion US dollars<br />

globally by 2022, “but not without consumer acceptance first”<br />

[2]. One of the most fascinating things about the onset of smart<br />

connected home products is the impact that they have had on<br />

mature industrial segments such as doorbells, thermostats, and<br />

access control products. In order to accelerate consumer<br />

acceptance of new smart products in these existing market<br />

segments, manufacturers have leaned heavily on not only the<br />

connectivity and functionality of their products, but also on the<br />

product’s aesthetics, its security features, and its long-term<br />

reliability. Aesthetics, security, and reliability have become<br />

key differentiators in the aforementioned market segments, and<br />

are looked at critically by residential and commercial<br />

consumers. It is this market need that has driven adoption of<br />

capacitive touch sensing technology in smart home products,<br />

specifically in electronic access control products.<br />

Electronic access control products, such as the electronic<br />

door lock, often use a short-range wireless connection such as<br />

Bluetooth Low Energy (BLE) or near field communication<br />

(NFC) to validate a user and unlock the controlled function.<br />

However, in the event that a user doesn’t have their mobile<br />

device or radio frequency identification (RFID) tag, a keypad<br />

may be used as a backup mechanism to allow access. While<br />

the keypad may not be the primary means of authentication, its<br />

inclusion in the electronic lock is often preferred by consumers<br />

because of the flexibility that it offers. For example, the owner<br />

of a home could enable temporary key codes for visiting guests<br />

so that they may come and go as they please for a period of<br />

time. Likewise, a business could issue a temporary security<br />

code to a contractor rather than issuing them a tag. The<br />

downside of including a mechanical keypad in an electronic<br />

lock is that the keypad itself takes up considerable space in the<br />

product and is often visually unappealing. Mechanical keypads<br />

also have the potential to become a security weakness, as<br />

fingerprints and dirt or grease smudges can leave a history of<br />

which keys were often pressed by users- allowing someone to<br />

extrapolate possible codes by observing the keys. Finally,<br />

mechanical keypads come with reliability concerns, as moving<br />

parts and electrical contacts experience fatigue over time.<br />

Capacitive touch sensing technology enables designers of<br />

electronic access control products to improve market<br />

acceptance of their products by offering premium aesthetics,<br />

enhanced security possibilities, and improved robustness with<br />

respect to a traditional mechanical keypad. However,<br />

capacitive touch is not without its own unique challenges. For<br />

example, capacitive touch is an “always-on” activity- meaning<br />

that the touch sensors must be actively scanned at a periodic<br />

interval to determine if a user has touched a key. When<br />

contrasted with a mechanical button, such as a membrane<br />

switch, capacitive touch carries a power consumption penaltywhich<br />

is significant in an electronic access control application<br />

that may be required to run off of a set of “AA” batteries for 12<br />

months or more. In addition to the power consumption<br />

challenge, capacitive touch sensors are often susceptible to<br />

false touch detections when subjected to exterior environments<br />

where rain and snow are present.<br />

www.embedded-world.eu<br />

775


In this paper, the benefits of capacitive touch sensing for<br />

electronic access control products are presented. In addition,<br />

the unique challenges of capacitive touch in this application are<br />

analyzed from a technical standpoint and solutions to those<br />

challenges are proposed.<br />

II.<br />

BENEFITS OF CAPACITIVE TOUCH SENSING<br />

Capacitive touch technology arms product designers with<br />

the ability to create abstract human-machine interfaces (HMIs)<br />

using the same fundamental technology found in touchscreens.<br />

A capacitive touch sensor consists of a conductive structure, or<br />

set of structures, from which an electric field is projected out<br />

through an insulating dielectric overlay material to the free<br />

space just above the overlay. When a user comes into close<br />

proximity of the overlay, or touches the overlay, the electric<br />

field is changed due to the presence of the user. It is this<br />

change in electric field that is measured via some type of<br />

acquisition method, referred to as a capacitance-to-digital<br />

conversion. Capacitance to digital conversion usually involves<br />

translating the sensing electrode’s capacitance into some<br />

measurable quantity (usually a time, current, or voltage that<br />

varies proportionally with the capacitance of the electrode).<br />

Microcontrollers with capacitive touch sensing technology are<br />

commonly available from several integrated circuit<br />

manufacturers. These devices allow for the creation of<br />

anything from a single capacitive touch button that replaces a<br />

mechanical button, to complex designs with many buttons,<br />

positional sensors, and short-range proximity sensors. In recent<br />

years capacitive touch capable microcontrollers have become<br />

quite cost optimized, with broad portfolios available from<br />

several manufacturers to allow product designers to select the<br />

best device that meets their needs based on the complexity and<br />

budget of a given design. In this section, the key benefits of<br />

capacitive touch in electronic access control products are<br />

presented, including aesthetics, enhanced security possibilities,<br />

and improved reliability.<br />

A. Appealing Aesthetic Design<br />

When compared with mechanical pushbuttons and<br />

membrane switches, capacitive touch sensors give mechanical<br />

designers considerable freedom to improve the aesthetics of a<br />

product. Because capacitive touch sensors have no moving<br />

parts, the keypad can be designed to fit into the mechanical<br />

enclosure by taking on a variety of shapes and configurations,<br />

rather than the mechanical keypad dictating the size and shape<br />

of the enclosure. A typical capacitive touch button mechanical<br />

stack-up is illustrated in Fig. 1. The stack-up consists of a label<br />

(decal, silkscreen, or other) on top of an overlay material<br />

(typically the product enclosure), bonded to the sensor<br />

implementation, which is generally a printed circuit board<br />

assembly [3].<br />

Fig. 1 shows a typical thickness FR-4 rigid core printed<br />

circuit board, but this is not a requirement. Flexible circuits<br />

may also be utilized to create sensors that curve to match the<br />

contour of a product enclosure. If transparent sensors are<br />

required, indium tin oxide (ITO) sputtered onto polyethylene<br />

terephthalate (PET) or glass may be used. This optically clear<br />

implementation is commonly used in capacitive touch screens<br />

[3].<br />

Fig. 1. Typical Capacitive Touch Button Stackup<br />

A defining feature of capacitive touch technology is the<br />

ability to create abstract sensors beyond just touch buttons that<br />

fit the form-factor of the product in which they reside.<br />

Common examples of this include touch slider sensors and<br />

scroll wheel sensors. Fig. 2 shows how a capacitive touch<br />

button or slider sensor may be constructed to wrap around a<br />

cylindrical product enclosure. Slider sensors are capable of<br />

reporting the position of a user touching the sensor with high<br />

accuracy and resolution; positional accuracy greater than 8 bits<br />

(256 points) is not unheard of with higher performance<br />

microcontrollers.<br />

Fig. 2. Abstract Sensor Shape Examples<br />

Since the capacitive touch sensor itself is mounted inside<br />

the product, the exterior of the product can maintain a flush,<br />

seamless appearance. This allows smooth uninterrupted<br />

artwork to be used on the overlay material to identify key<br />

locations. Capacitive touch can also be used to “hide” the<br />

presence of keys when they are not in use. For example, a<br />

capacitive touch keypad could be implemented on an FR-4<br />

PCB and bonded to a plastic overlay material. The overlay<br />

stack may be designed with LED backlighting provisions such<br />

that the button locations and button identifying marks are not<br />

visible until the LED backlighting is illuminated. A short-<br />

776


ange capacitive proximity sensor could be used to control<br />

whether the keypad is illuminated or not. In this way, only<br />

when a user approaches the keypad do the keys activate and<br />

become visible.<br />

As a single overlay material and PCB assembly is all that is<br />

needed mechanically to implement a complex capacitive touch<br />

interface, it is quite easy to create product variants with<br />

different colors and textures. There is no need to match the<br />

color of mechanical keys to the color of the product enclosure,<br />

for example. Likewise, the product will color-fade evenly with<br />

UV exposure because it is constructed of a single material,<br />

rather than a composite of multiple materials as would be the<br />

case with mechanical switches or a membrane switch overlay.<br />

B. Enhanced Security Possibilities<br />

Capacitive touch sensors offer the possibility of improving<br />

the security of electronic access control products by providing<br />

the ability to scramble or hide the history of previous<br />

keystrokes. A drawback to including a keypad in an access<br />

control product is that previous keystrokes can be visible if an<br />

authorized user leaves fingerprints, dirt, or grease on the keys.<br />

If a trace of the keys which are commonly pressed is visible to<br />

an intruder, it limits the possible passcode combinations that<br />

the intruder must try, making a brute force attack more<br />

feasible. There are several different ways in which capacitive<br />

touch sensing may be used to counteract this scenario.<br />

1) Use of a slider or wheel for number selection: A slider<br />

or wheel sensor may be implemented as the method for<br />

selecting a passcode character. This method is effective at<br />

hiding previous passcode character entries because the starting<br />

number or character in a list of valid characters may be<br />

randomized whenever the capacitive touch system is awoken.<br />

For example, a short range proximity sensor could be used to<br />

wake up and activate a capacitive scroll wheel sensor. Upon<br />

wake-up, the product would randomly select a starting digit.<br />

From there, the user would use the capacitive scroll wheel to<br />

select the next digit in their passcode. This method requires<br />

the use of at least a single-digit display to give feedback to the<br />

user regarding which character they have currently selected.<br />

2) Use of buttons with scrambled values: A numeric<br />

keypad may be implemented using capacitive touch sensors<br />

with a single-digit display element present at each button, such<br />

that the numeric value corresponding with a given button is<br />

randomized for each code entry. As in #1 above, a short range<br />

proximity sensor may be used to wake up a hidden keypad<br />

when a user is close to the keypad. At that time, the numberto-button<br />

assignment is randomly selected and displayed for<br />

the user at that instance in time. Segmented LED displays<br />

may be used for indicating the current key mapping.<br />

Alternatively, a monochrome liquid crystal display could be<br />

utilized, with touch sensors installed over the display. In this<br />

configuration, the touch sensors would need to be optically<br />

clear, nescessitating the use of ITO or other optically clear<br />

conductor for the sensors.<br />

C. Improved Reliability<br />

As there are no moving parts in a capacitive touch user<br />

interface, capacitive touch offers a nearly infinite lifetime in<br />

terms of number of presses. Mechanical switchgear is often<br />

rated to a certain lifetime of presses, after which performance is<br />

not guaranteed. With a capacitive solution there is no material<br />

fatigue and no electrical contacts that can corrode over time.<br />

Capacitive touch solutions can offer improved ESD<br />

immunity, as the mechanical enclosure can be sealed tightly<br />

with the keypad sensors fully contained inside of the enclosure.<br />

Typical polycarbonate (PC) and acrylonitrile butadiene styrene<br />

(ABS) plastics offer dielectric breakdown voltages of 15kV per<br />

mm of thickness or higher [4].<br />

III.<br />

CHALLENGES OF CAPACITIVE TOUCH SENSING<br />

While there are clear benefits to using capacitive touch<br />

technology to implement a keypad in an electronic access<br />

control product, there are still challenges that need to be<br />

addressed. In this section, the challenges of power<br />

consumption, environmental influence, and system integration<br />

will be addressed.<br />

A. Power Consumption<br />

Unlike mechanical switches, capacitive touch sensors<br />

require active scanning at a periodic rate by the controlling<br />

processor. The scan rate is a configurable parameter, with the<br />

tradeoff being between response time and power consumption.<br />

Scan rates for a typical capacitive touch keypad are generally<br />

in the range of 8 Hz to 100 Hz. The operational flow of a<br />

capacitive touch controller is a loop, in which the following<br />

tasks must be performed:<br />

1. All sensors in the system must be measured (an<br />

equivalent digital value for the external capacitance<br />

being measured is obtained)<br />

2. The digital values representing the current state of each<br />

sensor are post-processed. This involves the<br />

following, at a minimum:<br />

a. Application of noise filtering to the new raw<br />

samples<br />

b. A threshold comparison between the updated<br />

filtered samples and their historical, long term<br />

references is performed on an electrode by<br />

electrode basis to determine if there was<br />

enough deviation in any of the measurements<br />

with respect to their idle state to signify that a<br />

touch or proximity event has taken place<br />

c. If a touch was detected, it is de-bounced,<br />

validated, and reported<br />

d. If no touch was detected, the historical longterm<br />

reference value for each sensor is<br />

updated to reflect any temperature or<br />

environmental drift that may be occurring<br />

Fig. 3 shows this process visually.<br />

www.embedded-world.eu<br />

777


capacitance measurement method, in which the change in<br />

capacitance between two sensing elements (two conductors) is<br />

measured to detect touch and proximity [3].<br />

Fig. 3. Capacitive Touch Application Loop<br />

In many solutions, the measurement is controlled by a<br />

processor running acquisition software, and the measurement<br />

results are also interpreted by the processor. This means that<br />

the power consumption of the capacitive touch solution can be<br />

quite high, because the CPU is needed to perform the<br />

measurements and interpret the results on a sample-by-sample<br />

basis. What is interesting about this problem is that the<br />

keypads in smart building applications are usually used less<br />

than 1% of total runtime. Think about it- you may access your<br />

door keypad once or twice daily, for 10 seconds at a time. The<br />

rest of the time, the keypad is not being actively used- but it is<br />

still necessary to actively scan it and post-process the<br />

measurement results.<br />

To address this issue, integrated circuit manufacturers are<br />

now beginning to automate their scanning and post-processing,<br />

so that a processor does not need to wake up and execute<br />

software at all! A digital state machine can be constructed to<br />

periodically scan a set of sensors, apply an IIR filter for AC<br />

noise rejection, perform proximity and/or touch threshold<br />

detection, and apply a second IIR filter to track for changing<br />

environmental conditions- all without any software execution<br />

needed. Touch sensing microcontrollers that implement this<br />

technique have been shown to reduce average current<br />

consumption by >30% for a basic proximity sensor, and >50%<br />

for an application with 4 capacitive touch buttons [3].<br />

In addition to IC techniques for reducing power<br />

consumption, system design techniques may be used to<br />

optimize a capacitive touch system for low power. Adding a<br />

short range capacitive proximity sensor around the buttons of<br />

an electronic access control keypad can reduce average current<br />

by using the proximity sensor to wake up the keypad. In this<br />

way, it is only necessary to scan the proximity sensor regularly,<br />

until proximity is detected. When proximity is detected, all<br />

sensors in the keypad are then activated and their status is made<br />

available to the system. Fig. 4 below illustrates how this<br />

concept may be implemented in a sensor design. This specific<br />

sensor design is used in the BOOSTXL-CAPKEYPAD<br />

evaluation module [4]. This sensor design uses the mutual<br />

Fig. 4. BOOSTXL-CAPKEYPAD Sensing Electrode Pattern<br />

In this approach, it is still necessary to infrequently measure<br />

the entire keypad (for example, every 5 minutes) to refresh the<br />

long term, un-touched reference values for all of the keys in the<br />

keypad. This ensures that valid references are present when a<br />

proximity event is detected. This is needed because the<br />

reference, un-touched digital capacitance values will drift as a<br />

function of IC temperature. For this power-saving technique to<br />

be effective, the proximity sensing distance must be kept small<br />

(for example, 8 centimeters or less). Achieving long range<br />

proximity sensing (for example, >10cm) requires considerable<br />

measurement resolution, scan time, and sensor area. The<br />

average power required to detect proximity at >10cm will often<br />

be larger than the average power required to measure 12<br />

capacitive touch buttons.<br />

Fig. 5 shows the power profile of the BOOSTXL-<br />

CAPKEYPAD EVM, which uses the MSP430FR2522<br />

microcontroller to implement a 12-key numeric keypad with a<br />

proximity sensor. In the wake-on-proximity operating mode,<br />

the BOOSTXL-CAPKEYPAD reaches approximately 8µA of<br />

average current at 3.3V. Notice how the instantaneous current<br />

may be as high as 2.3mA; this is the current during the<br />

measurement of the proximity sensor. If all 12 electrodes in<br />

the keypad were measured continuously, rather than just the<br />

proximity sensor, the time that the capacitive touch controller<br />

would spend in that high current state would be larger, leading<br />

to higher average current.<br />

778


Fig. 5. BOOSTXL-CAPKEYPAD Proximity Sensor Power Duty Cycle<br />

B. Environmental Influences<br />

Due to the fact that capacitive touch sensing is based on the<br />

underlying foundation of analyzing changes in an electric field<br />

over time, it should not come as a surprise that capacitive touch<br />

sensors can be adversely affected by environmental influences<br />

such as moisture build-up on the keypad overlay. Moisture<br />

tolerance is important because electronic access control<br />

products are often designed to be installed outdoors where the<br />

keypad may be exposed to rain water. Even in the case of an<br />

indoor-only product, the keypad will need to tolerate being<br />

cleaned with water and cleaning solution. Despite these<br />

challenges, with proper system design it is possible to develop<br />

a capacitive touch keypad that is robust in the presence of<br />

moisture due to rainfall or cleaning of the touch surface.<br />

Moisture tolerance may be improved by the addition of a<br />

guard sensing channel near the affected touch sensors. In the<br />

case of the keypad design shown in Fig. 4 with the additional<br />

proximity sensor, the proximity sensor may also be repurposed<br />

as a guard sensing channel. A guard sensing channel<br />

simply acts as a mask. If moisture builds up on the sensing<br />

overlay, it will often be present across the guard channel as<br />

well as the capacitive touch buttons themselves. If the guard<br />

channel goes into detection, that detection may be used as a<br />

mask against the buttons. When the guard channel is in detect,<br />

the keypad becomes locked and touches are not allowed. This<br />

method also works well for the cleaning scenario. If cleaning<br />

solution is applied to the touch panel, the guard channel will<br />

detect this and that information can be used to lock out the<br />

keypad.<br />

The mutual capacitance measurement topology can also be<br />

used to improve moisture tolerance, and even enable touch<br />

detection when capacitive touch sensors are covered in running<br />

water. In a mutual capacitance measurement, the capacitance<br />

being measured is the capacitance between two sensing<br />

electrodes. This means that the designer of the electrodes gains<br />

considerable control over the electric field when compared<br />

with the self-capacitance measurement topology, in which the<br />

electric field comes out from the sensor in all directions. It is<br />

possible to design a mutual capacitance electrode geometry that<br />

limits nearby ground and contains the electric field between the<br />

two electrodes being measured in the mutual capacitance<br />

mode.<br />

Another key benefit of the mutual capacitance<br />

measurement topology is that when water is present on an<br />

overlay panel, it has the effect of increasing the mutual<br />

capacitance- which is the opposite effect of a touch or<br />

proximity, which has the effect of decreasing the mutual<br />

capacitance. This behavior enables the processor interpreting<br />

the measurement results to differentiate between water and a<br />

valid touch, because they create different changes in the<br />

system. By actively monitoring for changes due to moisture<br />

versus changes due to a touch, it is possible to actively control<br />

the touch thresholds in a system and enable accurate touch<br />

detection even with water flowing over the capacitive touch<br />

keypad area.<br />

Texas Instruments (TI) has subjected a 12 button numeric<br />

keypad capacitive touch design to IPX5 moisture tolerance<br />

testing, and full touch detection was possible under all IPX5<br />

test conditions. These results were achievable due to the<br />

following key system parameters:<br />

- Use of the mutual capacitance measurement topology<br />

- Use of an electrode geometry that limited ground and<br />

other conductors on the sensor layer of the PCB,<br />

leaving just TX and RX patterns<br />

- Use of moisture-specific firmware that monitors for the<br />

presence of water by looking for a reverse-touch<br />

scenario, at which point sensors are re-calibrated to<br />

operate in the presence of moisture<br />

These techniques, when combined together, significantly<br />

increase the reliability of capacitive touch solutions in<br />

electronic access control products that are installed outdoors.<br />

C. System Integration and Development<br />

At first glance, integrating capacitive touch sensors into a<br />

product that previously used mechanical buttons can seem like<br />

a challenging task. Admittedly, it is hard to beat the simplicity<br />

of a mechanical push button when it comes to hardware and<br />

firmware development. On the hardware development side,<br />

capacitive touch mandates that careful attention be paid to<br />

mechanical stack-up consistency, electrode geometry, and trace<br />

routing. On the firmware side, capacitive touch often involves<br />

adding a new microcontroller to an application, which means<br />

adding a new firmware development flow.<br />

Fortunately for product designers looking to integrate<br />

capacitive touch, there has never been a better time to get<br />

started than right now. Competitive pressure on IC<br />

manufacturers has led to the creation of a significant amount of<br />

www.embedded-world.eu<br />

779


high quality literature, tools, and devices to address common<br />

system integration challenges.<br />

1) Literature: The majority of the major IC manufacturers<br />

that offer capacitive touch technology now also offer high<br />

quality literature to educate the product designer that is new to<br />

capacitive touch on best practices that address not only<br />

firmware development, but also schematic capture, PCB<br />

layout, and mechanical design. System level challenges<br />

including noise immunity, moisture tolerance, and low power<br />

design are addressed.<br />

2) Tools: Just like literature, IC manufacturers also offer<br />

platform tools that enable designers to quickly start their<br />

designs without having to be an expert on a perticular<br />

technology or device. Some of these tools will even generate<br />

code for you to run on the capacitive touch controller.<br />

3) Devices: Microcontrollers with integrated capacitive<br />

touch are now available in a variety of memory densities,<br />

package sizes, and peripheral configurations. In many cases,<br />

it’s possible to find a microcontroller for an electronic access<br />

control product that can integrate the capacitive touch control<br />

with some or all of the other application functions. When it is<br />

possible to use a single microcontroller for the application<br />

functions and the capacitive touch interface, the bill of<br />

materials (BOM) cost of adding capacitive touch to a product<br />

can become quite small. IC-based capacitive sensing<br />

measurement technology has also improved considerably in<br />

recent years. Features such as parasitic capacitance offset<br />

have been implemented, enabling longer shielded capacitive<br />

sensing trace runs on PCBs without significantly increasing<br />

power consumption and measurement time.<br />

The available literature, tools and devices on the market<br />

today make integrating capacitive touch sensors into electronic<br />

access control systems easier and faster than ever. The<br />

combination of benefits provided by capacitive touch sensing<br />

now outweighs the system integration challenges of the past.<br />

IV.<br />

CONCLUSIONS<br />

Capacitive touch sensing technology is continually being<br />

adopted into smart home and electronic access control products<br />

due to the clear advantages it provides product designers. The<br />

technology has opened up new ways to mechanically design<br />

enclosures that must contain keypads and the potential security<br />

and robustness benefits are desired by end users. As the<br />

integrated circuit industry continues to remove challenges to<br />

adoption by reducing average power consumption, reducing the<br />

impacts of external environmental influence, and lowering the<br />

cost of system integration, the total cost of integrating<br />

capacitive touch from a bill of materials (BOM) standpoint as<br />

well as a time-to-market standpoint will continue to decrease<br />

and adoption of the technology will continue to increase.<br />

ACKNOWLEDGMENT<br />

W.S. thanks Yiding Luo of Texas Instruments for his<br />

significant contributions to capacitive touch sensing moisture<br />

tolerance research and development.<br />

REFERENCES<br />

[1] D. Olick, “Why 2017 will finally be the year of the smart home:<br />

consumers figure it out” CNBC, Jan 2017.<br />

[2] I. Berger, “Is it smart to have a smart home?”, The Institute, IEEE, May<br />

2017<br />

[3] CapTIvate Technology Guide, Design Guide Chapter, Texas<br />

Instruments, Inc., Revision 1.60.00.00, Dec 2017<br />

[4] Electrical properties of plastic materials, Professional Plastics<br />

780


Accelarating 3D Graphics Performance With EGL<br />

Image on Zynq UltraScale+ MPSoC<br />

Author: Alok Gupta<br />

Platforms Processing Group<br />

Xilinx, SJ (CA)<br />

alok.gupta@xilinx.com<br />

Abstract— when texture content is going to be updated very<br />

often – more or less every frame classic functions like glTexImage<br />

and glSubTexImage are very inefficient. These functions are not<br />

suitable because data is being copied/converted in the drivers<br />

from CPU to GPU memory in order to be complaint against<br />

Khronos standard which results in lower than expected graphics<br />

rendering frame rates for Video textures. Fortunately, a lesser<br />

known solution that is more efficient exists. As with some design<br />

choices, this increase in efficiency comes with some increase in<br />

effort. This paper aims to describe different texturing techniques<br />

with EGLImage extension where CPU & GPU shares the same<br />

physical memory and copying data is not required and help users<br />

in choosing the proper method to avoid performance issues in<br />

certain situations.<br />

Keywords—GPU,Graphics, OpenGLES, EGL Image, Zero-copy<br />

I. INTRODUCTION (Heading 1)<br />

Each new generation of devices comes with an expectation<br />

of better performance and user experience. An indicator of<br />

performance, that is perhaps the most easily observed by the<br />

average consumer, is the performance of 2-D and 3-D graphics,<br />

and thus, graphics performance capabilities have become<br />

paramount to the success of a new device. Most modern mobile<br />

devices, such as smartphones and tablets, are powered by SoC<br />

application processors that contain dedicated graphics<br />

processing units (GPUs). These processors, in their entirety, are<br />

designed to efficiently accelerate graphics operations while<br />

maintaining a balance between power and performance.<br />

OpenGL is an open API, standardized by a not- for-profit<br />

technology consortium called The Khronos Group, which has<br />

been in use for some time to enable developers to draw 3-D<br />

graphics on a variety of devices, and OpenGL ES is a subset of<br />

OpenGL that is designed to accommodate the unique demands<br />

of mobile devices. In OpenGL ES, every object on the screen is<br />

represented as a series of triangles each of which is defined by a<br />

set of three vertices. Images, which are referred to as textures,<br />

are transposed over the surfaces of these triangles, as determined<br />

by the application. Hundreds or thousands of these textured<br />

triangles sum to form a scene that represents anything a<br />

developer or artist could imagine.<br />

II.<br />

PROBLEM STATEMENT<br />

A. RAPID TEXTURE UPDATES<br />

While textures can originally exist in a variety of different<br />

formats, they ultimately exist as raw, uncompressed, color data<br />

in memory before they are applied to an object. With high<br />

quality display resolution, the need for higher resolution textures<br />

also increases. For example, equation below shows that a 32-bit<br />

texture the size of a 1080p display would require almost eight<br />

megabytes of memory. 1920 xres * 1080 yres * 32bpp = ~7.9<br />

MB On several occasions, customers have tried to use sequences<br />

of rapidly updating textures to create animation. For example,<br />

every frame of a YouTube video is actually just a series of new<br />

textures being rapidly displayed. In order for the GPU to draw a<br />

texture on an object, it must exist in a special area of system<br />

memory called video RAM (VRAM). VRAM is technically just<br />

regular system memory, but it exists within a predetermined<br />

address range. While there are several ways to upload textures<br />

into VRAM, customers have been observed using a method that<br />

involves uploading a fresh texture for every frame. This<br />

consumes significant memory bandwidth and various other<br />

system resources. Furthermore, this method was never intended<br />

to support animation.<br />

B. Pros and cons of glTexImage2D API<br />

A common way to get texture data into VRAM is through<br />

the use of the glTexImage2D function. This function is designed<br />

upload a texture to a memory region in VRAM where it can be<br />

reused throughout the program. The benefit of this is in scenarios<br />

like scrolling, where only the offset of where the texture is<br />

displayed on the display is altered. However it is not suitable for<br />

situations such as animation where it would require frequent<br />

modification of the content in the texture. It was not designed to<br />

be called on every frame with updated texture content, but<br />

unfortunately this has been observed in practice. When<br />

glTexImage2D is called, the GPU driver copies the texture data<br />

into a temporary buffer, and queues it to be uploaded to VRAM.<br />

If the handle to the new texture is currently in use, as would most<br />

likely be the case with animation, the existing texture must be<br />

ghosted. This means that the current on-screen texture and the<br />

newly uploaded one must exist in VRAM at the same time. This<br />

involves allocating another buffer in VRAM. Customers<br />

www.embedded-world.eu<br />

781


generally want their applications to achieve a frame rate of 60<br />

frames per second (FPS), meaning all this needs to be done in<br />

about 16 milliseconds. The memory flow, under these<br />

conditions, can be seen in Figure 1.<br />

for only one frame. EGL Images are not twiddled by the driver<br />

upon upload. All of this results in substantial memory bandwidth<br />

and CPU usage savings.<br />

Performing two extra memory transfers and allocating two<br />

additional buffers are time consuming operations. It is<br />

important to understand the implication of the glTexImage2D<br />

call and use it only in cases where the content is not frequently<br />

updated.<br />

III.<br />

SOLUTION<br />

An alternative to calling glTexImage2D to place a texture<br />

into VRAM is to make use of EGL Images. EGL Images are an<br />

OpenGL ES extension, to allow for the sharing of image data<br />

between two processes. OpenGL ES is very lenient with what it<br />

allows developers to do with EGL Images, and thus, developers<br />

are burdened with slightly more responsibility in exchange for<br />

increased flexibility. EGLImages are important building block<br />

when displaying Video content as OpenGL ES textures. The<br />

reason the Khronos group came up with the idea of EGLImage<br />

was to be able to share buffers across rendering APIs (OpenVG,<br />

OpenGL ES, and OpenMAX) without the need of extra copies.<br />

IV.<br />

ADVANTAGES OF EGL IMAGES<br />

EGL Images are designed to be shared between processes.<br />

One thread can produce content into an EGL Image, while<br />

another consumes the content. For example: Thread A decodes<br />

an H.264 video stream and places the next frame’s data in an<br />

EGL Image. Thread B then displays this EGL Image, as a<br />

texture, in the YouTube application. An additional texture<br />

upload is unnecessary as the client application is directly editing<br />

memory that already exists in VRAM address space. In addition,<br />

when a texture is uploaded using glTexImage2D, the bits are<br />

automatically rearranged by the driver in a process called<br />

twiddling. This can increase performance when the texture is<br />

read multiple times, but actual twiddling process is time<br />

consuming and of little benefit if the texture is going to be used<br />

V. COMPLEXITIES INTRODUCED BY EGL IMAGES<br />

When all rendering operations for all running applications<br />

are complete for a given frame, the resulting image is stored in<br />

a buffer called the framebuffer. The framebuffer, like other<br />

graphics buffers, is stored in VRAM. Most OpenGL programs<br />

use a technique called double buffering; a technique that makes<br />

use of two separate framebuffers. The first framebuffer, referred<br />

to as the front buffer, is the buffer that is currently being<br />

displayed on the screen of the device. The second buffer,<br />

referred to as the back buffer, is the buffer that the GPU is<br />

asynchronously rendering new content into off of the screen.<br />

Each time the display updates, the front buffer and back buffer<br />

switch places. This means that the GPU never renders content<br />

directly to the screen and as a result, the user is only presented<br />

with frames that have been completely rendered. Were the GPU<br />

to render directly to the screen, data from two different frames,<br />

at some point in time, would be present on screen<br />

simultaneously. This effect, known as tearing, would be very<br />

noticeable to the user. OpenGL usually handles double buffering<br />

from behind the scenes as is the case with glTexImage2D.<br />

However, a drawback of EGL Images is that the developer must<br />

independently implement this technique if they wish to avoid<br />

tearing. This is generally accomplished by allocating two EGL<br />

Images and alternating between the two as new content is<br />

produced. In most use cases, producers of content write into<br />

these two buffers asynchronously of the consumer application<br />

that is reading them. This means that content can be produced<br />

faster that it is consumed and the consumer discards the extra<br />

information. Clearly, the production and consumption of this<br />

EGL Image pair needs to be thread-safe. Any synchronization<br />

method can be used to accomplish this, but it is the responsibility<br />

of the developer to implement.<br />

VI.<br />

RESULTS<br />

While EGL Images have been used successfully in many<br />

consumer devices, data sets comparing the use cases improved<br />

by EGL Images were not readily available. A sample application<br />

was written to compare EGL Image performance versus<br />

glTexImage2D under ideal conditions. Reference application is<br />

using default OpenGL texture upload API (glTexImage2D) and<br />

used as base-line for benchmarking here on the other hand the<br />

other application is using zero copy based EGL Image texture<br />

and is implemented with DRM/DMA_BUF_EXT. The results<br />

are 5X faster than classic copy implementation as shown in<br />

Table 1.<br />

782


second thread, which consists of the render loop, consumes the<br />

content. The program flow for the EGL Image case can be seen<br />

in Figure 3. Note that the buffers referred to in the figure are in<br />

reference to the developer-created swap chain. An EGL Image<br />

is simply a texture whose content can be updated without<br />

having to re-upload to VRAM (meaning no call to<br />

glTexImage2D). One of the only drawbacks, besides increased<br />

code complexity, is that the application developer has to handle<br />

synchronization themselves.<br />

VII. PROGRAM DESCRIPTION<br />

A sample program was created to measure the performance<br />

gains realized from using eglImages versus glTexImage2D. The<br />

application was architected so that as many components as<br />

possible could be shared regardless of the texturing method.<br />

While the data was gathered on a linux powered device, it was<br />

written entirely in native code. These efforts were taken to<br />

ensure minimal overhead, fair data points, and consistently<br />

reproducible results.<br />

When the application is executed using glTexImage2D, a<br />

single thread, known as the render loop, is spawned. The render<br />

loop function is executed once per frame. As seen in Figure 2,<br />

the content is generated, uploaded, and rendered serially.<br />

VIII. FOR DEVELOPERS<br />

If you’re looking for updating an texture in real time, well<br />

there are two types of textures one which are natively supported<br />

in OpenGLES & on the other hand there are Image formats not<br />

supported in OpenGL ES natively, which can be supported via<br />

the additional extensions: eg: GL_OES_EGL_image_external<br />

, Textures specified in this way can be sampled as textures or<br />

used as framebuffer attachments as if they were native objects.<br />

You can't use normal texture to render camera or video preview,<br />

you have to use GL_TEXTURE_EXTERNAL_OES extension.<br />

This extension provides a mechanism for creating EGLImage<br />

texture targets from EGLImages. This extension defines a new<br />

texture target, TEXTURE_EXTERNAL_OES.<br />

A. Example code snippets EGL/OpenGL<br />

When the application is executed using EGL Images, it<br />

spawns two threads. The first thread generates content, and the<br />

www.embedded-world.eu<br />

783


Also you need to change your fragment shader like this,<br />

adding the #extension declaration and declaring your<br />

texture uniform as samplerExternalOES:<br />

B. Example code snippets GLSL<br />

CONCLUSION<br />

While EGLImages are known and understood by select<br />

graphics experts, there exists a lack of documentation that<br />

prevents many from comprehending their implementation<br />

implications and performance impact. As illustrated by the<br />

aforementioned findings, the use of appropriate texture upload<br />

methods results in a significant performance improvement. Now<br />

that the memory flow, implementation details, and performance<br />

gains have been explained in paper, hopefully more developers<br />

will start using EGLImages its well supported on ARM Mali 400<br />

GPU on Xilinx Zynq Ultra scale + MpSoC, Visit Our Web here.<br />

ACKNOWLEDGMENT<br />

The author of this paper would like to thank Yashu Gosain,<br />

Glenn Steiner and Louie Valena for providing technical counsel<br />

& feedback for completing the paper.<br />

REFERENCES<br />

[1] J. Leech, " EGL_KHR_image_base.txt," Khronos API Registry ,<br />

December, 1, 2010. [Accessed September 24, 2012]<br />

[2] J. Neider, T. Davis, M. Woo, OpenGL Programming Guide, Addison<br />

Wesley, 1993<br />

[3] The Android Open Source Project, (2010) Android (Version 4.0.4)<br />

[Source Code]. Available at http://sourceandroid.frandroid.com/frameworks/base/opengl/tests/gl2_yuvtex/<br />

[Accessed September, 24, 2012]<br />

[4] EGL_KHR_image_base<br />

https://www.khronos.org/registry/EGL/extensions/KHR/EGL_KHR_im<br />

age_base.txt<br />

784


Multicore Approach on AUTOSAR Systems,<br />

Performance Impact Analysis<br />

Eng. Roberto Agnelli<br />

Teoresi Group S.p.A.<br />

Torino, Italy<br />

roberto.agnelli@teoresigroup.com<br />

PhD. Niki Regina<br />

Teoresi Group S.p.A.<br />

Turin, Italy<br />

niki.regina@teoresigroup.com<br />

Abstract—In multicore systems applications with a high<br />

degree of complexity, the large amount of data communication<br />

and the resulting management affect the performances. The<br />

complex device drives, which do not share the core with the basic<br />

software, have to use the runtime environment while it is not<br />

strictly necessary in a single core approach. Moreover, in a single<br />

core application the interaction between different software<br />

components can be realized by specific interface. The present<br />

article wants to show that the usage of a multicore approach<br />

needs to take into account of these different problems in<br />

particular by considering the software architecture. Moreover,<br />

all these perspectives will be enhanced by real examples that will<br />

highlight the loss of effectiveness and performances of the<br />

multicore compared to the single core approach.<br />

Keywords—multi-core; single-core; software architecture;<br />

software components.<br />

I. INTRODUCTION<br />

The usage of multicore architecture in the automotive<br />

sector, and in particular in the safety-oriented systems, is<br />

becoming widespread due to the increase of power computing.<br />

By facing to high application processing and hardware<br />

management, the solution of a parallel computing is considered<br />

the most efficient. Moreover, by considering new automotive<br />

standard regulations such as ISO 26262 for the safety critical<br />

systems, the computational redundancy requested forces to use<br />

the multicore approach more often than in the past.<br />

However, the result of this application leads to an<br />

increment of the computational power but it is necessary also<br />

to consider the effort for the effective co-operation of the<br />

systems used for a specific functionality. For example, the<br />

tasks and the routines of an application layer communicate and<br />

trigger several events and consequently it is necessary also to<br />

manage the low hardware level and the operational system.<br />

Hence, there is a big effort to consider both for the<br />

synchronization of the events placed in the various cores and<br />

for the coordination and management of the low level access,<br />

such as CAN bus communication and peripheral devices.<br />

Nowadays, it is also necessary to consider that the<br />

automotive AUTOSAR standard defines the fundamental<br />

software architecture and it organizes and simplifies the<br />

approach to the application development. However, if the<br />

AUTOSAR standard is used in a multicore approach, it must<br />

be considered that the basic software will manage all the<br />

requests coming from several software components and the<br />

complex device drivers placed in the other cores.<br />

For example, in multicore system applications each basic<br />

software access coming from a different core engages a<br />

synchronization with the operative system. This operation is<br />

definitely heavier than the one completed with a simple task<br />

switch in a schedule table of a single core. Moreover, if the<br />

modules of the complex device driver allocated to the<br />

management of the functionalities are in different cores, they<br />

need to use the runtime environment for the co-operation. This<br />

is not strictly necessary in a single core approach. Likewise, the<br />

complex device drives have to use the runtime environment in<br />

a multicore approach context if they do not share the core with<br />

the basic software.<br />

The present article wants to highlight that the application of<br />

a multicore approach needs to take into account of these<br />

problems, with a particular attention to the software<br />

architecture. All the loss of performances will be enhanced by<br />

real examples. In particular, the paper will focus on the<br />

degeneration of effectiveness and performances by comparing<br />

the multicore software architecture approach to the more<br />

classical and known single core.<br />

The present article is divided into VI sections. In section II<br />

the general problems of a multicore architecture are analyzed.<br />

Sections III, IV, V, VI are dedicated to specific problems.<br />

II.<br />

MULTICORE ARCHITECTURE GENERAL PROBLEMS<br />

A. The Event-Triggered Approach<br />

In an AUTOSAR multicore software architecture, the<br />

operative system runs on each core with independent<br />

applications and it is synchronized by hardware and software<br />

timing procedures. Moreover, hardware counter defines the<br />

“timing slice”.<br />

www.embedded-world.eu<br />

785


The scheduling of the tasks could follow the subsequent<br />

different approaches:<br />

The Schedule Table: this approach is easier to<br />

implement and it saves a lot of effort to the operative<br />

system.<br />

The event-triggered: it enables to trigger on request the<br />

execution of specific functionalities. It uses specific<br />

alarms.<br />

In the second approach, the operative system can manage<br />

also the 2 nd interrupt categories and it can start different tasks.<br />

In this way, it is possible to interrupt a task execution in a<br />

specific core by several bus messages or requests coming from<br />

other cores and with a high priority.<br />

Moreover, the event-triggered approach can manage<br />

different cores scheduling through specific events. It is also<br />

possible to divide a real time complex process in several<br />

different sections and to split them into the different cores.<br />

These features explain the widespread usage of this approach<br />

compared to the Schedule Table.<br />

However all these just cited advantages creates an overhead<br />

around the 20% of the operative system compared to the<br />

first approach. In fact, it must take into account of:<br />

The trigger and manage of interrupts<br />

The context switch<br />

The determination of the request origin<br />

The destination process<br />

Etc..etc.<br />

B. OS Application & Memory Constraint<br />

Another important feature of the multicore software<br />

architecture that can limit its performances is the memory<br />

manage and the implementation of the software structures for<br />

its access.<br />

A multicore system uses the OS Application concept. It<br />

identifies a functional unit for the software and this is then<br />

assigned to a specific core. The OS Application concept<br />

defines the regions where the memory operates and it allocates<br />

the tasks, the alarms and the interrupts.<br />

However, in this way there are some constraints to the<br />

software execution: for example, all the software components<br />

runnable have to belong to the same OS Application. This is a<br />

remarkable difference between the single core and the<br />

multicore application. In the former approach, all the processes<br />

can belong to the same OS Application and it is possible to<br />

share all the resources with an increase of the performances. In<br />

the latter, all the processes that belong to different cores are in<br />

a specific OS Application: the need of communication between<br />

these different cores leads to significate decrease of the<br />

performances.<br />

C. Scheduling<br />

In a single core application it is possible to design the software<br />

architecture in order that the complex procedures are in the<br />

same task of the OS application: for example the ones which<br />

requires a frequently each other interaction. This design<br />

method allows the share of the RAM memory with a high<br />

performance level. Several application uses this approachwith<br />

proved results.<br />

On the contrary, in a multicore software architecture the<br />

complex procedures that need an interaction between different<br />

cores are necessary in different operative system applications<br />

and consequently in different software stacks.<br />

For example, it must consider that if there is a call through<br />

an alarm of a runnable in a different OS application, there is<br />

the start of a context switch with a time process and memory<br />

costs not negligible. Moreover, the standard AUTOSAR<br />

foresees that specific consistency tools ensures the<br />

communication between different OS applications without<br />

data corruption. Hence, in this last context there is an<br />

additional increase of computational execution.<br />

By comparing, the single core and the multicore software<br />

architecture there are the subsequent important differences:<br />

<br />

<br />

Single core<br />

Capability to preview both the correct task sequences<br />

by the schedule table and to optimize the<br />

performance system by minimizing the context<br />

switch.<br />

Multicore<br />

If there is the need of a high interaction between<br />

different cores, it is not easy to preview the<br />

scheduling of each core. Moreover, it is necessary to<br />

take into account of possible occurrences and worst<br />

cases in order to set up correctly the dead lines for all<br />

the tasks.<br />

D. Memory Boundary<br />

The sharing of the data between different cores creates<br />

problems of authorization and competition to the memory<br />

accesses. The AUTOSAR safety architecture implements a<br />

memory protection mechanism that act on a specific<br />

peripheral device of the microprocessor the Memory<br />

Protection Unit. The MPU is used when the AUTOSAR<br />

requires a value of the operative systems scalability class<br />

equal or higher than three.<br />

The MPU is responsible to assign exclusive permission<br />

access to a specific memory area and to protect it from<br />

different possible access belonging to other operative system<br />

applications. The correct configuration of it can take a long<br />

time.<br />

As for OS Application & Memory Constraint, the tasks<br />

scheduling is different between single and multicore software<br />

architecture. In the former it is possible to work on a single OS<br />

application and consequently on a single memory region. In<br />

the latter, it is much more complicated: there is the need to<br />

786


define exclusive region where the different process can share<br />

their data. In this case, there are specific buffer memory<br />

regions with shared permission of access (see Fig. 1).<br />

Here, there is the widespread utilization of support variable.<br />

Semaphores protect them when there is a writing or reading<br />

operation.<br />

As highlighted in the previous section, the multicore<br />

software architecture has this problem. The MPU and its tool<br />

bring an overhead into the system due to the copies, the<br />

different procedures and other operations necessary to<br />

preserve the integrity of the data. After several tests it is clear<br />

that the advantage of the possible parallelism introduced by<br />

the multicore approach is definitely reduced by all the<br />

structure and operations necessary to the interaction of the<br />

different processes<br />

Fig. 1 shows the concept of the shared memory. If there is a<br />

data in the application core 1 that it must be utilized by the<br />

application core 2 the subsequent operations must be carried<br />

out:<br />

1. Copy of the data<br />

2. Sharing of the memory region<br />

<br />

<br />

In this phase the source process calls the spinlock in<br />

order to avoid possible competitor access<br />

Copy of the data in the shared buffer<br />

Once the first phase concludes, the data goes to the<br />

destination process after a copy. At the end of this<br />

phase, the spinlock releases itself.<br />

Production of the trigger event for the process<br />

destination<br />

At the end of the procedure, the source process sets a<br />

trigger that allow the destination process to schedule<br />

the data reading.<br />

It is important to underline that a RTE procedure does not<br />

require an IOC if the both the processes are in the same<br />

operative system application or better in the same core. In this<br />

case the RTE results definitely with a high performance.<br />

Hence, if the number of interactions between different<br />

elements or processes are in different cores, the communication<br />

increases by creating an overhead and reducing the advantage<br />

of the multicore parallelism.<br />

In Fig. 2 the IOC concept is depicted. The processes<br />

inserted in the same core can operate in the same operative<br />

system application with specific tools. On the contrary, the use<br />

of IOC is mandatory if the processes operate between different<br />

cores.<br />

Fig. 1 Shared memory<br />

3. Access to the data from the other core<br />

III.<br />

IOC<br />

In the previous chapter, some of the more principal<br />

problems have been highlighted. However, the standard<br />

Autosar gives some specific tools in order to manage all the<br />

requests between different cores. The most important is the<br />

Inter Operative System Application Communicator (IOC).<br />

This tool is given as a part of the multicore operative<br />

system. It is possible to use it through two different elements:<br />

the RunTime Enviroment software component and specified<br />

low-level sections of the software. It must consider that these<br />

elements have to be part of the AUTOSAR or of the Complex<br />

Device Driver.<br />

The IOC produces three different phases:<br />

<br />

Protection with spinlock for the integrity of the data<br />

Fig. 2 IOC Architecture<br />

IV.<br />

INTER PROCESS COMMUNICATION<br />

In the standard AUTOSAR the RTE is responsible of the<br />

functionalities software management. Hence, all the software<br />

and hardware components below this “mask” can be<br />

considered at the same level from the system architecture point<br />

of view. In particular, the RTE has some interfaces in order to<br />

standardize the communication method. The interactions<br />

between the different software components are similar in a<br />

single or multicore software approach.<br />

However, it is important to highlight the main difference<br />

between these two methods. In the single core, the resources<br />

management are usually focused on the time execution. On the<br />

contrary, in the multicore approach gives the main importance<br />

to the parallelism of the processes.<br />

www.embedded-world.eu<br />

787


Moreover, in a multicore software architecture there is the<br />

need of the IOC, as described in the previous chapter. In a<br />

single core the software components use direct client/server or<br />

sender/receiver gates if they need to share information or<br />

functionalities. This characteristic saves time execution<br />

compared to the multicore approach. Hence, even if in the<br />

multicore software architecture the interfaces of the RTE are<br />

the equal, there still are all the core-to-core communication<br />

problems already cited.<br />

Fig. 3 Application-Application Communication<br />

Fig. 3 shows the multicore software architecture. There is a<br />

direct communication if the software components are in the<br />

same core. This communication is always managed by the<br />

RTE. If there is the need of a cross-core communication<br />

between two software components in different cores, there<br />

must be the utilization of the IOC with all the problems cited<br />

in section III.<br />

V. CDD-CDD COMMUNICATION<br />

The Complex Device Driver (CDD) is responsible for all the<br />

elements that are not possible to manage through the basic<br />

software modules. Each feature, which is not present in the<br />

AUTOSAR stack, is implemented in the CDD such as the<br />

access to sensors, actuators or peripheral devices.<br />

Moreover, in the CDD contains several codes sections that are<br />

not present in a specific software component. All the codes and<br />

elements in the CDD represent a significant section of the ECU<br />

software. In a multicore approach, they are in the different<br />

cores according to the peripheral devices availability or load<br />

distribution. In this way, the functionalities that require a high<br />

load are in secondary cores while the safety ones are in<br />

lockstep cores.<br />

For example, in a multicore approach the sensor functionalities<br />

are in different cores: one core can be responsible for the<br />

access or manage of the peripheral device, the parsing and data<br />

reading; the other lockstep core can be used for safety<br />

functionalities.<br />

Also in this scenario, the CDD-CDD communication has the<br />

same problems already cited. In a single-core, it is possible to<br />

use global variables or specific call functions for the data<br />

communication. On the contrary, in a multicore software<br />

architecture the resources are independent (the process time,<br />

memory variables allocation). As for the inter process<br />

communication there is the need to use the IOC services of the<br />

operative system. Hence, the CDD sections pass necessary to<br />

through the RTE and then they use the core-to-core<br />

communication.<br />

Fig. 4 represents the CDD communication process. If the<br />

two different CDD are in the same core, the information<br />

exchanges is faster and easier. On the contrary if there is the<br />

need to communicate between two different CDDs in different<br />

cores, the utilization of the IOC is essential. This is an<br />

additional problem in terms of performance.<br />

VI.<br />

BASIC SOFTWARE COMMUNICATION<br />

The functionalities of the AUTOSAR stack consider all the<br />

possible perspectives such as communication, diagnosis,<br />

management of ECU, memory accesses etc…etc. All these<br />

aspects are available by using a vendor tool and they are<br />

integrated with the RTE and the operative system.<br />

The basic software is in a single specific core and any<br />

accesses to the stack functionalities is accomplished inside it.<br />

Any requests of a secondary core to a basic software<br />

functionality must pass through all the cross-core<br />

communication tools and the previous cited problems arises<br />

(memory accesses, delay of task activation, change of<br />

scheduling, etc…etc.). On the contrary, in a single core<br />

software architecture an API can activate some of the modules,<br />

at least for the CDD, and then there is the call to the RTE for<br />

the software.<br />

As an example, it is possible to consider the Diagnostic<br />

Event Manager of AUTOSAR that is responsible for the<br />

diagnosis. This special module collects all the problems of a<br />

running execution. In a multicore software architecture, if a<br />

software component of a secondary core needs to act a<br />

diagnostic trigger, the access must follow always the same<br />

procedure: a cross-core call through the RTE with a possible<br />

overhead already discussed.<br />

The accesses from the basic software or the CDD to the<br />

AUTOSAR core can use a specific diagnostic API module<br />

available for the low-level software. This option is definitely<br />

useful for the management of the various events and it<br />

improves the performances by saving time process. On the<br />

contrary, the accesses from the AUTOSAR core to the other<br />

secondary cores use the Diagnositic Communicator Manager<br />

module which is responsible of the services activated by the<br />

UDS protocol. The DCM service gives also to the user some<br />

utilities for diagnosis testing and debugging.<br />

In a multicore software architecture can be necessary to<br />

activate some functionalities managed by the UDS protocol in<br />

the secondary cores. Two different cases can be considered:<br />

1. If the software routine is running on a secondary core,<br />

the activation follows the cross-core communication<br />

procedure already described.<br />

2. If the software routine is in the same core of the basic<br />

software, the messages are received and processed by<br />

788<br />

Fig. 4 CDD-CDD Communication


triggering a RTE call to the specific function<br />

immediately.<br />

Fig. 5 Basic Software Communication<br />

Fig. 5 describes the CDD communication in a multicore<br />

software architecture. In this picture, the Core 1 is the<br />

AUTOSAR core. If there is a call from the OS Application 2 to<br />

a functionality of the CDD 2, it is possible to by-pass the IOC<br />

and to use a direct API call. On the contrary, if there is a call<br />

from a software component in the other OS Application 1 it is<br />

necessary to pass through the IOC and all the problems already<br />

described.<br />

VII. CONCLUSION<br />

The choice between a single-core and a multi-core software<br />

architecture needs a particular attention. Even if the multicore<br />

approach is becoming more popular due to the possibility of<br />

processes parallelism and a higher computing power, it is not<br />

always true that the performances increase compared to a more<br />

classical single core approach.<br />

All the problems highlighted in this article want to<br />

underline this last statement. The overhead created by the<br />

communication, synchronization and memory management<br />

increase the load of the operative system. Before choosing<br />

between a single or a multicore architecture it is important to<br />

consider the cores interactions and how the processes are<br />

inserted inside the cores.<br />

For all the problems cited in this article, it is fundamental to<br />

have a clear microprocessor and software architecture in order<br />

to estimate the load of the operative system in the case of a<br />

multicore: if the processes are independent this solution is<br />

preferred. On the contrary, the advantages of the multicore are<br />

less and the classical single core architecture still represents an<br />

optimal choice.<br />

[1] Matthias Becker, Dakshina Dasari, Vincent Nèlis, Moris Behnam, Luìs<br />

Miguel Pinho, Thomas Nolte, “Investigation on AUTOSAR-Compliant<br />

Solutions for Many-Core Architectures”, 2015 Euromicro Conference<br />

on Digital System Design, pp.95-103.<br />

[2] AUTOSAR, standard 4.2, www.autosar.org.<br />

[3] ISO/DIS 26262-1 - Road Vehicles – Functional safety, International<br />

Organization for Standardization / Technical Committee Std., 2009.<br />

[4] Björn B. Brandenburg, James H. Anderson, “On the Implementation of<br />

Global Real-Time Schedulers,” 2009 30th IEEE Real-Time Systems<br />

Symposium.<br />

[5] R. Nicole, “Comparison of service call implementations in an<br />

AUTOSAR multi-core os,” in 9th IEEE International Symposium on<br />

Industrial Embedded Systems (SIES), June 2014, pp. 199–205.<br />

[6] AUTOSAR - Guide to Multi-Core Systems, AUTOSAR Std. V1.1.0,<br />

Rev. R4.1 Rev3, 2014.<br />

www.embedded-world.eu<br />

789


Applied Machine Learning on Low-energy<br />

Platforms<br />

Running Machine Learning Optimally on Heterogeneous, Low-energy Arm Platforms<br />

Robert Elliott<br />

Technical Director, Machine Learning<br />

Arm Ltd.<br />

Cambridge, England<br />

Mark O’Connor<br />

Director, Deep Learning<br />

Arm Ltd.<br />

Grasbrunn, Germany<br />

Neural network frameworks such as TensorFlow, PyTorch and<br />

Caffe have revolutionized machine learning and computer vision on<br />

desktop PCs and on servers in the cloud, and are poised to do the<br />

same at the edge. But running these frameworks optimally and<br />

within a low-power budget provides one of the biggest challenges yet<br />

for developers.<br />

To help, modern systems-on-chips (SoCs) offer a variety of<br />

processor core types – CPUs, GPUs, DSPs and other accelerators –<br />

each suited to different parts of typical machine learning pipelines.<br />

But mapping these frameworks to run seamlessly across these cores,<br />

whilst minimizing power-sapping operations such as memory copies,<br />

can be complex and time-consuming to implement. Optimizing for<br />

one platform is often challenge enough, but with such a huge variety<br />

of potential target platforms, the prospect of optimizing for each one<br />

limits the feasibility of write-once applications that run optimally<br />

across multiple devices.<br />

This paper looks at the development environments available on<br />

low-energy platforms and how middleware libraries can simplify the<br />

process of reaching high efficiency. It also explores the tools and<br />

techniques available for deploying a neural network on these<br />

platforms. This is illustrated with examples, highlighting some of the<br />

work Arm is doing to enable machine learning wherever compute<br />

happens.<br />

Finally, we look ahead to approaches being proposed for future<br />

heterogeneous systems and future network optimization techniques<br />

which could provide significant performance and efficiency<br />

improvements in future systems. What will the implication be for the<br />

world of machine learning once these tools and frameworks have<br />

evolved, enabling this exciting set of new use cases across embedded<br />

devices of all shapes and sizes?<br />

Machine Learning; TensorFlow; Caffe; Mobile; Android; Mali<br />

ARMv8.2-A; Arm Cortex-A; Arm Cortex-M; Neural Network;<br />

Artificial Intelligence; Arm Accelerator; Arm Compute Library;<br />

Arm inference engine; ACL<br />

I. MACHINE LEARNING’S ADOPTION AND IMPORTANCE<br />

Machine learning is the term of the moment; no matter which<br />

part of technology you work in it’s likely that you’re hearing a<br />

lot about it. You’d be forgiven for thinking it’s a fad, but there<br />

has been such a rapid adoption of machine learning algorithms<br />

for solving key problems that it’s proving to be much more than<br />

that. At Arm, we’re in the fortunate position of being able to<br />

observe this adoption over a wide array of markets, from mobile<br />

phones and smart homes to agriculture and servers. One thing is<br />

clear, machine learning is solving real problems – such as face<br />

recognition, object detection and scene segmentation – with<br />

amazing accuracy[1][2][3][4][5]. It’s also become clear that the<br />

availability of large data sets, along with improving techniques<br />

and network complexity, is making it possible to deploy<br />

machine learning on embedded SoCs.<br />

Over the past few years, in a number of applications, it’s<br />

become possible to beat human accuracy. For some key<br />

problems, such as identifying objects and understanding spoken<br />

words, the problem seems solved at the algorithmic level and a<br />

fantastic summary of the problems being solved and why you<br />

should care is maintained by the EFF[1], providing insight into<br />

why everyone should be paying more attention to the advances<br />

in this field. More recently there’s also been significant work on<br />

reducing computational requirements[7][6] to make these<br />

solutions work within very limited processing budgets[8][10]. In<br />

practice, this means we’ve entered the age of machine learning<br />

on practically any SoC, and many devices within that system.<br />

II.<br />

MODERN SYSTEM-ON-CHIP DESIGNS<br />

As most will know, modern SoCs comprise a number of key<br />

parts which are common to most designs, along with a number<br />

of other elements chosen to solve the problems specific to the<br />

target market. The key components of a design are the memory<br />

subsystem and CPU, and often a GPU and display chip. There<br />

are then more market-specific functions, such as highresolution<br />

video decode and image processing for camera<br />

sensor input.<br />

For machine learning this has meant two things – firstly due<br />

to its pervasiveness, all devices in the embedded platform need<br />

www.embedded-world.eu<br />

790


good machine learning performance. Secondly, the stretch for<br />

performance density in the most demanding cases is, once<br />

again, pushing into the world of dedicated accelerators that can<br />

make the hard domain-specific decisions needed.<br />

Figure 1: A modern embedded or mobile System on Chip<br />

The recent trend, which we argue is not going to be shortlived,<br />

is the introduction of optimizations and dedicated<br />

accelerators for machine learning. The expectation is, that since<br />

the applications for machine learning are so wide-ranging,<br />

many of these future systems – whether they contain<br />

accelerators or not – will have a high requirement for machine<br />

learning performance. This leads us to the need to achieve high<br />

performance across all accelerators present in the system, be<br />

they CPU and dedicated instruction sets, GPU and extended<br />

fixed function units for ML algorithms, or bespoke hardware<br />

accelerating one or more classes of neural network.<br />

A. Fitting machine learning use cases to platforms<br />

You may be wondering what kind of use cases can map to<br />

the wide array of devices in an SoC. In practice, there are quite<br />

a few promising areas where traditional algorithms can be<br />

improved, or new algorithms can be deployed, that exploit<br />

machine learning, from the very small to the very big. The key<br />

benefit is replacing those complex, fragile, laborious and<br />

impenetrable sequences of conditional logic built up to manage<br />

the complex behaviors arising from seemingly simple inputs.<br />

1) Microcontrollers<br />

• Power management and scheduling – when deciding<br />

how and when to adjust operating points, reassign tasks to<br />

different cores and balance throughput against efficiency,<br />

schedulers can benefit from training based on known good<br />

behaviors and followed by reinforcement learning to<br />

become more suited to the specific device they are running<br />

on. This approach can also use data for which it would be<br />

impossible to code rules, such as detailed cache state and<br />

access patterns of programs.<br />

• Security auditing – where patterns of behavior of a<br />

system in normal use can be observed and abnormal<br />

behaviors can be caught quickly[16].<br />

• Object detection – to enable low power modes as part of<br />

complex actions, such as unlocking a mobile phone or<br />

taking a camera image, without using up a battery too<br />

quickly, or basic object detection to reduce false-positives<br />

when waking a camera (over and above a simple PIR or<br />

change detection).<br />

• Key-word or Key-noise spotting[20] – as part of a<br />

connected world, having simple detection devices spread<br />

out as part of an interactive environment or as part of a<br />

security system detecting atypical sounds like breaking<br />

glass.<br />

2) CPUs<br />

• Implementation of SLAM techniques for autonomous<br />

vehicles – navigating the world is possible on CPUs – not<br />

always for fast moving cars or drones, but for those with<br />

constrained movement, it’s often sufficient to work on<br />

CPUs.<br />

• Simple natural language processing for robotics and<br />

home devices – recognizing requests, particularly<br />

complex or compound statements, needs to work even<br />

when network connectivity is spotty or unavailable. This<br />

is possible on CPU or GPU today.<br />

• Face recognition/identification – allowing entry to<br />

shared areas or recording participants of a meeting while<br />

maintaining privacy by keeping data local.<br />

3) GPUs<br />

• Complex NLP – answering questions rather than acting<br />

on commands.<br />

• Secure face identification – the additional steps taken for<br />

anti-spoofing and low delay for the user when being used<br />

for activities like unlocking a mobile phone.<br />

• Image processing (such as style transfer networks[15])<br />

and face point registration – used extensively in the<br />

social media world to provide entertaining content for<br />

users<br />

4) Dedicated Accelerators<br />

• High resolution scene extraction – for safety during fast<br />

movement, automobiles require low latency to respond<br />

quickly. Moving this processing to dedicated accelerators<br />

can notably reduce cost.<br />

• Complex NLP for non-connected responses – the cost<br />

of processing large amounts of audio can be a notable<br />

TCO expense for busy services. Moving this to more<br />

efficient accelerators can be driven by cost.<br />

• Faster face identification (mobile phone face unlock)<br />

– premium experiences where fast response to the user is<br />

key.<br />

III.<br />

INTRODUCTION TO SOFTWARE APPROACHES<br />

To manage all of the aforementioned devices, and provide a<br />

manageable experience, we also need a stable and performanceoptimized<br />

software stack that does some of the heavy lifting of<br />

device selection and routine tuning.<br />

There are many ways to deploy machine learning on<br />

embedded platforms. Today, the most common way tends to be<br />

via bespoke frameworks, which allow performance on specific<br />

platforms or processors, or by running the full framework on<br />

the CPUs available in the platform. The first of these choices<br />

has issues of portability; the second, issues of level of<br />

optimization of the software. To help with this problem, we are<br />

791


curating a list of frameworks and libraries with support for Arm<br />

hardware at:<br />

https://developer.arm.com/technologies/machine-learningon-arm/frameworks<br />

For the widest deployment of a network architecture where<br />

the goal is functionality and reasonable performance, using a<br />

full machine learning framework deployed with a general<br />

backend is the best choice today. This provides the widest<br />

functional support, and the ability to modify and experiment<br />

with networks if needed.<br />

However, this route does not provide the maximum<br />

performance possible. For that, it’s necessary to focus on<br />

optimized libraries with a narrower applicability – something<br />

that requires a specifically optimized inference engine. This is<br />

the approach being taken[21] for Android deployment, and<br />

something which Arm is making available for other embedded<br />

platforms via the Arm software stack, which we will explore<br />

further here.<br />

For ongoing optimizations of machine learning primitives<br />

and a stable inference engine, the approach we are supporting<br />

is:<br />

Arm’s inference engine – Optimized Inference Engine for 32-<br />

bit float and 8-bit integer<br />

https://developer.arm.com/technologies/arm-nn-sdk<br />

Arm’s Compute Library – Optimized low-level routines for<br />

computer vision and machine learning, focusing on CNNs for<br />

32-bit float and 8-bit integer today, across a wide array of Arm<br />

CPUs and GPUs<br />

https://developer.arm.com/technologies/compute-library<br />

A. The different approaches available for machine learning<br />

deployment in an embedded system<br />

1) Direct integration<br />

The direct integration approach looks to<br />

embed libraries and routines directly into<br />

the codebase of the machine learning<br />

framework. This means either a call by call<br />

for operating on the layers of a neural<br />

network, or a runtime handover of a graph<br />

representing the network to be operated<br />

upon.<br />

This might, at first, seem like a<br />

promising approach to keeping the<br />

development environment the same, but in practice we often<br />

find that overheads introduced by running the full framework<br />

are detrimental to overall performance or impractical for<br />

shipping in production. This approach also limits the time<br />

available for optimization, as there is no offline step where the<br />

full neural network graph can be observed and modified for<br />

better performance.<br />

One area where this approach is particularly useful,<br />

however, is for training where full flexibility and access to the<br />

vast set of already implemented operators is key.<br />

2) Importing from a graph description file<br />

The file import approach takes the<br />

output graph and trained weights from a<br />

machine learning training framework and<br />

converts this to target a specifically<br />

designed inference engine. This<br />

conversion can happen as a compilation<br />

step or as a runtime step. The inference<br />

engine is used to run one or more graphs by<br />

implementing a subset of functionality<br />

found in a full training and inference<br />

machine learning framework.<br />

This allows for a much smaller runtime, where the<br />

capabilities are constrained to meet memory limitation targets,<br />

and also provides an opportunity to perform an offline<br />

optimization stage and further improve performance. Critically,<br />

this also allows a decoupling of the machine learning training<br />

framework from the production deployment, which tends to<br />

improve the design and efficiency of inference engines<br />

targeting different platforms.<br />

Quite often this approach is the right practical choice for<br />

deploying machine learning solutions today.<br />

3) Compilation flows<br />

Compilation flows, though long<br />

imagined, are just beginning to appear and<br />

represent a promising approach to higher<br />

performance when running neural networks.<br />

The two key advantages are exploitation of<br />

the existing corpus of compilation expertise<br />

in compiler frameworks and developers (the<br />

graph optimization problem is well known<br />

to these circles), and the ability to properly<br />

represent and optimize tensor processing<br />

and its representation on modern CPU, GPU<br />

and hardware accelerators.<br />

The compilation of graphs can result in the fusion of<br />

operations that reduce memory footprint (also exploited in other<br />

approaches, but potentially more thoroughly with a compiler<br />

flow). It can also reduce active memory footprint and<br />

bandwidth, by working on a smaller data set which fits in cache<br />

and makes better re-use of data loaded from memory.<br />

Today, however, there are still hurdles to overcome in the<br />

optimization process and it’s often more practical to work with<br />

direct integration, as this allows full flexibility for training?<br />

B. System software design<br />

Practically speaking, any of the aforementioned approaches<br />

is a reasonable way to get machine learning networks running<br />

www.embedded-world.eu<br />

792


on an embedded platform, and Arm’s approach has been to<br />

develop a low-overhead inference engine with the ability to<br />

import from file. This allows the same framework to target both<br />

Cortex-A class cores, found in high end mobile, and Cortex-M<br />

class cores, found in processing environments with just<br />

kilobytes of memory to play with.<br />

Arm expects machine learning to become a natural part of<br />

programming environments, requiring support not just for<br />

large networks executing on accelerator hardware, but also for<br />

tiny embedded networks, situated as a natural part of the<br />

program execution [13]. Being in a position to specify the<br />

system design and software allows us to design and balance<br />

every element to ensure the most efficient and cost-effective<br />

designs that meet the rapidly evolving needs of a machine<br />

learning-based world.<br />

Figure 2: Arm’s machine learning software and platform stack<br />

As mentioned in the introduction, machine learning is seen to<br />

have relevance to all classes of device, and so naturally the<br />

software seeks to enable and exploit this. To translate this view<br />

of the world into a useable software stack, we have developed<br />

the Arm inference engine to allow work to be distributed to<br />

devices and take advantage of the key optimizations of each.<br />

this model and weight set and using a converter to prepare an<br />

optimal representation for the inference engine.<br />

Figure 4: The Arm usage flow<br />

The major stages used in this process are:<br />

• Import or build the graph<br />

o Take the graph as input from a TensorFlow pb<br />

file or Caffe caffemodel/prototxt<br />

o Build the graph ‘manually’ from within your<br />

application using the runtime graph building API<br />

o This graph represents the network architecture<br />

and the weights to be used<br />

• Run the optimization process to allow the engine to<br />

optimize the graph and operators<br />

o Optimize the graph, replacing suitable sequences<br />

with single operations, fusing stages<br />

o This emits the tuned graph object which can be<br />

passed input and output tensors for an instance of<br />

processing<br />

• Run the inference process on the optimized graph<br />

o This can be repeated as needed with additional<br />

input data<br />

A framework that targets all devices in the system makes it<br />

easy to select and schedule between them. We’re working on<br />

optimizing each of these paths to deliver maximum<br />

performance on every device. We have also open-sourced our<br />

compute library with optimized primitives under a permissive<br />

license, so you can take a look, compile it for other platforms<br />

and provide feedback.<br />

A. Heterogeneous performance<br />

Figure 3: Arm’s inference engine software<br />

IV.<br />

ACCELERATING NEURAL NETWORKS<br />

To get the highest performance, it’s often preferable to run<br />

a smaller codebase that is dedicated as an inference engine,<br />

rather than a full machine learning framework. This approach<br />

requires preparation of a model and weights in a training<br />

environment, using a machine learning framework, then taking<br />

Our experience with high-performance math libraries shows<br />

that – despite the promise of advanced compilation flows for<br />

multi-device targeting – well-tuned computational primitives<br />

and operators are required to unlock peak performance from<br />

hardware. Over time, these common libraries also result in a<br />

benchmark that allows new designs to be focused and<br />

optimized. In effect, this ensures that, over time, these libraries<br />

will provide continued good performance over multiple<br />

releases, new hardware designs and new versions of the<br />

software stack.<br />

Arm’s inference engine library, working on full network<br />

definitions, also allows us to target workloads to the right<br />

device, or parallelize across devices, where it makes sense. This<br />

allows for both a more efficient execution when targeting the<br />

optimal device for a workload and higher throughput if speed<br />

of work completion is the key. In addition, the ability to easily<br />

793


select between optimized devices within a framework means<br />

that portability is notably easier.<br />

Today, choosing the right device for offloading the network<br />

node is a manual operation. This allows a developer to profile<br />

the platform, choose the right device for processing that stage<br />

of the network and bake that choice into the graph description.<br />

A future step will be to allow networks to be automatically<br />

tuned for the platform, to make this level of manual control<br />

optional rather than mandatory.<br />

B. Standards<br />

The benefit of reducing the set of core operators is being<br />

seen in ongoing standardization efforts such as ONNX[11] and<br />

NNEF[12]. While it may be some time before these take off,<br />

they promise to open up an ecosystem of interoperating tools,<br />

frameworks and inference engines, making the development of<br />

neural networks easier and faster.<br />

V. A PRACTICAL EXAMPLE OF DEPLOYMENT<br />

1) Preparing a model in TensorFlow for deployment on an<br />

embedded platform<br />

Preparing a model for TensorFlow deployment today<br />

involves removing unnecessary nodes and ensuring the<br />

operations used are available in the TensorFlow distributions<br />

on mobile devices, e.g. by removing training-specific<br />

operations in the model's computational graph. Optionally, it<br />

can also involve modifying the weights and operations to<br />

reduce file size and improve speed, at the expense of accuracy.<br />

This is accomplished through TensorFlow's graph_transforms<br />

tool, build from the TensorFlow source [8] with:<br />

bazel build<br />

tensorflow/tools/graph_transforms:transform_graph<br />

2) 32-bit floating-point model<br />

To build a 32-bit floating-point version of the graph ready<br />

for mobile TensorFlow deployment:<br />

bazelbin/tensorflow/tools/graph_transforms/transform_graph<br />

\<br />

--in_graph=resnetv1_50_fp32.pb \ --<br />

out_graph=optimized_resnetv1_50_fp32.pb \<br />

--inputs='Placeholder' \ --<br />

outputs='resnet_v1_50/predictions/Reshape_1' \<br />

--transforms='strip_unused_nodes(type=float,<br />

shape="1,224,224,3")<br />

fold_constants(ignore_errors=true)<br />

fold_batch_norms<br />

fold_old_batch_norms'<br />

This has the largest file size and highest accuracy, but also<br />

has the highest computational requirements. For deployment in<br />

a mobile or embedded device, we can perform more preparation<br />

steps which make the model run more quickly and with similar<br />

accuracy.<br />

3) 8-bit weights and operations<br />

There are many techniques for retaining as much accuracy<br />

as possible, such as gradient thresholding and retraining, but<br />

these are beyond the scope of this paper. Applying naive<br />

quantization is straightforward and does not require additional<br />

passes through the training data:<br />

bazelbin/tensorflow/tools/graph_transforms/transform_graph<br />

\<br />

--in_graph=resnetv1_50_fp32.pb \<br />

--out_graph=optimized_resnetv1_50_int8.pb \<br />

--inputs='Placeholder' \<br />

--outputs='resnet_v1_50/predictions/Reshape_1' \<br />

--transforms='<br />

add_default_attributes<br />

strip_unused_nodes(type=float, shape="1,224,224,3")<br />

remove_nodes(op=Identity, op=CheckNumerics)<br />

fold_constants(ignore_errors=true)<br />

fold_batch_norms<br />

fold_old_batch_norms<br />

quantize_weights<br />

quantize_nodes<br />

strip_unused_nodes<br />

sort_by_execution_order'<br />

This produces a file 25% of the size that uses 8-bit integer<br />

operations for faster inference, at the expense of accuracy.<br />

4) Benchmarking optimized models<br />

It's important to benchmark optimized models on real<br />

hardware. TensorFlow contains optimized 8-bit routines for<br />

Arm CPUs but not for x86, so 8-bit models will perform much<br />

slower on an x86-based laptop than a mobile Arm device. You<br />

can build the TensorFlow Android benchmark application with:<br />

bazel build -c opt --cxxopt=--std=c++11 \<br />

--crosstool_top=//external:android/crosstool \<br />

--cpu=armeabi-v7a --<br />

host_crosstool_top=@bazel_tools//tools/cpp:toolchain \<br />

tensorflow/tools/benchmark:benchmark_model<br />

With the Android deployment device (in this case a HiKey<br />

960) connected, run:<br />

adb shell "mkdir -p /data/local/tmp"<br />

adb push bazelbin/tensorflow/tools/benchmark/benchmark_model<br />

/data/local/tmp<br />

adb push optimized_resnetv1_50_fp32.pb /data/local/tmp<br />

adb push optimized_resnetv1_50_int8.pb /data/local/tmp<br />

The benchmarks are run with:<br />

www.embedded-world.eu<br />

794


adb shell '/data/local/tmp/benchmark_model \<br />

--num_threads=1 \<br />

--graph=/data/local/tmp/optimized_resnetv1_50_fp32.pb \<br />

--input_layer="Placeholder" \<br />

--input_layer_shape="1,224,224,3" \<br />

--input_layer_type="float" \<br />

--output_layer="resnet_v1_50/predictions/Reshape_1"'<br />

adb shell '/data/local/tmp/benchmark_model \<br />

--num_threads=1 \<br />

--graph=/data/local/tmp/optimized_resnetv1_50_int8.pb \<br />

--input_layer="Placeholder" \<br />

--input_layer_shape="1,224,224,3" \<br />

--input_layer_type="float" \<br />

--output_layer="resnet_v1_50/predictions/Reshape_1"'<br />

5) Performance comparison<br />

Accuracy should be evaluated using application-specific<br />

data, as the impact of quantization on accuracy can vary. In<br />

terms of compute performance, the above networks show the<br />

following performance on the HiKey 960 development<br />

platform with stock firmware, Android and CPU frequency<br />

settings:<br />

Figure 5: Performance of Resnet50 (standard configuration) running<br />

on different inference implementations<br />

As described previously, different approaches to the<br />

deployment of machine learning software can have a material<br />

impact in the performance. In this example, we can see that a<br />

deployment in Arm’s inference engine, where the whole graph<br />

can be accelerated on device (even before fusion is possible),<br />

can produce much higher performance. Reduction of round<br />

trips to user-space software, removal of data transfers between<br />

layers, and specifically optimized inference routines all<br />

contribute to this performance difference.<br />

mostly attributed to the above overheads. For accelerated CPU<br />

routines, direct integration into TensorFlow is more straight<br />

forward, and so the performance of those devices is more easily<br />

achieved.<br />

It should be noted however, that continued efforts are being<br />

made in a number of these codebases, which will materially<br />

change performance over time.<br />

B. Using Arm’s inference engine<br />

The general flow of using the inference engine follows the<br />

graph import software model, and this is further broken down<br />

into the ‘Import, Optimize, Run’ pattern, expecting that the<br />

input network weights come from an independent training<br />

process.<br />

Figure 7: General Arm inference flow<br />

The initial process is to create a network. In this example,<br />

we use the TensorFlow parser to take an input graph and<br />

convert it into our runtime graph representation<br />

armnn::INetwork which can then be used in the normal Arm<br />

inference flow. For this graph, we also need to connect the input<br />

and output tensors that are used when running inference to<br />

capture data. These are named based on choices of the model<br />

and so depend on the model you pass.<br />

First, we create the parser:<br />

// Create a network from a file on disk, using (in<br />

// this case) the tensor flow parser<br />

std::unique_ptr<br />

parser(ITfParser::Create());<br />

Then we parse the network, in this case coming from a text<br />

input representing the mnist network, using inputTensorInfo to<br />

specify the inputs to the graph:<br />

// Call the parser function with the input network,<br />

// this can be binary or text<br />

armnn::TensorInfo inputTensorInfo({ 1, 784, 1, 1 },<br />

armnn::DataType::Float32);<br />

Figure 6: Performance of Mobilenet v1 1.0_224 running on different<br />

inference implementations<br />

The Arm inference engine and SYCL implementation are<br />

both running on Mali in this example, and differences are<br />

std::unique_ptr network =<br />

parser->CreateNetworkFromTextFile(<br />

armnn::DataType::Float32,<br />

"simple_mnist_tf.prototxt",<br />

inputTensorInfo));<br />

We also get input and output bindings based on the textual<br />

name of the node in the graph:<br />

795


Get the input and output bindings based on node<br />

// name in the graph<br />

m_InputBindingInfo = parser-><br />

GetNetworkInputBindingInfo("input");<br />

m_OutputBindingInfo = parser-><br />

GetNetworkOutputBindingInfo("output");<br />

Once these steps have been completed, we can now<br />

continue using the Arm inference stack, as we would for this or<br />

any other input path. From this point on, the code we use is<br />

common, regardless of the framework we started with. Very<br />

simply, our next steps are to take the graph and optimize it to<br />

make an immutable graph ready for running on the device we<br />

choose, then running inference by enqueueing the workload and<br />

reading the result.<br />

First, we run the optimization flow, to produce our graph,<br />

optimized for the devices it will run on, with all nodes’<br />

functions created and internal memory objects for processing<br />

and data transfer between devices:<br />

// The optimize step, which finalizes the graph<br />

// ready for running inference<br />

std::unique_ptr optNet =<br />

armnn::Optimize( *network,<br />

m_GraphContext->GetDeviceSpec());<br />

as a way to run general or experimental TensorFlow graphs that<br />

are not yet heavily optimized by hand. A detailed walkthrough<br />

for installing TensorFlow SYCL with ComputeCpp for<br />

deployment on Arm Mali G71 devices can be found here:<br />

https://developer.codeplay.com/computecppce/latest/tensorflo<br />

w-arm-setup-guide<br />

Once this is installed, 32-bit floating point models will be<br />

deployed onto the GPU, allowing for a wide array of preexisting<br />

models to be used. This is useful for evaluating<br />

multiple networks and experimenting on the target platform<br />

before deployment.<br />

VI.<br />

PERFORMANCE<br />

As previously described, measuring performance on the<br />

target platform is key for getting accurate performance,<br />

particularly as there are a number of factors that can have an<br />

impact, and different network configurations running the same<br />

routines can produce notably different performance across<br />

implementations and devices; one device might be faster for<br />

some networks and another device faster for others.<br />

Even if you are able to use a previously optimized network<br />

provided on our developer portal, there are still tradeoffs<br />

between power, performance and convenience as can be seen<br />

below.<br />

Next, load the graph into the execution context:<br />

// Load the network into the context.<br />

armnn::Status ret = m_GraphContext-><br />

LoadNetwork(m_NetworkIdentifier,<br />

std::move(optNet));<br />

The context is what records the devices we will execute on,<br />

typically one of the following:<br />

armnn::Compute::CpuAcc;<br />

armnn::Compute::GpuAcc;<br />

The final step is to run the inference for the network on a<br />

given input, and capture the output:<br />

armnn::Status ret = m_GraphContext-><br />

EnqueueWorkload(m_NetworkIdentifier,<br />

MakeInputTensors(&input.image[0]),<br />

MakeOutputTensors(&output[0]));<br />

It’s possible for this performance balance to change when<br />

looking at different models:<br />

Mobilenet v1_1.0_224 - 1-batch inference time (ms)<br />

8x Mali G71 Arm Compute Library - 32-bit floating point<br />

4x Arm Cortex A73 TensorFlow Mobile - 8-bit integer<br />

Arm Cortex A73 TensorFlow Mobile - 8-bit integer<br />

4x Arm Cortex A73 TensorFlow Mobile - 32-bit floating …<br />

Arm Cortex A73 TensorFlow Mobile - 32-bit floating point<br />

38<br />

158<br />

155<br />

304<br />

433<br />

0 50 100 150 200 250 300 350 400 450 500<br />

C. Deployment to GPU with TensorFlow + SYCL<br />

TensorFlow models can also be executed on the Arm Mali<br />

GPU via the OpenCL, targeting SYCL compiler. This<br />

workflow is currently experimental and under active<br />

development, but initial results are encouraging – particularly<br />

And of course, as software is further optimized, big leaps<br />

can be seen.<br />

www.embedded-world.eu<br />

796


Figure 8: Alexnet speedup of new versions of Compute Library<br />

• Seamless deployment from cloud to edge – making the<br />

training experience easier and providing better tooling for<br />

performance, adjusting model complexity, and accuracy<br />

tuning<br />

• More advanced heterogeneous scheduling – better tools<br />

for static scheduling of workloads across devices in the<br />

SoC in the first instance, and hopefully improving<br />

dynamic scheduling in future<br />

• Network Compilers – taking advantage of the full network<br />

and operator code to produce interleaved scheduling to<br />

make maximum re-use of caches, to simplify arithmetic<br />

sequences, and reduce memory accesses and bandwidth.<br />

This rapidly evolving field is continuing to solve more<br />

complex problems and improve performance at an amazing<br />

rate. Why not take a look at our developer community[18] and<br />

try[19] some of these techniques out for yourself?<br />

Figure 9: Matrix multiply speedup in SYCL and Compute Library<br />

It’s also worth being<br />

mindful of the<br />

performance benefits,<br />

bandwidth savings, and<br />

energy savings from<br />

using smaller datatypes,<br />

for example, switching<br />

from 32-bit float to 8-bit<br />

integer. It’s been shown<br />

that accuracy loss for a network is negligible when doing so,<br />

provided that retraining is also performed. In the case illustrated<br />

in this throughput graph, using 8-bit for matrix multiply results<br />

in a speedup of around 1.62x and, notably, a 2.46x reduction in<br />

bandwidth.<br />

VII. THE FUTURE<br />

There’s so much more that can be done in this space. Adding<br />

further optimizations such as fusion being introduced into<br />

inference engines, improving machine learning frameworks to<br />

enable accelerators, providing better development<br />

environments and tools for cloud to edge deployment, and<br />

exploiting compiler technology to further improve<br />

optimization. Of all the things that can be done, there are a few<br />

really interesting areas to look at:<br />

REFERENCES<br />

[1] https://www.eff.org/ai/metrics<br />

[2] https://arxiv.org/pdf/1705.02498.pdf<br />

[3] https://arxiv.org/pdf/1706.06969.pdf<br />

[4] https://www.forbes.com/sites/michaelthomsen/2015/02/19/microsoftsdeep-learning-project-outperforms-humans-in-imagerecognition/#4381b2f9740b<br />

[5] https://www.youtube.com/watch?v=k4ovpelG9vs<br />

[6] https://arxiv.org/pdf/1712.05877.pdf<br />

[7] https://arxiv.org/pdf/1707.01083.pdf<br />

[8] https://arxiv.org/pdf/1711.07128.pdf<br />

[9] https://community.arm.com/processors/b/blog/posts/high-accuracykeyword-spotting-on-cortex-m-processors<br />

[10] https://github.com/ARM-software/ML-KWS-for-MCU<br />

[11] https://onnx.ai/<br />

[12] https://www.khronos.org/nnef/<br />

[13] https://github.com/tensorflow/tensorflow/tree/master/tensorflow/tools/gr<br />

aph_transforms<br />

[14] https://arxiv.org/pdf/1712.01208.pdf<br />

[15] https://arxiv.org/pdf/1508.06576.pdf<br />

[16] https://pages.arm.com/iot-security-<br />

manifesto.html?utm_medium=Website&utm_source=Arm-<br />

HomepageHero&campaign=SecurityManifesto<br />

[17] http://zhiyisun.github.io/2017/02/15/Running-Google-Machine-<br />

Learning-Library-Tensorflow-On-ARM-64-bit-Platform.html<br />

[18] https://developer.arm.com/technologies/machine-learning-on-arm<br />

[19] https://developer.arm.com/technologies/machine-learning-onarm/developer-material/how-to-guides/teach-your-raspberry-pi-yeahworld<br />

[20] http://developer.arm.com/-<br />

/media/Files/pdf/The%20Power%20of%20Speech%20Supporting%20V<br />

oice-Driven%20Commands%20in%20Small%20Low-<br />

Power%20Microcontrollers.pdf<br />

[21] https://developer.android.com/ndk/guides/neuralnetworks/index.html<br />

797


Triple Core ARM® Based MCU Architecture for<br />

Radiation Environments<br />

Balaji, V. (ARM Holdings Ltd.), Bannatyne, R. (VORAGO Technologies), Iturbe, X. (ARM Holdings Ltd.)<br />

Abstract— This technical paper proposes an ARM-based<br />

microcontroller that has been optimized for use in conditions of<br />

extreme radiation. Namely, several radiation mitigating<br />

techniques are combined to address different types of failures that<br />

occur in CMOS devices when exposed to radiation. The proposed<br />

microcontroller integrates three Cortex-R5 CPUs in lock-step<br />

mode and implements a quick error recovery mechanism to cope<br />

with radiation-provoked soft errors. The microcontroller is<br />

proposed to be manufactured using VORAGO Technologies<br />

HARDSIL® technology that immunizes the device against<br />

radiation induced latch-up. HARDSIL also allows operation<br />

during exposure to a significant level of Total Ionizing Dose (TID),<br />

typically up to 300 krad(Si). Single Event Upsets (SEU) due to<br />

radiation particle strikes on memory are mitigated by Error<br />

Detection and Correction (EDAC) and Scrub Engine subsystems<br />

that operate on the program and data memories of the device.<br />

Keywords—ARM; MCU; microcontroller; radiation; SEU;<br />

latchup; HARDSIL<br />

I. ADDRESSING LATCH-UP IN RADIATION ENVIRONMENTS<br />

A major problem that CMOS semiconductors face in<br />

extreme environments is ‘latch-up’. Under conditions of high<br />

temperature and radiation, the CMOS device is exposed to<br />

conditions where parasitic transistors can be switched on by high<br />

temperature silicon effects or by an ionizing radiation strike. All<br />

bulk CMOS wafers contain millions of parasitic structures (that<br />

resemble and behave like a thyristor) that are spread across the<br />

wafer. This is a byproduct of the CMOS wafer architecture and<br />

is usually not a problem if the device is operated within a limited<br />

specification, but at high temperature or when radiation is<br />

present, latch-up occurs when the parasitic structure is triggered.<br />

Fig. 1 illustrates a cross-section of a CMOS device structure with<br />

the bi-polar parasitic transistor structure shown on the diagram<br />

in the well and substrate area of the wafer.<br />

Latch-up occurs if the parasitic bipolar transistors become<br />

forward biased and switch on. The transistors will drive each<br />

other into saturation and create a short circuit from Vdd to Vss.<br />

When latch-up occurs, a high current will flow through the short<br />

circuit. Latch-up will be sustained if the combined gain of the<br />

NPN and PNP parasitic structure is greater than unity. This can<br />

result in permanent damage. To get out of a latch-up condition,<br />

the device must be reset.<br />

At high temperature, junction leakage current increases as<br />

electron-holes pairs are generated in the silicon lattice. The<br />

forward bias voltage of parasitic transistors that reside on CMOS<br />

silicon structures is also reduced, leading to a reduced trigger<br />

current that is the onset of the latch-up condition. Similarly, a<br />

particle strike on the die can create charge that switches on the<br />

parasitic structure.<br />

Immunizing against latch-up has not easy. Hardening<br />

electronic components for extreme environments has been<br />

achieved by using specialized semiconductor manufacturing<br />

V ss<br />

Gate<br />

p + n+ n+ p p<br />

oxide<br />

+<br />

+ n+<br />

p-well<br />

n-well<br />

PNP<br />

NPN<br />

R nwell<br />

R pwell<br />

Substrate (p-well)<br />

Gate<br />

COMMERCIAL TWIN WELL CMOS<br />

V dd<br />

oxide<br />

Fig. 1. Cross Section of CMOS Device Showing Parasitic Bipolar Transistor<br />

Structure<br />

processes such as Silicon-on-Insulator (SOI). This approach is<br />

effective but expensive as it is a ‘boutique’ process that is not<br />

compatible with the sizeable CMOS infrastructure that is the<br />

standard in the industry.<br />

Another approach that has been developed to address latchup<br />

in extreme environments is to modify standard CMOS by<br />

adding a ‘Buried Guard Ring’ (BGR) to the existing CMOS<br />

substrate.<br />

The HARDSIL ® process [1] is a modification to standard<br />

CMOS designs that includes a vertical and horizontal implant in<br />

the die. This approach immunizes against latch-up by creating a<br />

highly conductive layer underneath the CMOS devices and<br />

wells, combined with a high conductivity connection to well<br />

contacts. The HARDSIL ® approach enables high temperature<br />

www.embedded-world.eu<br />

798


and radiation tolerant operation by reducing the parasitic<br />

resistance so that the parasitic NPN cannot turn on and reducing<br />

the gain of the parasitic transistors so that the bi-polars cannot<br />

sustain latch-up. HARDSIL ® has been implemented on space<br />

grade semiconductors and has proved to be effective for latchup<br />

immunization.<br />

The HARDSIL ® BGR is implemented by adding 1-2 mask<br />

steps and 3 implants during the wafer manufacturing stage. No<br />

special equipment is required to implement HARDSIL ® and<br />

standard CMOS design tools and manufacturing equipment are<br />

used. There are no negative effects in terms of transistor<br />

performance or power consumption. HARDSIL ® can be<br />

implemented on any CMOS integrated circuit at any processing<br />

geometry node. This is very significant for designers of extreme<br />

environment electronics systems, as it unlocks the door to using<br />

the latest state-of-the-art semiconductor products, rather than<br />

being limited to the small pool of tried-and-tested components<br />

that are never the best-fit products for a state-of-the-art design.<br />

The BGR is shown under the transistor well areas in Fig. 2.<br />

V ss<br />

n+<br />

R pwell<br />

Gate<br />

p-well<br />

n+ n+<br />

oxide<br />

NPN<br />

Substrate (p-well)<br />

HARDSIL ® TWIN WELL CMOS<br />

V dd<br />

BGR<br />

Fig. 2. Buried Guard Ring Structure in CMOS Device Using HARDSIL ®<br />

oxide<br />

II. ARM-BASED MICROCONTROLLER ARCHITECTURE<br />

The proposed microcontroller is based on the recently<br />

announced ARM Triple Core Lock-Step (TCLS) architecture [2] .<br />

This architecture is shown in Fig. 3 and includes three lockstepped<br />

Cortex-R5 CPUS coordinated by a TCLS Assist Unit.<br />

At every clock cycle, the instructions to execute by the<br />

microcontroller are read from a shared memory and distributed<br />

to the triplicated CPUs. The CPU outputs are majority-voted and<br />

forwarded to memories and I/O ports, preventing CPU errors<br />

from propagating to other parts of the system. Simultaneously,<br />

an error detection logic in the TCLS Assist Unit checks if there<br />

is any mismatch in the outputs delivered by the three CPUs. If<br />

there is a mismatch, this logic identifies whether it is a<br />

correctable error (i.e., only one of the CPUs delivers a different<br />

set of outputs) or an un-correctable one (i.e., all CPUs deliver<br />

different outputs). If the error is correctable, a resynchronization<br />

logic takes over the control to correct the architectural state of<br />

the erroneous CPU by resynchronizing all the CPUs. If the error<br />

is un-correctable, the entire system where the TCLS processor is<br />

integrated must be reset.<br />

p +<br />

n-well<br />

Gate<br />

PNP<br />

p +<br />

R nwell<br />

Fig. 3. Proposed ARM-based microcontroller architecture<br />

In TCLS, the CPU resynchronization process is automatic<br />

and transparent to the software. It consists in pushing out the<br />

architecture state of the three CPUs and then restoring the<br />

majority voted values back. Two remarks are important here.<br />

First, the error recovery process can be completed in less than<br />

2,500 clock cycles (less than 2.5 us @ 450 MHz) as there is no<br />

need to correct the memory state, whose integrity is protected<br />

using ECC. Secondly, the TCLS architecture is fail functional as<br />

it can continue working correctly in the event of a single CPU<br />

error using the two remaining functionally correct CPUs until all<br />

critical computations are completed and there is enough time to<br />

resynchronize the three CPUs.<br />

Unlike related space-qualified processors, the TCLS can<br />

deliver comparable performance (i.e., CPU clock frequency) to<br />

the COTS Cortex-R5 processor widely used in terrestrial<br />

automotive applications. Finally, note that that the ARM TCLS<br />

architecture can be potentially used with any ARM CPUs,<br />

including performance-oriented A-class CPUs.<br />

III. MITIGATING AGAINST RADIATION EFFECTS THAT UPSET<br />

MEMORIES<br />

To mitigate against an SEU that could flip a memory bit, an<br />

Error Detection and Correction (EDAC) subsystem is proposed.<br />

Error Correcting Code (ECC) memories have the ability to<br />

detect a flipped memory bit and correct it. The VA10820<br />

ARM® Cortex®-M0 microcontroller Error Detection &<br />

Correction sub-system implements a Hamming Code based<br />

solution that detects two errors and corrects one PER BYTE.<br />

This means that there can be four flipped bits per 32-bit word<br />

and the microcontroller will still operate normally. As words are<br />

fetched by the CPU, the EDAC automatically performs<br />

detection and correction on these words in the course of normal<br />

CPU operation. There is still a risk however that particle strikes<br />

can flip bits on areas of the memory array that are not regularly<br />

being fetched by the CPU. This increases the likelihood that<br />

there will be more than a single bit error, creating an<br />

uncorrectable error. For this reason, a ‘Scrub Engine’ has also<br />

been integrated into the VA10820.<br />

The purpose of the Scrub Engine is to prevent accumulated<br />

errors and is an important part of the overall strategy to prevent<br />

uncorrectable bit flips due to radiation strikes. The Scrub Engine<br />

operates independently of the ECC system and will operate in<br />

the background of regular CPU activity to periodically examine<br />

the contents of each memory location and correct any bit-flip<br />

errors. This prevents the build-up of accumulated errors to<br />

reduce the possibility of a double-bit error that is uncorrectable.<br />

p + 799


The Scrub Engine frequency can be adjusted so that a full<br />

memory scrub can be implemented regularly enough to be<br />

effective based on the radiation conditions of the environment at<br />

any time. A recommended approach is to measure the number<br />

of errors that the EDAC system encounters and use that<br />

information to adjust the scrub rate to a reasonable level.<br />

IV. ARM DEVELOPMENT ECOSYSTEM<br />

A development ecosystem for a microcontroller is the broad<br />

range of tools and support that is required to get the device upand-running<br />

in the embedded system. The ecosystem includes<br />

hardware development tools that can be used to prototype<br />

systems, software packages that allow a designer to create high<br />

level language code, programming and debugging tools. An<br />

effective development ecosystem also includes code that can be<br />

used in the embedded system such as a Real-Time Operating<br />

System (RTOS) and communications stacks. Application notes<br />

and online support communities are also an important part of an<br />

effective development ecosystem.<br />

Embedded designers usually prefer to use devices based on<br />

the ARM Cortex- architecture because the ARM ecosystem is<br />

large, mature and continues to evolve with the latest state-of-theart<br />

tools.<br />

As explained in section II, the ARM TCLS architecture is<br />

transparent to the software programmer and hence, a user of the<br />

TCLS architecture is automatically granted with access to the<br />

entire ARM ecosystem. Furthermore, the general error recovery<br />

process in TCLS does not require any user intervention. If<br />

required, in hard real-time applications, the user can keep track<br />

of the occurrence of errors in the TCLS architecture and start the<br />

CPU resynchronization process on demand. This is controlled<br />

by means of some flags in internal TCLS registers that can be<br />

read and written from the user application.<br />

V. CONCLUSION<br />

As more money is invested in commercial space, there is a<br />

demand for state-of-the-art products that can operate in<br />

conditions of extreme radiation, but are leading edge<br />

performance and affordable. The space industry has typically<br />

used legacy products that have been processed using specialized<br />

hardening techniques that are very expensive. The combination<br />

of leading edge ARM-based technology, the huge ecosystem of<br />

development tools around it and low-cost hardening technology<br />

is very attractive. This approach will simplify the job for<br />

designers, reduce costs and ultimately help enable low-cost<br />

reliable commercial space systems.<br />

REFERENCES<br />

[1] VORAGO Technologies, “Technology: An Overview of VORAGO’s<br />

HARDSIL ® Technology”, VORAGO Technologies,<br />

www.voragotech.com<br />

[2] X. Iturbe, B. Venu, E. Ozer and S. Das, “A Triple Core Lock-Step (TCLS)<br />

ARM Cortex-R5 Processor for Safety-Critical and Ultra-Reliable<br />

Applications”, Proc. of the IEEE/IFIP Intl. Conf. on Dependable Systems<br />

and Networks, 2016.<br />

www.embedded-world.eu<br />

800


Dynamic Memory Allocation & Fragmentation in C<br />

& C++<br />

6326<br />

Colin Walls<br />

Mentor, a Siemens business<br />

Newbury, UK<br />

colin_walls@mentor.com<br />

Abstract—In C and C++, it can be very convenient to allocate<br />

and de-allocate blocks of memory as and when needed. This is<br />

certainly standard practice in both languages and almost<br />

unavoidable in C++. However, the handling of such dynamic<br />

memory can be problematic and inefficient. For desktop<br />

applications, where memory is freely available, these difficulties<br />

can be ignored. For embedded – generally real time –<br />

applications, ignoring the issues is not an option. Dynamic<br />

memory allocation tends to be non-deterministic; the time taken<br />

to allocate memory may not be predictable and the memory pool<br />

may become fragmented, resulting in unexpected allocation<br />

failures. In this paper the problems are outlined in detail and an<br />

approach to deterministic dynamic memory allocation detailed<br />

Keywords—component; formatting; style; styling; insert (key<br />

words)<br />

I. C/C++ MEMORY SPACES<br />

It may be useful to think in terms of data memory in C and<br />

C++ as being divided into three separate spaces:<br />

Static memory. This is where variables, which are defined<br />

outside of functions, are located. The keyword static does not<br />

generally affect where such variables are located; it specifies<br />

their scope to be local to the current module. Variables that are<br />

defined inside of a function, which are explicitly declared<br />

static, are also stored in static memory. Commonly, static<br />

memory is located at the beginning of the RAM area. The<br />

actual allocation of addresses to variables is performed by the<br />

embedded software development toolkit: a collaboration<br />

between the compiler and the linker. Normally, program<br />

sections are used to control placement, but more advanced<br />

techniques, like Fine Grain Allocation, give more control.<br />

Commonly, all the remaining memory, which is not used for<br />

static storage, is used to constitute the dynamic storage area,<br />

which accommodates the other two memory spaces.<br />

Automatic variables. Variables defined inside a function,<br />

which are not declared static, are automatic. There is a<br />

keyword to explicitly declare such a variable – auto – but it is<br />

almost never used. Automatic variables (and function<br />

parameters) are usually stored on the stack. The stack is<br />

normally located using the linker. The end of the dynamic<br />

storage area is typically used for the stack. Compiler<br />

optimizations may result in variables being stored in registers<br />

for part or all of their lifetimes; this may also be suggested by<br />

using the keyword register.<br />

The heap. The remainder of the dynamic storage area is<br />

commonly allocated to the heap, from which application<br />

programs may dynamically allocate memory, as required.<br />

II. DYNAMIC MEMORY IN C<br />

In C, dynamic memory is allocated from the heap using<br />

some standard library functions. The two key dynamic memory<br />

functions are malloc() and free().<br />

The malloc() function takes a single parameter, which is<br />

the size of the requested memory area in bytes. It returns a<br />

pointer to the allocated memory. If the allocation fails, it<br />

returns NULL. The prototype for the standard library function is<br />

like this:<br />

void *malloc(size_t size);<br />

The free() function takes the pointer returned by<br />

malloc() and de-allocates the memory. No indication of<br />

success or failure is returned. The function prototype is like<br />

this:<br />

void free(void *pointer);<br />

To illustrate the use of these functions, here is some code to<br />

statically define an array and set the fourth element’s value:<br />

int my_array[10];<br />

my_array[3] = 99;<br />

The following code does the same job using dynamic<br />

memory allocation:<br />

int *pointer;<br />

www.embedded-world.eu<br />

801


pointer = malloc(10 * sizeof(int));<br />

*(pointer+3) = 99;<br />

The pointer de-referencing syntax is hard to read, so normal<br />

array referencing syntax may be used, as [ and ] are just<br />

operators:<br />

pointer[3] = 99;<br />

When the array is no longer needed, the memory may be<br />

de-allocated thus:<br />

free(pointer);<br />

pointer = NULL;<br />

Assigning NULL to the pointer is not compulsory, but is<br />

good practice, as it will cause an error to be generated if the<br />

pointer is erroneous utilized after the memory has been deallocated.<br />

The amount of heap space actually allocated by<br />

malloc() is normally one word larger than that requested.<br />

The additional word is used to hold the size of the allocation<br />

and is for later use by free(). This “size word” precedes the<br />

data area to which malloc() returns a pointer.<br />

There are two other variants of the malloc() function:<br />

calloc() and realloc().<br />

The calloc() function does basically the same job as<br />

malloc(), except that it takes two parameters – the number<br />

of array elements and the size of each element – instead of a<br />

single parameter (which is the product of these two values).<br />

The allocated memory is also initialized to zeros. Here is the<br />

prototype:<br />

void *calloc(size_t nelements, size_t<br />

elementSize);<br />

The realloc() function resizes a memory allocation<br />

previously made by malloc(). It takes as parameters a<br />

pointer to the memory area and the new size that is required. If<br />

the size is reduced, data may be lost. If the size is increased and<br />

the function is unable to extend the existing allocation, it will<br />

automatically allocate a new memory area and copy data<br />

across. In any case, it returns a pointer to the allocated<br />

memory. Here is the prototype:<br />

void *realloc(void *pointer, size_t<br />

size);<br />

III. DYNAMIC MEMORY IN C++<br />

Management of dynamic memory in C++ is quite similar to<br />

C in most respects. Although the library functions are likely to<br />

be available, C++ has two additional operators – new and<br />

delete – which enable code to be written more clearly,<br />

succinctly and flexibly, with less likelihood of errors. The new<br />

operator can be used in three ways:<br />

p_var = new typename;<br />

p_var = new type(initializer);<br />

p_array = new type [size];<br />

In the first two cases, space for a single object is allocated;<br />

the second one includes initialization. The third case is the<br />

mechanism for allocating space for an array of objects.<br />

The delete operator can be invoked in two ways:<br />

delete p_var;<br />

delete[] p_array;<br />

The first is for a single object; the second de-allocates the<br />

space used by an array. It is very important to use the correct<br />

de-allocator in each case.<br />

There is no operator that provides the functionality of the C<br />

realloc() function.<br />

Here is the code to dynamically allocate an array and<br />

initialize the fourth element:<br />

int* pointer;<br />

pointer = new int[10];<br />

pointer[3] = 99;<br />

Using the array access notation is natural.<br />

De-allocation is performed thus:<br />

delete[] pointer;<br />

pointer = NULL;<br />

Again, assigning NULL to the pointer after de-allocation is<br />

just good programming practice.<br />

Another option for managing dynamic memory in C++ is<br />

the use the Standard Template Library. This may be<br />

inadvisable for real time embedded systems.<br />

IV. ISSUES AND PROBLEMS<br />

As a general rule, dynamic behavior is troublesome in real<br />

time embedded systems. The two key areas of concern are<br />

determination of the action to be taken on resource exhaustion<br />

and non-deterministic execution performance.<br />

There are a number of problems with dynamic memory<br />

allocation in a real time system.<br />

The standard library functions (malloc() and free())<br />

are not normally reentrant, which would be problematic in a<br />

multithreaded application. If the source code is available, this<br />

should be straightforward to rectify by locking resources using<br />

RTOS facilities (like a semaphore).<br />

A more intractable problem is associated with the<br />

performance of malloc(). Its behavior is unpredictable, as<br />

802


the time it takes to allocate memory is extremely variable. Such<br />

non-deterministic behavior is intolerable in real time systems.<br />

Without great care, it is easy to introduce memory leaks<br />

into application code implemented using malloc() and<br />

free(). This is caused by memory being allocated and never<br />

being de-allocated. Such errors tend to cause a gradual<br />

performance degradation and eventual failure. This type of bug<br />

can be very hard to locate.<br />

Memory allocation failure is a concern. Unlike a desktop<br />

application, most embedded systems do not have the<br />

opportunity to pop up a dialog and discuss options with the<br />

user. Often, resetting is the only option, which is unattractive.<br />

If allocation failures are encountered during testing, care must<br />

be taken with diagnosing their cause. It may be that there is<br />

simply insufficient memory available – this suggests various<br />

courses of action. However, it may be that there is sufficient<br />

memory, but not available in one contiguous chunk that can<br />

satisfy the allocation request. This situation is called memory<br />

fragmentation.<br />

V. MEMORY FRAGMENTATION<br />

The best way to understand memory fragmentation is to<br />

look at an example. For this example, it is assumed that there is<br />

a 10K heap. First, an area of 3K is requested, thus:<br />

#define K (1024)<br />

char *p1;<br />

p1 = malloc(3*K);<br />

Then, a further 4K is requested:<br />

p2 = malloc(4*K);<br />

3K of memory is now free.<br />

Some time later, the first memory allocation, pointed to by<br />

p1, is de-allocated:<br />

free(p1);<br />

This leaves 6K of memory free in two 3K chunks.<br />

A further request for a 4K allocation is issued:<br />

p1 = malloc(4*K);<br />

This results in a failure – NULL is returned into p1 –<br />

because, even though 6K of memory is available, there is not a<br />

4K contiguous block available. This is memory fragmentation.<br />

It would seem that an obvious solution would be to defragment<br />

the memory, merging the two 3K blocks to make a<br />

single one of 6K. However, this is not possible because it<br />

would entail moving the 4K block to which p2 points. Moving<br />

it would change its address, so any code that has taken a copy<br />

of the pointer would then be broken. In other languages (such<br />

as Visual Basic, Java and C#), there are de-fragmentation (or<br />

“garbage collection”) facilities. This is only possible because<br />

these languages do not support direct pointers, so moving the<br />

data has no adverse effect upon application code. This defragmentation<br />

may occur when a memory allocation fails or<br />

there may be a periodic garbage collection process that is run.<br />

In either case, this would severely compromise real time<br />

performance and determinism.<br />

VI. MEMORY WITH AN RTOS<br />

A real time operating system may provide a service which<br />

is effectively a reentrant form of malloc(). However, it is<br />

unlikely that this facility would be deterministic.<br />

Memory management facilities that are compatible with<br />

real time requirements – i.e. they are deterministic – are usually<br />

provided. This is most commonly a scheme which allocates<br />

blocks – or “partitions” – of memory under the control of the<br />

OS.<br />

A. Block/partition Memory Allocation<br />

Typically, block memory allocation is performed using a<br />

“partition pool”, which is defined statically or dynamically and<br />

configured to contain a specified number of blocks of a<br />

specified fixed size. For Nucleus OS, the API call to define a<br />

partition pool has the following prototype:<br />

STATUS<br />

NU_Create_Partition_Pool(NU_PARTITION_<br />

POOL *pool,<br />

CHAR *name, VOID *start_address,<br />

UNSIGNED pool_size,<br />

UNSIGNED partition_size, OPTION<br />

suspend_type);<br />

This is most clearly understood by means of an example:<br />

status =<br />

NU_Create_Partition_Pool(&MyPool, "any<br />

name",<br />

(VOID *) 0xB000, 2000, 40, NU_FIFO);<br />

This creates a partition pool with the descriptor MyPool,<br />

containing 2000 bytes of memory, filled with partitions of size<br />

40 bytes (i.e. there are 50 partitions). The pool is located at<br />

address 0xB000. The pool is configured such that, if a task<br />

attempts to allocate a block, when there are none available, and<br />

it requests to be suspended on the allocation API call,<br />

suspended tasks will be woken up in a first-in, first-out order.<br />

The other option would have been task priority order.<br />

Another API call is available to request allocation of a<br />

partition. Here is an example using Nucleus OS:<br />

status =<br />

NU_Allocate_Partition(&MyPool, &ptr,<br />

NU_SUSPEND);<br />

This requests the allocation of a partition from MyPool.<br />

When successful, a pointer to the allocated block is returned in<br />

www.embedded-world.eu<br />

803


ptr. If no memory is available, the task is suspended, because<br />

NU_SUSPEND was specified; other options, which may have<br />

been selected, would have been to suspend with a timeout or to<br />

simply return with an error.<br />

When the partition is no longer required, it may be deallocated<br />

thus:<br />

which is deterministic, immune from fragmentation and with<br />

good error handling.<br />

status =<br />

NU_Deallocate_Partition(ptr);<br />

If a task of higher priority was suspended pending<br />

availability of a partition, it would now be run.<br />

There is no possibility for fragmentation, as only fixed size<br />

blocks are available. The only failure mode is true resource<br />

exhaustion, which may be controlled and contained using task<br />

suspend, as shown.<br />

Additional API calls are available which can provide the<br />

application code with information about the status of the<br />

partition pool – for example, how many free partitions are<br />

currently available.<br />

Care is required in allocating and de-allocating partitions,<br />

as the possibility for the introduction of memory leaks remains.<br />

B. Memory Leak Detection<br />

The potential for programmer error resulting in a memory<br />

leak when using partition pools is recognized by vendors of<br />

real time operating systems. Typically, a profiler tool is<br />

available which assists with the location and rectification of<br />

such bugs.<br />

VII. REAL TIME MEMORY SOLUTIONS<br />

Having identified a number of problems with dynamic<br />

memory behavior in real time systems, a better approach can<br />

be proposed.<br />

A. Dynamic Memory<br />

It is possible to use partition memory allocation to<br />

implement malloc() in a robust and deterministic fashion.<br />

The idea is to define a series of partition pools with block sizes<br />

in a geometric progression; e.g. 32, 64, 128, 256 bytes. A<br />

malloc() function may be written to deterministically select<br />

the correct pool to provide enough space for a given allocation<br />

request. This approach takes advantage of the deterministic<br />

behavior of the partition allocation API call, the robust error<br />

handling (e.g. task suspend) and the immunity from<br />

fragmentation offered by block memory.<br />

VIII. CONCLUSIONS<br />

C and C++ use memory in various ways, both static and<br />

dynamic. Dynamic memory includes stack and heap.<br />

Dynamic behavior in embedded real time systems is<br />

generally a source of concern, as it tends to be nondeterministic<br />

and failure is hard to contain.<br />

Using the facilities provided by most real time operating<br />

systems, a dynamic memory facility may be implemented<br />

804


Optimized – Cost Effective Implementation of<br />

Widely-Used Safety Mechanisms in Heterogeneous<br />

Software Architectures<br />

Esam Mamdouh<br />

Functional Safety Department<br />

eJad L.L.C<br />

Cairo, Egypt<br />

Esam.Mamdouh@ejad.com.eg<br />

Abstract— Functional safety is a key player in the<br />

development of Advanced Driver Assistance Systems (ADAS).<br />

Most of the ADAS software architecture is mainly developed<br />

based on either multi-core targets or multi-chip processors,<br />

where both of them can be considered as a heterogeneous<br />

software architecture. Heterogeneous Software Architectures<br />

require a special attention in order to utilize the available<br />

software capabilities to implement the safety recommendations<br />

defined by ISO 26262. Following these recommendations in such<br />

complex software architectures has become a major challenge<br />

facing the developers of safety critical applications. Current<br />

methodologies for deploying the safety critical features mainly<br />

rely on component redundancy with extra development time and<br />

effort. This paper will introduce an optimized – cost effective<br />

implementation of safety critical features. The main idea of the<br />

presented approaches is to simplify the implementation of the<br />

safety critical features by utilizing the available capabilities of the<br />

applicable system. These approaches are applied on a case study<br />

in the automotive industry for a Medium Range Radar<br />

application where it is classified as a safety critical application.<br />

The results of this approach show a significant performance<br />

improvement of multi-core target/processor, and emphasize the<br />

cost saving of possible duplicated component development when<br />

it has been adopted with safety critical features.<br />

Keywords— ISO 26262; Functional Safety; Multi-core; ADAS;<br />

MPU; IPC; Flow Control Monitoring; Watchdog;<br />

I. INTRODUCTION<br />

Heterogeneous Software Architectures acquire special<br />

attention in order to utilize the available software capabilities to<br />

implement the additional safety requirements as requested by<br />

the standard ISO 26262 [1]. Some of these additional safety<br />

requirements are used to tolerate some failures in the system. In<br />

this case, they are called safety mechanisms and normally<br />

defined in the Technical Safety Concept (TSC) during the<br />

safety analysis phase according to part 4 of [1]. This paper<br />

explains how the commonly used safety mechanisms such as<br />

Flow Control Monitoring, Memory Protection and Stack<br />

Protection are implemented in a multi-core platform whose<br />

Hossam H. Abolfotuh<br />

Functional Safety Department<br />

eJad L.L.C<br />

Cairo, Egypt<br />

Hossam.Abolfotuh@ejad.com.eg<br />

system’s functions originally do not require multi-tasking on<br />

all cores (e.g. a simple schedule is maybe enough) and hence a<br />

multi-core OS is not required. In the proposed solution, only an<br />

ASIL single-core OS is used on one core, while the other two<br />

cores do not need an OS, which saves the high cost of an ASIL<br />

multi-core OS.<br />

Normally the implementation of the mentioned safety<br />

mechanisms is performed through multiple instants of the<br />

safety critical components such as the watchdog for each core,<br />

and on other cases achieved by applying a complex technique<br />

using OS communication overhead in which adding more CPU<br />

load and degrade the system performance.<br />

This research is started by demonstrating the most popular<br />

safety mechanism, Flow Control Monitoring, where it is<br />

applied in most of ADAS systems discussing how to avoid the<br />

duplication effort of using multiple watchdog instances; then<br />

talk about an effective implementation of Memory Protection<br />

for proper software partitioning between ASIL-x and QM<br />

components; passing through the smart implementation of<br />

Stack Protection in the light of MPU functionality, and finally<br />

applies these safety mechanisms in a case study for a Medium<br />

Range Radar application in automotive industry illustrating the<br />

enhanced results of the proposed solutions.<br />

II.<br />

FLOW CONTROL MONITORING<br />

A. Current implementation and challenges<br />

The first widely used safety mechanism is the Flow Control<br />

Monitoring. Its main purpose is to ensure the correct execution<br />

of the program sequence. As shown in Fig. 1, the current<br />

implementation for performing a flow control monitoring on a<br />

multi-core platform is typically achieved using multiple<br />

instances of ASIL watchdog stack for each core in order to<br />

implement aliveness supervision and logical supervision as<br />

described by the AUTOSAR standard; this is actually an<br />

expensive solution as it requires a perfect synchronization<br />

between the multiple watchdog instances to report the final<br />

status of the system accurately and on time.<br />

www.embedded-world.eu<br />

805


Fig. 1. Multiple instances of Watchdog stack on a Tri-Core platform<br />

B. Proposed Solution<br />

In the suggested proposal, the watchdog stack is only<br />

deployed on the first core (the one having an OS with the<br />

required ASIL) and handle the flow control monitoring on the<br />

other two cores by utilizing the existing watchdog module of<br />

the first core. This is achieved by implementing a simplified<br />

flow control monitoring with the basic required functions on<br />

the other two cores. The implementation is including the<br />

definitions of the necessary check points on the program<br />

sequence running on these two cores; then it reports the status<br />

of these check points to the main watchdog stack on the first<br />

core over the Inter-Processor Communication (IPC) as<br />

illustrated in Fig. 2.<br />

The first core then calculates the status of the supervised<br />

entities of other cores received over the IPC and reporting the<br />

overall status to the system. In case of any detected violation,<br />

the system will enter the relevant safe state or perform reset<br />

according to what is described in the TSC.<br />

This solution can be generalized to cover the flow control<br />

monitoring in a multi-chips system (e.g., microcontroller and<br />

DSP) relying on inter-chip communication (e.g., SPI<br />

communication) instead of IPC.<br />

III. MEMORY PROTECTION<br />

A. Current implementation and challenges<br />

Another commonly used safety mechanism is the Memory<br />

Protection which is used to protect the critical memory<br />

partitions that contain the critical data identified by the safety<br />

analysis. In the mixed ASIL software architecture, there is a<br />

possible risk may be caused by an unauthorized accesses from<br />

the QM partition on the ASIL partition. This an unauthorized<br />

access may corrupt the ASIL data and hence leads to a safety<br />

goal violation. Therefore the MPU is typically handled by an<br />

OS with at least Scalability Class 3 (SC3) to support the<br />

software partitioning for mixed ASIL software architecture.<br />

This solution requires an OS on all cores which in turn<br />

acquires an expensive multi-core OS license. On the other hand<br />

it will degrade the performance due to the overhead of Inter-OS<br />

Communication (IOC) used in context switching between QM<br />

partition and ASIL partition.<br />

B. Proposed Solution<br />

To avoid such a complex implementation, it is proposed to<br />

develop a Safety Element out of Context (SEooC) MPU driver<br />

to be used on all cores taking into consideration the different<br />

compiler options of each core. This MPU driver provides a<br />

simple interface that allow the application through a simple<br />

wrapper called “Memory Protection Wrapper” to switch<br />

ON/OFF the MPU device according to the safety level context<br />

change. The software architecture proposed for the memory<br />

protection is illustrated in Fig. 3.<br />

This is valid mainly when having only two safety levels<br />

(e.g., QM and ASIL-x) which is a common case in mixed<br />

ASIL software architectures. In other words it restricts the<br />

access of the lower ASIL software to the memory partitions<br />

belonging to the higher ASIL software.<br />

Fig. 3. Proposed software architecture of the memory protection<br />

Fig. 2. Single instance of Watchdog stack with simplified flow control<br />

monitoring<br />

806


IV.<br />

STACK PROTECTION<br />

A. Current implementation and challenges<br />

Regarding the Stack Protection, it is typically realized using<br />

an OS that providing different stacks for each task or interrupt.<br />

Thus, the OS of the first core is responsible to protect its<br />

different stacks as configured. For the other cores, due to nonpreemptive<br />

nature of their tasks, a single stack can be<br />

considered safe. Hence, no multi-core OS is needed to protect<br />

the stacks on the other cores. This single stack is usually<br />

protected against stack overflow by placing magic number<br />

patterns at its border to be checked periodically by software.<br />

This continuous check is adding an overhead on software<br />

processing and consumes a considerable amount of CPU load<br />

and affects the overall system performance.<br />

B. Proposed Solution<br />

A smart solution was proposed to take the advantage of the<br />

MPU feature that limits the access across different cores. Using<br />

this feature, the stack of each core is located at the top border<br />

of the memory space of each core so that the stack overflow<br />

will be a considered as an unauthorized writing attempt in the<br />

memory space of the other core. Thus a memory access<br />

violation is detected and a safety reaction is causing a reset.<br />

V. CASE STUDY<br />

In the case study, the implementation of the mentioned<br />

safety mechanisms were adopted on a tri-core target<br />

“MPC5774-RaceRunner” [2] with ASIL-D Micro Controller<br />

Abstraction Layer (MCAL) provided by NXP. The application<br />

deployed in the target is a Medium Range Radar application<br />

where the front radars have an ASIL-B on software level.<br />

The software architecture is based on AUTOSAR version<br />

3.2.1, supplied by different ASIL-B stacks from Vector such as<br />

WdgM, E2E, and SafeOS.<br />

In the following 3 sub-sections, the implementation of the<br />

mentioned safety mechanisms are illustrated in the light of the<br />

Medium Range Radar application.<br />

A. Flow Control Monitoring<br />

The first core is supplied by an ASIL-B single core OS and<br />

an AUTOSAR package including ASIL-B watchdog stack used<br />

for flow control monitoring on the first core while the check<br />

points statuses of the other two cores are communicated<br />

through the IPC as defined in section II.<br />

The configuration of the watchdog stack on the main core is<br />

including two additional check points for special IPC messages<br />

containing the information of the check point statuses of the<br />

other two cores.<br />

B. Memory Protection<br />

Whereas the other two cores do not need OS because they<br />

do not have any preemptive tasks. The discussed optimized<br />

approach can be applied without impacting the safety aspects.<br />

According to the ASIL decomposition on the system level,<br />

there are only two safety levels (QM, ASIL-B) defined in<br />

software architecture, therefore the simplified memory<br />

protection solution explained in section III can be implemented<br />

properly as shown in Fig. 4.<br />

Fig. 4. Context ASIL change using the simplified MPU ON/OFF switch<br />

Fig. 5. Stack location for Core1 and Core2 of MCU 'MPC5774-RaceRunner’<br />

C. Stack Protection<br />

The single stack of the other two cores (Core1 and Core2)<br />

are located at the top boundaries of their memory map so that<br />

the MPU can detect any stack overflow and perform MCU<br />

reset by hardware. An additional memory hole is inserted at the<br />

bottom of the stack and configured in the MPU as restricted<br />

memory area to detect the stack underflow.<br />

The memory layout example of the MCU MPC5774-<br />

RaceRunner is illustrated in Fig. 5.<br />

VI. SUMMARY/CONCLUSIONS<br />

The major advantage of the proposed solutions is that they<br />

are cost-effective alternatives by using only one ASIL OS<br />

compared to having a multi-core ASIL OS.<br />

The advantage of the proposed flow control monitoring<br />

solution is saving the effort of developing and configuring a<br />

multiple instants of ASIL watchdog stack for the other cores.<br />

For the memory protection, the proposed solution has a<br />

better performance as it saves the overhead of IOC caused by<br />

using OS configured with SC3.<br />

Finally in the proposed stack protection, the solution has no<br />

development effort and no processing load on the CPU.<br />

VII. REFERENCES<br />

[1] International standard, “Road Vehicles – Functional Safety”, ISO<br />

Standard 26262, first edition, Nov. 2011.<br />

[2] NXP Datasheet, “MPC5775K Reference Manual”, Document Number:<br />

MPC5775KRM, Rev. 2, 2/2014.<br />

www.embedded-world.eu<br />

807


Design security into your code. Don’t just hope to<br />

remove insecurity<br />

Mark A. Pitchford<br />

Technical Specialist<br />

LDRA<br />

Wirral, UK<br />

mark.pitchford@ldra.com<br />

I. INTRODUCTION<br />

If someone constructed a suspension bridge by guessing at<br />

steel cabling sizes and then loading the deck to see whether it<br />

collapsed, you would be unlikely to suggest that he was a<br />

great civil engineer. And if a lift manufacturer sized their<br />

motors by trying them to see whether they caught fire, you<br />

wouldn’t expect their electrical engineers to win many<br />

awards.<br />

And yet these approaches are exactly analogous to how<br />

security critical software developers often approach their<br />

work.<br />

The development cycle for traditional security markets is a<br />

largely reactive one, where coding is developed mostly on an<br />

informal agile basis, with no risk mitigation and no coding<br />

guidelines. The resulting executables are then subjected to<br />

performance, penetration, load and functional tests to attempt<br />

to find the vulnerabilities that almost certainly result. The<br />

hope, rather than the expectation, is that all issues will be<br />

found and the holes adequately plugged.<br />

In short, it challenges secure software developers to embrace<br />

the concept that it is far better to design in security rather<br />

than hope to remove insecurity.<br />

II. THE TRADITIONAL APPROACH TO ENTERPRISE<br />

SOFTWARE SECURITY<br />

Figure 1 shows an extract from a slide show based on a<br />

popular text book. The book itself is focused on the<br />

development of software for enterprise systems iii , and was<br />

published as recently as 2011. It typifies an approach to<br />

enterprise software development that focuses only on “end<br />

user business requirements”, with no clear regard for system<br />

security or safety. It goes on to place focus on testing only<br />

after the development phase, such that the application is<br />

developed in accordance with a specification and, when it is<br />

completed, it is tested to see whether requirements are met,<br />

and to “eliminate errors or bugs”.<br />

Safety critical software development belongs to a different<br />

world, with a process that would be far more familiar to<br />

exponents of the more traditional engineering disciplines. A<br />

process that consists of defining requirements, creating a<br />

design to fulfil those requirements, developing a product that<br />

is true to the design, and then testing it to show that it is.<br />

This paper argues that whether their product is safety critical<br />

or not, it is time for security critical software developers to<br />

embrace that same, sound engineering lifecycle. In doing so,<br />

it will compare and contrast the difference in focus between<br />

CERT C i ’s application centric approach to the detection of<br />

issues, versus MISRA ii ’s ethos of using design patterns to<br />

prevent their introduction. It will advocate the use of<br />

reactive penetration and load tests to prove that the product is<br />

sound, rather than to find out where it isn’t.<br />

Figure 1: The traditional enterprise development lifecycle,<br />

with a test phase only after development<br />

It is possible, of course, that security could indeed be one of<br />

the “business requirements” even it is not explicitly<br />

highlighted as such. Even assuming that to be the case, it<br />

remains no surprise that many established security test<br />

www.embedded-world.eu<br />

808


techniques focus on the “develop first, test later” model<br />

reflected in Figure 1. Penetration testing iv , for example, is an<br />

authorized simulated attack on a computer system, performed<br />

to evaluate the security of that system. The test is performed<br />

to identify both strengths and vulnerabilities – that is, the<br />

potential for unauthorized parties to gain access to the<br />

system's features and data - enabling a full risk assessment to<br />

be completed.<br />

Fuzz testing v is a related technique where large amounts of<br />

data in varying formats are sent to the inputs of an<br />

application. For example, “File Fuzzing” involves taking a<br />

well-formed file, modifying it to introduce fuzz data, and<br />

then driving the program to open the modified file. The<br />

application will then process the fuzz data and its response<br />

can be monitored.<br />

Such techniques, then, fit with the development lifecycle<br />

model advocated in Figure 1. The idea is that armed with<br />

such information, developers and IT engineers can hope to<br />

“plug the gaps” with the aim of ensuring that the system is<br />

adequately secure.<br />

III. SAFE & SECURE APPLICATION CODE DEVELOPMENT<br />

This traditional approach to secure software development is<br />

mostly a reactive one – develop the software, and then use<br />

penetration, fuzz and functional test to expose any<br />

weaknesses. Useful though that is, in isolation it is not good<br />

enough to comply with a functional safety standard such as<br />

DO-178C vi (in the aerospace sector), IEC 62304 vii (medical<br />

devices) or ISO 26262 viii (automotive) which implicitly<br />

demands that security factors with a safety implication are<br />

considered from the outset, because a safety-critical system<br />

cannot be safe if is not secure.<br />

Using ISO 26262 as an example, Figure 2 illustrates a V-<br />

model with cross-references to both the ISO 26262 standard<br />

and to tools likely to be deployed at each phase in the<br />

development of today’s highly sophisticated and complex<br />

automotive software. This serves as a reference to illustrate<br />

how the introduction of a security perspective impacts each<br />

phase. (Note that other process models such as agile and<br />

waterfall can be equally well-supported.)<br />

Figure 2: Software-development V-model with crossreferences<br />

to ISO 26262 and standard development tools<br />

The outputs from the system design phase (top left) includes<br />

technical safety requirements refined and allocated to<br />

hardware and software. In a connected system, these will<br />

include many security requirements because the action to be<br />

taken to deal with each safety-threatening security issue needs<br />

to be proportionate to the risk. Hazard analyses are performed<br />

to assess risks associated with safety, whereas threat analyses<br />

identify risks associated with security. Detailed hazard<br />

analysis may involve Fault Tree Analysis (FTA) whereas<br />

threat analysis may consist of Attack Tree Analysis (ATA),<br />

but each contributes key information to the safety case.<br />

Maintaining traceability between these requirements and the<br />

products of subsequent phases can cause a major project<br />

management headache.<br />

The specification of software requirements involves their<br />

derivation from the system design, isolating the softwarespecific<br />

elements and detailing the process of evolution of<br />

lower-level, software-related requirements, including those<br />

with a security-related element.<br />

Note that the application of such a process does not negate<br />

the value of penetration and fuzz testing. However, it makes<br />

it much more likely that such techniques will provide<br />

evidence of the robustness of systems, rather than being used<br />

to expose their vulnerabilities.<br />

Figure 3: Graphical representation of Control and Data<br />

Flow as depicted in the LDRA tool suite<br />

809


Next comes the software architectural design phase, perhaps<br />

using a UML graphical representation. Static analysis tools<br />

help here by providing graphical representations of the<br />

relationship between code components for comparison with<br />

the intended design (Figure 3).<br />

Figure 4 illustrates a typical example of a table from ISO<br />

26262-6:2011, relating to software design and<br />

implementation. It shows the coding and modelling<br />

guidelines to be enforced during implementation,<br />

superimposed with an indication of where compliance can be<br />

confirmed with the aid of automated tools.<br />

Figure 4: ISO 26262 coding and modelling guidelines<br />

The “use of language subset” (topic 1b in the table)<br />

exemplifies the impact of security considerations on the<br />

process. Language subsets have traditionally been viewed as<br />

an aid to safety, but security enhancements to the MISRA<br />

C:2012 standard and security-specific standards such as<br />

CWE ix and CERT C reflect an increasing interest in the role<br />

they have to play in combating security issues. These too can<br />

be checked by means of static analysis (Figure 5). Despite<br />

being nominally similar, the underlying ethos can differ<br />

considerably between these language subsets, as discussed<br />

later.<br />

Figure 6: Unit testing with the LDRA tool suite<br />

Figure 6 shows how the software interface is exposed at the<br />

function scope, allowing the user to enter inputs and expected<br />

outputs to form the basis of a test harness. That harness is then<br />

compiled and executed on the target hardware, and actual and<br />

expected outputs compared. Such a technique is useful not<br />

only to show functional correctness in accordance with<br />

requirements, but also to show resilience to issues such as<br />

border conditions, null pointers and default switch cases – all<br />

important security considerations.<br />

In addition to showing that software functions correctly,<br />

dynamic analysis is used to generate structural coverage<br />

metrics. Both MISRA C:2012 (Dir 3.1) and the security<br />

standard CWE (Figure 7) require that code coverage analysis<br />

is used to ensure that there is no hidden functionality<br />

designed to potentially increase an application’s attack<br />

surface and expose weaknesses.<br />

Figure 7: CWE requirement for code coverage analysis<br />

Figure 5: Coding standards violations as represented by the<br />

LDRA tool suite<br />

Dynamic analysis techniques (involving the execution of<br />

some or all of the code) are applicable to unit, integration<br />

and system testing. Unit testing is designed to focus on<br />

particular software procedures or functions in isolation,<br />

whereas integration testing ensures that safety, security and<br />

functional requirements are met when units are working<br />

together in accordance with the software architectural design.<br />

IV. CHOOSING A LANGUAGE SUBSET<br />

Although there are several language subsets (or less formally,<br />

“coding standards”) to choose from, these have traditionally<br />

been focused primarily on safety, rather than security. More<br />

lately with the advent of such as the Industrial Internet of<br />

Things, connected cars, and connected heart pacemakers, that<br />

focus has shifted towards security to reflect the fact that<br />

systems such as these, once naturally secure through isolation,<br />

are now increasingly accessible to aggressors.<br />

www.embedded-world.eu<br />

810


There are, however, subtle differences between the differing<br />

subsets which is perhaps a reflection of the development<br />

dichotomy between designing for security, and appending<br />

some measure of security to a developed system. To illustrate<br />

this, it is useful to compare and contrast the approaches taken<br />

by the authors of MISRA C and CERT C with respect to<br />

security.<br />

A. Retrospective adoption<br />

MISRA C:2012 x categorically states that “MISRA C should<br />

be adopted from the outset of a project. If a project is building<br />

on existing code that has a proven track record then the<br />

benefits of compliance with MISRA C may be outweighed by<br />

the risks of introducing a defect when making the code<br />

compliant.”<br />

This contrasts in emphasis with the assertion of the CERT C<br />

authors that although “the priority of this standard is to<br />

support new code development…. A close-second priority is<br />

supporting remediation of old code”<br />

In the case of static analysis tools, that requires that the rules<br />

can be checked algorithmically. Compare, for example, the<br />

excerpts shown in Figure 8, both of which address the same<br />

issue. The approach taken by MISRA is to prevent the issue<br />

by disallowing the inclusion of the pertinent construct. CERT<br />

C instead asserts that the developer should “be aware” of it.<br />

Of course, there are advantages in each case. The CERT C<br />

approach is clearly more flexible; something of particular<br />

value if rules are applied retrospectively. MISRA C:2012 is<br />

more draconian, yet by avoiding the side effects altogether the<br />

resulting code is certain to be more portable, and perhaps more<br />

importantly, it can be automatically checked by a static<br />

analysis tool. It is simply not possible for a tool to check<br />

whether a developer is “aware” of side effects – and less<br />

possible still to ascertain whether “awareness” equates to<br />

“understanding”.<br />

Of course, as with the system as a whole, the level of risk<br />

involved with the compromise of the system will reflect on the<br />

approaches to be adopted. Certainly, the retrospective<br />

application of any subset is better than nothing, but it does not<br />

represent best practice.<br />

B. Relevance to safety, high integrity and high reliability<br />

systems<br />

MISRA C:2012 “define[s] a subset of the C language in which<br />

the opportunity to make mistakes is either removed or<br />

reduced. Many standards for the development of safety-related<br />

software require, or recommend, the use of a language subset,<br />

and this can also be used to develop any application with high<br />

integrity or high reliability requirements”. The accurate<br />

implication of that statement is that MISRA C was always<br />

appropriate for security critical applications even before the<br />

security enhancements introduced by MISRA C:2012<br />

Amendment 1 xi .<br />

CERT C attempts to be more all-encompassing, covering<br />

application programming (e.g. POSIX) as well as the C<br />

language That is reflected in its introductory suggestion that<br />

“safety-critical systems typically have stricter requirements<br />

than are imposed by this standard … However, the application<br />

of this coding standard will result in high-quality systems that<br />

are reliable, robust, and resistant to attack”.<br />

V. DECIDABILITY<br />

The primary purpose of a requirements-driven software<br />

development process as exemplified by ISO 26262 is to<br />

control the development process as tightly as possible to<br />

minimize the possibility of error or inconsistency of any kind.<br />

Although that is theoretically possible by manual means, it<br />

will generally be far more effective if software tools are used<br />

to automate the process as appropriate.<br />

Figure 8: Contrasting approaches concerning the decidability<br />

of coding rules<br />

VI.<br />

PRECISION OF RULE DEFINITIONS<br />

The stricter, more precisely defined approach of MISRA does<br />

not only lend itself to a standard more suitable for automated<br />

checking. It also addresses the issue of language<br />

misunderstanding more convincingly than CERT C.<br />

Evidence suggests that there are particular characteristics of<br />

the C language which are responsible for most of the defects<br />

found in C source code xii , such that around 80% of software<br />

defects are caused by the incorrect usage of about 20% of the<br />

available C or C++ language constructs. By restricting the use<br />

of the language to avoid the parts that are known to be<br />

problematic, it becomes possible to avoid writing associated<br />

defects into the code and as a result, the software quality<br />

greatly increases.<br />

This approach also addresses a more subtle issue surrounding<br />

the personalities and capabilities of individual developers.<br />

Simple statistics tell us that of all the C developers in the<br />

world, 50% of them have below average capabilities – and yet<br />

it is very rare indeed to find a development team manager who<br />

811


would acknowledge that they recruit any such individuals.<br />

More than that, in any software development team, there will<br />

some who are more able than others and it is human nature for<br />

people not to highlight the fact if there are things they don’t<br />

understand.<br />

Figure 9 uses the handling of variadic functions to illustrate<br />

how this approach differs from that of CERT C. CERT C calls<br />

for developers to “understand” the associated type issues, but<br />

doesn’t suggest how a situation might be handled where a<br />

developer is, despite the best of intentions, harbouring a<br />

misunderstanding.<br />

A counter argument might be that there will be developers<br />

who are very aware of the type issues associated with variadic<br />

functions, who make very good use of them, and who may feel<br />

restricted in the prohibition of their use. However, for highly<br />

safety or security critical systems, MISRA would assert that<br />

because the “opportunity to make mistakes is either removed<br />

or reduced”, that is a price well worth paying.<br />

VII.<br />

BI-DIRECTIONAL TRACEABILITY<br />

The principle of bi-directional traceability runs throughout<br />

the V-models referenced in such as DO-178C, IEC 62304,<br />

and ISO 26262, with each development phase required to<br />

accurately reflect the one before it. In theory, if the exact<br />

sequence of the standard is adhered to, then the<br />

requirements will never change and tests will never throw<br />

up a problem. But life’s not like that.<br />

For example, it is easy to imagine these processes as they<br />

relate to a “green field” project. But what if there is a need<br />

to integrate many different subsystems? What if some of<br />

those are pre-existing, with requirements defined in<br />

widely different formats? What if some of those systems<br />

were written with no security in mind, assuming an<br />

isolated system? And what if different subsystems are in<br />

different development phases?<br />

Then there is the issue of requirements changes. What if<br />

the client has a change of heart? A bright idea? Advice<br />

from a lawyer that existing approaches could be<br />

problematic?<br />

Should changes become necessary, revised code would<br />

need to be reanalysed statically, and all impacted unit and<br />

integration tests would need to be re-run (regression<br />

tested). Although that can result in a project management<br />

nightmare at the time, in an isolated application it lasts<br />

little longer than the time the product is under<br />

development.<br />

Figure 9: Comparing differing precision of rule definition<br />

A. A question of priorities<br />

The correct application of either CERT C or MISRA C:2012<br />

will certainly result in more secure code than if neither were to<br />

be applied. However, for safety or security critical<br />

applications, MISRA C is considerably less error prone both<br />

because it is specifically designed for such systems and as a<br />

result of its stricter, more decidable rules. Conversely, there is<br />

an argument for using the CERT C standard because it is more<br />

tolerant, perhaps if an application is not critical but is to be<br />

connected to the internet for the first time. The retrospective<br />

application of CERT C would then be a pragmatic choice to<br />

make.<br />

Connectivity, with its inherent need for security, changes<br />

all that. Whenever a new vulnerability is discovered, there<br />

is the potential for a resulting change of requirement to<br />

cater for it, coupled with the additional pressure of<br />

knowing that a speedy response could be critically<br />

important if products are not to be compromised in the<br />

field. Indeed, many IoT systems are very difficult to patch<br />

once in service.<br />

Automated bi-directional traceability links requirements<br />

from a host of different sources through to design, code<br />

and test. The impact of any requirements changes – or,<br />

indeed, of failed test cases - can be assessed by means of<br />

impact analysis, and addressed accordingly. Artefacts can<br />

be automatically re-generated to present evidence of<br />

continued compliance to the appropriate standard.<br />

During the development of a traditional, isolated system,<br />

that is clearly useful enough. But connectivity demands<br />

the ability to respond to vulnerabilities, because each<br />

newly discovered vulnerability implies a changed or new<br />

requirement, and one to which an immediate response is<br />

needed – even though the system itself may not have been<br />

touched by development engineers for quite some time. In<br />

such circumstances, being able to isolate what is needed<br />

www.embedded-world.eu<br />

812


and automatically test only the functions implemented<br />

becomes something much more significant.<br />

VI. CONCLUSIONS<br />

The “develop first, test later” development lifecycle so often<br />

applied to enterprise software security is too prone to error<br />

where the application under development is critical in nature.<br />

Sometimes the requirement to do more is implicit in the safety<br />

implications if security is breached, but the same principle<br />

applies even when the concern centres on only the sensitivity<br />

of data. Happily, there are numerous examples of functionally<br />

safe process standards such as ISO 26262 in the automotive<br />

industry, DO-178C in aerospace, and IEC 62304 in medical<br />

devices, and these provide a more stringent model for<br />

developers of security critical applications to adopt.<br />

These functional safety standards require the use of coding<br />

rules, and those specified by such as CERT C and MISRA<br />

C:2012+AMD 1 are designed for use in secure software<br />

development. MISRA’s mission statement to “…provide<br />

world-leading, best practice guidelines for the safety<br />

application of both embedded control systems and standalone<br />

software” contrasts with CERT C’s wider remit, and so<br />

MISRA C:2012 perhaps lends itself better to highly critical<br />

applications especially in view of the fact that more of its rules<br />

are designed to be automatically decidable by static analysis<br />

tools.<br />

COMPANY DETAILS<br />

LDRA<br />

Portside<br />

Monks Ferry<br />

Wirral CH41 5LH<br />

United Kingdom<br />

Tel: +44 (0)151 649 9300<br />

Fax: +44 (0)151 649 9666<br />

E-mail:info@ldra.com<br />

Presentation Co-ordination<br />

Mark James<br />

Marketing Manager<br />

E:mark.james@ldra.com<br />

Presenter<br />

Mark Pitchford<br />

Technical Specialist<br />

E:mark.pitchford@ldra.com<br />

The nature of the connected system means that the software<br />

development lifecycle effectively continues after product<br />

release. Tools designed to support bi-directional traceability<br />

during development provide the ideal platform to ensure that<br />

responses to security breaches are as rapid as possible, and<br />

that the resulting modified codebase is as compliant to<br />

standards as the version initially released.<br />

i<br />

SEI CERT C Coding Standard<br />

https://www.securecoding.cert.org/confluence/display/c/<br />

SEI+CERT+C+Coding+Standard<br />

ii<br />

MISRA – The Motor Industry Software Reliability<br />

Association https://www.misra.org.uk/<br />

iii<br />

Business Driven Information Systems, Paige Baltzan,<br />

McGraw-Hill Education, 2011<br />

iv<br />

TechTarget Definition: pen test (penetration testing)<br />

http://searchsoftwarequality.techtarget.com/definition/pe<br />

netration-testing<br />

v<br />

TechTarget Definition: fuzz testing (fuzzing)<br />

http://searchsecurity.techtarget.com/definition/fuzztesting<br />

vi<br />

RTCA DO-178C Software Considerations in Airborne<br />

Systems and Equipment Certification”, Prepared by SC-<br />

205, December 13, 2011<br />

vii<br />

IEC 62304 International Standard Medical device<br />

software – Software life cycle processes Consolidated<br />

Version Edition 1.1 2015-06<br />

viii<br />

International standard ISO 26262 Road vehicles —<br />

Functional safety — Part 6: Product development at the<br />

software level<br />

ix<br />

CWE Common Weakness Enumeration<br />

https://cwe.mitre.org/<br />

x<br />

MISRA C:2012 Guidelines for the use of the C<br />

language in critical systems. March 2013.<br />

xi<br />

MISRA C:2012 - Amendment 1: Additional security<br />

guidelines for MISRA C:2012, ISBN 978-906400-16-3<br />

(PDF), April 2016.<br />

xii<br />

Applying the 80:20 Rule in Software Development, Jim<br />

Bird, Nov 15 2013 https://dzone.com/articles/applying-<br />

8020-rule-software<br />

813


Partitioning of Algorithms for Distributed<br />

Computation<br />

Andreas Rechberger<br />

Institute of Technical Informatics<br />

Graz University of Technology<br />

Graz, Austria<br />

Eugen Brenner<br />

Institute of Technical Informatics<br />

Graz University of Technology<br />

Graz, Austria<br />

Abstract— Early evaluation of the computing needs is a<br />

crucial process during developing embedded systems. Providing<br />

measurable metrics on the performance demands to implement a<br />

specified algorithm includes a large amount of target dependency.<br />

This paper aims on providing a generalized method, which can<br />

be utilized prior to mapping the algorithm onto a dedicated<br />

hardware. The aim is to provide a quantification on the runtime<br />

with reasonable accuracy when applied to a dedicated hardware<br />

architecture. We focus on the analysis and extraction steps of<br />

this process discussing its challenges in order to transform the<br />

algorithm implementation into a form suited for distribution<br />

analysis. Finally some basic methods for hardware mapping of<br />

the generalized algorithm are presented.<br />

Keywords—algorithm, partitioning, compiler, LLVM, data flow<br />

graph<br />

I. INTRODUCTION<br />

When designing embedded systems it is vital to determine<br />

the required computing power in advance. With nowadays<br />

micro-controllers, signal processors and programmable logic<br />

usually the main problem is not to get sufficient processing<br />

capability, but to choose the appropriate computing platform.<br />

The majority of embedded systems do not operate in an<br />

isolated stand-alone environment, but rather communicate and<br />

interact with others. This holds true for a system scale, where<br />

for example multiple Internet of Things (IoT) sensors unite<br />

themselves, as well as on intra device scale, where a general<br />

purpose controller teams up with digital signal processors<br />

(DSP) for wireless communication and data acquisition.<br />

Within section II a method to analyse the data flow and<br />

processing needs of arbitrary algorithms is described. Such<br />

an analysis is one of the initial steps required in order to<br />

distribute the various processing tasks to the components<br />

within a system.<br />

By combining the results of the dynamic and static analysis<br />

the data flow as well as the control flow graph can be extracted.<br />

Section III describes the challenges that occur when extracting<br />

the computational effort and data amount out of the<br />

analysis data.<br />

Examples for such challenges are:<br />

• Dealing with aspects of the single static assignment (SSA)<br />

form of the intermediate language in presence of loop<br />

constructs (Φ-nodes).<br />

• Handling of code sequences that do not contribute to<br />

the algorithm itself, such as calling subroutines with the<br />

required parameter passing.<br />

• Decomposing of data access to aggregate structures (arrays).<br />

With these techniques applied the impact of the compilers<br />

optimization level can be minimized. The goal is to achieve<br />

similar results whether or not the compiler performs aggressive<br />

function inlining or not.<br />

Finally a proof of concept tool has been implemented, which<br />

is able to automatically process an application, identifies the<br />

function to be analysed, runs the static and dynamic analysis<br />

and generates a combined data- and control flow graph.<br />

With this graph a reasonable metric of the computational<br />

effort, as well as the required processing data for each part<br />

of the algorithm can be given. This provides the required<br />

quantitative data material for formulating a suitable system<br />

partitioning. Such a partition might be on macroscopic scale,<br />

like partitioning computation between a web server and an<br />

(resource limited) embedded client with constrained data flow<br />

in between, or on microscopic scale, like a general purpose<br />

CPU paired with a DSP.<br />

II. ANALYSIS METHOD<br />

Prior to analysing the data flow dependencies of an algorithm,<br />

it has to be formulated in a machine readable manner.<br />

This usually means to express the sequence of commutations<br />

in a suitable programming or scripting language. While data<br />

flow analysis could operate directly on the algorithm’s engineer<br />

input (hence for example by analysing the C++, C, Matlab<br />

code or the mathematical formulas) this approach does not<br />

allow dynamic analysis to be performed.<br />

Especially for descriptions in imperative languages even<br />

very simply constructs - like iterating over all elements of an<br />

data array - cannot always be satisfactorily statically analysed.<br />

For the previous example (iterating over a data array) this<br />

would require the data size to be a compile time constant.<br />

814<br />

www.embedded-world.eu


While almost all languages support idioms for compile<br />

computations, for reasons of simplicity it is not always desired<br />

to formulate the code in such a way. In the case of the<br />

C++ language completely determining all template parameters<br />

and propagated constants basically requires a fully functional<br />

compiler front-end.<br />

Modern compilers for high level languages aim to separate<br />

the language front end, the optimization stages and the code<br />

generation. This simplifies the handling of multiple language<br />

front ends for similar languages like various dialects of C or<br />

C++ as well as handling significantly different languages like<br />

Ada, Java and C++ or Go in a single compiler project.<br />

The GNU compiler collection (GCC front ends) supports<br />

a wide collection of languages (common ones like C/C++<br />

and Java, as well as less prominent ones like Ada, Pascal,<br />

Mercury, Cobol, Go or Modula-2). The GCC suite uses several<br />

internal code representation formats called GIMPLE, RTL and<br />

GENERIC [1]. Other high level compilers, like the LLVM<br />

project compiler do use a single intermediate language [2]. The<br />

intermediate languages used within the optimization phase,<br />

and as such those which are handed over to the code generators<br />

usually have single static assignment (SSA) form. This has<br />

shown to beneficial for various optimization techniques, like<br />

dead code elimination, constant propagation and variable range<br />

analysis.<br />

In order to maintain the benefits of using a well known<br />

language, and being able to utilize the front end processing<br />

and tooling already present, the analysis suitably is based<br />

on a different representation of than the language used to<br />

describe the algorithm. As basically all high level languages<br />

(such as C++, Java, C#) are transformed in a generalized<br />

intermediate representation by use of a standard compilation<br />

flow [3] a method to operate on this level of abstraction would<br />

be beneficial.<br />

Not only does this exempt the algorithmic analysis from the<br />

details of the front end language, it does also allow combining<br />

all front end languages the compiler is able to handle [4], [5].<br />

Based on the intermediate representation the algorithm can be<br />

analysed dynamically by means of executing an instrumented<br />

binary in addition to a purely static analysis.<br />

Instrumentation<br />

As code base for the analysis tool the compiler framework<br />

of the LLVM project has been used, specifically the C++ front<br />

end. While via static analysis the control flow graph can be<br />

trivially extracted, for a dynamic analysis the algorithm has<br />

to be executed. Within the control flow graph the algorithm<br />

is decomposed into a set of elementary blocks, which are<br />

connected via conditional or unconditional branches. Within<br />

the LLVM assembly language such an elementary block is<br />

called BasicBlock. Simplified this is a sequence of arbitrary<br />

instructions terminated by a branch instruction which denotes<br />

the next block to be executed.<br />

Exemplary a very simple function (Listing 1) which computes<br />

the sum of all elements in an array is used. This<br />

operation is called foldl (left fold)[6]. In contrast to the<br />

for.cond.cleanup:<br />

ret i32 %sum.0<br />

entry:<br />

br label %for.cond<br />

for.cond:<br />

%sum.0 = phi i32 [ 0, %entry ], [ %add, %for.body ]<br />

%i.0 = phi i64 [ 0, %entry ], [ %inc, %for.body ]<br />

%exitcond = icmp eq i64 %i.0, 3<br />

br i1 %exitcond, label %for.cond.cleanup, label %for.body<br />

T<br />

CFG for ’TestFunction’ function<br />

F<br />

for.body:<br />

%arrayidx = getelementptr inbounds i32, i32* %array, i64 %i.0<br />

%0 = load i32, i32* %arrayidx, align 4, !tbaa !3<br />

%add = add nsw i32 %0, %sum.0<br />

%inc = add nuw nsw i64 %i.0, 1<br />

br label %for.cond<br />

Fig. 1. Control Flow Graph for Array Sum<br />

given C code other programming languages might have builtin<br />

support for this operation (like Haskell), or implement<br />

them via library functions (std::accumulate in C++).<br />

The control flow of the intermediate language representation<br />

(Listing 2) consists of four blocks (Fig. 1).<br />

1 # d e f i n e N 3<br />

2 i n t T e s t F u n c t i o n ( c o n s t i n t a r r a y [N] )<br />

3 {<br />

4 i n t sum = 0 ;<br />

5 f o r ( s i z e t i =0; i


1) Load the source module (LLVM assembly language<br />

code)<br />

2) Instrument the modules functions<br />

3) Execute the modules entry function<br />

4) Post process the generated tracking data<br />

As the execution of the module to be analysed will take<br />

place within the thread context of the analysis tool their<br />

functions and global variables are shared. For the tracking a<br />

single function call is inserted at the very beginning of each<br />

BasicBlock.<br />

For reasons of simplicity within the instrumentation code<br />

generator this is done in a two step approach. The function<br />

call inserted into the module operates with untyped memory<br />

addresses. It conveys the address of a second level function<br />

as well as a reference to the tracking instance and the first<br />

instruction of the block to be executed.<br />

While using raw memory addresses (or void pointers) is<br />

generally considered as bad practice it allows to fully decouple<br />

the type information of the algorithm which is to be analysed<br />

and the tracking module. The first level function (called<br />

SpringBoard) simply re-generates the type information and<br />

invokes the corresponding trace function within the tracker,<br />

while the second level function (called Trampoline) calls<br />

the actual tracking functions of the instrumentation handlers<br />

object instance (CoderTracker_t).<br />

The first level function is depicted in Listing 3.<br />

TestFunction<br />

array<br />

array_Slice_0 <br />

array_Slice_0 array_Slice_1 array_Slice_2<br />

add (0 | 5)<br />

array_Slice_1<br />

add<br />

array_Slice_2<br />

add (1 | 12)<br />

add<br />

add (2 | 19)<br />

add<br />

ret (3 | 24)<br />

Fig. 2. Data Flow Graph for Array Sum<br />

TestFunction<br />

<br />

add (0 | 6)<br />

icmp (0 | 1)<br />

array<br />

inc<br />

inc<br />

exitcond<br />

add (1 | 13)<br />

icmp (1 | 8) br (1 | 2) array_Slice_0<br />

inc<br />

exitcond<br />

add array_Slice_0 array_Slice_1 array_Slice_2<br />

icmp (2 | 15) br (2 | 9) add (0 | 5)<br />

array_Slice_1<br />

Listing 3<br />

FIRST LEVEL TRACKING FUNCTION (SPRING BOARD)<br />

1 # i n c l u d e <br />

2 namespace CodeTracker<br />

3 {<br />

4 c l a s s C o d e T r a c k e r t ;<br />

5 }<br />

6 namespace llvm<br />

7 {<br />

8 c l a s s I n s t r u c t i o n ;<br />

9 }<br />

10<br />

11 t y p e d e f i n t (∗ Trampoline ) ( CodeTracker : : C o d e T r a c k e r t ∗, llvm : : I n s t r u c t i o n ∗) ;<br />

12<br />

13 e xtern ”C”<br />

14 {<br />

15 void T r a c k B a s i c B l o c k S p r i n g B o a r d ( u i n t 6 4 t F c t P t r , u i n t 6 4 t me , u i n t 6 4 t<br />

↪→ I n s t r c P t r )<br />

16 {<br />

17 Trampoline t r a m p o l i n e = r e i n t e r p r e t c a s t(F c t P t r ) ;<br />

18 t r a m p o l i n e ( r e i n t e r p r e t c a s t(me ) ,<br />

19 r e i n t e r p r e t c a s t(I n s t r c P t r ) ) ;<br />

20 }<br />

21 }<br />

Regular instructions (for example binary operations (add,<br />

sub, mul, ... [7])) can be directly added into the data flow<br />

graph. For data flow and dependency analysis each node is<br />

accompanied by some meta data. Within the meta data set<br />

the cycle count (incremented upon each instruction traced) is<br />

the most prominent one. It provides a reliable mechanism to<br />

identify the most recent graph node in case the instruction (or<br />

basic block) is executed multiple times. With this approach<br />

the data flow and control flow elements can be composed into<br />

a single directed acyclic graph (DAG). The data flow graph<br />

for the Listing 1 is shown in Fig. 2.<br />

Besides the expected adder tree the graph depicts the<br />

dissolved elements of the input array (the slice nodes), as well<br />

exitcond<br />

br (3 | 16)<br />

add<br />

add<br />

add<br />

add (1 | 12)<br />

add<br />

add (2 | 19)<br />

add<br />

ret (3 | 24)<br />

array_Slice_2<br />

Fig. 3. Control/Data Flow Graph for Array Sum<br />

as a single instruction (add, Listing 1, Line 19) of which the<br />

graph nodes are build. The numbers within the parenthesis<br />

denotes the operations instruction cycle (second number) and<br />

its rank (first number). The instruction rank denotes a virtual<br />

instruction cycle within a fully parallelized execution. An node<br />

with a rank of 0 only requires static inputs in order to be<br />

computed, while a node of rank N has at least on input which<br />

is of rank N − 1.<br />

By inspection the instruction cycles of the add nodes<br />

it is obvious that there are additional instructions executed.<br />

The instruction cycle is therefore guaranteed to be strictly<br />

monotonic (hence unique), but not necessarily continuous.<br />

Extending value nodes with operations required to perform<br />

the loop house-keeping yields Fig. 3.<br />

This reveals a second adder tree to be required (upper left<br />

corner). This adder tree reflects the pointer/index arithmetic<br />

for the loop implementation. Each of the increment operations<br />

is followed by an icmp, br (integer compare and branch)<br />

pair used to implement iterating over the input array. This<br />

816<br />

www.embedded-world.eu


aspect depend on whether or not the optimizations steps of<br />

the compiler have unrolled the loop or not. Within this paper<br />

the optimization has been configured not to unroll loops in<br />

order to demonstrate the generic case.<br />

III. DATA FLOW EXTRACTION<br />

For extracting the data flow from the instrumented code<br />

execution some extension to the previously described method<br />

are required. Some of which are caused by the fact that the<br />

LLVM assembly language is a SSA based language, while<br />

others are a property of the LLVM assembly language itself.<br />

The most prominent member of the first group (SSA introduced)<br />

is the handling of Φ nodes (Listing 2, Line 8-9).<br />

These are required to deal with values produced by one out of<br />

multiple possible predecessor blocks. Commonly this pattern<br />

is used to implement loop counters, which are loaded with the<br />

value 0 in case the predecessor has been the entry block, while<br />

being denoted the value n + 1 during all other iterations.<br />

Resolving the Φ nodes to their corresponding value can<br />

easily be achieved, provided that the tracking engine keeps a<br />

record of the previously executed basic blocks. This queue<br />

basically is similar to a function call stack (a last in first<br />

out (LIFO) queue) but operates on BasicBlock level rather<br />

than on function scope. Dealing with function calls as such<br />

is a necessity of the LLVM assembly language, which shares<br />

this property with almost all programming languages. While,<br />

depending on the optimization settings of the language frontend<br />

certain functions will be inlined, but this process is not<br />

reliable enough to relieve the tracking module from handling<br />

function calls.<br />

For properly embedding the data and control flow of a<br />

subroutine into the global DAG following two tasks need to<br />

be performed. First the input arguments and return values of<br />

the called function need to be mapped to the corresponding<br />

nodes within the parent blocks scope, and second the functions<br />

instructions as such are to be processed with the proper instruction<br />

cycle offset, corresponding to the current instruction cycle<br />

count of the parent function when handling the call instruction.<br />

Resolving the value nodes out of the subroutine is a recursive<br />

issue as soon as the depth of the call tree exceeds two. With<br />

this approach the functions calls transparently vanish within<br />

the analysis graph, as depicted in (Listing 4 and Fig. 4) which<br />

demonstrates that the arguments a1 and b1 of the inner<br />

most function f1 are resolved to the top level inputs a and<br />

b. This behaviour is independent of the in-lining behaviour<br />

(optimization level) of the compiler.<br />

A. Aggregate Data and Memory Access<br />

When dealing with data arrays, or aggregate data structures<br />

in general, another aspect of the intermediate language has to<br />

be taken into account. Whenever a computational operation is<br />

applied to data stored in an array it is required to explicitly<br />

reference a single entry within the aggregate set. In this context<br />

the generation of the control and data flow graph competes<br />

with the vectorisation optimization passes of the compiler.<br />

As the vectorization capabilities of the front end are usually<br />

Listing 4<br />

RECURSIVE ARGUMENT LOOKUP<br />

1 i n t T e s t F u n c t i o n ( i n t a , i n t b ) a t t r i b u t e ( ( n o i n l i n e ) ) ;<br />

2 e x t e r n ”C”<br />

3 {<br />

4 i n t f1 ( i n t a1 , i n t b1 ) { r e t u r n a1 + b1 ; }<br />

5 i n t f2 ( i n t a2 , i n t b2 ) { r e t u r n f1 ( a2 , b2 ) ; }<br />

6 i n t f3 ( i n t a3 , i n t b3 ) { r e t u r n f2 ( a3 , b3 ) ; }<br />

7 i n t f4 ( i n t a4 , i n t b4 ) { r e t u r n f3 ( a4 , b4 ) ; }<br />

8 }<br />

9<br />

10 i n t T e s t F u n c t i o n ( i n t a , i n t b )<br />

11 {<br />

12 r e t u r n f4 ( a , b ) ;<br />

13 }<br />

14<br />

15 i n t main ( i n t argc , c h a r∗ a rgv [ ] )<br />

16 {<br />

17 r e t u r n T e s t F u n c t i o n ( 1 , 2 ) ;<br />

18 }<br />

TestFunction<br />

b<br />

b<br />

add (0 | 0)<br />

[RecCallTree.cpp:4:42]<br />

add.i.i.i.i<br />

ret (1 | 1)<br />

[RecCallTree.cpp:12:5]<br />

Fig. 4. Recursive Argument Lookup<br />

limited to a certain width (usually less than 4) there is not<br />

much lost by inhibiting the vector optimization in the front<br />

end. Throughout this paper the front end and optimization<br />

configuration has been chosen as such that the vectorization<br />

passes have been disabled. Doing so causes memory access<br />

to happen via the pointer arithmetic like scheme of the<br />

getelemptr instruction. This closely resembles the index<br />

operator of the C language.<br />

Digging into the details of the C index operator (and as such<br />

the getelemptr) reveals that for those not only an arbitrary<br />

number of arguments is required, but also that resolving the<br />

corresponding element of the aggregate structure requires the<br />

run-time values of the arguments. The analysis concept so far<br />

does perform a dynamic analysis of which blocks or to be<br />

traced (and their order of execution), backed up by a purely<br />

static analysis of the block content itself. Hence the operand<br />

data for the array indexing is only available as reference to the<br />

variable or instruction computing it but not the actual value it<br />

has been assigned when executing the array indexing.<br />

In order to make this values available for the control<br />

and data flow extraction engine the dynamic part of the<br />

instrumentation requires some extension. First of all the static<br />

processing of the elementary block has to be interrupted upon<br />

reaching an instruction which requires the actual values of its<br />

operands. Doing so requires the instrumentation mechanism<br />

to be changed. Rather than simply inserting an informative<br />

a<br />

a<br />

817


callback (“this block is now executed”) at the beginning of<br />

each elementary block and recursively doing so for all subroutines<br />

the memory access instructions need to be handled.<br />

Differently than inserting a simple informative callback the<br />

extended tracing function is required to additionally convey<br />

the actual values of the arguments.<br />

The fully instrumented code is shown in Listing 5. The<br />

numeric arguments of the Track*() functions denote the<br />

memory addresses of the referenced instructions and trampoline<br />

functions as described in section II. Their values have<br />

been simplified from the 64-bit address space value to low digit<br />

decimals for reasons of better demonstration. Each elementary<br />

block is started with a regular tracking spring board function as<br />

demanded by the previous analysis method, while computing<br />

the memory address (Listing 5, Line 23) requires three steps.<br />

Pausing the static sweep over the elementary block, fetching<br />

the parameter values and finally assigning them to the proper<br />

node in the graph. The fact that there are two invocations of<br />

the springboard function within the loop body (line 20, 22) is<br />

contributed to the implementation details of the analysis tool.<br />

The first one prepares the value tracking, and is interrupted<br />

before evaluation the getelementptr instruction. The second<br />

invocation continues the static analysis to the end of the<br />

elementary block (line 27).<br />

The nodes for aggregate memory access are labelled as<br />

slices in the previous graphs, with each slice representing a<br />

single entry within the aggregate or array. This decomposition<br />

allows better separation of the control and data flow graph<br />

in future processing steps. For proper handling of the subsequent<br />

access into the aggregate the index values for multiple<br />

iterations need to be recursively processed.<br />

Listing 5<br />

FOLDL LLVM CODE<br />

1 ; F u n c t i o n A t t r s : m i n s i z e n o i n l i n e n o r e c u r s e nounwind o p t s i z e r e a d o n l y u w t a b l e<br />

2 d e f i n e i 3 2 @TestFunction ( i 3 2∗ nocapture readonly %a r r a y ) l o c a l u n n a m e d a d d r #0<br />

3 {<br />

4 e n t r y :<br />

5 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 5167)<br />

6 br l a b e l %f o r . c o n d<br />

7<br />

8 f o r . c o n d : ; p r e d s = %f o r . b o d y , %e n t r y<br />

9 %sum.0 = phi i 3 2 [ 0 , %e n t r y ] , [ %add , %f o r . b o d y ]<br />

10 %i . 0 = phi i 6 4 [ 0 , %e n t r y ] , [ %inc , %f o r . b o d y ]<br />

11 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 679)<br />

12 %e x i t c o n d = icmp eq i 6 4 %i . 0 , 3<br />

13 br i 1 %e x i t c o n d , l a b e l %f o r . c o n d . c l e a n u p , l a b e l %f o r . b o d y<br />

14<br />

15 f o r . c o n d . c l e a n u p : ; p r e d s = %f o r . c o n d<br />

16 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 3240)<br />

17 r e t i 3 2 %sum.0<br />

18<br />

19 f o r . b o d y : ; p r e d s = %f o r . c o n d<br />

20 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 9328)<br />

21 c a l l void @Track GEP i64 ( i 6 4 1406 , i 6 4 516 , i 6 4 9328 , i 6 4 %i . 0 )<br />

22 c a l l void @TrackBasicBlock SpringBoard ( i 6 4 140 , i 6 4 516 , i 6 4 9328)<br />

23 %a r r a y i d x = g e t e l e m e n t p t r inbounds i32 , i 3 2∗ %a r r a y , i 6 4 %i . 0<br />

24 %0 = load i32 , i 3 2∗ %a r r a y i d x , a l i g n 4 , ! t b a a !3<br />

25 %add = add nsw i 3 2 %0, %sum.0<br />

26 %i n c = add nuw nsw i 6 4 %i . 0 , 1<br />

27 br l a b e l %f o r . c o n d<br />

28 }<br />

For example differing to the example shown in Listing 5,<br />

the loop could have run the pointer argument via a Φ node<br />

and keep the index operand constant at the value 1 (Listing 6,<br />

Lines 3 and 10). This demands the tracking engine to perform<br />

the arithmetic required to resolve the final index position on<br />

its own.<br />

Listing 6<br />

LOOP ALTERNATIVE (INSTRUMENTED)<br />

1 f o r . c o n d : ; p r e d s = %f o r . b o d y , %e n t r y<br />

2 % V a l . a d d r = phi i 3 2 [ 0 , %e n t r y ] , [ %add , %f o r . b o d y ]<br />

3 % F i r s t . a d d r = phi i 3 2∗ [ %a r r a y d e c a y . i , %e n t r y ] , [ %i n c . p t r , %f o r . b o d y ]<br />

4 %cmp = icmp eq i 3 2∗ % F i r s t . a d d r , %a d d . p t r . i<br />

5 br i 1 %cmp , l a b e l %” . e x i t ” , l a b e l %f o r . b o d y<br />

6<br />

7 f o r . b o d y : ; p r e d s = %f o r . c o n d<br />

8 %0 = load i32 , i 3 2∗ % F i r s t . a d d r , a l i g n 4 , ! t b a a !9<br />

9 %add = add nsw i 3 2 %0, % V a l . a d d r<br />

10 %i n c . p t r = g e t e l e m e n t p t r inbounds i32 , i 3 2∗ % F i r s t . a d d r , i 6 4 1<br />

11 br l a b e l %f o r . c o n d<br />

B. Detecting Input and Output Arguments<br />

Another challenge is the automated detection of input and<br />

output parameters for the function to be analysed. Automating<br />

this process is required when retaining the language front<br />

end independence of the analysis tool is a design goal. As<br />

the LLVM assembly language does support constant data<br />

types it can safely be concluded that any data which is<br />

marked as constant within the intermediate language is to<br />

be treated as input. However the inverse is not necessarily<br />

the case. This is particularly true as the analysis tool does<br />

distinguish the algorithms entry function (usually the main<br />

function in a C or C++ program) and the function implementing<br />

the algorithm (denoted TestFunction within this paper).<br />

Differentiation between those functions allows the algorithm to<br />

conveniently be developed as regular application but prevents<br />

the analysis to track functionality which shall not contribute<br />

to the performance results - like reading input data from a file.<br />

Therefore to achieve reliable as well as correct results the<br />

analysis tool does determine the input (and output parameters)<br />

dynamically via tracing the data flow. Nodes which only<br />

have output edges are considered as inputs, nodes with input<br />

edges only as outputs and nodes with both edge types as<br />

combined input/output arguments. It is to be mentioned that<br />

the latter ones will break the acyclic property of the graph.<br />

This classification is done for the individual elements of<br />

aggregates/arrays, such that the latter processing is capable<br />

of decomposing an array on multiple hardware entities.<br />

C. Raising the Abstraction Level<br />

One of the greatest drawback of evaluating the algorithm<br />

on LLVM assembly language scope its very low level view of<br />

things. For data flow and sequencing this has shown to be very<br />

effective abstraction. Quite common although it is beneficial<br />

to raise the abstraction level for certain computations. This can<br />

be the case for well known functions, like computing the Fast<br />

Fourier transform (FFT), or if the targeted hardware natively<br />

supports them. On some CPU architectures for example this<br />

is the case for the square root function (sqrt).<br />

Rather than decomposing each function into its assembly<br />

instructions the instrumentation engine handles them similar<br />

to high level instructions. These functions are treated as simple<br />

compute nodes, with their corresponding input and outputs.<br />

Listing 7 and Fig. 5 demonstrate the method √ for computing the<br />

L2 norm of a real valued vector (x = ∑ N i=0 a2 i<br />

)) implemented<br />

as variant to the foldl function of Listing 1. While the<br />

818<br />

www.embedded-world.eu


fmul (0 | 6)<br />

[cmath:239:15]<br />

TestFunction<br />

a_Slice_0_0 a_Slice_0_1 a_Slice_0_2<br />

a_Slice_0_0 a_Slice_0_0<br />

<br />

mul.i.i.i.i<br />

fadd (1 | 7)<br />

[Vector_Norm.cpp:13:42]<br />

a<br />

a_Slice_0_1 a_Slice_0_1<br />

add.i.i.i<br />

fmul (0 | 13)<br />

[cmath:239:15]<br />

mul.i.i.i.i<br />

fadd (2 | 14)<br />

[Vector_Norm.cpp:13:42]<br />

Fig. 5. VectorNorm Dataflow<br />

add.i.i.i<br />

a_Slice_0_2 a_Slice_0_2<br />

fmul (0 | 20)<br />

[cmath:239:15]<br />

mul.i.i.i.i<br />

fadd (3 | 21)<br />

[Vector_Norm.cpp:13:42]<br />

add.i.i.i<br />

sqrtf (4 | 26)<br />

[cmath:287:10]<br />

call.i<br />

ret (5 | 27)<br />

[Vector_Norm.cpp:12:5]<br />

std::pow function has been decomposed, and by inlining<br />

and special coding within the runtime library optimized (x 2 =<br />

x·x), the sqrt function has been kept as high level primitive.<br />

This can also be done on a much different scope as it is the<br />

case for a sparsely distributed system with a web server and<br />

a resource limited interface node [8]. For this case running<br />

a geo-location algorithm (GPS data to address lookup) is<br />

performed, with only the server part being capable of running<br />

the lookup in full detail (solving the point in polygon problem<br />

for a large set of complex polygons). Such polygons occur<br />

when mapping a GPS location to a district with a certain area,<br />

as district boundaries are usually of highly irregular shape.<br />

Listing 7<br />

VECTOR NORM<br />

1 # i n c l u d e <br />

2 # i n c l u d e <br />

3 # i n c l u d e <br />

4<br />

5 template<br />

6 T T e s t F u n c t i o n ( c o n s t T(&) [N] ) a t t r i b u t e ( ( n o i n l i n e ) ) ;<br />

7<br />

8 template<br />

9 T T e s t F u n c t i o n ( c o n s t T(& a ) [N] )<br />

10 {<br />

11 using namespace s t d ;<br />

12 return s q r t ( a c c u m u l a t e ( b e g i n ( a ) , end ( a ) , T ( 0 ) ,<br />

13 [ ] ( T sum , T n e x t ){ return sum + pow ( next , 2 ) ; }) ) ;<br />

14 }<br />

15<br />

16 f l o a t a [ ] = { 1 . , 2 . , 5 . } ;<br />

17<br />

18 i n t main ( )<br />

19 {<br />

20 return T e s t F u n c t i o n ( a ) ;<br />

21 }<br />

D. Parallelization and Distribution<br />

Decomposing the computation on several hardware modules<br />

requires the control data flow graph to be divided. For an initial<br />

approach the separation can be based on the graph model<br />

itself. Considering the goal to find the optimal partitioning<br />

with respect to parallelisation is an intricate set of problems,<br />

some of which NP-complete [9], aiming for an generalized<br />

exact and optimal solution is not economic. Also doing the<br />

partitioning with a dedicated set of computation nodes in<br />

mind (an embedded CPU like the ARM Cortex-M4, or a<br />

specific digital signal processor (DSP)) might require further<br />

processing on the computation graph, prior to partitioning it.<br />

The parallelisation width (number of operations which can be<br />

performed simultaneously) is highly dependant on the architecture<br />

of the hardware. The ARM Cortex-M4 for example<br />

is only capable of simultaneous processing a very restricted<br />

set of data types (small integer types of 8 or 16 bits), while<br />

a decent DSP usually can handle at least two floating point<br />

instructions in parallel. As such vectorization can significantly<br />

improve the performance of the computation, but does not<br />

alter the input and output dependencies. Under the assumption<br />

that a good partitioning is dominated by minimizing the data<br />

flow in between distinct computation engines its impact to the<br />

partitioning results is limited.<br />

A higher impact however can occur if instruction reordering<br />

is utilized. Assuming N to be even (1) and (2) are mathematically<br />

identical, while their data flow graphs are very<br />

different. One of which consisting of a single large adder tree<br />

(as depicted in Fig. 5) while the latter one will result in two<br />

virtually independent adder trees. Generally a composition of<br />

nodes representing a fold like instruction (like this adder<br />

tree) can be executed in log 2 (N) cycles, provided a sufficient<br />

amount of computing nodes (N/2) to be available.<br />

√ N∑<br />

n = xi 2 (1)<br />

i=0<br />

<br />

<br />

n =<br />

√ N/2<br />

∑<br />

i=0<br />

x 2 i +<br />

N<br />

∑<br />

i= N 2 +1 x 2 i<br />

(2)<br />

This is partially covered by the instruction rank (refer<br />

section II). Its purpose is to provide a simple metric for<br />

parallelisation and distribution, without performing instruction<br />

reordering. Such a reorder procedure would increase the<br />

utilization of a distributed computing structure, but requires a<br />

certain (mathematical) understanding of the instructions within<br />

the graph, namely the commutative and associative properties.<br />

Another challenge in partitioning the graph is dealing with<br />

the control flow. Considering an obviously separable and<br />

artificial function like (3) from Fig. 6 is trivially to conclude<br />

that isolation of the left and right part of the computation is<br />

optimal.<br />

{<br />

2a(⌊i/2⌋), i ∈ 2N<br />

c(i) =<br />

(3)<br />

2b(⌊i/2⌋), i ∈ 2N + 1<br />

Not only do the computations allow an easy separation also the<br />

input data (vectors a and b) can be independently feed into two<br />

distinct computation engines. This is only true in cases where<br />

the loop is fully unrolled. As soon as additionally considering<br />

819


a_Slice_0_0<br />

a_Slice_0_0<br />

fmul (0 | 5)<br />

[Vector_Split.cpp:16:21]<br />

br (1 | 2)<br />

[Vector_Split.cpp:14:5]<br />

a<br />

mul<br />

c_Slice_0_0<br />

icmp (0 | 1)<br />

[Vector_Split.cpp:14:18]<br />

exitcond<br />

mul<br />

a_Slice_0_1<br />

a_Slice_0_1<br />

fmul (0 | 21)<br />

[Vector_Split.cpp:16:21]<br />

TestFunction<br />

mul<br />

c_Slice_0_2<br />

<br />

c<br />

fmul (0 | 27)<br />

[Vector_Split.cpp:17:21]<br />

mul6<br />

c_Slice_0_3<br />

b<br />

b_Slice_0_1<br />

b_Slice_0_1<br />

Fig. 6. vector split example (data flow)<br />

<br />

<br />

a_Slice_0_0<br />

fmul (0 | 5)<br />

[Vector_Split.cpp:16:21]<br />

a<br />

a_Slice_0_0<br />

add (0 | 15)<br />

[Vector_Split.cpp:14:21]<br />

icmp (1 | 17)<br />

[Vector_Split.cpp:14:18]<br />

a_Slice_0_1<br />

mul<br />

<br />

c_Slice_0_0<br />

indvars.iv.next<br />

TestFunction<br />

exitcond<br />

a_Slice_0_1<br />

br (2 | 18)<br />

[Vector_Split.cpp:14:5]<br />

mul<br />

fmul (0 | 21)<br />

[Vector_Split.cpp:16:21]<br />

mul<br />

c_Slice_0_2<br />

c<br />

<br />

fmul (0 | 27)<br />

[Vector_Split.cpp:17:21]<br />

c_Slice_0_3<br />

c_Slice_0_1<br />

b_Slice_0_1<br />

b_Slice_0_1<br />

c_Slice_0_1<br />

Fig. 7. vector split example (data and control flow)<br />

mul6<br />

b_Slice_0_0<br />

b_Slice_0_0<br />

fmul (0 | 11)<br />

[Vector_Split.cpp:17:21]<br />

mul6<br />

fmul (0 | 11)<br />

[Vector_Split.cpp:17:21]<br />

mul6<br />

b<br />

b_Slice_0_0<br />

the control flow nodes (Fig. 7) a significant asymmetric portion<br />

of the graph appears, caused by the sequential (imperative)<br />

nature of the LLVM assembly language. Care has to taken into<br />

account when partitioning such graphs, by either duplication<br />

the control flow nodes (Fig. 7 the left part) of the detached<br />

components, or by applying a cross-the-board performance<br />

decrease on the simplified half.<br />

IV. CONCLUSION<br />

With this paper an approach to analyse algorithms with respect<br />

to their distribution capability has been presented. Using<br />

a rather low level representation (LLVM assembly language)<br />

as base for the analysis has shown to be a suitable approach.<br />

Especially with respect to reuse of existing tools, optimization<br />

techniques, and the ability to raise the abstraction level upon<br />

need. As proof of concept a tool has been implemented,<br />

which performs a combined static and dynamic analysis of<br />

the algorithm by means of an executable specification.<br />

For common low level routines (fold instructions and vector<br />

norm) the generated data and control flow graphs match<br />

the expected well-known results. Additionally decomposing<br />

aggregate data structures/arrays has been demonstrated. By<br />

addressing the needs of vectorization and parallelisation the<br />

initial steps for a semi-automated partitioning have been made.<br />

b_Slice_0_0<br />

V. FUTURE WORK<br />

With a DAG based representation of the data flow as<br />

well as control flow needs of an algorithm, future work can<br />

focus on evaluating the performance metrics when running<br />

it on a dedicated set of computation entities. Considering<br />

a computation entity as composition of nodes offering the<br />

capability to execute a set of instructions the control data<br />

flow graph can be mapped to those of the hardware. The<br />

instructions can either the low level idioms of the LLVM<br />

assembly language or higher level abstractions. With the goal<br />

of finding a generic graph based representation of such a<br />

hardware component, the mapping process heads to a graph<br />

on graph mapping procedure. With the ability to replay the<br />

algorithm on various hardware models, and varying boundary<br />

conditions a reasonable estimate on the performance metric<br />

for a specific hardware/algorithm pair can be generated. This<br />

would allow an easy exploration of different hardware models<br />

running the same algorithm.<br />

REFERENCES<br />

[1] J. Merrill, “Generic and gimple: A new tree representation for entire<br />

functions,” in In Proceedings of the 2003 GCC Summi, 2003.<br />

[2] C. Lattner and V. Adve, “The llvm compiler framework and infrastructure<br />

tutorial,” in LCPC’04 Mini Workshop on Compiler Research Infrastructures,<br />

West Lafayette, Indiana, Sep 2004.<br />

[3] A. Dijkstra, J. Fokker, and S. D. Swierstra, “Implementation and<br />

application of functional languages,” in Implementation and Application<br />

of Functional Languages, O. Chitil, Z. Horváth, and V. Zsók, Eds. Berlin,<br />

Heidelberg: Springer-Verlag, 2008, ch. The Structure of the Essential<br />

Haskell Compiler, or Coping with Compiler Complexity, pp. 57–74.<br />

[Online]. Available: http://dx.doi.org/10.1007/978-3-540-85373-2 4<br />

[4] D. A. Terei and M. M. T. Chakravarty, “An llvm backend for ghc,” in<br />

ACM SIGPLAN Haskell Symposium, Baltimore MD, United States, 2010.<br />

[5] C. Lattner, “Llvm and clang: Next generation compiler technology,” in<br />

The BSD Conference, 2008.<br />

[6] G. Hutton, “A tutorial on the universality and expressiveness of fold,”<br />

J. Funct. Program., vol. 9, no. 4, pp. 355–372, Jul. 1999. [Online].<br />

Available: http://dx.doi.org/10.1017/S0956796899003500<br />

[7] LLVM Project. (2018) LLVM language reference. [Online]. Available:<br />

https://llvm.org/docs/LangRef.html<br />

[8] S. Tani, A. Rechberger, B. Süsser-Rechberger, R. Teschl, and<br />

H. Paulitsch, “Application of crowdsourced hail data and damage<br />

information for hail risk assessment in the province of styria,<br />

austria,” in Application of crowdsourced hail data and damage<br />

information for hail risk assessment in the province of Styria,<br />

Austria. EGU 2017, IE2.1/NH9.19, April 2017. [Online]. Available:<br />

http://meetingorganizer.copernicus.org/EGU2017/EGU2017-6822.pdf<br />

[9] M. R. Garey and D. S. Johnson, Computers and Intractability; A Guide to<br />

the Theory of NP-Completeness. New York, NY, USA: W. H. Freeman<br />

& Co., 1990.<br />

820<br />

www.embedded-world.eu


How to efficiently combine test methods for an<br />

automated ISO 26262 compliant software<br />

unit/integration test<br />

Markus Gros<br />

Vice President Marketing & Sales<br />

BTC Embedded Systems AG<br />

Berlin, Germany<br />

markus.gros@btc-es.de<br />

Abstract— The verification of embedded software in today’s<br />

development projects is becoming more and more a challenge.<br />

This is in particular true for the automotive industry, where we<br />

can observe a rapidly growing software complexity combined<br />

with shortened development cycles and an increasing number of<br />

safety critical applications. New methodologies like Model-based<br />

design or agile processes on one hand clearly help to make the<br />

development more efficient, on the other hand they even bring<br />

additional challenges related to the test process. One effect is for<br />

example, that tests need to be executed earlier and more often<br />

and due to the Model-based development approach on more<br />

execution levels like MIL/SIL/PIL. One more dimension of<br />

complexity comes from the fact, that one test method is not<br />

enough to get the necessary confidence regarding the correctness<br />

and robustness of the system-under-test. This conclusion is also<br />

part of several standards like ISO 26262, which recommend a<br />

combination of different test activities on model and code level.<br />

This paper presents a concept for an integrated verification<br />

platform for models and production code, which addresses the<br />

challenges explained above by focusing on three main aspects:<br />

integration, separation and automation. The integration aspect<br />

can be divided in two different approaches. First of all, the<br />

platform should be integrated with other development tools like<br />

modelling tool, requirements management tool or code generator.<br />

All information needed for the verification of a component<br />

should be extracted as automatically as possible, including<br />

information about interfaces, data types, data ranges,<br />

requirements or code files. As this kind of information is needed<br />

in a similar way for different verification methods, the second<br />

integration approach consists of integrating different test<br />

methodologies on top of a shared database within one<br />

environment. The first obvious benefit is, that the information<br />

described above needs to be extracted only once for all<br />

verification activities which can include guideline checking, static<br />

analysis, dynamic analysis and formal methods. We will also<br />

describe a second benefit coming from the fact, that these<br />

different methods deeply leverage from each other’s results.<br />

Separation means that software units shall be thoroughly verified<br />

before they are integrated into software components. Integrated<br />

components are then being verified according to the software<br />

architecture definition. The verification platform should support<br />

this divide and conquer approach as recommended and<br />

described in ISO 26262 or Automotive SPICE. One final topic to<br />

be discussed is automation, which should be made possible by a<br />

complete API as well as integration with technologies like<br />

Jenkins. The discussed verification platform approach automates<br />

many testing activities, from the more mundane activities to<br />

develop MBD and code centric test harnesses to the more<br />

sophisticating activities of automatic test generation.<br />

Keywords—Model-based Development, ISO 26262, Software<br />

Unit Test, Software Integration Test<br />

I. INTRODUCTION<br />

In today’s development projects for embedded software, the<br />

complexity is growing in many dimensions, which brings in<br />

particular many challenges for the test and verification process.<br />

The size of software in terms of lines of code and number of<br />

software components is constantly growing, which obviously<br />

also increases the number of test projects and test cases. On top<br />

of this, the Model-based development approach is becoming<br />

more and more popular and, despite all advantages, it brings<br />

some additional challenges to the testing workflow because test<br />

activities need to be done on model level as well as on code<br />

level. While these observations seem to lead to an increasing<br />

test effort, it is also obvious that the competitive pressure in the<br />

industry leads to a need to control or even reduce development<br />

cost and time. The amount of test activities is even further<br />

increased by the adoption of agile development methods,<br />

which require a frequent repletion of test tasks on slightly<br />

modified components. As a consequence, software tools are<br />

introduced in the process in order to automate tasks like test<br />

execution, test evaluation or report generation.<br />

One more challenge we can see in particular in the<br />

automotive industry is, that software is more and more taking<br />

over safety critical features related to steering or braking,<br />

slowly leading the way to fully autonomous vehicles. The level<br />

of confidence which is needed for these kinds of features can<br />

only be achieved by combining multiple test methods. This is<br />

also reflected in standards like ISO 26262 and leads to a<br />

growing number of software tools which contribute to the<br />

www.embedded-world.eu<br />

821


overall quality metrics. While on one hand specialized software<br />

tools for individual verification tasks are available, the growing<br />

number of tools inside development projects becomes more<br />

and more difficult to manage. Reasons are:<br />

• Every software tool comes with specific<br />

limitations regarding the supported environment<br />

(e.g. versions of Microsoft Windows, Matlab etc.)<br />

and the supported language subset (e.g. supported<br />

Simulink blocks). Cross-checking all limitations<br />

before selecting the tools and tool versions for a<br />

specific project is always a time consuming and<br />

error prone tasks.<br />

• While different software tools in a project address<br />

different use cases, they often also have features<br />

and needs in common. One example in the<br />

verification context is the fact, that every tool<br />

needs information about the system under test (or<br />

SUT), which typically includes details about<br />

interface, data ranges or the list of files needed to<br />

compile or simulate. Importing the SUT into<br />

different tools is not only a redundant task, it is<br />

also error prone as the user needs to learn and<br />

apply different workflows for a similar task.<br />

• As software tools often use different file formats<br />

for storing data or reports, users need to learn<br />

different tool specific aspects and need to store<br />

and analyze reports in different environments and<br />

formats. For automation, APIs, if available at all,<br />

might be based on different concepts or are only<br />

available in different programming languages.<br />

• When different test methods in a model-based<br />

process are applied independently, they typically<br />

do not benefit from each other’s results.<br />

This paper presents the concept of a test platform for<br />

software unit test and software integration test within a modelbased<br />

development process including automatic code<br />

generation. While Section II presents the core features of the<br />

platform, sections III to VI focus on the main benefits that we<br />

call integration, separation and automation. Several aspects of<br />

the described approach have already been integrated in the<br />

commercial tool BTC EmbeddedPlatform.<br />

II.<br />

CORE FEATURES<br />

This chapter describes some common needs and features<br />

that we find in a redundant way in different tools being<br />

designed for different test methods. The benefits of providing<br />

these features once and making them available to different test<br />

methods will be described in section III.<br />

A. Import of the system under test<br />

The starting point of any test activity is, to provide<br />

information about the SUT to the test tool. As we assume a<br />

model-based development process, we will consider at least<br />

two levels for test activities: Simulink/Stateflow models as well<br />

as production C code. Relevant information includes:<br />

• List of needed files and libraries for model<br />

(models, libraries, data dictionaries, .m/.mat files)<br />

and code level (.c/.h files, include paths)<br />

• Structure of subsystems in the model and structure<br />

of functions in the production code<br />

• List of interface objects on both levels. The main<br />

interface types are inputs, outputs as well as<br />

calibration parameters and observable internal<br />

signals. Interface objects can be scalar variables,<br />

vectors, arrays or they can be structured in form of<br />

bus signals or C code structures. Additional<br />

important information for each interface object<br />

includes data types, scalings and data ranges.<br />

• For test execution, a test frame needs to be<br />

available on both levels. In particular on unit test<br />

level, this might include the need to generate stub<br />

implementations for external functions and<br />

variables.<br />

B. Requirements<br />

The traceability to requirements is an important aspect of<br />

test methods like requirements-based testing or formal<br />

verification. The platform should be able to link test artifacts to<br />

requirements in a bi-directional way.<br />

C. Debugging<br />

If tests are failed, the platform should support debugging<br />

activities on model and code level<br />

D. Reporting<br />

It should be possible to generate report documents for all<br />

test activities in an open format like html. Creating the different<br />

types of reports with a common look and feel can support<br />

clarity and make them easier to read.<br />

III.<br />

TOOL INTEGRATION<br />

A tight integration between the test platform and other tools<br />

used inside the development project is a key prerequisite for an<br />

efficient and automated workflow. In the context, we can<br />

identify three main types of tools to connect to.<br />

In context of a model based development approach with<br />

automatic code generation, the most important tools to<br />

integrate with are the modelling environment (e.g.<br />

Simulink/Stateflow) and the code generator (e.g. dSPACE<br />

TargetLink or EmbeddedCoder). This integration should<br />

enable a highly automated import and analysis of the SUT as<br />

described in II.A. A manual setup of the test project or a semiautomated<br />

approach with third-party formats like Excel should<br />

be avoided for efficiency reasons and to avoid errors.<br />

As requirements play an important role, the platform should<br />

provide a direct connection to requirements management tools<br />

like IBM DOORS or PTC Integrity. It should be possible to<br />

automatically import the desired subset of requirements and to<br />

write information about test results back to the requirements<br />

management tool as additional attributes.<br />

Especially in larger projects where a lot of developers and<br />

test engineers are involved, a global data management platform<br />

822


might be available providing features like centralized access to<br />

all development and test artifacts, version and variant<br />

management or the control of access rights. This kind of tool<br />

also has the potential to collect quality metrics for different<br />

components and make them accessible on a project wide level.<br />

Therefore, the test platform should be able to integrate with<br />

such a data management platform in a bi-directional way in<br />

order to obtain information about the SUT and in order to<br />

provide test metrics back to it.<br />

IV.<br />

INTEGRATION OF TEST METHODS<br />

As already mentioned above, the needed confidence for the<br />

development of embedded systems can only be achieved by a<br />

combination of different test methods. Combining different test<br />

methods inside on platform will bring two main benefits. The<br />

first obvious benefit is, that the features described in II can be<br />

accessed and shared by the test methods, increasing efficiency<br />

and avoiding the need for redundant tasks. Being located in the<br />

same environment, some of the relevant test methods also have<br />

the potential to benefit from having information about each<br />

other’s results. Relevant tasks in this context are:<br />

a. Requirements-based Testing: Functional test cases<br />

should be derived from requirements and applied on<br />

model and code level. The creation of these test cases<br />

clearly benefits from the detailed information the<br />

platform has about the SUT including available<br />

interface variables and data ranges. This way, the test<br />

editor can already protect the user against invalid data<br />

entry. Other platform features which are needed for<br />

this task contain the capability to run simulations, the<br />

availability of requirements as well as debugging and<br />

reporting features.<br />

b. Analysis of equivalence classes and boundary values:<br />

Both methods are recommended by ISO 26262 and<br />

target an analysis of different values and value ranges<br />

for interface variables. These tasks will benefit from<br />

the fact that the platform already contains information<br />

about all available functions, their interface signals and<br />

the data ranges. The outcome of this activity should be<br />

a set of test cases which cover the defined variable<br />

ranges and values, therefore it makes sense to combine<br />

this analysis with the Requirements-based Testing<br />

activity.<br />

c. Analysis of model and code coverage: In order to<br />

assess the completeness of the test activities, structural<br />

coverage metrics should be measured on model and<br />

code level. Due to an integration with the<br />

Matlab/Simulink environment, model coverage can<br />

easily be measured via standard mechanisms. For code<br />

coverage, the code needs to be instrumented and all<br />

available tests need to be executed on the instrumented<br />

code. As the platform should have access to a<br />

compileable set of code and header files, this analysis<br />

can be handled fully automatically.<br />

d. Check for Modelling and coding guidelines: These<br />

kind of static analysis methods can be fully automated<br />

in case the list of model and code artifacts is available.<br />

Modelling guidelines for example can check for<br />

prohibited block types, wrong configuration settings or<br />

violations of naming rules. An example for coding<br />

guidelines are the widely used MISRA C rules.<br />

e. Analysis of runtime errors: This static analysis is<br />

typically done on code level by applying the abstract<br />

interpretation method. This methodology requires<br />

access to the list of code and header files and it also<br />

benefits from getting information about data ranges of<br />

variables. If some analysis goals are already covered<br />

by existing tests, it might be able to exclude them from<br />

the analysis to increase efficiency.<br />

f. Resource consumption: This means analyzing the<br />

resource consumption on the target processor regarding<br />

RAM, ROM, stack size and execution time. One<br />

option is to measure these metrics during the test<br />

execution on a real or virtual processor, which the<br />

platform should be able to call. This measurement is of<br />

course only possible, if a sufficient set of test cases is<br />

available, which covers different paths in the software.<br />

g. Structural Test Generation: In order to maximize<br />

structural coverage metrics on model and code level,<br />

test cases can be generated automatically either by<br />

random methods or using model checking. This task<br />

can benefit dramatically from the availability of<br />

requirements-based test cases, as only uncovered parts<br />

need to be analyzed. Structural tests can be used e.g.<br />

for showing robustness of the SUT and for Back-to-<br />

Back as well as regression testing.<br />

h. Back-to-Back Testing: Back-to-Back Testing between<br />

models and code is (highly) recommended by ISO<br />

26262 and it obviously requires test cases (functional<br />

and/or structural), the ability to run them on the<br />

different execution levels and the generation of<br />

corresponding reports.<br />

i. Formal Specification: Textual (or informal)<br />

requirements often leave some room for ambiguities or<br />

misunderstandings. Expressing requirements in semiformal<br />

or formal notation (as recommended by ISO<br />

26262) does not only improve their quality, it also<br />

allows to use them as a starting point for some highly<br />

automated and efficient verification methods (see<br />

below). The formalization process requires information<br />

about the architecture of the SUT and it should also<br />

provide traceability to existing informal requirements<br />

from which the formal notation is derived. Both are<br />

already provided by the platform concept.<br />

j. Requirements-based Test Generation: As the<br />

previously described formalized requirements are<br />

machine-readable, they can be used as a starting point<br />

for an automatic generation of test cases which will test<br />

and cover the requirements. If these requirements don’t<br />

describe the full behavior of the system, the SUT itself<br />

(available in the platform) can contribute to the<br />

process. If manual test cases already exist, they can be<br />

analyzed regarding their requirements coverage, so that<br />

only missing tests need to be generated.<br />

www.embedded-world.eu<br />

823


k. Formal Test: In a Requirements-based Testing process,<br />

every test case is usually only evaluated with respect to<br />

the requirement from which it has been derived. A<br />

situation where a particular test case violates a<br />

different requirement typically goes undetected. By<br />

performing a Formal Test, all test cases are evaluated<br />

against all requirements, which dramatically increases<br />

the testing depth without the need to create additional<br />

test data. Obviously, this method benefits from a<br />

platform in which formalized requirements and<br />

functional/structural test cases are managed together<br />

for a particular SUT.<br />

l. Formal Verification: The number of possible value<br />

combinations for input signals and calibration values is<br />

almost infinite for a typical software component. It is<br />

therefore obvious that even a large number of test<br />

cases can never cover all possible paths through the<br />

component. Formal Verification with Model-Checking<br />

technology can automatically provide a complete<br />

mathematical proof that shows a requirement cannot be<br />

violated by the analyzed SUT. This guarantees that<br />

there is no combination of input signals and calibration<br />

values that would drive the system to a state in which<br />

the requirement is violated. The analysis takes the SUT<br />

as well as the formalized requirement(s) as an input. If<br />

a requirement can be violated, a counter example is<br />

provided in form of a test case, which can then be<br />

debugged to find the root cause for the possible<br />

requirement violation.<br />

V. SEPARATION<br />

The growing complexity in todays embedded software<br />

development projects can only be managed by a divide and<br />

conquer approach. This concerns different disciplines including<br />

requirement authoring, software architecture design, software<br />

development and also testing. System requirements need to be<br />

broken down to smaller units as part of a bigger architecture.<br />

Afterwards, these units should be developed and tested<br />

independently before being integrated. This process is also<br />

reflected in the so-called V-Cycle as well as in ISO 26262<br />

which on software level contains a clear separation between<br />

software unit test and software integration test.<br />

The test platform should support this approach mainly in<br />

two ways. First of all, the tool should be flexible enough, to<br />

separate the SUT structure from the model structure. This<br />

means, it should be possible to individually test<br />

subsystems/subcomponents which are managed inside one<br />

single model or code file. Therefore, it is necessary to separate<br />

individual subsystems from their original model and embed<br />

them in a newly created test frame. A similar approach is also<br />

needed on code level. When it comes to the integration testing<br />

phase, the tool should be able focus on the new tasks which are<br />

related to potential integration issues. It should not be<br />

necessary to repeat activities (like importing unit test cases) on<br />

the integration level again. This also means for example, that<br />

metrics like MC/DC Coverage on individual units should be<br />

excluded from the test process, as this has already been shown<br />

in the unit test. This can be achieved by avoiding the code<br />

annotation for the units during the integration testing.<br />

VI.<br />

AUTOMATION<br />

As mentioned before, the number of test executions needed<br />

within a project is growing constantly. One obvious reason is<br />

the growing number of functions and features that needs to be<br />

tested. Also, the introduction of model based development with<br />

its different simulation levels MIL, SIL and PIL contributes to<br />

this effect. However, probably the biggest contribution comes<br />

from the fact that agile development methods become more<br />

popular, which leads to tests being created early and more<br />

frequently within a project, up to a situation where tests (at<br />

least for the modified modules) run automatically as part of<br />

nightly builds within a continuous integration approach.<br />

For maximum flexibility in this context, the platform<br />

should provide a complete API, allowing to automate all tool<br />

features including test execution and reporting. An integration<br />

with established continuous integration environments like<br />

Jenkins is also helpful and can reduce the need to manually<br />

script standard workflows.<br />

VII. CONCLUSION<br />

This paper presented a concept for a verification platform<br />

focusing on the software unit test and software integration test<br />

of embedded software as part of an ISO 26262 compliant<br />

model based development process. While software becomes<br />

more and more safety critical in automotive applications, more<br />

test methods need to be combined to achieved a sufficient<br />

confidence, leading to more tools being introduced in the<br />

process. This number of independent tools leads to several<br />

challenges and problems, which were described in section I. As<br />

a solution, we propose a platform concept which provides some<br />

common core features (described in section II) on top of which<br />

the different test methods can be realized. This way they can<br />

benefit from a shared database which provides general and<br />

reusable information about the system under test, avoiding<br />

redundant tasks that would need to be repeated for every test<br />

method in different tool environments. We also described three<br />

key features of this platform: Integration, separation and<br />

automation. Several aspects of this concept are already<br />

implemented in the commercially available product BTC<br />

EmbeddedPlatform, which is also certified for ISO 26262 by<br />

German TÜV Süd. Thanks to an open Eclipse-based<br />

architecture, additional test methods described in this paper<br />

could be added in the future either by BTC Embedded Systems<br />

or by 3 rd parties.<br />

824


Continuous Integration and Test<br />

from Module Level to Virtual System Level<br />

Johannes Foufas, Martin Andreasson<br />

Volvo Car Corporation<br />

Gothenburg, Sweden<br />

Michael Hartmann, Andreas Junghanns<br />

QTronic GmbH<br />

Berlin, Germany<br />

Abstract— Software-in-the-Loop (SiL) is a test strategic sweet<br />

spot between Model-in-the-Loop (MiL) and Hardware-in-the-<br />

Loop (HiL) tests. We show in this paper how to use automatic C-<br />

code instrumentation to harness the superior properties of SiL<br />

technology for Module Tests even when the C-code is generated<br />

in a few, large controller functions combining the modules to be<br />

tested.<br />

Furthermore we show how to re-use module test<br />

specifications in integration and system tests by separating the<br />

test criteria from the test stimulus. We call these test criteria<br />

requirements watchers and define them as system invariants.<br />

This powerful technique, combined with efficiently handling<br />

large numbers of controller variants by annotating watchers and<br />

scripts, allows the automatic validation of hundreds of<br />

requirements in module, integration and system tests improving<br />

the software quality dramatically very early in the software<br />

development process.<br />

Last but not least, we extend the idea of continuous<br />

integration to continuous validation to leverage all of the above to<br />

reach high levels of software maturity very early in the software<br />

development process. That will also benefit later test phases – like<br />

HiL system and system integration tests – by dramatically<br />

reducing commissioning efforts.<br />

Keywords— Software-in-the-Loop; continuous integration<br />

I. MOTIVATION AND CHALLENGES<br />

Engineers are under pressure to deliver improvements at a<br />

growing pace while satisfying an increasing amount of<br />

regulatory pressures concerning performance, safety, reliability<br />

and ecology. The combination of more functionality and<br />

smaller turnaround times between new versions requires new<br />

methods of test and validation to keep software quality up to<br />

par. While traditional testing on the target hardware maintains<br />

a role in integration testing and satisfying strict safety norms, it<br />

is too slow, resource intensive and late with feedback for<br />

earlier phases of the control-software development cycle to<br />

increase robustness in a meaningful way.<br />

Common Unit/Module test approaches rely on MiL which<br />

is prone to failure when looking for certain classes of bugs. SiL<br />

simulation can alleviate these concerns by providing a testable<br />

system that is much closer to the C-code reality: using the<br />

generated C-code, the target integer variable scaling and the<br />

(variant-coded) parameter values for the target system, often<br />

even including parts of the basic-software and communication<br />

stacks [1,2]. And despite being so close to reality, SiL is still<br />

offering all the strong points of MiL: cheap and early available<br />

execution platform (PC), determinism, flexibility when<br />

integrating into different simulations tools for example as<br />

FMUs, fully accessible and debuggable internals, easy<br />

automation for all system variants and many more benefits.<br />

But moving to SiL is not without challenges. First and most<br />

obviously, hardware-dependent parts of the control software<br />

cannot be included and suitable SiL-abstractions have to<br />

replace the missing code. Recent, standardized software<br />

architectures, like AUTOSAR or ASAM MDX, ease such<br />

replacement and IO connectivity considerably as standard APIs<br />

can be provided by the SiL platform or standard description<br />

formats can be used to generate the connection layers like SiL-<br />

AUTOSAR-RTE generation from .arxml files. Even for pre-<br />

AUTOSAR ECUs this task can be handled quite efficiently<br />

these days: A limited number of tier-1-suppliers produced a<br />

limited number of vendor-dependent RTOS (inspired)<br />

architectures that allow for high-levels of reuse[3].<br />

Another challenge is dealing with generated C-code for<br />

module test. The generated code is optimized for target use and<br />

may fuse many software modules into one large C-function<br />

(task). Stimulating individual software modules from the<br />

outside is not possible. Regenerating individual modules is out<br />

of the question, because changing the code generation process<br />

would lead to different C-code and therefore defeating the<br />

purpose of SiL: to test exactly the code that will be compiled<br />

for target without changes. The solution: we will instrument the<br />

generated C-code to gain control over all input variables to the<br />

module(s) under test.<br />

Ideally one would like to reuse tests from MiL to SiL to<br />

HiL. However, the different levels of simulation detail,<br />

restrictions on measurement bandwidth, availability of the<br />

execution platforms, setup cost for different variants,…<br />

requires a more sophisticated test strategy than “simple reuse”.<br />

Focusing on the strength of each platform and running each<br />

test as early as possible will frontload, as one example,<br />

application layer function and integration tests to SiL, while<br />

leaving hardware related diagnostic tests on the HiL platform.<br />

And optimizing control strategies as early as possible will<br />

move these tasks to MiL simulations. Re-using test definitions<br />

is therefore limited by the different test goals and platformrelated<br />

restrictions.<br />

However, module and system-level tests can still share the<br />

same requirements, if not the same test focus. The solution to<br />

high levels of reuse for test specifications is separating the<br />

www.embedded-world.eu<br />

825


implementation of the requirements tests from the stimulus.<br />

While classic test automation combines test stimulus and<br />

requirement tests into the same script, we define requirement<br />

watchers as formal, stimulus and system-state independent<br />

invariants: conditions that must always hold. Engineers need to<br />

spend more time and care in writing such requirement<br />

watchers, but the payoff justifies this extra effort: Requirement<br />

watchers can be tested with any kind of stimulus be it scripted,<br />

field measurements, short test vectors, hour-long loadcollective<br />

simulations or auto-generated test stimulus (e.g. by<br />

TestWeaver)[4]. Here we will show how to reuse module<br />

requirements defined for module testing in system-level testing<br />

when written as requirement watchers.<br />

Increasing number of variants of control systems requires<br />

special measures during test and validation to reduce manual<br />

matching of test cases to variants of the control software. We<br />

show how annotating requirement watchers and stimuli with<br />

filter properties enables automatic selection of relevant test<br />

Continuous Integration is a state-of-the-art method to detect<br />

integration problems. Combining CI with more than<br />

rudimentary tests is difficult if the target binary is the test<br />

object. Using SiL as execution platform allows high levels of<br />

automation for large numbers of tests because they can run on<br />

the same platform as the build process: the PC. Extending the<br />

idea of Continuous Integration (nightly builds) to Continuous<br />

Validation (nightly test) improves early detection of large<br />

classes of software problems considerably.<br />

II.<br />

TESTING AT VOLVO CARS CORPORATION<br />

At Volvo Cars Corporation (VCC), SiL testing is at the<br />

core of a new Continuous Integration strategy. Through<br />

increasing the frequency of integration points and<br />

corresponding tests, control software reaches a higher level of<br />

maturity when final acceptance tests are carried out close to<br />

production. In order to achieve this, a large number of tests<br />

need to be defined and used throughout the development<br />

process.<br />

One concern so far has been the incompatibility of test<br />

cases and stimuli between MiL and SiL setups. The structure<br />

that is designed by a developer in modeling tools is often<br />

disregarded during code generation. This means testing is<br />

limited to module level, with modules growing in scope over<br />

time. Developers on the other hand design around smaller units<br />

represented through subsystems.<br />

Fig. 1: Basic module with subfunctions<br />

The difficulty in test design for large models can be<br />

illustrated by the simple example in Figure 1. Subfunction A is<br />

defined by a set of requirements that define the behavior of the<br />

outputs (y) as a function of the intermediate signals (m).<br />

Historically, testing these requirements in anything other than<br />

MiL simulation would require the engineer to invert<br />

Subfunction B in order to design the correct set of inputs (u)<br />

for the test.<br />

Fig. 2: Function requiring transient stimuli<br />

For more complex modules, this approach is very costly<br />

and error-prone. As loops inside functions and state diagrams<br />

are introduced, tests for simple functionality require<br />

increasingly complicated transient stimuli.<br />

We aim to present an instrumentation approach that offers<br />

the opportunity to bypass parts of a function and allows<br />

developers to define stimuli and test criteria around arbitrarily<br />

small subfunctions of a module.<br />

Fig. 3: System under test in with bypassing: Test stimuli can be definded as<br />

m(t)<br />

The requirements that are defined using this process shall<br />

remain independent of the stimulus and usable throughout all<br />

levels of testing, up to integration and robustness tests.<br />

III.<br />

INSTRUMENTATION APPROACH<br />

Modelling tools like Simulink allow developers to<br />

structure their models into subsystems which can be used like<br />

atomic blocks. The subsystem can be copied or moved freely<br />

across models and can be tested independently in MiL. When<br />

generating code from a model using TargetLink, the complete<br />

model is represented by a single C method. Statements that are<br />

generated from blocks within a subsystem are spread across the<br />

entire compilation unit. This means a subsystem cannot be<br />

executed on its own, preventing any kind of meaningful unit<br />

testing. To remove this limitation from SiL tests, we analyze<br />

the resulting C-code and inject bypass opportunities wherever a<br />

measurable signal is written.<br />

The injected code remains inactive unless the source is<br />

compiled for a SiL target and the user enables bypassing for<br />

www.embedded-world.eu<br />

826


the respective variable. This way, MISRA compliance of<br />

production software is ensured even if the instrumented code<br />

makes it into release builds on accident.<br />

As code generators tend to use temporary, local variables<br />

where signals are not specifically made measurable, further<br />

analysis of the generated code is necessary. In cases where<br />

such a temporary variable is always equal to a measurable<br />

signal, it has to be set to the correct value as well. This<br />

specifically applies to signals transcending subsystem borders,<br />

which can be represented by two different variables in code.<br />

Fig. 4: Instrumentation of temporary variables<br />

State Machines can by bypassed entirely so no transitions<br />

are necessary to provide the system under test with the correct<br />

state and/or corresponding flags.<br />

After the code is instrumented, the virtual basic software is<br />

automatically set up with regards to task scheduling and<br />

supplier-dependent modifications. Compilation results in a<br />

virtual ECU containing the entire OEM-part of the control<br />

software which can be coupled with a plant model and/or other<br />

ECUs for system-level simulation.<br />

Without recompilation, engineers can trim the V-ECU to fit<br />

their use-case. Depending on a specification file provided by<br />

the user, the Virtual ECU will reconfigure its scheduler to only<br />

execute a subset of the included functions. The same<br />

specification can be extended by a detailed interface<br />

specification listing the ports of a subsystem. If this<br />

specification is present, all bypasses on the input side are<br />

activated and the variables are overwritten by stimuli during<br />

simulation.<br />

IV.<br />

DESIGN OF STIMULUS-INDEPENDENT TESTS<br />

The instrumentation method described reduces the effort in<br />

test design significantly. Unit-Tests of small subfunctions can<br />

be created through traditional scripting and deployed as part of<br />

an automated test framework. While this method can produce<br />

comprehensive results in regards to verification and coverage,<br />

it relies heavily on developers being able to foresee all possible<br />

problems.<br />

During the specification phase, requirements are written in<br />

a broad scope. Often a requirement will define a certain<br />

behavior that shall be true under certain conditions. In essence:<br />

Condition A => Behavior B<br />

Defining test cases around such requirements would be<br />

difficult, especially if the condition contains several continuous<br />

signals. The widespread approach of testing by creating a<br />

stimulus and checking for a specific reaction fails to capture a<br />

large number of possible scenarios as engineering hours and<br />

therefore the number of defined test cases are limited.<br />

In addition, a stimulus-reaction based test becomes obsolete<br />

once the object under test is integrated into a system, as the<br />

previously defined stimulus often cannot be reproduced due to<br />

its artificial nature.<br />

Side effects that appear based on the interaction of several<br />

components cannot be tested. A developer might cover all the<br />

expected combinations of outputs from another module or subfunction,<br />

but faulty signals as result of a bug in this module<br />

might not be considered.<br />

TestWeaver by QTronic provides the means to define<br />

requirements in a way that closely resembles the original<br />

specification. The test for a requirement is defined by<br />

precondition and expected behavior instead of stimulus and<br />

reaction.<br />

The definition of a requirement watcher entails conditions<br />

to activate the instrument and the criteria to be tested. A<br />

watcher intended to test the simple example above would<br />

remain inactive until Condition A is met and once becoming<br />

active check for the Behavior B.<br />

For more complex cases additional options such as<br />

tolerance times can be specified. Inverse usage, i.e. the<br />

specification of unwanted behavior is also supported.<br />

Each requirement can be tested at every point in time<br />

during a simulation. Requirements defined at subsystem level<br />

remain valid in system context and vice versa and can be tested<br />

in regardless of scope. As requirement watchers do not require<br />

write-access to any signals, the definitions implemented for<br />

unit tests are still applicable in larger contexts where the code<br />

instrumentation might be omitted. Module and integration tests<br />

can thus be executed on final production code.<br />

Fig 5: Requirement watchers can be reused throughout and refined based upon<br />

different scopes. All requirements are tested against at every point. Stimuli are<br />

selected from a pool where applicable.<br />

Code coverage is measured with Testwell’s CTC++. The<br />

decoupled requirements described above provide the option to<br />

use any input vector to increase coverage. Any scripts or<br />

measurements that are available can be added to the stimulus<br />

pool and simulated. This way, high code coverage can be<br />

www.embedded-world.eu<br />

827


achieved without specifically designing additional tests. The<br />

requirement definitions can also be reused with TestWeaver’s<br />

scenario generation for focused explorative tests, further<br />

increasing coverage and robustness.<br />

V. CONTINUOUS INTEGRATION AND VERIFICATION<br />

At VCC Powertrain, code is deployed to a Jenkins-based<br />

continuous integration system. Pipelines are defined to<br />

automatically build virtual ECUs and run applicable tests.<br />

Commits by function developers into the common model base<br />

trigger the execution of interface verification and module tests<br />

as well as integration tests relevant to the Module in SiL and<br />

HiL.<br />

Fig 6: A typical Jenkins Pipeline.<br />

As a result, function developers get quick and reliable<br />

feedback about the behavior of their models in the context of a<br />

wider system. To verify and keep track of code quality and<br />

open issues, the full test suite is executed nightly.<br />

VI.<br />

CONCLUSION<br />

In this paper we present a number of critical building<br />

blocks necessary to improve software maturity early in the<br />

software development process. Software-in-the-Loop (SiL)<br />

allows test execution in a Continuous Validation process of the<br />

target C-code. Instrumentation of the target C-code allows<br />

manipulation of any input of the software module enabling<br />

module tests even if target code generation merges many<br />

modules into larger C-functions (tasks).<br />

When expressing module and system requirements as<br />

requirement watchers we can reuse these more easily in most<br />

of the test stages more than compensating for the extra effort<br />

defining requirements as invariants.<br />

Annotating requirement watchers and stimulation scripts<br />

with variant information allows automatic filtering to matching<br />

ECU configurations. This way, a single test database can be<br />

used to handle a multitude of variants and at the same time<br />

ensuring all relevant requirements will be tested on all variants<br />

during all test stimuli reaching code-coverage and requirementcoverage<br />

goals more quickly and more easily than with<br />

traditional test methods.<br />

As the virtual ECU can be reconfigured within Silver to<br />

include or exclude any function in the entire application<br />

software, build times are kept to a minimum.<br />

In order to reduce the amount of work needed to design<br />

tests even further, closed loop simulations including detailed<br />

plant models will be integrated into the VCC CI and CT<br />

toolchain. Reusing the existing requirement watchers,<br />

TestWeaver’s scenario generation will be employed in order to<br />

increase robustness and test coverage even further.<br />

[1] Brückmann, Strenkert, Keller, Wiesner, Junghanns: Model-based<br />

Development of a Dual-Clutch Transmission using Rapid Prototyping<br />

and SiL. International VDI Congress Transmissions in Vehicles 2009,<br />

30.06.-01-07.2009, Friedrichshafen, Germany<br />

[2] Rui Gaspar, Benno Wiesner, Gunther Bauer: Virtualizing the TCU of<br />

BMW's 8 speed transmission, 10th Symposium on Automotive<br />

Powertrain Control Systems, 11. - 12.09.2014, Berlin, Germany<br />

[3] René Linssen, Frank Uphaus, Jakob Maus: Software-in-the-Loop at the<br />

junction of software development and drivability calibration, 16 th<br />

Stuttgart International Symposium (FKFS), 15. - 16.03.2016, Stuttgart,<br />

Germany<br />

[4] Mugur Tatar: Enhancing the test and validation of complex systems with<br />

automated search for critical situations, VDA Automotive SYS<br />

Conference, 06. - 08.07.2016, Berlin, Germany<br />

www.embedded-world.eu<br />

828


Self-testing in Embedded Systems<br />

6329<br />

Colin Walls<br />

Mentor, a Siemens business<br />

Newbury, UK<br />

colin_walls@mentor.com<br />

Abstract—All electronic systems carry the possibility of<br />

failure. An embedded system has intrinsic intelligence that<br />

facilitates the possibility of predicting failure and mitigating its<br />

effects. This paper reviews the options for self-testing that are<br />

open to the embedded software developer. Testing algorithms for<br />

memory are outlined and some ideas for self-monitoring software<br />

in multi-tasking and multi-CPU systems are discussed.<br />

Keywords—embedded software, self-testing<br />

I. INTRODUCTION<br />

Things go wrong. Electronic components die. Systems fail.<br />

This is almost inevitable and, the more complex that systems<br />

become, the more likely it is that failure will occur. In complex<br />

systems, however, that failure might be subtle; simple systems<br />

tend to just work or not work.<br />

As an embedded system is "smart", it seems only<br />

reasonable that this intelligence can be directed at identifying<br />

and mitigating the effects of failure...<br />

Self-testing is the broad term for what embedded systems<br />

do to look for failure situations. This paper primarily identifies<br />

some of the key issues.<br />

Broadly, an embedded system can be broken down into 4<br />

components, each of which can fail:<br />

• CPU<br />

• Peripherals<br />

• Memory<br />

• Software<br />

II. CPU FAILURE<br />

CPU failure is not too common, but is far from unknown.<br />

Unfortunately, there is very little that a CPU can do to predict<br />

its own demise. Of course, in a multicore system, there is the<br />

possibility of the CPUs monitoring one another.<br />

III. PERIPHERAL FAILURE<br />

Peripherals can fail in many and varied ways. Each device<br />

has its own possible failure modes. To start with, the self-test<br />

software can check that each peripheral is responding to its<br />

assigned address and has not failed totally. Thereafter, any<br />

further self-test is very device dependent. For example, a<br />

communications port may have a "loop back" mode, which<br />

enables the self test to verify transmission and reception of<br />

data.<br />

IV. MEMORY FAILURE<br />

Memory is a critical component of an embedded system of<br />

course and is certainly subject to failure from time to time.<br />

Considering how much memory is installed in modern systems,<br />

it is surprising that catastrophic failure is not more common.<br />

Like all electronic components, the most likely time for<br />

memory chips to fail is on power up, so it is wise to perform a<br />

comprehensive test then, before vital data is committed to<br />

faulty memory.<br />

If a memory chip is responding to being addressed, there<br />

are broadly two possible failure modes: stuck bits [i.e. bits that<br />

are set to 0 or 1 and will not change]; cross-talk [i.e.<br />

setting/clearing one bit has an effect on one or more other bits].<br />

If either of these failures occurs while software is running, it is<br />

very hard to trace. The simplest test to look for these failures<br />

on start-up is a "moving ones" [and "moving zeros"] test. The<br />

logic for moving ones is simple:<br />

set every bit of memory to 0<br />

for each bit of memory<br />

{<br />

}<br />

verify that all bits are 0<br />

set the bit under test to 1<br />

verify that it is 1 and that all other bits are 0<br />

set the bit under test to 0<br />

A moving zeros test is the same, except that 0 and 1 are<br />

swapped in this code.<br />

Coding this test such that it does not use any RAM to<br />

execute [assuming start up code is running out of flash] is an<br />

interesting challenge, but most CPUs have enough registers to<br />

do the job.<br />

www.embedded-world.eu<br />

829


Of course, such comprehensive testing cannot be performed<br />

on a running system. A background task of some type can carry<br />

out more rudimentary testing using this kind of logic:<br />

for each byte of memory<br />

{<br />

}<br />

turn off interrupts<br />

save memory byte contents<br />

for values 0x00, 0xff, 0xaa, 0x55<br />

{<br />

}<br />

write value to byte under test<br />

verify value of byte<br />

restore byte data<br />

turn on interrupts<br />

These testing algorithms, as described, assume that all you<br />

know about the memory architecture is that it spans a series of<br />

[normally contiguous] addresses. However, if you have more<br />

detailed knowledge - which memory areas share chips or how<br />

rows and columns are organized - more optimized tests may be<br />

devised. This is desirable, as a slow start up will impact user<br />

satisfaction with a device.<br />

These testing algorithms, as described, assume that all you<br />

know about the memory architecture is that it spans a series of<br />

[normally contiguous] addresses. However, if you have more<br />

detailed knowledge - which memory areas share chips or how<br />

rows and columns are organized - more optimized tests may be<br />

devised. This is desirable, as a slow start up will impact user<br />

satisfaction with a device.<br />

V. SOFTWARE ERROR CONDITIONS<br />

Software failure is obviously a possibility and defensive<br />

code may be written to avoid some possible failure modes. Of<br />

course, a bug in the software might lead to a totally<br />

unpredictable failure.<br />

All non-trivial software has bugs. Obviously, well designed<br />

software is likely to have less and the application of modern<br />

embedded software development tools can keep them to a<br />

minimum. Of course, specific bugs cannot be predicted<br />

[otherwise they could be eradicated], but certain types of<br />

software problem can be identified and it may be possible to<br />

spot a problem before it becomes a disaster.<br />

I would divide such software problems into two broad<br />

categories:<br />

• data corruption<br />

• code looping<br />

As a significant amount of embedded code is written in C,<br />

that means that developers are likely to be making use of<br />

pointers. Used carefully, pointers are a powerful feature of the<br />

language, but they are also one of the most common sources of<br />

programmer error. Problems with pointer usage are hard to<br />

identify statically and the bugs introduced might manifest<br />

themselves in subtle ways when the code is executed. Some<br />

things, like dereferencing a null pointer are easily detected, as<br />

they normally cause a trap. Others are harder, as a pointer<br />

could end up pointing just about anywhere - more often than<br />

not it will be to a valid address, but, unfortunately, it may not<br />

be the correct one. There is little that self-testing code can do<br />

about this. There are, however, two special cases of pointer<br />

usage where there is a chance: stack overflow and array bound<br />

violations.<br />

Stack overflow should not occur, as the stack allocation<br />

should be carefully determined and its usage verified during the<br />

debug phase. However, it is quite possible to overlook a special<br />

situation or make use of a less testable construct [like a<br />

recursive function]. A simple solution is to include an extra<br />

word at either end of the stack space - "guard words". These<br />

are pre-loaded with a specific value, which is monitored by a<br />

self-test task [which may run in the background]. If the value<br />

changes, the stack limits have been violated. The value should<br />

be chosen carefully. An odd number is best, as that would not<br />

represent a valid address for most processors. Perhaps<br />

0x55555555. So long as the value is "unlikely" - so not<br />

0x00000001 or 0xffffffff for example - there is a 4 billion to 1<br />

chance of a false alarm.<br />

In some languages, there is built-in detection for addressing<br />

outside the bounds of an array, but this introduces a runtime<br />

overhead, which may be unwelcome. So this is not<br />

implemented in C. Also, it is possible to access array elements<br />

using pointers, instead of the [ ] operator, so any checking<br />

might be circumvented. The best approach is to just check for<br />

buffer overrun type of errors by locating a guard word at the<br />

end of an array and monitoring in the same way as the stack<br />

overflow check.<br />

Code should never get stuck in an infinite loop, but a logic<br />

error or the non-occurrence of an expected external event might<br />

result in code hanging. In any kind of multi-threaded<br />

environment - either an RTOS or mainline code with ISRs - it<br />

is possible to implement a "watchdog" mechanism. Each task<br />

that runs continuously [which might be just the mainline code]<br />

needs to "check in" with the watchdog task [which may be a<br />

timer ISR] every so often. If a timeout occurs, action needs to<br />

be taken. I discussed this matter, from a different perspective,<br />

in a blog about user displays a little while ago.<br />

So, what is to be done when a stack overflow, array bound<br />

violation or hanging task is detected? This depends on the<br />

application. It may be necessary to stop the system, sound an<br />

alarm of some kind, or simply reset the system. The choice<br />

depends on many factors, but broadly the goal is for something<br />

better than a crashed system.<br />

VI. FAILURE RECOVERY AND REPORTING<br />

A final question is what to do if a failure is detected. Of<br />

course, this will be different for every system. Broadly, the<br />

system should be put in a safe state [shut down?] and the user<br />

advised.<br />

VII. CONCLUSIONS<br />

It is important to accept that failure of parts of a system is a<br />

possibility. Consideration must be given to all possible failure<br />

830


modes. Code may be added to a system to monitor its “health”<br />

and take action if a failure is detected. That action may be a<br />

warning to the user or perhaps a rectification of the problem.<br />

www.embedded-world.eu<br />

831


Efficient Software Variants Testing<br />

Michael Wittner<br />

Razorcat Development GmbH<br />

Berlin, Germany<br />

www.razorcat.com<br />

Abstract—The challenge in testing software variants is that<br />

every variant needs to be tested completely. In the following a<br />

method to reuse and inherit variant tests is introduced. By<br />

defining base tests that can be inherited to variant tests working<br />

on a redundant level can be avoided. For every application<br />

change tests need to be maintained in one place only.<br />

Keywords—unit integration testing, requirement, certification,<br />

variant management<br />

I. INTRODUCTION<br />

Safety-critical norms of various industries, like e.g. ISO<br />

26262 in automotive engineering or IEC 62304 in the medical<br />

industry, demand complete code coverage. This requires that<br />

every single software variant needs to be tested completely. In<br />

praxis this is often realized by copying the tests of one variant<br />

and adapting the copied test to the respective other variant.<br />

New software requirements and software changes increase the<br />

price of such variant testing because of the need to implement<br />

those changes in all variants redundantly. Besides the high<br />

effort in maintenance and expansion of such tests there is also a<br />

high risk of e.g. mistakes due to copy & paste operations,<br />

which might eventually lead to undiscovered safety-critical<br />

mistakes in the application.<br />

A. What is a variant?<br />

There are various possibilities to create software variants<br />

(e.g. C/C++ source code):<br />

Enabling/disabling of code parts by defines<br />

Generating code variants with tools (e.g. out of<br />

MATLAB)<br />

Copying, renaming, and changing the source file<br />

Executing identical sources on different hardware<br />

platforms (applicable for high safety requirements)<br />

A software variant is defined by a particular software<br />

module configuration (e.g. a C source file). Such a variant does<br />

not necessarily need to be functional; it might as well be an<br />

abstract variant. Only by specific settings (mostly by defines)<br />

an abstract basis variant turns into actually applicable software<br />

variants.<br />

B. Test goal: Code coverage<br />

To obtain complete code coverage of each variant the<br />

measured value of the code coverage of the variant specific<br />

code could be added up to the measured value of the<br />

commonly used code. Fig. 1 shows a simple example of a code<br />

variant where another value should be added to the variable<br />

“level”. The yellow marked programming error (the missing of<br />

an addition operator in line 16) could remain undiscovered if<br />

the commonly used code in line 19-23 remains untested in the<br />

variant.<br />

Fig. 1. Code variant with mistake<br />

Therefore it is not enough to test only single parts of the<br />

variant code put together by defines and add up the code<br />

coverage measurements: Every code variant needs to be looked<br />

at as an independent program because hidden or added parts<br />

can influence the shared parts.<br />

II.<br />

SOLUTION APPROACH<br />

In the following a method to establish and maintain variant<br />

testing is presented by means of an example. This example<br />

contains a function for a status indication of the tank content of<br />

various vehicles (passenger cars and trucks). The fact that the<br />

vehicle variant “truck” can be equipped with a supplementary<br />

tank of which the fuel level should also be considered is an<br />

additional difficulty.<br />

A. Example function<br />

The example is about a function that supplies a status<br />

related to the filling level of a vehicle tank. The specification is<br />

graphically displayed in Fig. 2: The function is expected to<br />

www.embedded-world.eu<br />

832


give a warning or an alarm when the fuel level falls below<br />

defined marks. If none of this the case, the function is supposed<br />

to deliver the value “normal”.<br />

On the other hand the variant hierarchy definition most of<br />

all serves a greater clarity in case of highly nested software<br />

variants. Within the presented method the variants are arranged<br />

in a variant tree that can be quite multileveled and displays the<br />

software’s variant structure. This tree serves as an orientation<br />

and shows which test should be created on which variant level.<br />

Fig. 4. Variants hierarchy<br />

Fig. 2. Fuel level value definition with calculation thresholds<br />

A simple implementation of this function could look like<br />

shown in Fig. 3. Through a variant configuration (#define<br />

TRUCK) the supplementary tanks fuel level could possibly be<br />

included into the calculation.<br />

The example above shows a variant hierarchy of various<br />

vehicles and includes variants with fixed or optional<br />

supplementary tanks for trucks. The variant “truck” in this case<br />

could be both, an abstract configuration or a concrete vehicle<br />

shape.<br />

C. Variant tests definition<br />

Tests can be divided into two types: Basic tests and variant<br />

tests. Basic tests refer to abstract functionalities, e.g. the basic<br />

configuration of a software module. On this level all potential<br />

test cases of variants derived from this basic model can be<br />

defined. Also first test data for the basic test can be specified, it<br />

does not have to be complete though. (Anyway, one does not<br />

want to execute them.) Possible test cases can be defined on the<br />

top level of a variant tree e.g. by means of a classification tree.<br />

The test specification below takes all possible variant<br />

configuration of our example into account.<br />

Fig. 3. Implementation of the filling level function<br />

B. Software hierarchy analysis<br />

The amount of the software variants to test automatically<br />

results from all possible software configurations. For the test it<br />

has to be taken into account whether a variant (e.g. the basic<br />

configuration) actually can be tested, depending on the fact<br />

whether it is executable software or just a basic configuration<br />

not being a functional unit on its own. Sometimes abstract tests<br />

can be defined for those abstract variants. However, actually<br />

executable tests only occur by further implementation of such<br />

tests for a certain variant.<br />

Fig. 5. Test specification for all variants<br />

The test specification describes the necessary tests that are<br />

now propagated to the children in the variant tree: The<br />

inherited tests can be changed, hidden or completed with<br />

specific tests in every variant. In Fig. 6 e.g. all tests are hidden<br />

that do not refer to the variant “passenger car”.<br />

833


“40” within the base test case 2.1 is normal tank level for an 80<br />

liters tank. This value needs to be increased significantly within<br />

the “Truck” variant assuming a 1000 liters truck tank.<br />

Therefore the value has been overwritten with “500” for the<br />

variant test.<br />

Fig. 6. Test cases for the variant “passenger car“<br />

As a consequence every superordinate variant test only<br />

needs to be created once and then maintained in only one place<br />

for every new requirement and application change.<br />

D. Unique identification of test cases<br />

A "Universal Unique Identifier" (UUID) is assigned to<br />

every basic test case to uniquely identify test cases. These<br />

UUIDs are generated world-wide and timely unique. If a test<br />

case is now passed on to a variant, the inherited and eventually<br />

modified test case is still the same (basic) test case. In case of a<br />

review of the tests the inherited test cases can be uniquely<br />

compared with the basic tests or with other variants tests.<br />

The UUID assignment also enables geographically<br />

distributed working on variant tests because all test cases are<br />

uniquely defined and tests can therefore be put back together<br />

and updated again without any problem.<br />

E. Rules for test case inheritance<br />

In our example all possible test cases were defined on the<br />

top level. For some variants (e.g. “passenger car”) only some of<br />

the test cases make sense. Therefore the following inheritance<br />

operations are necessary:<br />

Change of inherited test data<br />

Deleting/hiding of inherited test cases<br />

Adding additional test cases<br />

It also needs to be distinguished on the test data level if a<br />

value was inherited or only defined locally. Synchronizing the<br />

variants updates all inherited values. Therefore the following<br />

variable values statuses result out of that:<br />

Value was inherited<br />

Value was inherited and overwritten<br />

Value was defined locally for this variant test<br />

These values can easily be distinguished through a color<br />

coding like shown in Fig. 7: The light blue colored values were<br />

inherited and the purple colored values were overwritten in the<br />

variant.<br />

The example in Fig. 7 shows the inheritance of values from<br />

the base tests to the tests of variant “Truck”. Most of the values<br />

were assigned within the test specification in the classification<br />

tree so that they are display with grey background. The value<br />

Fig. 7. Color coding of inherited and overwritten values<br />

III.<br />

RESULTS<br />

The following strategy was used for the presented variant<br />

management solution:<br />

Deduction of the variant definition from the application<br />

design<br />

Definition of all possible test cases on the highest level<br />

for all variants<br />

Hiding of the not needed test cases in every variant<br />

Completing/implementing of tests separately in every<br />

variant<br />

The advantage of this approach lies within the centralized<br />

test case specification in one single classification tree. This<br />

increases transparency and offers a complete overview over all<br />

tests in a review. Hiding not necessary test cases in all variants<br />

is relatively easy while at the same time it opens the test<br />

engineers mind to essential questions such as: Which test case<br />

is in fact relevant for the actual variant?<br />

A. “Natural” way of creating test cases<br />

For a test engineer it is normally easier to develop tests for<br />

a concrete software variant than thinking about abstract tests<br />

for all potential software variants. Depending on the kind of<br />

software to be tested, it could make sense to firstly develop a<br />

complete test suite for a specific variant and then transfer these<br />

tests to the upmost base variant. This way the tests can be<br />

inherited down the variant hierarchy and the test engineer can<br />

use all the features of the variant management.<br />

B. Variants within the classification tree<br />

One could also think of introducing variants already into<br />

the test specification (i.e. the classification tree). Filtering of<br />

sub trees according to the selected variant would result in<br />

specific test specifications for each software variant. The whole<br />

classification tree itself would still be available and being<br />

maintained on topmost level of the variant hierarchy.<br />

Reviewers could either look at the overall tree or at each<br />

specific filtered variant test specification.<br />

www.embedded-world.eu<br />

834


The Impact of Test Case Quality<br />

Frank Büchner<br />

Principal Engineer Software Quality<br />

Hitex GmbH<br />

Karlsruhe, Germany<br />

frank.buechner@hitex.de<br />

Abstract—Even “good looking” sets of test cases can fail to<br />

detect defects in the source code, e.g. during unit testing, even if<br />

the tests achieve 100% code coverage. However, how do we<br />

develop good tests? This paper tries to give some insights.<br />

Keywords— Test case specification, equivalence partitioning,<br />

boundary values, code coverage, mutation testing, error seeding,<br />

Classification Tree Method (CTM).<br />

I. INTRODUCTION<br />

When it comes to testing, a lot of effort is spent selecting<br />

the “right” testing tool. However, often this effort is expended<br />

for a second-class goal. Certainly, you need a tool that works<br />

for you, your development environment, your project, and your<br />

process. However, what is paramount for good testing is not<br />

the testing tool, but the quality of the test cases. Only “good”<br />

test cases will find defects in the software.<br />

II.<br />

SIMPLE EXAMPLE<br />

The specification for a simple test object could be as<br />

follows:<br />

A start value and a length define a range of values.<br />

Determine if a given value is within the defined range or not.<br />

The end of the range shall not be inside the range. All data<br />

types are integer.<br />

The following three test cases pass and reach 100%<br />

(MC/DC) overage.<br />

Fig. 1. This three test cases pass and reach 100% coverage<br />

So, what is wrong with our test cases? All three test cases<br />

pass and we have reached 100% code coverage. The answer is,<br />

that we do not have tested all requirements (i.e. we have not<br />

tested that the end of the range is not in the range) and in<br />

consequence, we have not tested using boundary values. A test<br />

using the values 5, 2, 7 for start, length and value respectively<br />

fails, because of a software defect in the test object. This<br />

software defect probably is a wrong check, e.g. ‘


The second column tries to give for the defective<br />

implementations a combination of the likelihood that (a) the<br />

error is made by the programmer and (b) the defect is not<br />

detected by a review. The likelihood for (a) is considered to be<br />

related to the number of wrong characters in the relation (e.g.<br />

from the correct (i


Fig. 5. The minimum() function with a programming defect<br />

For the example of a sorting function, the extreme input<br />

could be if the input is already in sorted order, or sorted in<br />

reverse order, or if all elements to sort have the same value.<br />

D. Illegal Values<br />

Let us go back to the simple example of section II. The<br />

specification says, “All data types are integer”. This holds for<br />

the length of the range, and, as consequence, the length could<br />

be negative. Is a negative length a valid input? Probably not. At<br />

least, it is an interesting test to try the inputs 5 for the start, -2<br />

for the length and see, if 4 is considered to be inside the range<br />

or not by the implementation.<br />

As a rule of thumb: Always look for (maybe) invalid input<br />

and construct test cases out of it.<br />

IV.<br />

EQUIVALENCE PARTITIONING<br />

A ubiquitous problem related to test case specification is<br />

that an input variable can take on too many values, and it is not<br />

possible / not efficient to use all values in test cases, especially<br />

in combination with many values of other input variables<br />

(“combinatorial explosion”). Generation of equivalence classes<br />

(also called “equivalent partitioning”) solves this problem.<br />

Equivalence partitioning divides all input values into classes.<br />

Values are assigned to the same class, if the values are<br />

considered equivalent for the test. Equivalent for the test means<br />

that if one value out of a certain class causes a test to fail and<br />

hence reveals an error, every other value out of this class will<br />

also cause the same test to fail and will reveal the same error.<br />

In other words: It is not relevant for testing which value out<br />

of a class is used for testing, because they all are considered to<br />

be equivalent. Therefore, you may take an arbitrary value out<br />

of a class for testing, even the same value for all tests, without<br />

decreasing the relevance of the tests. However, the prerequisite<br />

for this is that the equivalence partitioning was done correctly.<br />

This is in the responsibility of the person applying equivalent<br />

partitioning.<br />

Fig. 6. Example for equivalence partitioning (according to shape)<br />

V. THE CLASSIFICATION TREE METHOD (CTM)<br />

The Classification Tree Method (CTM) is a method for test<br />

case specification supporting the methods for test case<br />

derivation discussed so far.<br />

The CTM starts by analyzing the requirements. The<br />

requirements determine which inputs are relevant (i.e. which<br />

inputs should be variated, i.e. which inputs should have<br />

different values during the test). In the next step the possible<br />

input values are divided into classes according to the<br />

equivalence partition method. The third step is to consider<br />

boundary / extreme / invalid input values. These three steps<br />

result in the classification tree. The classification tree forms the<br />

upper part of the (graphical) representation of the test case<br />

specification according to the CTM. The root of the tree is at<br />

the top; the tree grows from top to bottom; classifications have<br />

frames; classes are without frames; the leaf classes form the<br />

head of the combination table. The combination table consists<br />

of lines, each line specifying a test case. Markers on the<br />

respective line select equivalence classes, from which the<br />

values for the test are taken. A human draws the tree, giving<br />

names to classifications and classes, and sets the markers on<br />

the test case line, i.e. test case specification is a human activity<br />

(subject to human error, unfortunately).<br />

Fig. 7. Test case specification using the Classification Tree Method (CTM)<br />

In the figure above an example for the test case<br />

specification according to the CTM is given. The root of the<br />

tree is labelled “Suspension”, i.e. the test object obviously is a<br />

suspension. Also quite obviously, two inputs are relevant for<br />

the test: “Speed” and “Steering Angle”. “Speed” and “Steering<br />

Angle” are classifications (in frames), at the topmost level also<br />

called “test relevant aspects”. Both classifications are divided<br />

into equivalence classes (which do not have a frame). For<br />

“Steering Angle” there are three equivalence classes: “left”,<br />

“central”, and “right”. From the classification tree, we cannot<br />

conclude which values are inside a certain class, e.g. “left”, and<br />

how the values are represented. This is implementationdependent,<br />

and this is not relevant for the CTM, being a blackbox<br />

test specification method. (The test case specification is<br />

abstract.) If one does not take “central” as an extreme steering<br />

angle position, no boundary / invalid / extreme values are<br />

forced for “Steering angle”. This is different for “Speed”. The<br />

classification “Speed” is divided into the two equivalence<br />

classes “valid” and “invalid”. The latter class guarantees that<br />

www.embedded-world.eu<br />

837


invalid values for speed will be used during testing, because in<br />

a valid specification according to the CTM, all classes that<br />

were present in the tree need to be used in one test case<br />

specification at least. The class “invalid” is divided again using<br />

the classification “Too low or too high?”. This results in<br />

additional classes “negative” and “> v_max”. Test cases using<br />

values from these two classes will find out what happens if the<br />

unexpected hits the (software) test object. The valid speeds are<br />

divided into “normal” speeds and “extreme” speeds. We can<br />

assume that the class “zero” for a valid speed contains only one<br />

value (probably the value 0), as the class “v_max” which<br />

probably contains the maximum speed as specified in the<br />

requirements.<br />

The combination table (the lower part of the figure above)<br />

consists of seven lines and, hence, specifies seven test cases.<br />

The test case specifications can have names. The markers set<br />

on each line indicate, which classes provide a value for the test.<br />

The values are combined to form a test case specification<br />

eventually. This depicts the purpose of a test case. In our case,<br />

this is also indicated by the name of the test case specification,<br />

but this does not have to be the case always.<br />

From the test case specification, it is clearly visible that<br />

there are only three “normal” test cases (the first three test<br />

cases). For instance, a test case specification that requires<br />

testing e.g. low speed with the steering angle right does not<br />

exists. This is obvious. If you feel that three normal test case<br />

specifications are not enough, you might opt to add an<br />

additional one. However, the question is not if three is enough;<br />

the point is that it is obvious for everyone that there are only<br />

three. This is an important advantage of the CTM.<br />

The unit testing tool TESSY includes an editor for<br />

classification trees. I.e. unit test for TESSY can be specified<br />

using the CTM.<br />

VI. RECOMMENDATIONS FROM ISO 26262<br />

ISO 26262:2011 lists in part 6, section 9, table 11 the<br />

methods for deriving test cases for software unit testing [7].<br />

Fig. 8. Methods for deriving test cases from ISO 26262<br />

Hint for the interpretation of the table in the figure above:<br />

The recommendation depends on the Automotive Safety<br />

Integrity Level (ASIL). ASIL ranges from A to D, where D is<br />

the highest level (i.e. the level requiring the most effort to<br />

reduce risk). Methods that are “highly recommended” are<br />

marked by a double plus sign (“++”); methods that are<br />

“recommended” are marked by a single plus sign (“+”).<br />

Methods numbered 1a, 1b, 1c, … are alternative entries;<br />

methods numbered 1, 2, 3, … are consecutive entries. For<br />

alternative entries, an appropriate combination of methods shall<br />

be applied in accordance with the ASIL; for consecutive<br />

entries, all methods shall be applied in accordance with the<br />

ASIL.<br />

Method 1a of table 11 requires that the test cases for<br />

software unit testing are derived from the requirements. This is<br />

highly recommended for all ASILs. Starting from the<br />

requirements is the naive approach.<br />

Method 1b of table 11 requires that generation and analysis<br />

of equivalence classes is used to derive test cases for software<br />

unit testing. This is recommended for ASIL A and highly<br />

recommended for ASIL B to D.<br />

Method 1c of table 11 requires analysis of boundary values<br />

to derive test cases for software unit testing. This is<br />

recommended for ASIL A and highly recommended for ASIL<br />

B to D.<br />

Method 1a, 1b, and 1c were already discussed in the<br />

preceding sections of this paper.<br />

Method 1d of table 11 requires error guessing to derive test<br />

cases for software unit testing. This is recommended for all<br />

ASILs. Error guessing is discussed in the following section.<br />

A. Error Guessing<br />

Error guessing usually requires an experienced tester who is<br />

able to find error-sensitive test cases from experience. Hence, it<br />

is usually an unsystematic method (opposed to the first three<br />

methods). [I admit, you could use checklists or failure reports<br />

of previous systems or something similar as basis for<br />

guessing.] Error guessing relates to thinking about possible<br />

invalid / unexpected / extreme test cases, because this is<br />

actually error guessing. If a system under test has two buttons,<br />

and it is supposed that only one of these buttons is pressed at a<br />

time: What happens if the two buttons are pushed<br />

simultaneously? Can a button be pushed too fast / too often /<br />

too long? These are examples for error guessing.<br />

VII. ALTERNATIVES<br />

This section discusses alternative methods for test case<br />

derivation, which were not discussed in the previous sections.<br />

A. Test Cases from the Source Code<br />

It is tempting to use a tool to generate automatically test<br />

cases from the source code, e.g. with the objective that the test<br />

cases reach 100% code coverage. Different technical<br />

approaches exist, e.g. genetic algorithms or backtracking. Both<br />

open source and commercial tools implement these approaches.<br />

So why not leverage these tools? Generating test cases from the<br />

source code has some aspects that you should be aware of:<br />

1. Omissions: You will not detect omissions in the code. I.e. if<br />

a requirement is “if the first parameter is equal to the<br />

second parameter, then an error shall be returned” and the<br />

implementation of this check is missing: This problem will<br />

not be detected by test cases derived from the source code.<br />

This is evident. You need test cases that check if each<br />

requirement is implemented correctly. Such a test case will<br />

detect the missing implementation.<br />

2. Correctness: You cannot decide from the code if it is<br />

correct or not. E.g. you cannot decide if the decision (i


You need to check the behavior of the code against the<br />

requirements.<br />

Because of these two aspects, it is not sufficient to use only<br />

test cases generated automatically from the source code; you<br />

need test cases that test the requirements (at least you need a<br />

test oracle). But isn’t it not still a good idea to let a tool do a lot<br />

of the work and check afterwards, if the generated test cases<br />

test also the requirements, and, if not, change/extend the tests<br />

accordingly?<br />

Recently I came across a study [5] that tries to answer<br />

exactly that question. The main statements from this study are:<br />

1. Automatically generated test suites achieve higher code<br />

coverage than manually created test suites.<br />

2. Using automatically generated test suites does not lead to<br />

detection of more defects.<br />

3. Automatically generated test cases have a negative effect<br />

on the ability to capture intended class behavior.<br />

4. Automated test generation does not guarantee higher<br />

mutation scores.<br />

The study used the tool EvoSuite that automatically<br />

generates JUnit tests for Java classes. It was an empirical study,<br />

where students tried to detect defects in Java code, some of<br />

them starting from test cases generated by EvoSuite, some of<br />

them creating the test cases by themselves.<br />

The conclusion I draw from this study is that automated test<br />

case generation does not bring an advantage to testing (more<br />

defects found, less effort spend, etc.). On the other hand, it is<br />

also no disadvantage.<br />

Obviously, conditions of the study can be discussed<br />

(programming language used, programming skills of the<br />

students, etc.) but the tenor is surprising in my opinion.<br />

B. Random Test Data / Fuzzing<br />

Like the generation of test cases from the source code, it is<br />

also tempting to use automatically generated test input data.<br />

Many test cases can be generated and run quite effortless,<br />

having automated test execution in place. However, a<br />

(functional) test case needs an expected result. And it can be<br />

quite an effort to verify that all of that many test cases deliver<br />

the expected result, unless you have some kind of test oracle at<br />

hand. Running randomly generated test cases without checking<br />

the expected result is robustness testing. Only obvious<br />

misbehavior (e.g. denial of service, crash, etc.) will be detected.<br />

On the other hand, this can lead to the surprising detection of<br />

safety and security vulnerabilities. The process of stressing a<br />

test object with syntactically correct, but more or less randomly<br />

generated test input is called “fuzzing”.<br />

C. Artificial Intelligence<br />

Nowadays, artificial intelligence (AI) can accomplish many<br />

astonishing things, but I am currently not aware that AI is used<br />

successfully in the automated generation of test cases for<br />

embedded systems.<br />

VIII. MUTATION TESTING<br />

As we have seen in the previous sections, 100% code<br />

coverage does not guarantee the quality of the tests cases. But<br />

how can we rate the quality of our test cases? One possibility is<br />

mutation testing (called “error seeding” in IEC 61508 [8]).<br />

Having a set of passing test cases, you can mutate your code.<br />

Mutation means changing the code semantically, but keeping it<br />

syntactically correct. E.g. you can change a decision from (i


RISC-V; the Software and Hardware Aspects of an<br />

Open Source ISA<br />

Rob Oshana<br />

Vice President, Software Engineering<br />

Microcontrollers, NXP Semiconductor<br />

Austin, TX, USA<br />

robert.oshana@nxp.com<br />

Abstract—In this paper we discuss innovation and<br />

education using RISC-V. RISC-V is an open, free ISA<br />

enabling a new era of processor innovation through open<br />

standard collaboration.<br />

Keywords—RISC-V, ISA, Chisel<br />

I. INTRODUCTION<br />

RISC-V is a high-quality, license-free, royalty-free RISC<br />

ISA specification originally from UC Berkeley. The Standard<br />

is maintained by a non-profit RISC-V Foundation. This<br />

technology is suitable for all types of computing systems, from<br />

microcontrollers to supercomputers. There are numerous<br />

proprietary and open-source cores in the industry today. The<br />

technology is experiencing rapid uptake in industry and<br />

academia, and is supported by a growing shared software<br />

ecosystem. RISC-V technology can also be used for<br />

experiments in innovation and education and these two areas<br />

will be explored in this paper.<br />

RISC-V has these key characteristics;<br />

• Simple; Far smaller than other commercial ISAs<br />

• Clean-slate design; Clear separation between user and<br />

privileged ISA which avoids micro architecture or<br />

technology-dependent features<br />

• A modular ISA designed for<br />

extensibility/specialization which implies a small<br />

standard base ISA, with multiple standard extensions<br />

and sparse and variable-length instruction encoding<br />

for vast opcode space<br />

• Stable; where the base and standard extensions are<br />

frozen and additions are via optional extensions, not<br />

new versions<br />

• Community designed; Developed with leading<br />

industry/academic experts and software developers<br />

II. SOFTWARE DRIVES HARDWARE<br />

Our interest in RISC-V is driven primarily from a model of<br />

software driving hardware architecture. One of the tools we<br />

experimented with is Chisel. Chisel is an open-source<br />

hardware construction language from UC Berkeley that<br />

supports hardware design using parameterized generators and<br />

layered domain-specific hardware languages.<br />

We chose to use Chisel for architectural investigation<br />

primarily due to;<br />

• Chisel can be used by software teams for ISA<br />

architectural explorations<br />

• Use case investigation - expanding RISC-V with<br />

custom instructions<br />

As an example we chose an IP checksum for a IPv4 packet<br />

– ipcsum rd, off(rs)<br />

The Chisel workflow we used went from design to<br />

simulation and testing. Chisel allows for relatively easy<br />

integration with existing Verilog RTL designs. Chisel is easy<br />

to extend and benchmark. Its possible to benchmark<br />

specialized RISC-V instructions vs C programs.<br />

The advantages of Chisel architecture evaluation include;<br />

• High level programming language features<br />

• Opens the hardware to software engineers &<br />

architects<br />

• Lower SLOC compared with equivalent Verilog<br />

(~1/3)<br />

• Free and fast tools for simulation like Verilator<br />

• Programs execution can be analyzed below the means<br />

of usual debuggers (gdb)<br />

• Easy integration into existing Verilog RTL projects<br />

(minimal Verilog glue logic may be required)<br />

• From GPP to ASIP through custom instructions<br />

840


We chose to evaluate an IP checksum for a IPv4 packet –<br />

ipcsum rd, off(rs). This benchmark has a 20 byte IPv4<br />

header located at offset off from the address in register rs.<br />

The checksum is stored in register rd<br />

We will develop a Chisel class implementing the<br />

instruction (~60 lines) as shown in Figure 1.<br />

• Hardware cost: functional unit, decoding logic, unit<br />

control logic<br />

III. EDUCATION<br />

We are also interested in broader educational endeavors<br />

with RISC-V based architectures. We are building a package<br />

of hardware and software to allow for educational<br />

experimentation. Our system includes multiple RISC-V cores<br />

and associated peripherals and interconnect, plus software<br />

enablement (Figure 3).<br />

Figure 1 Chisel Benchmark<br />

The Chisel design flow we used is shown in Figure 2. The<br />

key steps used in this flow are;<br />

• Step 1 - Chisel compiler<br />

• Step 2 - Verilator translator<br />

• Step 3 - Integration into Verilog code base (manual or<br />

by extending Chisel BlackBox class functionality)<br />

• Step 4 – Implement test scenarios and build the<br />

emulator<br />

Figure 3 Microcontroller device based on RISC-V<br />

Conclusion<br />

The “openness” of RISC-V allows different ways of working<br />

in a hardware/software environment. We are moving towards a<br />

model of “software drives hardware” in the creation of IoT<br />

systems. RISC-V can enable innovation and education which<br />

is our primary interest in this technology.<br />

Figure 2. Chisel design flow<br />

We chose to benchmark a set of RISC-V specialized<br />

instruction vs the sample C program. The target was a RISC-<br />

V processor – rv32_3stage (z-scale) from Sodor designs<br />

(https://github.com/ucb-bar/riscv-sodor).<br />

The C function IP checksum was compiled using – GCC<br />

riscv32, -O2 which produced a 35 cycles execution time.<br />

The ipcsum instruction is 7 cycles (1F/5EX/1WB) which is a<br />

5X speedup from the C implementation.<br />

ACKNOWLEDGMENTS<br />

I would like to thank Alex Badicioio for his contributions to<br />

this paper.<br />

REFERENCES<br />

[1] Jonathan Bachrach, Huy Vo, Brian Richards, Yunsup Lee,<br />

Andrew Waterman, Rimas Avižienis, John Wawrzynek,<br />

Krste Asanovic´ EECS Department, UC Berkeley<br />

Chisel: Constructing Hardware in a Scala Embedded Language<br />

Design Automation Conference, 2012<br />

[2] Chisel Tutorial, Jonathan Bachrach,<br />

Krste Asanovi´c, John Wawrzynek EECS<br />

Department, UC Berkeley, October 26, 2012, Design Automation<br />

conference<br />

This hardware/software tradeoff can be summarized as;<br />

• Software gain: execution speed, code size reduction<br />

841


When is a Custom SoC<br />

the Right Choice for IoT Products?<br />

Economic benefits and challenges when building Custom SoCs<br />

Michele Riga<br />

Embedded and Automotive<br />

Arm<br />

110 Fulbourn Road, Cambridge, UK<br />

michele.riga@arm.com<br />

The Internet of Things (IoT) will result in an explosion in the<br />

number of connected devices. Many of these devices will be built<br />

using standard off-the-shelf silicon, others will significantly benefit<br />

from using a System-on-Chip (SoC), customized for the specific<br />

application. Benefits of a custom SoC/ASIC include reduced cost,<br />

improved performance and added functionality, all within smaller<br />

form factors. Designing a custom SoC no longer requires large<br />

investments, thanks to the availability of proven IP and many<br />

competitive design services, as well as access to mature process<br />

nodes - enabling SoC design at reduced cost. This paper explores<br />

the economic benefits and challenges of building custom SoCs,<br />

supported by a real case study of an industrial control application,<br />

with real world data on cost, size and power consumption, and the<br />

impact of feature enhancements.<br />

Keywords - ASIC, custom SoC, custom ASIC, IoT, PCB, FPGA,<br />

MCU, microntroller, Cortex-M, sensor IC, mixed-signal,<br />

microchip.<br />

I. INTRODUCTION<br />

The past decades have been marked by an exponential<br />

growth in the number of digital devices shipped every year and<br />

present in everyday life - just a look at the exponential growth<br />

of the number of microchips shipped by Arm’s partners in the<br />

last 20 years (Fig. 1) helps to understand the size of this<br />

impressive growth.<br />

Billions of chips shipped<br />

18<br />

15<br />

12<br />

9<br />

6<br />

3<br />

0<br />

1998 2000 2002 2004 2006 2008 2010 2012 2014 2016<br />

Fig. 1. Exponential growth over time of the number of chips shipped by<br />

Arm’s partners.<br />

Many new devices have reached consumers, such as first<br />

mobile phones and then smartphones, wearables, etc. while<br />

various existing products have become more and more digital,<br />

like automobiles, white goods, and many others.<br />

This is just the beginning of the new IoT revolution. In the<br />

next few years, all areas of life and work will be shaped by the<br />

explosion of new technologies: a “typical family home could<br />

contain more than 500 smart devices by 2022” [1], cars will<br />

become autonomous and connected, powered by hundreds of<br />

digital devices, and also many industrial and agricultural<br />

applications will become smart and part of the IoT. It is<br />

predicted that this explosion of new technologies and new<br />

applications will lead to a trillion connected devices in the next<br />

decades.<br />

Many of these devices will share similar functionalities and<br />

requirements, and will be built using off-the-shelf parts chosen<br />

from the large selection provided by large manufacturers. A<br />

part, however, will be characterized by very specific needs,<br />

while also having very strict requirements in performance,<br />

power efficiency and cost – these may be difficult to address<br />

with off-the-shelf parts.<br />

For this second group of applications, the realization of<br />

custom SoCs will enable the creation of compelling and<br />

differentiated solutions, providing large improvements in all<br />

performance, power, area, and cost, as further detailed in the<br />

following sections.<br />

II.<br />

SILICON TECHNOLOGY<br />

Everyone that has been active in the silicon industry in the<br />

last few decades has heard at least once about Moore’s Law,<br />

stating that transistor count doubles approximately every two<br />

years. Similarly, the semiconductor manufacturing process<br />

used to build chips has evolved, enabling fabrication of designs<br />

with transistor of smaller dimensions. Traditionally, a process<br />

node takes the name from the size of the transistors, with<br />

process nodes that evolved from various µm to a few nm.<br />

www.embedded-world.eu<br />

842


The exponential growth in the number of transistors, the<br />

basic building block of silicon hardware, has enabled the<br />

digital revolution, with devices capable of greater and greater<br />

performance over time. It is astonishing to think how the<br />

processing power of a modern smartphone has similar<br />

processing power compared to supercomputers of a few<br />

decades ago – for instance, the Cray-2 supercomputer from<br />

1985, which was the fastest machine in the world for its time,<br />

has a processing power equal to that of an iPhone 4 [2].<br />

The continuous technology push for more advanced process<br />

nodes and for an increase in the number of transistors per chip<br />

has followed Moore’s Law quite consistently, sustained by ever<br />

increasing investments. At the same time, the astonishing<br />

progress in chip technology opens complete new technological<br />

paradigms. Previous generations of process nodes can be used<br />

for all these applications where added value is provided by<br />

incorporating functionalities that do not necessarily scale<br />

according to Moore’s Law, such as sensors and actuators.<br />

These latter application areas, characterized by functional<br />

diversifications, are termed More-than-Moore [3].<br />

The availability of mature process nodes provides an<br />

opportunity for the realization of custom SoCs, mixed-signal<br />

devices that combine both functional diversification and digital<br />

components (Fig. 2). This is an opportunity that is becoming<br />

more appealing than ever today, as access to previous<br />

technology becomes cheaper (Fig. 3), while still being<br />

technologically valid, with the possibility to build full masksets<br />

on older process nodes with just a few thousands of dollars of<br />

investment.<br />

III.<br />

CUSTOM SOC BENEFITS<br />

There are many potential reasons and benefits to build a<br />

custom SoC:<br />

• Make a product more compelling and differentiated, as<br />

new features can be added (i.e. connectivity)<br />

• Lower the overall cost of end products, as many discrete<br />

components can be replaced with only one microchip<br />

• Reduce component count, overall complexity, and Printed<br />

Circuit Board (PCB) size – potentially providing higher<br />

efficiency and reliability<br />

• Protect the technology solution for a product, making it<br />

harder, or even impossible, to reverse engineer and copy<br />

• Reduce the complexity of the supply chain, assuming<br />

complete ownership of the microchip design, and thus<br />

ensuring long life supply of the component – through the<br />

use of established foundries and process nodes<br />

• Meet performance and/or cost requirements for a specific<br />

application or product that are impossible to reach with a<br />

PCB or FPGA solution.<br />

IV.<br />

CASE STUDY<br />

When deciding to build a custom SoC, the first step is<br />

usually to determine what are the key targets that the project is<br />

aiming for, and then make the various design choices based on<br />

these targets. The following section analyzes a case study to<br />

provide details on why and how a company decided to build a<br />

custom SoC.<br />

A company active in the oil and gas industry used to build<br />

valve controllers to sense pressure and temperature, while also<br />

performing controlling functions. The solution was based on a<br />

PCB containing a large variety of off-the-shelf parts, digital<br />

and analog.<br />

Moving to the next generation product, it decided to replace<br />

the many off-the-shelf parts with one integrated solution. Key<br />

drivers were to reduce costs, improve reliability, and simplify<br />

the inventory and supply management - since some of the<br />

vendors where planning to discontinue the production of<br />

components used in the current solution. In addition to that,<br />

they planned to add connectivity capability, so as to be able to<br />

remotely manage valves deployed in the field and hence reduce<br />

the management costs.<br />

This company had no in-house expertise for the<br />

development of silicon hardware. Therefore, they decided to<br />

entirely outsource the project to S3 Group, one of the many<br />

design houses that provides full design services for the creation<br />

of custom SoCs.<br />

Fig. 2. “The combined need for digital and non-digital functionalities in<br />

an integrated system is translated as a dual trend […]:<br />

miniaturization of the digital functions […] and functional<br />

diversification (More-than-Moore)” [3].<br />

Fig. 3. Process node cost for 80nm, 130nm, and 65nm over time, showing<br />

a steady decline in price over time (courtesy of IMEC).<br />

843


The S3 Group built for them a low-power chip based on a<br />

cost-effective process node, 180nm, integrating a Digital-to-<br />

Analog Converter (DAC), an Analog-to-Digital Converter<br />

(ADC), and many interfaces to enable easy connectivity, such<br />

as I 2 C, UART, SPI – all packaged in a low-power design<br />

consuming 160uW/MHz. Overall results of this project have<br />

been impressive, with large improvements in cost, power, and<br />

area:<br />

• 80% cost reduction<br />

• 70% power consumption reduction<br />

• 75% smaller PCB size<br />

In addition, the new solution simplified considerably the<br />

inventory and supply management. The company did not have<br />

to deal anymore with many different vendors, each with a<br />

different product roadmap, or stock many different<br />

components. With the new custom SoC approach, owned<br />

completely by the company itself, they only needed to deal<br />

with one single point of contact provided by the S3 Group.<br />

V. ECONOMICS OF CUSTOM SOC<br />

As seen in the previous sections, there are many potential<br />

benefits in building custom SoCs. At the same time, silicon<br />

design is not an easy job, and a custom SoC project should not<br />

be started without full awareness of all the project phases and<br />

costs involved in such a project.<br />

From a high-level perspective, there are five different<br />

stages in custom SoC creation, each with its own required<br />

know-how and related costs: SoC definition, IP selection,<br />

Design & integration, Verification, Implementation (Fig. 4).<br />

Fig. 4. High-level project phase when creating a custom Soc.<br />

A. SoC definition<br />

As with all projects, the first step is defining the SoC and<br />

its key requirements. Typically, there are some feature<br />

requirements, such as security, wireless connectivity, etc. In<br />

addition, there are some minimum performance, power and<br />

area targets to be met.<br />

The performance target directly affects the minimal<br />

frequency of the device. While generally power and efficiency<br />

targets can be very important, since custom SoCs usually<br />

replace PCB solutions, they typically provide power savings so<br />

large that this is not necessarily a focus area. Area is instead<br />

another important factor, since it directly translates into silicon<br />

cost.<br />

B. IP selection<br />

When building a microchip, many required components can<br />

be easily purchased by third party vendors. On the open market<br />

it is possible to find processor IP, peripherals, radio, and IP to<br />

perform almost any function at a variety of price points.<br />

At the heart of a microchip there is usually a<br />

microprocessor, that can be programmed to perform all the<br />

required tasks. Typically, processors are one of the most<br />

complex IPs in the design and often there is a license fee and<br />

royalties to be paid on the units shipped. Access to IP usually<br />

involves a negotiation to discuss and agree all the details of the<br />

licensing contract, which can take various months to complete.<br />

Arm is the major IP provider, and many off-the-shelf<br />

devices are currently based on Arm Cortex-M processors. To<br />

address the need of companies that want to build custom SoCs<br />

with reduced initial investment and rapid access to the IP, Arm<br />

has recently enhanced the Arm DesignStart program. One of<br />

the key changes is the DesignStart Pro offering, that enables<br />

companies to quickly access the Cortex-M0 and Cortex-M3<br />

processors through a fixed-term contract, with $0 upfront fee<br />

and a success-based royalty model. Together with the<br />

processor of choice, both DesignStart Pro packages also<br />

provide a wide range of building blocks and peripherals to<br />

build or customize the memory system. In addition to that, the<br />

Cortex-M0 DesignStart Pro provides a simple example system,<br />

while Cortex-M3 DesignStart Pro contains a fully validated<br />

subsystem, named CoreLink SSE-050.<br />

C. Design & integration<br />

As for the PCB design, where each of the off-the-shelf parts<br />

needs to be connected, similarly with custom SoCs all the<br />

separate components need to be connected in the integrated<br />

design. This task is usually performed in the world of a<br />

Hardware Description Language (HDL), such as Verilog.<br />

These are specialized computer languages that enable the<br />

designer to provide a description of the electronic circuit in a<br />

text format; the code is then used as input to Electronic Design<br />

Automation (EDA) tools that convert it into actual transistors,<br />

ready to be built into a chip.<br />

One of the ways to reduce the effort required to perform<br />

this task is to use a good starting point, such as the preassembled<br />

CoreLink SSE-050 IoT subsystem (Fig. 5), included<br />

in the Cortex-M3 DesignStart Pro package. It contains all the<br />

common elements of such a system, and can be used as a<br />

starting point or simply as a reference design:<br />

• Cortex-M3 processor<br />

• Configurable memory system<br />

• Ready-made connectivity to Flash memory with<br />

integrated Flash cache to improve performance and<br />

power efficiency<br />

• Connectivity to the peripherals (not included,<br />

available from many third-party providers).<br />

• Real-time clock<br />

• True Random Number Generator (TRNG) to provide<br />

the foundation for security, that can be upgraded (for a<br />

fee) to one of the Arm CryptoCell IP for enhanced<br />

functionalities<br />

• Dedicated port for radio integration, with pre-verified<br />

integration of the Arm Cordio Radio (available for a<br />

fee)<br />

www.embedded-world.eu<br />

844


D. Verification<br />

Deciding on the right set of IP and assembling the system<br />

together in an appropriate way is very important. Similarly,<br />

ensuring that the assembled system meets the functional<br />

requirements and functions properly is essential. Quite<br />

counterintuitively, the amount of effort necessary for<br />

verification usually exceeds that for Design & Integration.<br />

Verification is time consuming. Generally, it is necessary to<br />

run thorough verification tests to ensure that all main use cases<br />

are covered. The reason is that, the earlier in the cycle a bug or<br />

issue is found, the easier and less expensive it is to solve. When<br />

an issue is found during the development stage, it is usually<br />

quite easy to fix it, and it is generally just a matter of normal<br />

engineering effort. However, if a bug is found after tape-out, it<br />

might become very difficult to solve it. At that point, all that<br />

can be done is in software, either by changing the sequence of<br />

operations performed so to ensure that the bug is not exposed<br />

or, potentially, by completely disabling a feature. As a result, a<br />

bug in silicon can result in decreasing the value of the custom<br />

solution or potentially can make the whole development<br />

worthless, if a required feature is completely compromised and<br />

cannot be used.<br />

There are various ways to limit the effort required on<br />

verification when developing hardware:<br />

1. Divide and conquer: verification is more effective<br />

when done first in small blocks. Focus on a small block<br />

of the design, or potentially an IP. For instance, when a<br />

Digital Signal Processing (DSP) is implemented in<br />

house, it is highly advisable to perform all the<br />

verification on that block separately, before integrating<br />

it with the rest of the system, so as to ensure that it<br />

functions properly in all the possible corner cases that<br />

might become particularly difficult to stimulate when<br />

running the whole system.<br />

2. Reuse: there is no need to reinvent the wheel. Literally<br />

hundreds of IPs can be found already verified,<br />

removing any need to perform block-level verification.<br />

The more complex the IP, the more time-consuming<br />

the verification. Processors are usually very complex,<br />

with the possibility to handle a large variety of<br />

different operations. Rather than building the processor<br />

solution in house, or using unproven solutions, Arm<br />

Cortex-M0 and Cortex-M3 processors can provide the<br />

computing solutions for many different applications –<br />

these processors are fully verified and proven in over<br />

20 billion devices to date.<br />

Cutting corners in verification means taking large risks.<br />

Various examples from the past can help to understand the<br />

huge potential cost of finding a bug in the field, after the silicon<br />

hardware has been manufactured. For instance, in the 90s a<br />

large microchip provider had to recall a whole series of chips<br />

from the market and provide upon-request replacements [4]<br />

because one of the key features did not work in an extremely<br />

unlikely, but possible, scenario [5].<br />

E. Implementation<br />

The last step in the journey to building a custom SoC is to<br />

realize the physical implementation and start manufacturing. If<br />

the implementation is done in-house, this section involves<br />

carrying out the physical design using an EDA tool and a<br />

physical library developed for a specific foundry and process<br />

node. The majority of custom SoCs are built with both digital<br />

and analog parts, and this requires a slightly more elaborate<br />

process that is nowadays supported by the major EDA vendors.<br />

Through Arm DesignStart, companies can access, with $0<br />

upfront fee, hundreds of physical libraries from 18 partner<br />

foundries, with availability of Logic IP, Standard Cell,<br />

Embedded Memory Compilers, and Interface IP. Arm is the<br />

industry’s leading supplier of foundation physical IP and<br />

processor implementation solutions providing access from the<br />

most mature and less expensive process nodes to the most<br />

Fig. 5. CoreLink SSE-050 sysystem block diagram, illustrating included IP in the DesignStart, other IPs available from Arm, and any additional third-party IP<br />

that can be easily integrated into the system.<br />

845


leading-edge ones (Fig. 6). This enables the implementation of<br />

SoC solutions which address performance, power and cost<br />

requirements for literally all application markets.<br />

Many leading applications require the use of the latest and<br />

most advanced process nodes to be competitive in the market,<br />

and to offer tangible benefits to the users. However, for<br />

applications like custom SoCs, the frequency and power targets<br />

required can usually be met with more mature process nodes,<br />

that also enable mixed-signal designs. Section II Silicon<br />

Technology already addressed how pricing for mature process<br />

nodes is reducing considerably over time, providing an offering<br />

that is technologically compelling and cost effective.<br />

When the implementation is ready, the last step is to<br />

connect with the selected design house or foundry for<br />

production. Tape-out costs can vary a lot, mostly depending on<br />

the process node and the type of chosen maskset, which is a<br />

series of physical masks used during the photolithography steps<br />

of semiconductor fabrication. There are 3 possible masksets<br />

that can be created:<br />

• Multi-project wafer (MPW): this maskset consists of<br />

many projects, potentially from different customers,<br />

distributing the cost of the maskset set among all the<br />

projects involved. It can be used for early prototype, or<br />

also for full production when a very low volume is<br />

required.<br />

• Multi-layer mask (MLM): this maskset contains various<br />

masks that are combined into one, reducing the overall<br />

number of masks required and thus the cost. This solution<br />

reduces the non-recurrent costs, but it results in higher<br />

variable production cost (wafer cost), since more foundry<br />

production time is required for production.<br />

• Full maskset: most optimized masks for the at-scale<br />

production in the chosen process nodes. Ideal for large<br />

volume projects.<br />

Which maskset is right to choose mostly depends on the<br />

production volume and the die size. For larger volume<br />

implementations, after the initial test tape-out with MPWs or<br />

MLMs, it is cost effective to move to a full production<br />

maskset. For low-volume implementation, or mid-volume<br />

ones with small die size, MPWs and MLMs are usually very<br />

effective solutions. Many foundries now provide accessible<br />

MPWs or MLMs services, for process nodes such as 90nm,<br />

80nm, 65nm and 40nm.<br />

F. Outsourcing<br />

The implementation phase can be easily outsourced to the<br />

many design houses, like in the mentioned case study. Many of<br />

these design houses have services to help in any phase of the<br />

custom SoC realization, providing the possibility of<br />

outsourcing part of the design project, from definition, to<br />

design, integration, and verification. Some of them can provide<br />

a complete service, delivering an already manufactured and<br />

packaged microchip starting only from a requirement<br />

specification document that they can also help define.<br />

To help OEMs and companies that have never built silicon<br />

before, Arm has established the Approved Design Partner<br />

program. This program connects companies to various selected<br />

design houses, audited and chosen for the quality of the<br />

services that they can provide, and with a proven track record<br />

of success using Arm IP.<br />

G. Overall cost considerations<br />

As seen in the journey to build a custom SoC, there are<br />

various key cost areas:<br />

• Engineering costs for design, integration, and verification<br />

– depending on the internal expertise, the engineering<br />

tasks can be easily outsourced<br />

• Cost of IP – there are many IP providers to fit any<br />

requirement. Some, like Arm DesignStart, offer a successbased<br />

royalty model to access proven IP, removing<br />

completely any upfront license fee and thus transforming<br />

this into a pure variable cost<br />

• EDA tools access – there are many EDA tools providers<br />

that nowadays offer competitive pricing for small or<br />

Fig. 6. Physical IP available through DesignStart with no upfront fee: addressing all markets with IP from the most mature process nodes to the leading edge.<br />

846<br />

www.embedded-world.eu


medium sized projects<br />

• Manufacturing and packaging – many foundries now<br />

provide MPW and MLM services to enable projects with<br />

small volumes. Many design houses and silicon aggregator<br />

can help to connect with the foundries, providing full<br />

support.<br />

All this has made the realization of custom SoC costeffective<br />

and affordable for small companies or projects with<br />

low production volumes (Fig. 7). An analysis performed by<br />

IMEC shows how on 180nm only production of few thousand<br />

units is required to make a custom SoC project cost-effective,<br />

providing a positive Return-On-Investment (ROI), together<br />

with all the other benefits of custom SoCs seen in III Custom<br />

SoC Benefits.<br />

Fig. 7. Minimum number of units required for an investment in Custom<br />

Soc. On 180nm, on current technology cost, few thousand units per<br />

year are a minimum requirement (courtesy of IMEC).<br />

VI.<br />

BEYOND SILICON<br />

Assembling the microchip and manufacturing it is not the<br />

end of the project. Hardware has little value when not paired<br />

with software to run and use all the functions implemented.<br />

Software is usually built using development tools and a<br />

compiler, that translates high-level code into machine code,<br />

ready to be executed by the microprocessor embedded in the<br />

SoC. Proven IP from Arm, such the Cortex-M0 or Cortex-M3<br />

processors, have full support from all the major compilers and<br />

development tools. In addition, when applying for DesignStart<br />

Pro, companies receive a free 90-day time-limited license to<br />

the Keil MDK Professional tool and/or IAR Embedded<br />

Workbench, both of which provide a compiler and debugger in<br />

a GUI.<br />

Depending on the selected processor, there might be some<br />

restrictions on the programming language that can be used. For<br />

instance, many 8-bit and 16-bit microcontrollers need to be<br />

programmed using low-level assembly code. Other processors<br />

are built to enable ease of programming and use. For all the<br />

Cortex-M processor family, a lot of effort has been put into<br />

removing the need for the user to write any assembly code. In<br />

fact, all code, including exception handlers and fault handlers,<br />

can be written in high level languages such as C or C++. In<br />

addition, for any specific situation where an assembly code<br />

instruction would be useful and cannot be inferred from high<br />

level language code, there is the free-to-use Cortex-M Software<br />

Interface Standard (CMSIS), which provides ways to access<br />

the processor’s internal registers and standard calls for lowlevel<br />

assembly instructions.<br />

Depending on the requirements and complexity of the<br />

custom solution, it might be possible to develop software that<br />

runs directly on the hardware, also called bare-metal<br />

environment. When a more scalable and modular solution is<br />

required, a Real-Time Operating System (RTOS) can run on<br />

the microprocessor, on-top of which the various software tasks<br />

can be developed. Using a standard architecture, such as Arm<br />

Cortex-M processors, enables companies to choose literally<br />

any RTOS provider, saving considerably in development and<br />

porting effort.<br />

Fig. 8. Arm processors provide access to largest technology ecosystem, with tools, compilers, OS support, accessible software, a thriving developer base, and a<br />

large wealth of resources.<br />

847


Choosing a proven architecture and vendor for key IP such<br />

as the microprocessor enables access to an established base of<br />

developers, that have already built hardware or software using<br />

that architecture or processor family, and are therefore familiar<br />

with all kinds of issues that can be encountered when building<br />

a custom solution. For instance, in the last 20 years, thousands<br />

of companies have joined the Arm partnership and built chips<br />

based on Arm IP. This ecosystem has grown into the largest<br />

technology ecosystem in the world, providing a wide choice of<br />

development tools, compilers and RTOS, and also access to an<br />

extremely large developer base, as evidenced by more than 5<br />

million downloads of CMSIS in 2016. All of this is backed by<br />

the largest open-access development resource library – with<br />

thousands of articles, how-to guides and other resources easily<br />

accessible online.<br />

In summary, the choice of the microprocessor IP has direct<br />

implications on the development time required to build the<br />

software necessary to make use of the hardware solution. Using<br />

proven and established vendors can considerably reduce the<br />

development risks not only on the hardware side, but also on<br />

software development.<br />

VII. CONCLUSIONS<br />

Custom SoCs can provide huge benefits to IoT and<br />

embedded applications. Thanks to the huge progress in the<br />

production processes and technology nodes, as well as the<br />

availability of masksets at reduced cost, companies can now<br />

access suitable process nodes for advanced mixed-signal<br />

designs at competitive prices.<br />

In addition, there is a large variety of IP available, from<br />

proven complex microprocessor IP, such as the Cortex-M3, to<br />

peripherals and accelerators. Pricing for these IP can vary a lot<br />

depending on the IP of choice and the vendor; some of these<br />

companies, like Arm through DesignStart, offer access to their<br />

IP with no upfront cost and a success-based royalty model,<br />

transforming the use of the IP into a pure variable cost that<br />

depends on the volume shipped.<br />

Finally, many design houses offer design services,<br />

providing the possibility to outsource part or all of the<br />

development cycle of a custom SoC, potentially providing<br />

tested, packaged chips to make the process even simpler.<br />

The barrier to developing microchips is lowering, with<br />

reduced investment size and lower development risk. This will<br />

result in an explosion in the number of custom SoC solutions<br />

that will power future IoT and embedded applications.<br />

REFERENCES<br />

[1] Contained in Gartner Special Report "Digital Business Technologies".<br />

[2] Published on Experts-exchange.<br />

[3] W. Arden, M. Brillouet, P. Cogez, M. Graef, B. Huizing, R. Mahnkopf,<br />

“More-than-Moore” white paper, International Technology Roadmap<br />

for Semiconductors, 2011.<br />

[4] “Intel adopts upon-request replacement policy on Pentium processors<br />

with floating point flaw; Will take Q4 charge against earnings".<br />

Business Wire. 1994-12-20<br />

[5] Statistical Analysis of Floating Point Flaw: Intel White Paper”, 9 July<br />

2004. p. 9. Solution ID CS-013007. Retrieved 5 April 2016.<br />

www.embedded-world.eu<br />

848


The Business Case for Affordable Custom Silicon<br />

Edel Griffith (Author), Darren Hobbs, Dermot Barry<br />

S3 Semiconductors<br />

Dublin, Ireland<br />

info@s3semi.com<br />

Abstract—This paper will show how custom silicon, which is<br />

every customer’s ideal – a highly optimized, efficient system<br />

designed exactly to their application requirement, is no longer the<br />

high cost solution it once was and is now a very feasible option even<br />

at lower volumes.<br />

Keywords—custom integrated chips, custom ASICs, System<br />

Integrators, OEMs<br />

I. INTRODUCTION<br />

Historically, custom chips or ASICs were considered costprohibitive<br />

and only possible for companies that were shipping<br />

millions of units a year. These days however, custom integrated<br />

chips are possible for many device makers and original<br />

equipment manufacturers (OEMs) who in the past may have<br />

found such designs outside their budgets. With high volume<br />

consumer products pushing the cutting edge of advanced process<br />

nodes, foundry capacity at mature process nodes is being freed<br />

up. These mature process nodes are almost fully depreciated and<br />

therefore more affordable than ever in which to fabricate custom<br />

chips.<br />

Year on year the number of discrete components in a system<br />

increases. As the number of components increase, the associated<br />

cost and size of the printed circuit board (PCB) also increases.<br />

Being able to integrate these components onto a single custom<br />

integrated chip has the potential to offer considerable bill of<br />

materials (BOM) cost and footprint savings. In this paper, S3<br />

Semiconductors will examine two recent case studies whereby<br />

BOM cost and footprint savings achieved were of the order of<br />

80-90% reduction on the previous generation products where<br />

discrete components on a PCB were utilized.<br />

S3 Semiconductors will also show how inputting a number<br />

of variables into their online BOM calculator allows you to<br />

quickly attain an estimate of the breakeven volume and total<br />

BOM savings achievable using a custom ASIC versus a discrete<br />

solution. Taking inputs such as how many devices are expected<br />

to be manufactured every year, the expected product lifetime,<br />

coupled with some design questions like how many data<br />

converters are on board, what level of integrated processing is<br />

needed, and whether you have RF requirements and what type<br />

of connectivity is required. The results will show how there is a<br />

business case for affordable custom silicon now available to<br />

even low- and medium-volume applications.<br />

Then we will discuss two recent case studies, supporting two<br />

very diverse market segments and show how moving from a<br />

discrete solution to a custom integrated chip solution saved two<br />

companies over 80% reduction in both BOM costs and in size.<br />

II.<br />

INDUSTRY CHANGES<br />

Technology and manufacturing techniques have moved on a<br />

lot in recent years. Changes in market and technology are<br />

driving the demand for custom silicon on chip (SoC) but also<br />

helping to reduce the cost of such solutions versus similar<br />

solutions in the past.<br />

A. Market Changes<br />

Industry 4.0 (or the fourth industrial revolution) is a<br />

collective term embracing many contemporary automation, data<br />

exchange and manufacturing technologies. It facilitates the<br />

vision and execution of a modular ‘smart factory’ in which<br />

cyber-physical systems monitor physical processes, create a<br />

virtual copy of the physical world and make decentralized<br />

decisions. Over the Industrial Internet of Things (IIoT), cyberphysical<br />

systems communicate and cooperate with each other<br />

and with humans in real time, and via the Internet of Services.<br />

Today, global sensors in the IIoT market generated revenue<br />

of $3.77B. This is estimated to increase to $11.23B in 2021,<br />

yielding a CAGR of 16.8% [1]. Within this market, the largest<br />

share is attributed to industrial control applications which<br />

commands 38% of that share.<br />

As the level of automation in factories grows and processing<br />

becomes more automated, the number of inputs and outputs in<br />

industrial electronic systems has also increased. There are now<br />

inputs coming from many different sources – multitudes of<br />

sensors, as well as the traditional switches, keyboards, touch<br />

pads, encoders and scanners to name a few. The sensors are<br />

required to monitor input conditions like pressure, temperature,<br />

humidity, air quality, acceleration or a host of other<br />

environmental conditions. On the other side, there are also many<br />

outputs like drivers for display, heaters, motors, actuators or<br />

switches.<br />

Factory automation requires real-time data. Process<br />

monitoring reduces waste and improves efficiency. Being able<br />

to access this information remotely and in real-time, allows for<br />

the optimization of resource, energy and man-power. In cases of<br />

variation in performance, operators can be warned in advance to<br />

www.embedded-world.eu<br />

849


mitigate against production down-time. Predictive maintenance<br />

and improved efficiencies should impact positively on the<br />

bottom-line.<br />

In the past, solutions like these sensor-based systems were<br />

designed using printed circuit boards (PCBs) containing discrete<br />

off the shelf components. These catalogue products are<br />

designed to service many different applications and therefore<br />

can be over specified for what the application actually requires.<br />

This can add to the cost of the products but also means that<br />

OEMs requiring low- to medium-volumes can become lower<br />

tier customers to the component suppliers and the risk of<br />

obsolescence is greater if the high-volume customer no longer<br />

requires the product. Obsolescence can also cause major issues<br />

if single or multiple components on a board are no longer<br />

manufactured. Best case a new component can be used to<br />

replace the obsolete part but worst case a whole new re-spin of<br />

the board may be required with potential changes to<br />

functionality if an exact replacement product is not available.<br />

B. Technology Changes<br />

Process nodes (the minimum feature size in an integrated<br />

circuit) have been reducing every year, as consumer device<br />

manufacturers try to push the boundaries of silicon performance.<br />

This push has seen the number of transistors on-chip continuing<br />

to increase. Designers of custom electronics are moving towards<br />

the most advanced process nodes available enabling even<br />

thinner mobile phones and faster computers. These high-volume<br />

opportunities are therefore centered around cutting-edge<br />

processes. To make these new generations of ICs even more<br />

sophisticated fabs are opening which in turn, has left the world<br />

with many fabs that are no longer quite at the cutting edge.<br />

Designers may often see these fabs as outdated, but the<br />

production technology is proven, with high yield and reliability<br />

and they present an opportunity to produce better products, as<br />

well as being fully depreciated and lower cost.<br />

According to reports by SEMI [2], the foundry market is<br />

projected to grow to $97.5B by 2025. While the volume growth<br />

will be in process nodes like 10nm and below, it is interesting to<br />

observe that still over 50% of the market will be for processes<br />

greater than 20nm.<br />

Fig. 1. Foundry Market by Feature Dimension<br />

So, while the volume demand may be centered around the<br />

smaller process nodes as foundries build newer and newer<br />

factories to meet the demand for the ever-smaller geometries,<br />

they find themselves with mature, reliable factories at capable<br />

geometries that are not at the risky leading edge. The foundries<br />

are wishing to maximize fab utilization across all process nodes.<br />

They want to get the best return from the investments already<br />

made for these process nodes and as the factories are all fully<br />

depreciated, their costs can therefore be lower.<br />

As a result, an excellent opportunity exists to be able to<br />

leverage this technology for custom integrated chips in an<br />

economical manner.<br />

S3 Semiconductors, with over 20 years’ experience in the<br />

semiconductor market has been fostering and developing<br />

partnerships with these foundry partners for many years and as<br />

a result has access to a broad range of technology nodes that<br />

allow performance leading mixed signal and RF IP and custom<br />

integrated chips to be developed.<br />

III.<br />

CASE STUDIES<br />

Here, we will examine two recent developments whereby the<br />

end customer achieved cost and area savings while realizing<br />

additional favorable results also. S3semi likes to take a holistic<br />

view of the discovery process and this can enable the<br />

development of a chip that can allow for future proofing and<br />

allows our customers to take advantages of market changes but<br />

also to be able to scale and enter parallel application areas.<br />

Finally, we will examine an online calculator that S3<br />

Semiconductors has developed to assist customers in comparing<br />

the savings possible when choosing a custom integrated chip<br />

route rather than using a commercial catalogue standard<br />

products method.<br />

A. Case Study 1<br />

A major supplier of plant equipment to the oil and gas sector<br />

had been brainstorming internally while working on their<br />

product roadmap to try and understand options to remain<br />

competitive while maintaining control of their product costs.<br />

They approached S3semi to try and establish and quantify the<br />

exact benefits they could expect going a custom integrated chip<br />

route.<br />

The key criteria that became apparent during these early<br />

discussions with the customer were the need for:<br />

• Ability to allow for portfolio tiering<br />

• Multiple sensor interfaces (pressure, temperature,<br />

diagnostics)<br />

• Integrated smart control loop<br />

• Accurate valve positioning<br />

• Multiple communications protocols<br />

• Integrated ARM processor and PIC controller<br />

• A SoC that was designed to be intrinsically safe<br />

• Low power<br />

850


The incumbent solution comprised mainly of discrete<br />

components using commercial off the shelf components. The<br />

total bill of materials (BOM) cost for the solution was high and<br />

while the end customer was not happy with the prevailing costs,<br />

they also felt it was unnecessarily over specified for the<br />

performance required and did not allow them to implement<br />

product tiering. A feasibility study followed by a design study<br />

phase established the details of the integration options possible.<br />

Central to the discussions were the sensing needs,<br />

measurement needs, control and programmability needs,<br />

connectivity needs and the security needs for the desired<br />

solution. Within 3 months the functional specification of the<br />

custom chip was clear and a detailed project plan for<br />

implementation of the design right up to the qualification and<br />

production was presented and agreed with the customer.<br />

The final custom integrated chip was manufactured using a<br />

0.18um TSMC foundry process. The final system-on-chip<br />

(SoC) delivered:<br />

• Analog Front End<br />

o 14-bit ultra-low power SAR ADCs<br />

o 12-bit control DACs<br />

o Power switches<br />

o Analog Multiplexers<br />

o Analog Operational Amplifiers<br />

• Multiple industrial communication interfaces<br />

o FOUNDATION Fieldbus<br />

o Highway Addressable Remote Transducer<br />

(HART)<br />

• ARM Cortex-M4 core<br />

• PIC microcontroller<br />

• FLASH & SRAM memories<br />

• Multiple peripheral interfaces<br />

o SPI<br />

o UART<br />

o I 2 C<br />

o Parallel<br />

Fig. 2. Visual Representation of the benefits of custom integrated chips<br />

B. Case Study 2<br />

S3 Semiconductors has for many years been designing and<br />

manufacturing custom integrated circuits (ICs) that ensure<br />

seamless integration of analog and digital subsystems within<br />

wired and wireless communication systems.<br />

Mobile satellite services (MSS), a more niche area of<br />

wireless communications, provides two-way voice and data<br />

communications to users worldwide who are mobile or in<br />

remote locations. The terminals range in size from handheld to<br />

laptop-size units and can be mounted in a vehicle, with<br />

communications maintained while the vehicle is moving.<br />

Today’s solutions that incorporate satellite terrestrial<br />

modems are generally ASSP based. As a result, they tend to be<br />

large, falling between either not being optimized enough or over<br />

specified. They can be noisy with poor signal integrity and have<br />

poor blocker performance. They are also inefficient and costly.<br />

MSS customers are no different to other communications<br />

customers. They are demanding greater asset tracking,<br />

monitoring and control. And they require all of this with<br />

increased broadband speed and no disruption in remote<br />

locations.<br />

A satellite operator that services the MSS Industry, had heard<br />

that by using integration they could offer many benefits for their<br />

next generation modem product and with enhanced connectivity<br />

they could introduce new functionality to their product, which in<br />

turn could introduce new service centric revenue streams.<br />

A meeting was arranged with S3 Semiconductors and the<br />

customer to review their product features and understand their<br />

product roadmap. Taking a holistic view of this discovery<br />

process the key criteria for the customer quickly became<br />

apparent and they were:<br />

• Footprint that was much less than that of the current<br />

discrete solution<br />

• Integrated L-band transceiver that could support<br />

multiple modulation schemes<br />

• Choice of converter line-ups (super-het, zero or low-IF,<br />

direct-up conversion),<br />

• Low power<br />

• Economic semiconductor integration node<br />

The development schedule was 12 months and the final<br />

custom ASIC was developed on time and within budget. The<br />

final product was fabricated on 0.18m RF-CMOS process from<br />

TSMC and included.<br />

• Embedded algorithms for DC offset correction<br />

• RC Time constant calibration<br />

• IP2 Calibration<br />

• IQ Gain/Phase Calibration<br />

• Image Rejection Calibration for low IF, AGC and AFC<br />

www.embedded-world.eu<br />

851


• Integrated receiver blocks<br />

• Integrated transmitter blocks<br />

• BIST, Analog-test multiplexer, SPI and<br />

communications digital signal processor<br />

C. Online Calculator<br />

When people think about custom silicon their first thoughts<br />

are inevitably about the perceived high cost associated with it.<br />

However, they also tend to look at the cost of development only.<br />

An area that is often missed in comparisons is the cost of the<br />

product over its entire lifecycle. In application areas within IIoT,<br />

the lifetime of products can be 10 years or more. There is also<br />

the largely unconsidered cost associated in sourcing, storing and<br />

testing the large number of components involved when a<br />

discrete solution is employed, as well as the placement costs<br />

incurred in the board assemble operation. Consideration should<br />

also be given to reductions in assembly costs and improved<br />

reliability due to a significant reduction in overall component<br />

numbers due to the custom chip incorporating this functionality<br />

on a single chip.<br />

Taking these elements into account, the true comparison cost<br />

should consider the number of devices expected to be<br />

manufactured every year and the expected product lifetime.<br />

Then, consideration can be given to the design aspect of the<br />

product. S3 Semiconductors has developed an easy to use online<br />

calculator that takes these inputs and gives an accurate budgetary<br />

representation of the potential savings that can be made in going<br />

a custom ASIC route.<br />

other analog components will be needed. The next question is<br />

on the level of processing power needed – is a cost-effective 8-<br />

bit MCU all that is required or will the design need one or more<br />

64-bit multicore processor. Finally, an understanding as to<br />

whether wireless connectivity is needed or not. When all the<br />

above inputs are entered quickly and easily into the S3semi free<br />

online calculator (https://www.s3semi.com/bom-calculator/),<br />

the output will show a break-even volume and the total savings<br />

that can be expected. A graphical representation of this is shown<br />

in Figure 3 versus the cost associated with going with a discrete<br />

solution. This example shows that a custom chip route will give<br />

total savings of over $15million with a breakeven volume of<br />

73,909 on a product that ships 50,000 units per year for 6 years.<br />

IV.<br />

RESULTS<br />

Two case studies for different end markets were highlighted<br />

in the previous section. In both instances, the end customer was<br />

looking for a smaller and cheaper solution without having to<br />

compromise on performance. The results achieved and the<br />

savings made versus the previous discrete implementation are<br />

detailed below.<br />

Case study 1<br />

In case study 1, the Industrial customer wanted to develop a<br />

custom integrated chip that allowed them to remain competitive<br />

while maintaining control of their product costs. The final SoC<br />

offered many advantages over the previous solution. It was<br />

shown to:<br />

• Achieve BOM cost reduction of 90%<br />

• Substantially reduced footprint with a SoC in a 19mm<br />

x 19mm TFBGA package<br />

• Met the low power budget supplied from the 4-20mA<br />

control loop<br />

• Allowed portfolio tiering<br />

Fig. 3. Calculator inputs and output<br />

The calculator looks for inputs on the number of devices<br />

being shipped per year and for how many years. Then it queries<br />

how many data converters you expect will be required and what<br />

Case study 2<br />

In case study 2, the communications customer wished to<br />

achieve substantial area savings while introducing new<br />

functionality to their product. When the final solution was<br />

compared versus the previous solution (see photo comparison<br />

below)<br />

• 80% reduction in size<br />

• Improved signal integrity and reliability<br />

• Reduced Power<br />

• Large saving in electronics BOM<br />

Much of the previous discussion has been centered on the<br />

BOM cost savings that custom silicon affords the end user. An<br />

additional saving, that may not be obvious, is that of area. While<br />

a user might recognize that going a custom route will save area,<br />

they may not realize the actual extent of the savings possible.<br />

An example of some key products that will be on board a typical<br />

analog front-end design are highlighted in Table 1 below. The<br />

852


typical sizes of an integrated option versus a typical discrete<br />

option are also shown in figure 4. As can be seen, the savings<br />

are considerable.<br />

Product Integrated Size Discrete Size<br />

RTC 0.07mm 2 15.2mm 2<br />

12-bit DAC 0.09mm 2 10mm 2<br />

14-bit ADC 0.24mm 2 14.7mm 2<br />

1.8V LDO 0.06mm 2 8.26mm 2<br />

4:1 Analog Mux 0.21mm 2 9mm 2<br />

Power Switch 0.01mm 2 4.2mm 2<br />

Table 1: Area comparison of products<br />

Fig. 4. Case study 2 – Before and After product size comparison.<br />

Creating a fully custom integrated chip design for your<br />

product can ensure you are getting exactly the technical<br />

functionality you need, in a package that is usually considerably<br />

smaller, cheaper and more efficient than one made up of<br />

numerous off the shelf components. Coupled with this is easier<br />

integration of the single custom chip and cheaper and faster final<br />

testing of your end product as you have less components to deal<br />

with. You can also enjoy the peace of mind knowing you will<br />

be able to produce the same device for many years to come and<br />

not risk obsolescence.<br />

V. CONCULSION<br />

Custom integrated chips are becoming increasingly more<br />

popular. A totally bespoke chip that incorporates most of the<br />

functionality you need in a single chip is the ideal for designers.<br />

By taking advantage of available capacity at mature process<br />

nodes and using an older geometry essentially removes a lot of<br />

the potential headaches of leading edge nodes as one has access<br />

to all that the mature nodes offer like extensive libraries of<br />

silicon proven IP. The resultant chip is usually smaller, cheaper<br />

and more efficient. Differentiation becomes easier and your<br />

intellectual property is better protected as a single chip is much<br />

more difficult to reverse engineer than a board.<br />

Going the custom route often does require you to work with<br />

an accomplished design partner. This offers the advantage of<br />

leveraging the experience of this design partner and negates the<br />

need to build this competence with your company. The ideal<br />

partner should have extensive experience in the area of custom<br />

chip development. S3 Semiconductors has more than 20 years<br />

of experience in designing analog and mixed-signal custom<br />

silicon, and can help guide you through the complete process.<br />

S3semi can be the perfect partner to help you realize large cost<br />

savings. Using the online calculator customers have immediate<br />

feedback on the costs associated with going the custom chip<br />

route. This is backed by 20 years of custom silicon design<br />

including the 2 case studies highlighted earlier and the provision<br />

of easy access to the most economical and highest quality<br />

production facilities which ensures access to higher<br />

performance, lower power and proven yield of low-cost custom<br />

silicon.<br />

REFERENCES<br />

[1] Frost & Sullivan, Analysis of Sensors in the Global Internet of Industrial<br />

Things Market, 2015<br />

[2] Dr. Handel Jones, Semiconductor Industry from 2015 to 2025, Internation<br />

Business Strategies (IBS) http://www.semi.org/en/node/57416<br />

[3] James O’Riordan, The Compelling Economics for OEMs to Commission<br />

their own Semiconductor Chips.<br />

www.embedded-world.eu<br />

853


Demand based and Software guided Selection of<br />

Microcontrollers<br />

Thomas Stolze, Klaus-Dietrich Kramer<br />

Department of Automation and Computer Science<br />

Harz University<br />

Wernigerode, Germany<br />

Abstract— Recent years of development lead to a huge<br />

diversification of the microcontroller market. Even though<br />

ARM based microcontrollers get widespread throughout most<br />

applications, more and more different controller families and<br />

derivatives get available to product development. On the one<br />

hand this situation appears great for development engineers as<br />

they can select devices from a large pool of hardware. On the<br />

other hand comparisons between these derivatives, especially<br />

with focus on project based requirements, get very complex<br />

and time demanding. They may even influence the time-tomarket<br />

of the final application. Furthermore, a quantified<br />

selection is not possible with common methods.<br />

A solution of this selection problem shall be given in the<br />

current paper. It focuses on the selection process of<br />

microcontrollers in very early stages of development, while a<br />

detailed view to this topic is given in [1].<br />

Choosing the best suitable microcontroller is not a trivial task,<br />

so the paper first deals with elementary key domains that have<br />

to be investigated. These key domains are separated into<br />

calculation power, properties of peripherals, memory and core<br />

features, and external factors such as economical values and<br />

software support. Then an algorithmic selection process based<br />

on these domains is presented that takes care of user<br />

constraints and allows to assess different microcontrollers so<br />

that they become easily comparable. A software tool has been<br />

developed that supports the user in this process. As a result,<br />

each considered microcontroller can not only be sorted into<br />

the common groups of suited an non suited devices, but gets a<br />

score that makes it directly comparable to other ones in all<br />

relevant domains. With this method it is possible to quantify<br />

the suitability of each microcontroller. It is even possible to<br />

evaluate microcontrollers regarding their reserves for future<br />

developments. A new vector based visualisation of<br />

microcontroller qualification supports the election of the best<br />

microcontroller, too. This new approach leads to a wellfounded<br />

selection and helps developers making the correct<br />

decision in a very short time. It thus also helps lowering<br />

development costs.<br />

Keywords—microcontroller; selection; choice; evaluation<br />

I. INTRODUCTION<br />

In recent years the microcontroller market developed to a<br />

billion dollar market [2] offering a huge amount of different<br />

microcontroller families with their respective derivatives,<br />

providing 8, 16 and 32 bit controllers for every kind of use.<br />

The demand of industrial and consumer applications enforces<br />

this trend, especially with mobile and IOT devices getting more<br />

and more widespread.<br />

But what sounds like a big deal for consumers has a major<br />

drawback for developers. The development process and<br />

especially the selection of well suited controllers get more and<br />

more complex. It is not only the number of possibly suitable<br />

devices, it is more the comparability of devices from different<br />

vendors with different properties and features. But the selection<br />

of an optimally suited device is essential for the later success of<br />

the product or application. What's more, this process is limited<br />

by time and costs. The longer the selection process takes, the<br />

later the real production can start, and the more expensive the<br />

whole development gets. And, even worse, selecting a device<br />

that may not fulfill a requirement very late in the development<br />

process may cause huge costs for changing to a better one or it<br />

may even lead to project failure. Because of this dilemma<br />

developers have to take the utmost care when selecting a<br />

certain derivative, but have to do it effectively, too. So, from a<br />

developer's point of view, what are possible helpful<br />

approaches?<br />

II.<br />

SELECTION OF A MICROCONTROLLER<br />

There are different ways to make a choice which of course<br />

have pros and cons. The most intuitive way is a manual<br />

selection process. Here the developer will attempt to gather<br />

information about microcontrollers that seem suitable for a<br />

given task with according constraints and then, based on a<br />

comparison of the information, a decision has to be made. By<br />

doing all the research manually, e.g. individually deciding<br />

which devices to look for, getting information out of data<br />

sheets, this tends to be the most time demanding way. The<br />

developer has to manually process and compare various<br />

properties. Furthermore, it is also the most limited way, since<br />

the selection will be affected by personal experiences and<br />

relations with certain manufacturers and devices.<br />

www.embedded-world.eu<br />

854


A second strategy is to use comparison tools provided by<br />

hardware manufacturers. These often come in form of websites<br />

that allow developers to put their requirements into forms that<br />

automatically limit the displayed results to those which fulfill<br />

all of them. All other derivatives are discarded. While this<br />

procedure helps developers to save time, it also limits the<br />

results to those of the respective manufacturer. That means a<br />

comprehensive comparison may only be possible if other<br />

manufacturers are included manually by the developer, or by<br />

using web-tools that are provided by controller distributors<br />

with a portfolio supporting different manufacturers.<br />

However, both selection strategies have other drawbacks<br />

which shall be discussed. First, the processing power is in most<br />

cases not part of the selection process, especially when using<br />

web tools. Therefore developers either have to test<br />

recommended devices on their own, or they have to rely on<br />

benchmark results published by manufacturers or third parties<br />

which may not be applicable to current application demands.<br />

Unfortunately, even today results running the Dhrystone<br />

benchmark are often published to compare microcontroller<br />

systems, but this type of benchmark is outdated and has been<br />

developed for other demands than nowadays microcontrollers<br />

[3]. Possible solutions are the EEMBC benchmarks, for<br />

instance the Autobench benchmark [4]. It provides detailed<br />

performance information for many use cases, but it has to be<br />

either run by developers themselves or by EEMBC as part of a<br />

commissioned work. As processing power is an important<br />

selection constraint, it is necessary to have reliable results, but<br />

this may lead to additional time and cost efforts. It may even be<br />

impossible to realize pre-development tests because of the<br />

mentioned efforts, or even because appropriate test software is<br />

not written yet. To summarize, a combination of an automatic<br />

selection including reliable performance data for multiple<br />

hardware manufacturers would be a convenient solution. That<br />

would also minimize efforts of the developers.<br />

Second, the comparison which is made only divides all<br />

considered devices into two groups: those which are suited for<br />

the intended use, and those which are not. There is no<br />

quantifying examination which may lead to a more<br />

sophisticated selection process showing an ordered lineup of<br />

suited devices. The selection process would become more<br />

transparent and flexible if there were dedicated qualifications<br />

for each derivative. An example may be a microcontroller that<br />

is disregarded in the old-fashioned scheme because of a minor<br />

lack of memory, but which would outperform other ones in all<br />

other aspects. Here a quantifying approach would allow to keep<br />

a more detailed view on that situation and thus finding a<br />

workaround, e.g. some compiler optimization with regard to<br />

code size, that makes this controller the number one to choose.<br />

Third, selecting a microcontroller is not only limited to<br />

processing power or peripheral demands. Furthermore, it is<br />

also depending on factors like costs per device, availability and<br />

software support. Therefore, these factors have to be included<br />

in the selection process, too. Some of these criteria are included<br />

in manual and web-based searches, e.g. the costs per device.<br />

Other constraints like the support by different development<br />

environments still have to be examined separately.<br />

As a conclusion one can say there are certain useful<br />

methods to partly support the selection process. But there are<br />

additional features necessary to assess all aspects of the<br />

selection. In order to comprehensively support the selection<br />

process a new assessment system has been developed that<br />

integrates all mentioned aspects to make a well-founded<br />

selection without time-demanding and costly research. This<br />

assessment system shall be described subsequently.<br />

III.<br />

VECTOR-BASED BENCHMARK OF EMBEDDED<br />

CONTROLLERS<br />

The main purpose of the development of the vector-based<br />

benchmark of embedded controllers (abbreviated "VBEC") is<br />

to support the selection process by providing a software tool<br />

that automatically calculates the qualification of<br />

microcontrollers with respect to given requirements. The<br />

qualification is displayed in form of dedicated values for<br />

different domains. Therefore, the following three assessment<br />

domains were defined:<br />

• processing power<br />

• peripherals, memory, features<br />

• external factors<br />

In order to work VBEC communicates with an underlying<br />

database which contains all relevant information for different<br />

microcontrollers. The data maintenance is realized by a special<br />

frontend to collect data and manage datasets. This is done by<br />

the administrator. Developers use the VBEC frontend to put in<br />

the project's requirements and get the assessment results. The<br />

following figure illustrates the software architecture.<br />

Fig. 1: Software Architecture of VBEC<br />

Each derivative is saved with according benchmark results<br />

based upon pre-run benchmark tests that were run on the<br />

device. The derivative is also linked to general information<br />

about its peripherals, memory system and special processor<br />

features like floating point units (FPU). The dataset for each<br />

device is being completed by information about unit costs at<br />

different distributors, availability information in terms of lead<br />

times at different distributors and software support in form of<br />

integrated development environments supporting the device.<br />

Developers using the software system have to key in their<br />

project specific requirements in the VBEC frontend. Therefore<br />

a form based software has been developed. As a first step, each<br />

855


element used for the detailed comparison has to be weighted<br />

according to project specific needs. That ensures a calculation<br />

where every single aspect matches the current project's<br />

priorities and is not only generalized information. It is also<br />

possible to declare requirements as essential, so that these have<br />

to be fulfilled in any case. Afterwards, the project's<br />

requirement values are collected, , ordered by the three domains.<br />

In the domain of calculation power this is a score value<br />

which specifies the overall performance needed. The score is<br />

automatically calculated using 12 different benchmark<br />

modules. Developers can decide whether all of them are<br />

weighted equally or in a more detailed fashion, e.g. weight<br />

certain benchmark modules higher than others to match the<br />

intended software design. The underlying database contains<br />

result values of the 12 different benchmark modules for each<br />

microcontroller, ranging from simple math operations over<br />

library functions to FFTs or cosine-transforms. These<br />

benchmarks were pre-run run so that developers neither have to do<br />

the test work on their own, nor have to rely on numbers from<br />

the web which may not be reliable enough or do not reflect the<br />

project adequately.<br />

In the second domain the demands regarding peripherals,<br />

memory or special processor features have to be defined. This<br />

is done either by selecting numbered values for the required<br />

elements (e.g. the number of available AD-channels) or by<br />

checking required features to search for, for instance a FPU.<br />

The third domain is processed in a similar fashion, selecting<br />

values for external factors like maximum allowed costs per<br />

unit. . The datasets provide detailed information to support the<br />

search, so that it is possible to keep costs up-to-date with<br />

regard to certain distributors and buying quantities.<br />

Additionally, this domain handles lead time and software<br />

support by a certain number of IDEs. Especially the<br />

information about usable IDEs may contribute to the overall<br />

project costs because of licenses that have to be acquired.<br />

Showing possible alternatives may help to decide for or against<br />

an IDE.<br />

Developers may now directly compare their inputs with<br />

two real microcontrollers selectable from a list of all<br />

derivatives available in the database. An easy comparison is<br />

given since all relevant values are lined up next to each other.<br />

Using a coloring scheme it is possible to mark underachievements<br />

orange and violations of the essential<br />

requirements red, helping to find any problems very quickly.<br />

Moreover, VBEC is now able to calculate triple values of<br />

the qualification for each microcontroller in the database,<br />

where each value represents the qualification in one of the<br />

three domains. In order to determine those triple values, a<br />

single qualification for each property has to be calculated first.<br />

This is done by using the following equation:<br />

and required quantity is limited to 1 or 100 %. The following<br />

formula is then used to summarize the single qualifications<br />

to a domain qualification:<br />

∑<br />

"<br />

#$ ∙! <br />

∑"<br />

#$ <br />

Equation (2) shows the calculation of the domain<br />

qualification by summarizing the single qualifications<br />

and the according weights using the weighted arithmetic<br />

mean. As a result, VBEC calculates three values between 0 and<br />

1 (or between 0 % and 100 % with regard to percent) when all<br />

are saturated. If they are not, the resulting may be<br />

greater than 100 %, indicating that there are over-achievements<br />

which may be used for further development. The three <br />

values provide information about how good a device matches<br />

the requirements in the described three domains. They form the<br />

basis of the selection decision.<br />

The triple values can also be interpreted a vector drawn<br />

within a 3D qualification cube e to simplify the comparison of<br />

the results (see fig. 2). The axes of the cube represent the three<br />

domains. Devices that are fully suitable and fulfill all<br />

requirements will have a qualification triple of Q={1,1,1} and<br />

show up as a vector from %0,0,0( to )* %1,1,1( in the<br />

3D cube. That is why point (1,1,1) can be seen as an<br />

optimum, whereas all deviation of result vectors from this point<br />

are under-achievements - as long as the calculated values use<br />

the saturated calculation method. If they are not saturated, then<br />

it is possible to get values greater than 1, indicating that there is<br />

an over-achievement or some future reserves.<br />

(2)<br />

_<br />

_<br />

(1)<br />

Fig. 2: 3D qualification cube for two compared microcontrollers A and B<br />

(example assessment)<br />

The calculation can be done in a non-saturated or a<br />

saturated way, where in the latter the ratio between available<br />

www.embedded-world.eu<br />

856


Fig. 2 shows a graphical comparison between two different<br />

microcontrollers assessed with the same requirements. It can<br />

easily be seen that controller B matches all requirements and<br />

therefore its qualification vector ends at . On the contrary<br />

microcontroller A does not fulfill all requirements. In fact it<br />

does not fulfill any of the three domains and has disadvantages<br />

on all three axes, even though only a slight one on the axis<br />

"external factors". By rotating the view this situation gets more<br />

clearly as shown in fig. 3.<br />

collect performance data. As of now, the database contains 25<br />

microcontrollers from major manufacturers as representatives<br />

of different microcontroller ontroller families. Their specification data<br />

was also saved in the database. Based upon that information<br />

VBEC has been tested with several example projects including<br />

the development of E-Bike control modules, surveillance of<br />

chemical processes, robot vehicles and consumer goods like<br />

home meteorological stations. It turned out that VBEC<br />

achieves an easy and fast comparison and provides useful<br />

information for selecting the best suited microcontroller.<br />

However, it is noticeable that the database is yet too small and<br />

needs to be extended. The more microcontrollers there are, the<br />

more meaningful VBEC gets. It also tends out that some underachieving<br />

devices may also be used because of substitutable<br />

missing elements. . This is often the case when the affected<br />

device is cheaper than others which may allow to implement<br />

workarounds using some extra hardware.<br />

Fig. 3: 3D qualification cube, rotated (example assessment)<br />

VBEC also allows to calculate an overall qualification<br />

value from the triple values, allowing to compare<br />

microcontrollers only by that qualification value. Developers<br />

may decide whether to do that using the arithmetic mean, the<br />

geometric mean or the length of the assessment vector in the<br />

3D qualification cube. The geometric mean represents a more<br />

pessimistic view because it emphasizes under-achievements,<br />

whereas the arithmetic mean represents a more balanced view<br />

and the length of the assessment vector provides a more<br />

optimistic view emphasizing the over-achievements.<br />

By<br />

selecting one of these methods the overall qualification value<br />

can be fitted to the project's peculiarities.<br />

Furthermore, VBEC is able to show a detailed comparison<br />

table including main information like the triple values, the<br />

overall qualification values and other relevant properties like<br />

current device costs. . That table can be sorted in different ways,<br />

so it is possible to quickly determine which microcontroller<br />

scores the highest triple values or overall qualification.<br />

Detailed module based benchmark results are provided by a<br />

performance diagram. By using these techniques, developers<br />

only need to pick the best-suited suited derivatives from the table and<br />

the diagram.<br />

IV. CURRENT TESTS<br />

Before VBEC could be tested, many microcontrollers with<br />

their according development kits had to be benchmarked to<br />

V. CONCLUSION AND FUTURE DEVELOPMENT<br />

Summarizing the presented software-guided selection of<br />

microcontrollers VBEC is able to effectively support<br />

developers making the correct choice. It integrates all relevant<br />

aspects of the selection process and provides data in order to<br />

compare microcontrollers directly with the project's<br />

requirements. Moreover, it automatically assesses all<br />

derivatives that are saved in the database and thus provides an<br />

easy selection. With the help of the 3D rendering and tables it<br />

is very easy to compare different controllers.<br />

In order to work properly and provide adequate data VBEC<br />

has to be updated with new devices as they appear on the<br />

market. That includes not only putting the specifications into<br />

the database, but also to run benchmark tests on every single<br />

device. The effort for doing so is high - but developers benefit<br />

from these tests, and the tests only have to be run once.<br />

Nevertheless, an appropriate way to keep VBEC up to date<br />

seems to select certain derivatives from a whole<br />

microcontroller family to find a reasonable middle course.<br />

REFERENCES<br />

[1] T. Stolze, Application Criteria of Single Chip Microcontrollers/<br />

Embedded Controllers, Ilmenau, Isle Verlag, 2018, unpublished. (ISBN<br />

978-3-938843-90-1)<br />

[2] IC Insights, Inc., “MCU Market Forecast to Reach Record High<br />

Revenues Through 2020: After recent years of price erosion, MCU<br />

ASPs are forecast to rise and help lift sales to new highs,” Scottsdale<br />

AZ: IC Insights Inc., , Research Bulletin, August 2016.<br />

[3] A. Weiss, “Dhrystone Benchmark : History, Analysis, 'Scores' and<br />

Recommendations,” El Dorado Hills, CA: EEMBC, 2002, URL:<br />

http://www.eembc.org/techlit/datasheets/dhrystone_wp.pdf, last access<br />

2018-01-18.<br />

[4] EEMBC, “AUTOBENCH TM : An EEMBC Benchmark,” El Dorado<br />

Hills, CA: EEMBC, , 2015, URL:<br />

https://www.eembc.org/benchmark/automotive_sl.php, last access 2018-<br />

01-18.<br />

857


Tips and Tricks for Debugging<br />

Greg Davis, Director of Engineering, Compiler Development<br />

Green Hills Software, Inc.<br />

Copyright 2011‐2018 by Greg Davis<br />

When people think about what a software engineer does for a living, they say he or she writes<br />

code. While this is certainly true, it is hardly an accurate description of what the job is like since<br />

writing code is unlike just about every other form of writing. A journalist might write several<br />

articles in a given month on a variety of topics in his domain. At a high level, the process of<br />

authoring a particular article might be like the process of writing a piece of code. You might<br />

start with an outline view of the whole, then work on the key points until you arrive at a rough<br />

draft. Then you might start fixing flaws and improving on various aspects until you arrive at<br />

something that is usable. This is probably where the similarity ends. A journalist may follow<br />

such a process to write his article, then he may take a step back, do some more research, and then<br />

start over again on a new article. The next article may stand independently of his previous<br />

works. Or, as is often the case, it may build upon previous articles. Yet once an article is<br />

published, it is on the record as published, and revisions are usually limited to corrections to<br />

factual errors.<br />

This alone sets the world of an engineer far apart from the world of a journalist. Except when<br />

changing jobs, very rarely does an engineer every really done with his code. Typically, an<br />

engineer works on an accumulating body of code; from version to version, new capabilities are<br />

added, but much of the old code remains. The old code needs to constantly be revisited to make<br />

it work with new aspects of the evolving system.<br />

An even more overlooked aspect of being a software engineer is the act of debugging. Studies<br />

show that an engineer spends roughly 80% of his time debugging, so it is surprising how little<br />

attention is paid to this time consuming activity. Debugging is unlike how we write code. When<br />

we write code, we start with a conceptual model of how we believe our code works, and then we<br />

augment the code to match the model of how we believe the code should work. The fact that our<br />

conceptual model of a program’s behavior is flawed is what leads to needing to debug in the first<br />

place. On the other hand, debugging is a centered on the reality of how your program is<br />

858


actually behaving. Debugging is the process of discovering how things really work so that we<br />

can set them right.<br />

We have seen books upon books about how to write code. Books on topics such as software<br />

design, software development methodologies, teamwork, coding styles, and the like are<br />

commonplace. But relatively little has been written about debugging. Similarly, universities are<br />

full of classes teaching about operating systems, algorithms, programming languages, theory,<br />

graphics, but little is actually taught about debugging. Students are forced to develop their own<br />

approaches to debugging based on the tools at hand.<br />

This paper focuses on techniques that are effective in debugging a wide range of embedded<br />

systems.<br />

Connecting to Your Target<br />

The first step in debugging a system is to get everything running under a debugger. While there<br />

are certain circumstances where a debugger is not appropriate, the vast majority of debugging<br />

can and should be done using a debugger. The main alternative, printf-debugging, will be<br />

discussed later.<br />

In order for your debugger to function, it needs to be able to connect to your embedded system.<br />

There are three main varieties of debug connections.<br />

<br />

<br />

JTAG Probe – The probe is a device that connects to your processor through special<br />

debugging channels, which are typically exposed by extra pins on the processor.<br />

Although there are many kinds of debug channels, JTAG is the most well known, and it<br />

is sometimes used as a generic term to refer to any such debug channel. Using this debug<br />

channel, the probe can inspect registers and memory, and it can start and stop the<br />

processor.<br />

Debug Agent – A debug agent is a process running under an operating system that is used<br />

to control other threads and processes on the system. The debug agent uses the operating<br />

system to inspect memory, set breakpoints, and to start and stop processes.<br />

859


Monitor – A monitor is like a debug agent, except that it runs directly on the hardware<br />

rather than on top of an operating system. It uses a timer to periodically halt the system<br />

so that it can communicate with the debugger.<br />

Typically a JTAG Probe or Monitor is used if you are running without an operating system or<br />

with a microkernel. When running under a full fledged operating system or RTOS, a Debug<br />

Agent is typically used, although a JTAG Probes is still be useful for debugging problems in the<br />

hardware or kernel.<br />

Basic Debugging<br />

Once your system is connected to your debugger, you can start debugging. To illustrate a basic<br />

debugging session, let’s pretend that your embedded system is a traffic light monitor. It is<br />

connected to traffic lights and traffic sensors, and it periodically sends signals over a serial port<br />

to a nearby computer. Your system works fine when it first starts up, but after a while it begins<br />

to send corrupted packets over the serial port.<br />

What you might do to start debugging is to set a breakpoint at the communication routine that<br />

sends packets over the serial port. You restart the system, and every time a packet it sent, you hit<br />

the breakpoint. This means that the system has stopped running, and it is in a mode where you<br />

can inspect the state of the program. Perhaps the function looks something like:<br />

error_t send_message(char *message, size_t message_len)<br />

{<br />

}<br />

...<br />

A logical next step is to view the message parameter and to look at it to see if it is corrupted. If<br />

it is corrupted, you know that the problem is that the caller of send_message passed along a<br />

corrupt message. You should now start looking at the caller to see how that happened. If, on the<br />

other hand, the message parameter looks correct, you might set the system running again.<br />

Now that the system is running again, the message should be sent over the serial port. Now look<br />

860


at the computer on the other end of the serial port and see what the message looks like. If it<br />

looks OK, then you have not yet experienced the problem. You can let the system keep running<br />

and repeat the process the next time the breakpoint at send_message is hit. However, if the<br />

message looks corrupt, then most likely there is some sort of problem in the sending of the<br />

message. At that point, you can review how a UART works and start stepping forward in<br />

send_message() to see when things start to go wrong.<br />

This sort of investigative approach is the bread and butter of debugging. If you are new to<br />

debugging with a debugger, you can stop reading here and start trying it out on your own. You’ll<br />

find that using a debugger is light years ahead of what you were doing before.<br />

Call Stacks<br />

Another very useful feature of debuggers is the call stack. The call stack shows the functions<br />

that are currently active. For example, if you had found that the message parameter passed<br />

into send_message() was corrupted, you would want to look to see how it had happened.<br />

The call stack is the first thing to check. The call stack might show:<br />

main() event_loop() status_check() periodic_update() <br />

send_message()<br />

At this point, you might ask the debugger to climb up from the current function,<br />

send_message() to the periodic_update() function to see what is going on.<br />

861


At this point, you can see the context where periodic_update() called<br />

send_message(). There are a few possibilities:<br />

1. status_message could have come out corrupted from get_status_message()<br />

2. The status could have been corrupted at some point after get_status_message()<br />

3. The accessor functions GET_MESSAGE() and GET_LEN() might be to blame.<br />

It’s hardly conclusive at this point, but at least we know something already. Without a call<br />

stack, you’d probably end up setting breakpoints earlier in the program to try to debug before<br />

the call to send_message().<br />

Hardware Breakpoints<br />

Software breakpoints are used to stop the execution of the program when a given line is reached.<br />

The way they work is that the debugger replaces the first instruction at that source line with a<br />

trap instruction. When the trap instruction is executed, the program stops and the debugger gains<br />

control. The implementation details of software breakpoints vary from system to system.<br />

862


Unlike software breakpoints, hardware breakpoints come in two varieties, hardware execution<br />

breakpoints, and hardware data breakpoints. Hardware execution breakpoints are like the<br />

standard breakpoints in that they stop the program when a given line of code is executed.<br />

Typically, hardware execution breakpoints are unnecessary as long as the program is running out<br />

of RAM. However, when the program is running out of ROM, hardware execution breakpoints<br />

are probably the only execution breakpoint that will be available. Hardware breakpoints are<br />

implemented in the CPU itself, so the capabilities vary from system to system.<br />

Hardware data breakpoints are unlike software in that they watch over a designated piece of<br />

memory. The capabilities of hardware breakpoints vary from target to target, but generally they<br />

can be set to halt on a read, a write, or either a read or write, to the designated memory.<br />

A common use case is when you find that a given piece of memory is corrupted you set a writeonly<br />

hardware breakpoint on the memory so that you can see every time the memory is being<br />

modified. In my experience, most data is set once or twice and then left alone until it goes out<br />

of scope or is freed. So a hardware breakpoint will quickly show you the culprit. The biggest<br />

problem in practice is identifying where to set the hardware breakpoint. You might know that<br />

your system is crashing because a packet is being corrupted, but unless the packet always ends<br />

up at the same location in memory, you’ll need some way to identify which packet is being<br />

corrupted. You’ll need to stop the system and set the hardware breakpoint once this packet is<br />

created.<br />

Hardware breakpoints can also be implemented under an operating system by unmapping the<br />

page that the data resides at. When the data is accessed, there will be a page fault, and the<br />

operating system can take over. However, since pages generally contain many variables, an<br />

operating system implementation slows down the system whenever unrelated variables that<br />

reside on the same page are accessed.<br />

Hardware data breakpoints should also be redundant if your system is running under a simulator<br />

or is interpreted. The simulator or interpreter should have such a capability already. But for the<br />

rest of us, hardware data breakpoints are invaluable.<br />

863


Printf Debugging<br />

Printf debugging is the practice of instrumenting your code with print statements as a means of<br />

debugging. Typically, you add printf statements to the code and run and see what was going on<br />

before the bug occurred. Usually the information is incomplete, so you typically iterate by<br />

adding more printf statements to the code in places where the details are sketchy. At some point,<br />

you end up with enough print statements in the code that you are able to determine what is going<br />

wrong.<br />

One deficiency with printf debugging is that it requires multiple compilations as you keep adding<br />

more printf’s to the code. Printf debugging can also make bugs disappear since the printf’s add a<br />

fair bit of overhead to the code, change the timing all over, cause the optimizer to unnecessarily<br />

make pessimistic assumptions, and perturb the register allocation in the routines calling printf().<br />

Printf debugging is all too often used in cases where there is no debugger available, or when the<br />

user doesn’t know how to use it. This is generally a bad reason to resort to printf debugging.<br />

There are a few legitimate cases where a debugger cannot be used to debug a system, but these<br />

are few and far between.<br />

A legitimate reason to use printf debugging is when there is a lot of data to sort through to<br />

narrow down on where the problem lies. If you can have the output onto the host machine’s disk<br />

or into a large memory buffer, you can sift through it all. Often times printf debugging is used to<br />

narrow down the problem to a particular software component at which time a debugger can be<br />

brought up on the component to debug the track down the problem further.<br />

Debugging for Comprehension<br />

As I stated earlier, one of the big reasons that we end up debugging in the first place is because<br />

our conceptual model of how a system works is flawed. During the process of software<br />

development, debugging can be used to increase your understanding of the code.<br />

864


“Why Isn’t this Failing Every Time?”<br />

Sometimes we look at an incomprehensible piece of code and it looks so dumb, so broken, and<br />

so wrong that we wonder how the code has survived for so long. We wonder what the author<br />

could have possibly been thinking. Was he pulling an all nighter? This is totally broken!<br />

Sometimes, we are the author, and we still have no clue.<br />

Sometimes we even vaguely remember writing the code, and this still doesn’t help.<br />

Chances are, the code isn’t totally broken. There’s probably some condition that makes the code<br />

work, at least most of the time. The only way to understand this is to see what’s going on when<br />

the code is executed. Set a breakpoint on the code, start looking around at the relevant<br />

variables, and prepare to be amazed.<br />

“Is this Code Ever Reached?”<br />

Sometimes something looks so ancient or quaint that you have a hard time believing it matters<br />

anymore. Maybe the code is commented using acronyms that you haven’t heard since you were<br />

in high school.<br />

A debugger can be useful to browse code between different modules. It can bring you to call<br />

points in different modules far faster and more accurately than the corresponding grep command.<br />

Often times, the code isn’t a function, but just a part of a function that is only reached under can<br />

be met.<br />

A more common technique is to replace the code in question:<br />

…<br />

if (guard_var == 3 && ptr == NULL) {<br />

865


… // you have a hard time believing this code works<br />

with:<br />

…<br />

if (guard_var == 3 && ptr == NULL) {<br />

send_message(“Dubious code”, 13);<br />

// Loop until you bring up a debugger to debug further<br />

while (one) ; // one is a volatile variable with value 1<br />

… // you have a hard time believing this code works<br />

Then test your system like you normally would. Often times, you’ll see the “Dubious code”<br />

message, and the program hangs due to the while (one) loop. You can attach to the program and<br />

see what’s happening. From the call stack and inspective different variables, you can probably<br />

figure out everything that you needed to know. Occasionally, you’ll find that the code is never<br />

reached. That means one of two things. Either the code can be deleted. Or you need to improve<br />

your testing system.<br />

“I hope this works…”<br />

Once you’ve identified a problem, you modify the code to fix it. If you’re like most people, you<br />

often times find yourself hoping that your code does the trick. You run your system again, and if<br />

it works properly, you consider your job done.<br />

A better technique is to first watch your new code run under the debugger. Often times, your<br />

new code is still wrong, and a quick look at it under the debugger would set you back on the<br />

right track quicker than waiting for your system to fail again. And the feedback will be<br />

immediate. Another advantage is that even if your code fixes the problem, it might not fix it for<br />

the reason you think. When you see the code run, you might realize what you were doing wrong.<br />

866


Advanced Debugging Techniques<br />

Data Visualization<br />

Some data structures are difficult to quickly visualize in order to understand the contents. For<br />

example, when viewing a C++ STL string, it typically requires decending a level into the data<br />

structure to find the string. STL maps are even more complicated. Consider an STL map that<br />

maps from a C++ string into an STL list of integers. The fully resolved name for this is data type<br />

is:<br />

std::map<br />

Actually viewing the data structure requires descending a data structure which is typically<br />

implemented as a tree. While there are efficiency reasons for the data structure being<br />

implemented as a tree, it isn’t convenient to have to view it as such if you’re just trying to figure<br />

out what strings the map contains and what they map to.<br />

Many debuggers offer the capability of viewing STL data structures simply. For example, an<br />

instance of the sorted map above that contains the words “better debugging is good” with each<br />

word having the index 1, 2, 3, and 4, respectively, might look like:<br />

The importance of this goes beyond debugging standard C++ data structures. A real world<br />

embedded application contains numerous important data structures. These data structures are<br />

probably performance optimized, but they need to be debugged by real people.<br />

867


If you debugger has data visualization capabilities, they are probably extensible. It is well worth<br />

your time to write the extensions so that they can view your own data structures. It might take<br />

several hours to a couple of days to write the extension, but this will quickly pay off if these data<br />

structures are often used. If your debugger does not have these visualization capabilities built in,<br />

the next best thing is to provide a textual representation of the data. You could easily iterate<br />

through the above data structure to print out something like:<br />

{{“better”, {1}},<br />

{“debugging”, {2}},<br />

{“good”, {4}},<br />

{“is”, {3}}}<br />

This textual representation can be viewed from the debugger or dumped out to a file on your host<br />

computer.<br />

Conclusion<br />

We have examined a number of ways to debug a system. These techniques can greatly speed up<br />

the process of debugging when compared to other approaches.<br />

868


Jump Starting Code Development to Minimize Bugs<br />

Jacob W. Beningo<br />

Beningo Embedded Group<br />

Linden, Michigan USA<br />

jacob@beningo.com<br />

Abstract—Debugging an embedded system is one of the<br />

greatest challenges that faces developers during the development<br />

cycle. Using a mix of traditional and modern development<br />

techniques such as assertions, RTOS Aware debugging, streaming<br />

trace and code reviews can easily decrease the time spent<br />

debugging and minimize debugging as a major development<br />

challenge.<br />

I. INTRODUCTION<br />

Debugging an embedded system is one of the most time<br />

consuming and expensive activities that embedded software<br />

developers engage in. Survey results 1 show that the average<br />

team can spend as much as 20% of a development cycle<br />

debugging their software. Despite these survey results, during<br />

2017 I polled developers at three Embedded Systems<br />

Conferences in Boston, Minneapolis and San Jose in addition to<br />

the Arm Technical Conference and found that from the several<br />

hundred engineers I encountered, these developers spent on<br />

average 40% of their development cycle debugging! Combining<br />

these two results, in a yearlong project, these developers are<br />

spending anywhere from 2.5 to 5 months debugging their<br />

software!<br />

Developers can easily prevent, detect and eliminate bugs to<br />

dramatically decrease the time they spend debugging their<br />

embedded system. In this paper, we are going to examine several<br />

techniques ranging from traditional techniques such as<br />

assertions and code reviews to modern techniques like real-time<br />

tracing that can be used to quickly detect bugs. We will develop<br />

a robust process that readers can follow and implement to<br />

decrease the time they spend debugging and spend more time<br />

innovating that can be found in the last section of this paper.<br />

II. WHERE DO BUGS COME FROM?<br />

I don’t think there are very many developers who start out<br />

wanting to put bugs into their software. Sure, many of us enjoy<br />

a debugging challenge but given the pressure that we are under<br />

to deliver product, the preference is to just skip the bugs and get<br />

the job done right the first time. The problem is that it is truly<br />

impossible to develop a bug free system and anyone who tells<br />

you otherwise is just kidding themselves and trying to fool you.<br />

We can certainly do everything in our power to mitigate and<br />

minimize the bugs that are present, but they are still undoubtedly<br />

there. Edsger W. Dijkstra once stated:<br />

“Program testing can be used to show the presence of bugs,<br />

but never to show their absence!”<br />

We can never truly declare that we develop bug-free code<br />

because no matter how strictly we test the system, we can only<br />

show that under certain circumstances the system behaves as<br />

expected.<br />

We could discuss how bugs are introduced into a system at<br />

great length, but since we want to focus on bug prevention and<br />

detection techniques, let’s just examine a few possibilities. The<br />

most common causes (in my opinion) for bugs in an embedded<br />

system are:<br />

• Faulty requirements<br />

• Poorly architected software<br />

• Complex implementation<br />

• Not using industry best practices<br />

• Rushed development cycle<br />

These are just a few examples but there are dozens that we could<br />

undoubtedly list. The trick, is to develop a process for preventing<br />

and immediately detecting bugs in a system. Before developing<br />

a process or procedure, it is useful to first evaluate your<br />

debugging skills.<br />

III. HOW SOPHISTICATED ARE DEBUGGING TECHNIQUES?<br />

Back when I first started developing embedded software<br />

twenty years ago, it always felt like we would cross our fingers,<br />

press the debug button and hope for the best. If things appeared<br />

to work, we would heave a sigh of relief and nervously announce<br />

the system was working. If something went wrong, we now had<br />

to guess and infer what on earth was going on in that little silicon<br />

core that we could only glimpse at through breakpoints, watches<br />

and maybe printf if it didn’t interfere with the systems real-time<br />

performance. In general, these techniques are inefficient, error<br />

prone and require a lot of guess work.<br />

www.embedded-world.eu<br />

869


The modern developer has far more than these simple,<br />

traditional techniques available to them. Newer techniques<br />

include but are not limited to:<br />

• Statistical profiling by sampling the program counter<br />

(PC)<br />

• Data profiling<br />

• Task and data tracing<br />

• Instruction tracing<br />

In order to determine how sophisticated a debugger the reader<br />

is, review the techniques that are listed in Figure 1. Rank<br />

yourself on a scale from 0 to 10 where 10 indicates technique<br />

IV. REVISION CONTROL SYSTEMS<br />

Using a revision control system doesn’t really help prevent<br />

bugs but they can be very useful in finding bugs. More times<br />

than not, a bug will surface or be discovered in code which then<br />

leads the developer to track down when the bug was introduced<br />

and what might have changed in the code to create the bug.<br />

Finding the bug can be very difficult if a revision control system<br />

is not used.<br />

A revision control system, when used properly, will allow a<br />

developer to revert their code and create difference reports that<br />

can be critical to discovering where the bug may have been crept<br />

into the system. For this reason and many others, every<br />

development team, whether a hundred engineers or just one,<br />

should use a revision control system.<br />

V. CODING STYLE GUIDES AND STANDARDS<br />

Two useful techniques that developers can use to help<br />

minimize the opportunity for bugs to get into their software is to<br />

use a good coding style guide and industry standards.<br />

A. Coding Style Guides<br />

A coding style guide is nothing more than a guide that<br />

specifies how the software will be organized and how it should<br />

look. The style guide specifies things such as<br />

• Naming conventions<br />

• White tab spacing<br />

• Documentation blocks<br />

• Where brackets go (new line or inline)<br />

Fig. 1. This diagrams highlights common debugging techniques starting with<br />

simple breakpoints and moving to more modern and complex techniques in a<br />

clockwise direction. (See the text for how to rank yourself).<br />

mastery and 0 indicates that either the technique is never used or<br />

that the reader knows little to nothing about it. Sum up the total<br />

for each development technique and see where you currently<br />

rank in your debugging techniques in Figure 2.<br />

Rank<br />

Debugging Technique Evaluation<br />

Status<br />

0 - 40 Stumbling in the dark ages<br />

40 – 60 Crawling out of the abyss<br />

60 – 80 Expert bug squasher<br />

Fig. 2. Rankings for how sophisticated your use of debugging techniques are.<br />

As you can see from the table, in order to be truly efficient at<br />

debugging and embedded system, you need to be an expert in at<br />

least six of these debugging techniques.<br />

Let’s now examine several different techniques that can be<br />

used by developers to prevent and find software bugs before<br />

exploring a simple process developers can follow to setup and<br />

rid themselves of bugs.<br />

• Etc<br />

The reason that a coding style guide should be used is that it<br />

helps provide a uniform look and feel to the software even if<br />

multiple developers with different preferences are working on<br />

the code base.<br />

A uniform look and feel can remove distractions during a code<br />

review from minor nuance differences in the coding style and<br />

allow a developer to focus in on the code and finding potential<br />

implementation flaws.<br />

Every developer has their own preferences so it’s a good idea to<br />

create a style guide for your own teams’ work. Jack Ganssle put<br />

together a useful template that can be leveraged and modified<br />

for any particular teams’ purpose 2 .<br />

In addition to using a style guide, it can also be helpful to put<br />

together template header and source files that match your style<br />

guide. My preference is to comment my code so that<br />

documentation can be generated by Doxygen. For this reason,<br />

I’ve created several Doxygen templates 3 that already meet my<br />

style guide and can be copied and pasted into any new module<br />

that is being created.<br />

B. Coding Standards<br />

A coding standard is a set of industry best practices that a<br />

developer can follow that removes the opportunity for error and<br />

confusion. The two best and perhaps well-known examples are<br />

MISRA-C and CERT-C which are internally renown and proven<br />

coding standards.<br />

870


MISRA-C provides developers with a set of mandatory and<br />

advisory rules for which C constructs are safe to use in safety<br />

critical applications. These recommendations eliminate potential<br />

issues that are associated with the C Standard and provide<br />

developers with a C subset that they can use in their application.<br />

CERT-C provides developers with a secure coding standard<br />

that is designed to help developers create secure software.<br />

Obviously following this standard also helps improve software<br />

robustness and minimizes not just the opportunity that someone<br />

will successfully hack the application but also will decrease the<br />

opportunity for bugs to exist in the software as well.<br />

Both of these standards are designed to help developers<br />

prevent bugs from entering into their software and are examples<br />

that developers can follow to help prevent bugs from ever<br />

entering into their application code. In order to properly follow<br />

these standards, developers can use code analysis techniques to<br />

prevent bugs as well.<br />

track of the big picture. When developing a software function,<br />

these numbers don’t change. It turns out that a developer<br />

creating a function can only keep track of 7 to 10 paths through<br />

the function and then after that, the risk increases that bugs will<br />

exist in the code or be introduced during maintenance. In<br />

computer science, the paths through a particular function is<br />

considered to be a measurement of the function complexity and<br />

has a special name known as the McCabe Cyclomatic<br />

Complexity measurement.<br />

Cyclomatic complexity quantitatively measures the number<br />

of linearly independent paths through a software function. The<br />

greater the function complexity is, the greater the risk that bugs<br />

will exist in that code. Figure 3 shows how the function<br />

complexity measurements relates to the risk a bug exists in the<br />

code.<br />

Complexity versus Reliability Risk<br />

Complexity<br />

Reliability Risk<br />

1 – 10 A simple function, little risk<br />

VI. CODE ANALYSIS TO PREVENT BUGS<br />

There are several different code analyses that can be<br />

performed in order to prevent bugs from getting into the<br />

software. The three analysis types that I have found to be the<br />

most useful include:<br />

• Static code analysis<br />

• Dynamic code analysis<br />

• McCabe Cyclomatic Complexity<br />

Before committing any code to a repository, it is useful to first<br />

perform each analysis on the code base and resolve any issues<br />

that might be found. Let’s briefly look at each analysis.<br />

A. Static Code Analysis<br />

Static code analysis scans a developers’ software while the<br />

code is still in source form and is not yet executing on the target<br />

platform. Static analysis provides developers with an automated<br />

tool that goes beyond the checks performed by the compiler such<br />

as precision tracking, initialization checking, value tracking,<br />

strong type checking and macro analysis 5 that can detect<br />

potential bugs in an application.<br />

Static analysis can be used to not just detect potential issues<br />

in the way that C/C++ was written but can also be used to check<br />

whether the code is meeting rules in coding standards such as<br />

MISRA-C or a team’s style guide.<br />

B. Dynamic Code Analysis<br />

Dynamic code analysis is performed on the software while it<br />

is executing on the embedded target. Dynamic code analysis can<br />

provide developers with useful information such as:<br />

• Stack usage<br />

• Heap usage<br />

• Can collect execution timing<br />

• Monitors system inputs and outputs<br />

C. McCabe Cyclomatic Complexity<br />

The typical human mind can only simultaneously keep track<br />

of between 7 and 10 pieces of information before it starts to lose<br />

11 – 20 More complex, moderate risk<br />

21 – 50 Complex, high risk<br />

51+ Untestable, high risk<br />

Fig. 3. The greater the complexity, the greater the risk that bugs will be<br />

present in the function.<br />

As the reader can see, as a functions complexity rises above 10,<br />

the risk starts to dramatically increase. For this reason, a<br />

developer can analyze their software functions for Cyclomatic<br />

Complexity and any functions that have a value greater than 10<br />

can be reworked and simplified.<br />

Monitoring the function complexity along with performing<br />

static and dynamic analysis can help detect and prevent bugs.<br />

VII. RTOS AWARE DEBUGGING<br />

Many modern embedded systems are now employing a realtime<br />

operating system (RTOS) to schedule tasks and manage the<br />

complex timing requirements and microcontroller resources.<br />

Introducing an RTOS into an embedded system has many<br />

advantages but it can also include potential issues related to:<br />

• Memory management<br />

• Stack utilization<br />

• System timing<br />

• etc<br />

Developers will often need some way to know:<br />

• how much stack space is being utilized by a task<br />

• the minimum, maximum and average period that a task<br />

is executing at<br />

• the minimum, maximum and average task response<br />

times<br />

• task, semaphore, queue and other RTOS resource states<br />

and availability<br />

www.embedded-world.eu<br />

871


Developers can use RTOS Aware debugging to monitor these<br />

critical features within their application which can help them<br />

answer important design questions, detect stack overflows and<br />

many other potential bugs. These debugging techniques are<br />

often instrumented within the RTOS and it is the responsibility<br />

of the IDE toolchain provider to make these RTOS details<br />

readily available to the developer.<br />

A simple example of RTOS Aware Debugging can be seen in<br />

Figure 3. In this example, the stack of a Blinky task (or thread)<br />

is being monitored while executing worst case system test cases<br />

in order to determine the worst-case stack usage. As you can see,<br />

the stack size is 1024 bytes, but the maximum stack used was<br />

232. In this example, we have dramatically oversized the stack,<br />

but this example easily could have shown that the stack had<br />

overflowed.<br />

Fig. 4. Using RTOS Aware debugging to monitor the stack usage in e2 Studio<br />

running the Renesas Synergy SSP.<br />

Another example that can be useful is to use RTOS Aware<br />

debugging to monitor the period, frequency and response times<br />

of the tasks in a system. Figure 4 shows an example using<br />

SEGGER SystemView where we can see how many times<br />

individual tasks executed along with other useful information<br />

such as task frequency, execution time and other data.<br />

Fig. 5. Using RTOS Aware debugging to monitor task execution.<br />

RTOS Aware debugging provides insights into an embedded<br />

system that not only helps a developer debug the system but also<br />

gain insights into its behavior, execution and response times.<br />

With this kind of information, debugging a system can be<br />

extremely fast and efficient.<br />

VIII. A SIMPLE BUG PREVENTION AND DETECTION PROCESS<br />

The best approach any developer can take to detecting bugs<br />

is to prevent them from ever entering their system. While we<br />

have discussed several ways developers can do this, completely<br />

preventing bugs from entering the system is unrealistic. We are<br />

after all human and unexpected interactions and behaviors are<br />

bound to spring up in our code. Therefore, the best way to really<br />

prevent and detect bugs is to develop a robust process for<br />

detecting them the moment that they appear in the system. Bug<br />

detection requires that a very simple process be followed when<br />

developers start writing their software.<br />

Over the years I have put together a checklist for the bug<br />

prevention and detection process that developers can follow at<br />

the beginning of the software implementation phase. The goal is<br />

to setup all the tools necessary to immediately detect a bug and<br />

monitor the system performance and behavior so that with every<br />

version commit, there is a baseline for how the system is<br />

expected to behave and perform. By doing so, if a suspected bug<br />

is detected, developers can go back through their revision<br />

control system, examine the baseline data and track down where<br />

the bug came from and hopefully arrive at a quick solution to<br />

remove it.<br />

A. Phase 1 – Project Setup<br />

There are several steps that should be done before an IDE is<br />

ever opened that can help prevent software bugs. These include:<br />

• Setup revision control system<br />

• Creating the empty project<br />

• Creating the project directory structure<br />

• Setting the white tab spacing<br />

These steps start to lay the baseline for first, being able to revert<br />

code and perform a forensic bug analysis but also put in place<br />

the issue with inconsistent tab spacing that can be distracting and<br />

an eye sore.<br />

B. Phase 2 – Documentation Facility Configuration<br />

The configuration phase is used to put in place the tools<br />

necessary to ensure good documentation. In this phase I do<br />

several things such as:<br />

• Add Doxygen code templates (Example can be found<br />

here 3 )<br />

• Configure Doxygen wizard<br />

• Import skeleton HAL’s and API’s<br />

• Create a version log<br />

• Create a hardware configuration module<br />

C. Phase 3 – Code Analysis<br />

There are many potential causes for bugs and using<br />

automated tools that can highlight potential issues in the code<br />

can dramatically decrease time spent debugging by detecting<br />

these issues in an automated fashion. For this reason, there are<br />

several steps to take in the code analysis setup phase such as:<br />

• Setup static code analysis tool<br />

• Setup software metrics analyzer tool<br />

• Setup dynamic code analysis tool (if one is available)<br />

Running these tools on every build will help to identify potential<br />

bugs in the code along with areas that are highly complex and<br />

could pose future risk for bug injection during maintenance.<br />

D. Phase 4 – Scheduler Setup<br />

At this stage, the toolchain is configured, and we are ready<br />

to start bringing up a board to begin software development. The<br />

first thing that is usually done at this stage is to get the hardware<br />

doing something. In this phase, the following items should be<br />

completed:<br />

872


• Setup a RTOS or baremetal scheduler<br />

o<br />

Will need a timer or system tick<br />

• Setup a single task to blink an LED (The electrical<br />

engineers “Hello World” program)<br />

E. Phase 5 – Setup RTOS Aware Debugging (Optional)<br />

If the application being developed using an RTOS,<br />

developers should utilize RTOS aware debugging techniques.<br />

These techniques are often implemented directly into the IDE<br />

and can help developers monitor task stack usage, semaphores,<br />

message queues and other RTOS related objects. RTOS Aware<br />

debugging elements are often implemented in the IDE and the<br />

following should setup at this stage:<br />

• Setup and become familiar with the IDE RTOS Aware<br />

debugging capabilities<br />

• Setup task stack monitoring and recording<br />

F. Phase 6 – Setup Debug Messages and Tracing<br />

I believe this is perhaps the most critical phase in the entire<br />

process. Up to this point we have been doing everything that we<br />

can to prevent bugs from entering the system. Now, we setup the<br />

tools to catch bugs when they show themselves in our<br />

application. At this point, a developer should setup the<br />

following:<br />

• Setup trace channel<br />

o<br />

o<br />

o<br />

Serial<br />

TCP/IP<br />

RTT<br />

• Setup trace tool(s)<br />

o<br />

o<br />

• Setup printf<br />

o<br />

o<br />

• Configure assert<br />

o<br />

SystemView (Segger)<br />

Percepio Tracealyzer<br />

Uart driver implementation<br />

Uart mapped to printf<br />

assert function implemented<br />

• Configure real-time data graphing<br />

• Setup watch points on critical variables<br />

These tools will allow a developer to output debug messages,<br />

halt the moment a bug is detected and monitor their applications<br />

performance. All before any production or prototype code is<br />

written.<br />

G. Phase 7 – Record a Baseline<br />

In order to understand how the system behavior and<br />

execution change overtime, it’s important that an initial baseline<br />

trace be taken. These baselines should be taken periodically and<br />

can be referenced when the system starts to misbehave to<br />

identify potential causes. This helps to provide an application<br />

footprint and prevents developers from scratching their heads<br />

and wondering when a specific behavior was introduced in the<br />

code. They can instead simply refer to their baseline traces and<br />

observe the change.<br />

At this point, a developer should:<br />

• Perform a baseline trace<br />

• Perform a statistical analysis on the function execution<br />

H. Phase 8 – Software Implementation<br />

At this point, all the necessary tools to prevent and detect<br />

bugs are setup and ready to be put into action. The developer can<br />

now start to implement their software. Even during the<br />

development phase, there are several things that a developer<br />

should be doing in order to minimize bugs and quickly catch the<br />

ones that do make it into the system:<br />

• Scheduler regular code reviews<br />

• Run analysis tools on every version before committing<br />

• Monitor their system trace and debug messages<br />

• Perform a baseline trace with every new version<br />

IX. CONCLUSIONS<br />

The modern developer has a wide range of techniques<br />

available to them to prevent and detect bugs that will help<br />

minimize the time spent debugging an embedded system. The<br />

problem faced by many developers is that in the fast-paced push<br />

to get products to market, they may feel they don’t have the time<br />

to follow a disciplined approach or learn more modern<br />

techniques. With the average developer spending 2.5 – 4.8<br />

months debugging their system, there is plenty of time to employ<br />

the processes and tools discussed in this paper which have the<br />

potential to decrease the debugging time by as much, if not more,<br />

than 50%.<br />

REFERENCES<br />

[1] Aspencore, “2017 Embedded Market Survey”., 2017.<br />

https://www.embedded.com/electronics-blogs/embedded-marketsurveys/4458724/2017-Embedded-Market-Survey<br />

[2] www.ganssle.com/misc/fsm.doc<br />

[3] J. Beningo, “Doxygen C Templates” https://www.beningo.com/162-<br />

code-templates/<br />

[4] McCabe, Thomas Jr. Software Quality Metrics to Identify Risk.<br />

Presentation to the Department of Homeland Security Software<br />

Assurance Working Group, 2008.<br />

(http://www.mccabe.com/ppt/SoftwareQualityMetricsToIdentifyRisk.pp<br />

t#36) and Laird, Linda and Brennan, M. Carol. Software Measurement<br />

and Estimation: A Practical Approach. Los Alamitos, CA: IEEE<br />

Computer Society, 2006.<br />

[5] http://www.gimpel.com/html/lintfaq.htm<br />

[6] J. Beningo, “Embedded Software Start-up Checklist”,<br />

https://www.beningo.com/tools-embedded-software-start-up-checklist/.<br />

Feb 2016<br />

www.embedded-world.eu<br />

873


On-Chip Debug and Test Infrastructures<br />

of Embedded Systems from the Users Perspective<br />

Jens Braunes<br />

PLS Programmierbare Logik & Systeme GmbH<br />

Lauta, Germany<br />

Jens.Braunes@pls.mc.com<br />

Abstract—In the world of embedded systems developers are<br />

facing widely different challenges when it comes to debugging<br />

and test of software. On one hand, there is a need for costefficient<br />

standard components fulfilling the requirement of<br />

minimal integration effort. On the other hand, extremely<br />

powerful multicore systems, used in automotive and industrial<br />

applications, where the requirements on debugging and system<br />

observability are much higher, are present. Of course, that<br />

affects the available interfaces, and not least, the on-chip debug<br />

and trace functions.<br />

The paper will give a brief overview of all aspects of debug<br />

support that is implemented on hardware. This includes<br />

interfaces, on-chip debug support and trace solutions. The focus<br />

of the paper is on widely used solutions and implementations<br />

which are standardized or have become quasi industrial<br />

standards.<br />

Keywords—Debugging; debug interfaces; on-chip debug;<br />

on-chip race; JTAG; CoreSight; Nexus; multicore<br />

I. INTRODUCTION<br />

With even more complex embedded applications, the error<br />

diagnostics and test gets more and more expensive. The<br />

efficiency of the development and test process of software for<br />

today’s multicore systems depends significantly on the internal<br />

debug infrastructure of the particular chip.<br />

As we learned from the past for low-cost standard systems<br />

the semiconductor industry does not invest very much into the<br />

debug infrastructure. However, for the automotive and<br />

industrial areas it is entirely different. Because of the arising of<br />

more and more powerful multicore systems the software<br />

development process makes higher and higher demands on<br />

debugging and observability. Interfaces for accessing the<br />

system as well as on-chip debug and trace functions have to<br />

address this concern.<br />

The paper will give a brief overview of present common<br />

debug interfaces of embedded systems as well as of the on-chip<br />

debug hardware providing essential functions for system<br />

observation and multicore debugging. Finally the paper will<br />

give an overview of available trace implementations for tracebased<br />

debugging, non-invasive system observation and tracebased<br />

analysis of the system’s run-time behavior.<br />

II.<br />

DEBUG INTERFACES<br />

Debugging and test with real hardware depends crucially<br />

upon an efficient communication with the target and the<br />

possibility to observe the system state from the outside. In the<br />

simplest case, the software itself reports its state and important<br />

values. This is known as ‘printf debugging’ and uses typically<br />

a console for text messages or a serial interface. Of course,<br />

‘printf debugging’ requires additional code in the application<br />

and has a great impact on the run-time behavior. It is<br />

completely unsuitable for debugging of multicore or real-time<br />

critical applications.<br />

Dedicated debug interfaces allow an even more efficient<br />

and comfortable access to the system, but require some effort<br />

and incur extra costs. Across all microcontroller architectures<br />

and semiconductor vendors, the IEEE 1149.1 JTAG (Joint Test<br />

Action Group) interface is still most common. Actually<br />

developed for testing integrated circuits, it plays the role of a<br />

quasi-standard for debug access. However, the JTAG<br />

implementation is relatively expensive in terms of required<br />

pins. At least five pins (TDI, TDO, TCK, TMS, TRST) are<br />

required. In some cases additional pins are needed, for instance<br />

for target reset, reference voltage and vendor specific signals of<br />

the debug system. That makes JTAG unattractive for cost<br />

sensitive and small devices. Furthermore regarding speed and<br />

robustness against disturbances, JTAG is no longer state-ofthe-art.<br />

In the last years, a number of different, often vendor<br />

specific, alternatives to JTAG have emerged. Due to the market<br />

dominance of some microcontroller architectures, some quasi<br />

industry standards have become apparent. First of all, ARM’s<br />

SWD (Serial Wire Debug) has to be mentioned, which is part<br />

of the ARM CoreSight Debug and Trace IP (intellectual<br />

property) [1]. It needs two pins only, one for bidirectional data<br />

transfer and one for the clock. Another advantage over JTAG is<br />

the two times higher transfer speed. With 50 MHz clock up to<br />

4 MB/s can be realized. SWD uses a packet oriented protocol<br />

that allows simple error detection and increased robustness<br />

against disturbances. The new SWD protocol version 2, which<br />

was introduced by the CoreSight SoC-600 [2], specifies a socalled<br />

multi-drop architecture which allows addressing<br />

multiple processors in a multi-processor system by one single<br />

debug interface.<br />

www.embedded-world.eu<br />

874


The Device Access Port (DAP) from Infineon is another<br />

vendor specific debug interface. DAP is used exclusively in<br />

Infineon products like the AURIX multicore microcontroller<br />

family. Like the SWD, DAP manages the target<br />

communication with two pins only (bidirectional data transfer,<br />

clock). For error detection and correction, the transmitted data<br />

are protected by a CRC code. With a maximum clock rate of<br />

160 MHz, DAP is one of the fastest debug interfaces at the<br />

moment and it achieves up two 15 MB/s for block data<br />

transfers. Furthermore, DAP can be used in two additional<br />

modes: wide mode and SPD (single pin DAP). In wide mode,<br />

the data rate is increased to up to 30 MB/s for block read or<br />

writes by an additional pin. In contrast, SPD reduces the pin<br />

count to only one pin. In SPD mode, the data bit is encoded by<br />

the distances between the SPD signal edges. This way, the<br />

receiver does not need to transmit a separate clock signal for<br />

decoding the DAP data. With SPD, the achieved speed is not<br />

very high and comparable to a 1.3 MHz regular DAP<br />

connection but SPD is suitable for transmitting debug signals<br />

over CAN.<br />

A real successor for JTAG and officially standardized in<br />

IEEE 1149.1 is cJTAG whose implementations can be found in<br />

some devices from NXP. cJTAG is backwards compatible to<br />

JTAG but needs only two pins. Some extensions of the used<br />

protocol address multicore and multi-processor systems, e.g.<br />

the support of multiple test access ports (TAPs), as well as<br />

systems with power management.<br />

Renesas provides for their microcontrollers also a<br />

proprietary debug interface called low-pin count debug (LPD).<br />

LPD can be used with different pin counts – with the minimum<br />

of two – which can be configured by the user. The<br />

comparatively efficient protocol allows realizing data transfers<br />

with up to 1 MB/s using a 10 MHz clock.<br />

III.<br />

DEBUGGING OVER FUNCTIONAL INTERFACES<br />

Since several years, in order to save costs there are attempts<br />

to avoid dedicated debug interfaces completely. Instead,<br />

functional interfaces like CAN, Ethernet or USB should be<br />

used for debugging. Without the need of dedicated debug<br />

interfaces, this could be much cheaper, because functional<br />

interfaces are often already implemented on the devices and<br />

can be possibly instantiated a second time for debugging<br />

purposes.<br />

One example is DXCPL (DAP over CAN Physical Layer).<br />

DXCPL uses the physical layer of CAN for transferring debug<br />

signals of Infineon’s single-pin DAP. Because of the limited<br />

bandwidth of CAN, the achievable transfer speed is only 10 to<br />

40 KB/s. Hence DXCPL is used primarily for in-field<br />

debugging where the actual debug interface (DAP, etc.) is not<br />

physically accessible anymore, e.g. by the housing of an ECU<br />

(electronic control unit).<br />

Another interesting alternative is a reuse of standardized<br />

interfaces used by calibration tools. A working group of the<br />

ASAM (Association for Standardization of Automation and<br />

Measuring Systems), for example, is currently pursuing the<br />

goal to utilize the XCP protocol [3], which was used<br />

exclusively for calibration of ECUs until now yet, also for<br />

debugging purposes too. In the future, debugging of ECU<br />

software will be possible under real or even extreme<br />

conditions.<br />

ARM goes considerably further with its SoC-600-<br />

specification [2]. SoC-600 offers a library of IP blocks that<br />

allows using almost any functional interface for debugging,<br />

including USB, CAN bus, Ethernet or WiFi.<br />

In general, using functional interfaces for debugging has<br />

advantages but also some disadvantages. A clear advantage is,<br />

that debugging remains possible even if dedicated debug<br />

interfaces are not accessible anymore, e.g. later in the field.<br />

Also it may save costs, because expensive, specialized<br />

hardware solutions on the target as well as on the tool side can<br />

be substituted by cost-efficient standard components and IP<br />

blocks which are often already implemented. On the other side,<br />

as disadvantage, the interfaces are eventually not available<br />

anymore for the actual application and the software itself has to<br />

make sure, that the interface is initialized properly and the<br />

debug channel gets opened.<br />

IV.<br />

ON-CHIP DEBUG SYSTEMS<br />

From the users perspective, the debug system provided by<br />

the chip is much more important than the debug interfaces.<br />

Because the offered functions are crucial how deep the system<br />

can be observed and how users can control the application from<br />

outside. At the end, the debug tool, running on the PC, relies on<br />

them and its provided functionality depends strongly on the<br />

available on-chip debug functions.<br />

The debug infrastructure, also known as on-chip debug<br />

system, has primarily two tasks:<br />

1. Provide target information, e.g. memory and register<br />

contents. In addition the users should be enabled to<br />

modify them.<br />

2. Control the program execution on the target. That<br />

includes:<br />

−<br />

−<br />

−<br />

Breaking the running application both triggered by<br />

the debugger and by breakpoints,<br />

Starting the halted application,<br />

Single stepping,<br />

As in the case of debug interfaces the on-chip debug<br />

systems are dominated by vendor specific and architecture<br />

specific solutions. At the end, the debug tool has to hide the<br />

differences and should provide a common user interface.<br />

The only real standard that can be found in the set of<br />

frequently implemented on-chip debug solution is Nexus<br />

(IEEE-ISTO 5001) [4]. The Nexus standard defines four<br />

compliance classes where each higher class is based on the<br />

lower class. An excerpt of the compliance classes and the<br />

belonging debug functions can be found in table I. A chip<br />

vendor has to implement at least all required functions of the<br />

class to be conform to that class. In order to fulfil the two basic<br />

tasks, mentioned above, a realization of all Nexus class 1<br />

functions is sufficient. At the market Nexus compliant<br />

implementations can be found for Power Architecture based<br />

SoCs from NXP and STMicroelectronics, but also in some<br />

RH850 device from Renesas.<br />

875


TABLE I.<br />

Read or write user registers and<br />

memory in debug mode<br />

EXCERP FROM NEXUS COMPIANCE CLASSES.<br />

Class<br />

1<br />

Class<br />

2<br />

STATIC DEVELOPMENT FEATURES a<br />

Single-step instruction in user mode<br />

and re-enter debug mode<br />

Enter / exit a debug mode from / to<br />

user mode<br />

Stop program execution on<br />

instruction/data breakpoint and enter<br />

debug mode (minimum 2<br />

breakpoints)<br />

Ability to set breakpoint or<br />

watchpoint<br />

Class<br />

3<br />

Class<br />

4<br />

◼ ◼ ◼ ◼<br />

◼ ◼ ◼ ◼<br />

◼ ◼ ◼ ◼<br />

◼ ◼ ◼ ◼<br />

DYNAMIC DEVELOPMENT FEATURES b<br />

Read or write memory locations<br />

while program runs in real time<br />

◼ ◼ ◼ ◼<br />

With the already mentioned CoreSight, ARM offers a<br />

complete set of debug IP including breakpoints, watchpoints<br />

(data breakpoints), read and write of memory at run-time as<br />

well as trace and cross-trigger functionality (Fig. 1). Due to the<br />

basic CoreSight concept, defining a set of configurable IP<br />

blocks, silicon vendors and IP licensees can decide which<br />

debug functions will be actually implemented. Often they take<br />

customer requests into account but at the end often costs play<br />

the important role.<br />

A complete proprietary solution comes from Infineon. The<br />

OCDS (On-Chip Debug Solution) is a hardware block<br />

exclusively used in their own architectures (TriCore, C16x<br />

and successors). Of course, OCDS supports breakpoints, data<br />

breakpoints (at least for data addresses) and provides access to<br />

memory and register contents at run-time. With the AURIX<br />

family – the latest TriCore based multicore devices – the<br />

cross-triggering was completely reworked and is now enabled<br />

for processors with a large number of cores. Fig. 2 shows the<br />

◼<br />

◼<br />

a. Development features available on halted target<br />

b. Development features available on running target too<br />

so called OCDS cross trigger switch, which allows also<br />

sending or receiving signals to or from external pins.<br />

All on-chip debug systems provide hardware breakpoints.<br />

These are in fact based on hardware comparators for code<br />

addresses, sometimes also for data addresses. A comparator hit<br />

asserts a configured debug action which can be for example<br />

issue a HALT signal. The debug tool hides that hardware<br />

realization and offers a breakpoint to the user. However,<br />

hardware breakpoints are a limited resource. For typical<br />

microcontrollers and embedded multicore processors only two<br />

to eight hardware breakpoints are available for each core. Once<br />

all hardware breakpoints are already used, the debug tool has to<br />

fall back to software breakpoints. Software breakpoints are<br />

generally based on a code patch. The debug tool replaces the<br />

original instruction at the desired breakpoint location by a<br />

special breakpoint instruction, which is provided by selected<br />

processor architectures, or sometimes by an illegal instruction.<br />

This patched instruction causes a trap when it is executed<br />

which is caught by the debug tool. The debug tool revokes the<br />

code patch and presents the halted application to the user.<br />

Software breakpoints are only applicable if the code, which has<br />

to be patched, is executed from RAM. For FLASH located<br />

applications software breakpoints are much more complicated<br />

to realize and users have to be satisfied only with available<br />

hardware breakpoints.<br />

V. MULTICORE RUN-CONTROL<br />

The emerging of multicore microcontrollers and SoCs is of<br />

course accompanied with the need for enhanced debug<br />

functionality especially for run-control; breaking, stepping and<br />

starting. Dependent on the application and the debug scenario,<br />

the cores have to be synchronized, e.g. to halt them at the same<br />

time at a breakpoint. Because of the significant differences<br />

between the core’s clock frequencies, and the clock of the<br />

debug interface, an external synchronization by the debug<br />

probe or the debug tool itself is not practical; the latencies to<br />

trigger a halt signal for all cores would be too high. This<br />

approach would lead into a completely inconstant view to the<br />

halted system. In fact, a cross-trigger mechanism is required to<br />

do the signaling for run-control on-chip directly. As already<br />

mentioned in section III some of the on-chip debug solutions<br />

already provide some cross-triggering functionality.<br />

However different clock domains, delays of signals as well<br />

Fig. 1. System overview of ARM CoreSight Debug and Trace IP,<br />

including components for target access, run-control, cross-tiggering and<br />

trace.<br />

Fig. 2. The Infineon OCDS trigger switch allows a quasi-synchronous<br />

break of multiple cores.The signal distribution via different trigger lines<br />

is configurable.<br />

www.embedded-world.eu<br />

876


as different pipeline depths cause latencies when entering the<br />

debug mode after a break or leaving the debug mode once the<br />

system is started again. A simultaneous breaking, stepping or<br />

starting of multiple cores is only quasi-synchronous. But with<br />

only a few cycles or executed instructions the slippage is quite<br />

low and can be ignored in most use cases.<br />

At this point, it must be clarified that cross-triggering for<br />

synchronized run-control and for trace may compete especially<br />

for CoreSight and for particular Nexus implementations.<br />

CoreSight, for example, uses the same cross-trigger matrix<br />

for distributing HALT signals to the cores as well as for<br />

triggering the trace capture. As a consequence, the user must<br />

expect that simultaneous trace recording as well as debugging<br />

with breakpoints is only possible to a limited extent.<br />

VI.<br />

TRACE-BASED SYSTEM OBSERVATION<br />

Beside traditional debug support (stop/go, memory<br />

read/write) today’s microcontrollers often provide on-chip<br />

trace for observing the system behavior at run-time and for<br />

exact measurements. Trace-based debugging and measurement<br />

has a great advantage over traditional debugging: The<br />

observation is non-intrusive, that means, on-chip trace does not<br />

influence the run-time behavior of the system. However, a<br />

significant additional expense arises at the silicon and as well<br />

as at the tool side. Trace units are quite expensive because of<br />

their required chip area. They need to be directly connected to<br />

the cores and busses and have to transmit the captured trace<br />

data across the chip and an appropriate interface to the debug<br />

tool. Therefore, chip vendors try to find a trade-off for their<br />

implementations, which results in direct consequences for the<br />

users:<br />

1. The observation of chip-internal activities is limited.<br />

For CoreSight implementations, for example, only<br />

program trace is available in many cases. Data trace is<br />

omitted for cost efficiency. Or the number of<br />

simultaneous observable cores is limited as it is in the<br />

case for Infineon’s MCDS (Multicore Debug<br />

Solution).<br />

2. Either the captured data is buffered in an on-chip trace<br />

memory before it is transferred off-chip towards the<br />

debug tool or the captured data is directly transferred<br />

and stored in the debug probe. The former needs only<br />

a debug interface to transmit the data, which makes<br />

the approach quite cost-efficient in terms of pin<br />

counts. But it will need much more chip area for<br />

implementing the on-chip trace memory. The latter<br />

requires a specific and much more powerful trace<br />

interface which goes along with higher pin counts,<br />

chip area and costs.<br />

Especially the second point needs to be detailed a bit more<br />

from the users point of view. As described, trace data can be<br />

buffered either on-chip trace memory or directly transferred<br />

off-chip via a high-bandwidth trace interface and stored by the<br />

debug tool. In the first case, the recording time is very limited<br />

due to the capacity of the on-chip trace memory – depending<br />

on the device typically a few KB to up to 2 MB are available.<br />

The resulting recording time until the trace memory is filled is<br />

in the range of a few milliseconds. An exact number can hardly<br />

be stated. It is strongly influenced by the executed code and the<br />

compression algorithms used by the trace hardware. While this<br />

is suitable for trace-based debugging – for example for tracing<br />

back the history of an exception – it is seldom applicable for<br />

trace based analysis like code coverage or profiling. One<br />

exception is a special trace mode that is available for the latest<br />

Infineon MCDS implementations. This mode is called<br />

Compact Function Trace (CFT). CFT records only function<br />

enters and leaves which saves a lot of trace memory and is<br />

sufficient for trace-based profiling and for call graph analysis<br />

even with a limited amount of trace memory.<br />

For longer trace recordings, where a large amount of data<br />

arises, the captured data needs to be transferred via an<br />

appropriate interface towards the debug tool. For several years,<br />

parallel interfaces where used for that purpose (e.g ARM<br />

CoreSight, Nexus). However, considering a reasonable<br />

expense regarding the pin count, the achievable bandwidth is<br />

limited to approximately 250 MB/s at maximum. For today’s<br />

multicore systems, which are also running with high clock<br />

rates, this bandwidth is often no longer sufficient. For this<br />

reason serial high speed interfaces are more commonly used<br />

nowadays. Current implementations of serial trace interfaces<br />

achieve with four lines only – two differential lines per data<br />

lane and two differential clock lines – somewhat higher transfer<br />

speeds than parallel interfaces. It can be anticipated that in<br />

future for some devices serial trace interfaces will transmit data<br />

with several GB/s.<br />

At the moment, serial trace interfaces, which rely on the<br />

Xilinx AURORA protocol, are implemented by Infineon for<br />

the AURIX device family as well as by NXP and<br />

STMicroelectronics for the latest Power Architecture devices<br />

(MPC/SPC57xx, SPC58x). With HSSTP (High Speed Serial<br />

Trace Port) ARM has defined a serial high speed interface,<br />

which is also based on the AURORA protocol [5]. First<br />

devices, which will support HSSTP, can be expected within the<br />

next two years.<br />

Indeed, serial high speed trace interfaces are much more<br />

elaborate regarding their hardware realization on both the<br />

silicon and the tool side (fig. 3 shows a typical setup). But to<br />

Fig. 3. Example of a setup for serial high speed trace interfaces<br />

(Universal Access Device 3+ from PLS with AURORA trace POD and a<br />

Infineon AURIX device)<br />

877


achieve the required bandwidth for even more powerful<br />

microcontrollers with even more cores, even more complex<br />

interconnects and increasing clock frequencies, serial trace<br />

interfaces represent a good trace-off between pin count and onchip<br />

logic. The main reason for that is, with even higher<br />

density of integration, the costs of transistors and thus the costs<br />

for the required logic are much lower than the for pins, which<br />

would be required to increase the bandwidth of parallel trace<br />

interfaces the same way. Especially in the area of deeply<br />

embedded systems, with high demands on functional safety<br />

and hard real-time requirements, a waiver of high bandwidth<br />

trace interfaces at all is not an option. For non-intrusive<br />

debugging, measurement and in particular for system analysis,<br />

a comprehensive trace is essential.<br />

VII.<br />

CONCLUSION<br />

In the view of the current market for microcontrollers and<br />

embedded SoCs, the provided debug infrastructure is mostly<br />

vendor-specific. Except from JTAG, real vendor-independent<br />

standards are seldom implemented. As a consequence, the<br />

decision for a specific microcontroller or platform architecture<br />

is in most cases also a decision for a specific debug<br />

infrastructure. System designers and integrators have to<br />

consider this, when they are plan the next generation of their<br />

products or completely new ones. The decision concerns not<br />

only the debug interfaces, on-chip debug systems and trace,<br />

rather it influences also the ecosystem, especially the debug<br />

and system analysis tools.<br />

[1] ARM Limited, “CoreSight Debug Trace”, https://www.arm.com/<br />

products/system-ip/coresight-debug-trace<br />

[2] ARM Limited, “ARM® CoreSight SoC-600 Technical Reference<br />

Manual”<br />

[3] R. König, “ASAM AE MCD-1 XCP SW-Debug V1.0”, Release<br />

Presentation, https://www.asam.net/standards/detail/mcd-1-xcp/<br />

[4] Nexus 5001 Forum, “IEEE-ISTO 5001-2012, The Nexus 5001<br />

Forum Standard for a Global Embedded Processor Debug Interface”,<br />

http://nexus5001.org/nexus-5001-forum-standard/<br />

[5] Xilinx, Inc., “Aurora 8B/10B Protocol Specification“,<br />

https://www.xilinx.com<br />

www.embedded-world.eu<br />

878


Debugging Live Cortex ® -M based Embedded Systems<br />

Jean J. Labrosse<br />

Micriµm Software, part of the Silicon Labs Portfolio<br />

Weston, FL, USA<br />

Jean.Labrosse@Micrium.com<br />

Abstract— Debugging embedded systems has always been challenging. Now, however,<br />

MCUs based on the ARM Cortex-M architecture have a secret weapon: the CoreSight TM<br />

Debug and Trace port. CoreSight is a block of IP that resides alongside all Cortex-M CPUs<br />

and offers varied capabilities based on the actual Cortex-M core found on the MCU you are<br />

using. CoreSight has many features, including the ability to start/stop a target. It contains<br />

a breakpoint unit, includes a data watchpoint, allows printf()-like output, has an optional<br />

instruction trace capability, and enables developers to read and write memory locations<br />

(including I/Os since those are memorymapped) without interfering with the CPU.<br />

In the past, this last feature has been underused by tool vendors, yet it offers unprecedented<br />

insight into a running embedded system. There are many applications where you simply<br />

cannot stop at a breakpoint and examine variables using the debugger: process control,<br />

engine control, communications protocols and more. Indeed, using printf() statements,<br />

which requires instrumenting your code, is not practical in these situations. Instead, having<br />

a tool allowing values of interest to be read directly from memory and displayed graphically<br />

has much greater value; you can show trends, oscillations and other abnormalities that<br />

would not be immediately apparent with just a numeric representation.<br />

Keywords: RTOS; Data Visualization, Dashboards, Real-Time; Debugging; IoT;<br />

879<br />

www.embedded-world.eu


I. INTRODUCTION<br />

Most, if not all, embedded systems read data from sensors, process that data, and, most<br />

likely, produce some form of output. The data read or produced by these systems often needs to<br />

be monitored and possibly displayed for human use. In many cases, the data is only available and<br />

meaningful when the system is running. An example of this is the engine/compressor control<br />

system shown in Figure 1. There is a tremendous amount of data being read, computed and output<br />

when the engine is running, such as spark plug firing angles, fuel injector flow rates, cylinder<br />

temperatures, RPM, valve positions, etc. The only way to debug such a system is to look at the<br />

overall system during operation because the dynamics of one subsystem will affect the operation<br />

of another. For example, an increasing engine load must be compensated for with an increase in<br />

fuel.<br />

Fig 1. Industrial Engine / Compressor<br />

You find this situation in many real-world and real-time applications, such as flight control<br />

systems, chemical reactions, food processing plants, printing presses and more.<br />

There are a number of ways developers display the status of their real-time systems.<br />

LEDs<br />

Developers trying to determine whether or not their code is running as expected often turn to LEDs.<br />

These components can be manipulated with relatively little code, and most evaluation boards<br />

include at least one or two of them. LEDs, then, are a low-cost method for gaining visual feedback<br />

from an embedded system.<br />

Although the feedback provided by LEDs can prove useful to developers, a blinking light hardly<br />

constitutes a wealth of information. Using LEDs, developers can see which portions of their<br />

application code are being executed, but other more advanced diagnostics cannot easily be<br />

performed. LEDs are ill-suited for displaying the values of variables, for instance.<br />

880<br />

www.embedded-world.eu


printf() statements<br />

printf() is another means of obtaining feedback from embedded systems. With printf(),<br />

developers can display the contents of memory buffers, the values of error codes, the results of<br />

analog-to-digital conversions, and other important information. In order for them to do so, however,<br />

their software must include drivers for printf(), and their development environment must<br />

include some sort of console for viewing printf() output.<br />

The code associated with printf() is generally not trivial. It includes both drivers and the<br />

function itself. In some development environments, the addition of a single printf() call to an<br />

application can bring about an increase in code size of as much as 10 kBytes. An application’s<br />

RAM footprint can also increase substantially as a result of printf().<br />

Larger memory footprints are not the only side effect of printf() usage; application<br />

performance can also be affected. Typically, any drops in performance are only noticeable in<br />

debugging, since completed systems generally do not utilize printf(). Even these changes can<br />

be harmful, though. They can actually create new bugs or mask existing ones. This phenomenon<br />

is sometimes referred to as the Heisenberg effect.<br />

The Heisenberg effect aside, unnecessary printf() calls are pollution, and developers must often<br />

generate an excessive amount of it in order for printf() to provide a comprehensive view of<br />

their systems. Thus, even on high-performance platforms with abundant memory resources, the<br />

use of printf() can be problematic.<br />

Full Graphics Display<br />

Developers dealing with complex systems often find graphical feedback to be more helpful than<br />

text. The graphical LCDs present on some hardware platforms are one means of obtaining such<br />

feedback.<br />

Although the information provided by an LCD can prove highly beneficial, the use of a display for<br />

monitoring an embedded system can cause many of the same problems associated with LEDs and<br />

printf(). For instance, graphical displays, when used for monitoring, are a source of code<br />

pollution. Developers must add code to their application whenever they wish to display new data.<br />

Since LCDs typically serve as user interfaces (not diagnostic tools) in completed systems, this extra<br />

code must eventually be removed.<br />

Graphical displays also necessitate drivers, and these drivers can be highly complex. Accordingly,<br />

displays can also be a source of the Heisenberg effect. Even in systems that will ultimately employ<br />

display drivers as part of a user interface, the utilization of these drivers for debugging can introduce<br />

unforeseen problems.<br />

Debugger Live Watch Feature<br />

Many developers turn to a debugger for feedback from their embedded systems. A variety of<br />

information can be gleaned from a typical debugger, including the values of variables. These values<br />

are usually listed in what is known as a watch window, and some actually provide live watch<br />

capabilities, displaying values while the target is running. Different tools offer slightly different<br />

versions of the watch window, but, in most debuggers, it is little more than a table of variables and<br />

their values.<br />

A key limitation of live watch windows is that they are typically refreshed only once per second.<br />

Live watch windows also only show numerical data, whereas, in some cases, the information being<br />

conveyed would be substantially improved with a graphical representation of the same data.<br />

881<br />

www.embedded-world.eu


Unfortunately, a debugger is not a practical tool when it comes to monitoring live data from<br />

deployed applications. What’s needed is a tool that can serve both the embedded developer as well<br />

as field service personnel.<br />

Commercial MMIs<br />

Many commercial Man-Machine Interfaces (MMIs) are available and are often found alongside<br />

Programmable Logic Controllers (PLCs) in factory floor automation. Such tools are rarely used by<br />

embedded systems engineers, but the visual aspect of such tools is exactly what many developers<br />

need. Unfortunately, these types of tools are often ignored either for reasons of cost or lack of<br />

interface mechanisms to the embedded system being developed; MMI software is typically<br />

compatible with PLC communications protocol but little else.<br />

Fig 2. Typical Man-Machine Interface Display Screen<br />

882<br />

www.embedded-world.eu


II.<br />

THE ARM CORTEX-M CORESIGHT TM DEBUG PORT<br />

ARM Cortex ® -M processors are equipped with special and very powerful debug hardware built<br />

onto each chip. CoreSight contains features that require stopping the processor or changing its<br />

program execution flow and are considered invasive. These features can be problematic when<br />

monitoring and controlling a live system because, in many cases, we cannot afford to stop the CPU<br />

at a breakpoint. CoreSight also provides capabilities that are non-intrusive, which allow us to<br />

monitor and control live systems without halting the CPU:<br />

- On the fly memory/peripheral access (Read and Write)<br />

- Instruction Trace (requires that the chip also include an Execution Trace Macrocell, ETM)<br />

- Data Trace<br />

- Profiling using profiling counters<br />

Figure 3 shows a simplified block diagram of the relationship between the CoreSight debug port,<br />

the CPU and the Memory/Peripherals.<br />

Fig 3. Relationship between CoreSight, CPU and Memory/Peripherals on a Cortex-M<br />

883<br />

www.embedded-world.eu


III.<br />

TOOLS FOR TESTING/DEBUGGING LIVE SYSTEMS<br />

Figure 4 shows how CoreSight connects to your development environment.<br />

F4-1 Your development environment typically consist of an Integrated Development<br />

Environment (IDE) that includes a code editor, compiler, assembler, linker, debugger and<br />

possibly other tools.<br />

F4-2 When you are ready to debug your application, download your code to the target through<br />

a Debugger Interface, such as the Segger J-Link.<br />

F4-3 J-Link connects to the CoreSight debug port and is able to start/stop the CPU, download<br />

code, program the onboard Flash, and more. J-Link can also read and write directly to<br />

memory as needed, while the target is executing code.<br />

F4-4 Micrium’s µC/Probe is a stand-alone, vendor-agnostic, Windows-based application that<br />

reads the ELF file, which is produced by the toolchain. The ELF file contains the code that<br />

was downloaded to the target as well as the names of all globally accessible variables, their<br />

data types and their physical memory locations in the target memory.<br />

F4-5 µC/Probe allows a user to display or change the value (at run-time, i.e. live) of virtually<br />

any variable or memory location (including I/O ports) on a connected embedded target.<br />

The user simply populates µC/Probe’s graphical environment with gauges, numeric<br />

indicators, tables, graphs, virtual LEDs, bar graphs, sliders, switches, push buttons and<br />

other components and associates each of these to variables or memory locations in your<br />

embedded device. µC/Probe doesn’t require you to instrument the target code in order to<br />

display or change variables at run time. By adding virtual sliders or switches to µC/Probe’s<br />

screens, you can easily change parameters of your running system, such as filter<br />

coefficients and PID loop gains, or actuate devices and test I/O ports.<br />

F4-6 µC/Probe sends requests to J-Link to read or write from/to memory.<br />

F4-7 J-Link requests are converted to CoreSight commands, which are fulfilled and the variable<br />

values displayed graphically onto µC/Probe’s screens.<br />

F4-8 Another highly useful tool for testing/debugging live embedded systems is Segger’s<br />

SystemView. This tool typically works in conjunction with an RTOS and displays the<br />

execution profile of your tasks and ISRs on a timeline. You can thus view how long each<br />

task takes to execute (minimum/average/maximum), when tasks are ready to run, when<br />

execution actually starts for each task, when ISRs execute and much more. SystemView<br />

can help you uncover bugs that could go unnoticed, possibly for years. However,<br />

SystemView requires that you add code to your target that records RTOS events and ISRs.<br />

SystemView also consumes a small amount of RAM to buffer events.<br />

F4-9 J-Link allows multiple processes to access CoreSight concurrently, so you can use all three<br />

tools at once.<br />

884<br />

www.embedded-world.eu


Fig 4. Tools for debugging and testing live systems.<br />

IV. SUMMARY<br />

Embedded systems are often black-box devices with little or no display capability making it<br />

difficult to see what is happening inside. Developers often use different techniques to show what’s<br />

going on inside these devices, but, often, this requires additional hardware and code to be<br />

instrumented.<br />

This paper presented tools that are available from Micrium (µC/Probe) and Segger (SystemView)<br />

that can provide unique insights into your live embedded systems while remaining non-intrusive.<br />

Both tools should be used in the early stages of product development as the feedback provided by<br />

these tools can help in better optimizing your design.<br />

885<br />

www.embedded-world.eu


V. REFERENCES<br />

[1] Micrium, “µC/Probe, Graphical Live Watch®,”<br />

https://micrium.com/ucprobe/about/<br />

[2] Segger, “SystemView for µC/OS,”<br />

https://www.micrium.com/systemview/about/<br />

www.segger.com/systemview.html<br />

[3] Segger, “Debug Probes,”<br />

https://www.segger.com/jlink-debug-probes.html<br />

[4] Silicon Labs, “Simplicity Studio,”<br />

http://www.silabs.com/products/mcu/Pages/simplicity-studio.aspx<br />

www.embedded-world.eu<br />

886


Time Sensitive Networks for Industry 4.0<br />

Thomas Leyrer<br />

Texas Instruments Incorporated<br />

Freising, Germany<br />

t-leyrer@ti.com<br />

Abstract— The digital revolution in the manufacturing<br />

process demands a communication standard which meets the<br />

requirements of the manufacturing floor. Additional sensing<br />

technology for predictive maintenance adds new quality of<br />

service requirements to the industrial network. Managing<br />

different communication requirements for motion control,<br />

programmable logic control and predictive maintenance is the<br />

key challenge of applying the IEEE Time Sensitive Network<br />

(TSN) standard to the trends in the industrial automation<br />

market.<br />

Keywords— Industry 4.0, real-time control, real-time Ethernet,<br />

Time Sensitive Networks, Industrial Ethernet<br />

I. INTRODUCTION<br />

workspaces. Multiple cameras or scanners are combined to<br />

provide a real-time view of the production process. Direct<br />

integration into control systems enables a more efficient<br />

collaboration of man and machines.<br />

Industrial control systems use Programmable Logic<br />

Controllers (PLC) to automate a large 24 volt input/output<br />

(IO) system. These IO modules can reside next to the PLC<br />

CPU or connect through Industrial Ethernet to remote IO<br />

systems at the machine. For cabinet deployed IO functions<br />

protection class IP20 is sufficient. Machine deployed IOs<br />

support high protection class IP67. Industrial sensors either<br />

connect via the IO system or directly over Industrial Ethernet<br />

to the PLC network.<br />

Industrial communication and control systems in a factory<br />

continuously increase the efficiency and flexibility of a<br />

production system. Modern factory floors support multiple<br />

control systems for different applications. Figure 1 shows the<br />

various control systems of a production cell with wires and<br />

wireless communication interfaces.<br />

Many products require tight control of temperature, humidity,<br />

air purity and light during the production process to maintain<br />

best and consistent product quality. These parameters get even<br />

more important with additive manufacturing. Besides handling<br />

of raw material and product, the automated transport of chips<br />

is part of the smart factory<br />

Machine tools support concurrent process of multiple parts<br />

with multiple tools. Such machines have up to 100 axes which<br />

are very dynamic in speed and very precise in position.<br />

Automated tool changer with data logging and quality checks<br />

of the tool support full online documentation of the production<br />

process. User friendly control panels support additional<br />

visualization of all process parameters and connect the<br />

machine to the Information Technology (IT) world.<br />

Cameras are used to detect the presence and position of<br />

objects. More enhanced systems include precise measurement<br />

and quality checks with reference parts. In addition,<br />

surroundings of machines and robots are scanned for safe<br />

Figure 1: Industry 4.0 Production Cell<br />

A special variant of industrial control exists for manipulators<br />

of industrial robots. Up to seven axes allow flexible movement<br />

in all direction. A number of interfaces are needed to integrate<br />

a robot arm into a production system. Interaction with PLC,<br />

tools and cameras enables higher degree of automation.<br />

Through additional sensing technologies and functional safety<br />

design the collaboration with humans is possible.<br />

Electronic tools in the manufacturing line have dedicated<br />

control units which are specific to the function like welding,<br />

painting and milling. These tools may only power up when<br />

they are taken out from the tool magazine.<br />

www.embedded-world.eu<br />

887


Flow of raw material, products and packaging requires<br />

identification, tracking and transportation. In certain<br />

applications parallel movement with multi-carrier systems<br />

increase the throughput of the system. Palletizing is used to<br />

stack many objects in one area to reduce the transport<br />

overhead.<br />

All components on the manufacturing floor are connected and<br />

controlled inside a network domain which is called<br />

Operational Technology (OT). In order to protect this domain<br />

from the IT world a secure gateway function is needed. Such<br />

gateway adjusts the data format and communication protocol<br />

at the edge between IT and OT domain.<br />

Many different control systems with hundreds of sensor and<br />

actuators share a common, deterministic backbone to make<br />

sure information flows at the right speed and at the correct<br />

time. The time sensitive network can only meet the<br />

requirement through wired communication interfaces.<br />

However, wireless technologies gain more momentum for<br />

service ports and mesh networks collecting less time critical<br />

process data.<br />

Autonomous guided vehicles (AGV) are a new way of<br />

transport in factories. They use certain intelligence to find the<br />

shortest route, pick the right parts and avoid collisions with<br />

other vehicles or humans. Equipped with a robot arm and<br />

vision system, AGVs can take over complex tasks of machine<br />

loading and unloading without human interaction<br />

While there are different control systems in the Industry 4.0<br />

production cell they can be viewed as a generic real-time<br />

control system using industrial communication to connect<br />

remote IO devices to a central processing unit.<br />

II.<br />

A. Field Level Control Systems<br />

INDUSTRIAL REAL-TIME CONTROL<br />

Production systems are organized in various levels to manage<br />

the production process. Optimization of the production<br />

process is only possible with a fully transparent and timely<br />

accurate view of the IO functions. Compared to consumer and<br />

office communication, industrial control systems needs to be<br />

real-time deterministic and safe. Figure 2 shows the basic<br />

functions and parameters of an industrial control system.<br />

There is cyclic exchange of IO data between devices on the<br />

field level of a manufacturing floor and the IO controller<br />

(IOC) which manages multiple IO devices (IOD) organized in<br />

line or ring structure. Multiple IOCs can be connected at the<br />

control level to exchange data between various machines or<br />

between machines and automation components such as robots,<br />

conveyor belts and tool magazines. Communication to office<br />

level bridges between the OT and the IT. This bridge function<br />

requires security for authentication and data while still<br />

maintaining the timing context of IO operation.<br />

IO communication over Industrial Ethernet is repeated with a<br />

pre-configured cycle time. There are different classes of cycle<br />

times in a production system ranging from 31.25 us for motion<br />

control applications to more than 10 ms for a complete<br />

manufacturing site. Based on the communication cycle time<br />

(t cycle ) the IOC sends new data packets to IODs. The IOC<br />

serves as a timing master with a local time reference (t ref ). The<br />

IOD synchronizes to the master time reference and all IODs in<br />

the network have same understanding of time. The time<br />

synchronization inside an IO communication network allows<br />

giving all input and output data at each node a reference time.<br />

For example IOD2 data_in2 is captured at a pre-configured<br />

time t in_2 . Same behavior exists for the output data which is<br />

triggered with time t out_2 .<br />

The time synchronized control of input and output data over a<br />

network of many IO devices serves as a basis for<br />

programmable logic controller (PLC) and multi-axis motion<br />

controller used in machine tools and robotics applications.<br />

B. Fieldbus and Industrial Ethenet<br />

Figure 2 - Industrial Control System<br />

The introduction of PLC has led to a need of deterministic<br />

communication interfaces and protocols. Over time the serial<br />

based communication get replaced with Ethernet based<br />

communication. The transition to Ethernet technology allows<br />

building larger systems and exchange of more data in a single<br />

packet. With 100 Mbit/s data rate there is enough bandwidth to<br />

control hundreds of devices in short cycle time.<br />

While there is enough bandwidth with 100 Mbit/s Industrial<br />

Ethernet to drive a single control system, a converged network<br />

888


for multiple applications requires even more flexibility and<br />

bandwidth. Table1 lists examples of different control systems<br />

with key parameters of cycle time, bandwidth, jitter and<br />

number of devices in the network.<br />

Control<br />

System<br />

Cycle<br />

Time<br />

#Devices<br />

Bandwidth<br />

[Mbit/s]<br />

Jitter<br />

]<br />

PLC 1 ms 512 100 250 ns<br />

CNC 125 µs 128 100 40 ns<br />

Robotics 250 µs 64 100 250 ns<br />

Vision 4 ms 8 1000 100 ns<br />

Transport 10 ms 16 100 1 µs<br />

Table 1 – Examples of Industrial Control Parameters<br />

The most critical application in terms of timing parameter<br />

are CNC machines. Inside the multi-axis application there are<br />

still serial based encoder protocols because Industrial Ethernet<br />

cycle time do not reach the cycle time of 10 µs today. Larger<br />

PLC systems may span hundreds of devices with cycle time in<br />

milliseconds. For automated transportation solutions the cycle<br />

time can be even slower and the usage of wireless<br />

communication can be used.<br />

A new challenge for industrial control systems is the<br />

integration of machine vision. The bandwidth of vision sensors<br />

in high throughput applications such as printing and bottling<br />

exceeds the 100 Mbit/s data rate of today’s Industrial Ethernet<br />

protocols. In addition there is high computation challenge to<br />

extract relevant object information out of an image capture<br />

with more than 50 frames per second.<br />

A converged network of multiple disciplines in one<br />

Ethernet cable is the opportunity for IEEE 802.1Q [1] Ethernet<br />

bridges to gain significant footprint in industrial control<br />

systems. Being real-time deterministic for one discipline is<br />

solved with protocols such as Profinet and EtherCAT.<br />

Supporting multiple control systems as a backbone for<br />

Industry 4.0 production cell requires a new approach, which is<br />

discussed next.<br />

III.<br />

A. TSN Standard<br />

TIME SENSITIVE NETWORKS<br />

The IEEE802.1 Working Group is further developing the<br />

Ethernet bridge technology for different networks. The<br />

customer bridge with Ethertype of 0x8100 typically applies to<br />

industrial networks. With the addition of Virtual Local Area<br />

Network (VLAN) header inside an Ethernet packet, there are<br />

more quality of service options based on VLAN identifiers and<br />

priority flags. For example the classification of packets used in<br />

the context for motion control can use unique identifier. The<br />

association of motion control traffic to a stream can now be<br />

managed in the network. The forwarding rules at a Customer-<br />

VLAN (C_VLAN) bridge for a motion stream can be mapped<br />

to a certain traffic class and priority. These mechanisms are<br />

used by TSN to support a more deterministic distribution of IO<br />

data over Ethernet network<br />

Figure 3 shows the basic flow of an Ethernet packet<br />

through a customer bridge. Before packets are processed for<br />

forwarding decision, certain rules for the ingress port apply.<br />

For example, a port can have a state in which only port to port<br />

traffic is allowed and all other traffic is dropped. Another<br />

ingress rule is port membership. Only VLAN IDs which are<br />

registered can enter the bridge. One enhancement of TSN is the<br />

stream filtering and policing under the module 802.1Qci [2]. A<br />

stream can be filtered on receive port in case it exceeds the data<br />

rate specified for the stream or the time window for the stream<br />

is not open.<br />

Figure 3 – TSN enhancements to customer bridge<br />

The IEEE802.1Q standard defines store and forward<br />

bridges. This means a packet needs to be received without<br />

errors to ensure error free forwarding. The penalty of store and<br />

forward scheme is high latency and high jitter through a daisy<br />

chained network of industrial IO devices. With the<br />

combination of 802.3br [3] and 802.1Qbu [4] standard<br />

modules, it is now possible to reduce the latency and jitter to<br />

about 1/10 of a max sized packet with 1500 bytes. Frame<br />

preemption works with a max latency of 123 bytes. Only<br />

packets larger than 123 bytes can be interrupted with express<br />

packets. Max jitter in a customer bridge with frame preemption<br />

at 1 Gigabit rate is therefore 1 us. This number comes<br />

closer to what dedicated Industrial Ethernet standard reach at<br />

100 Mbit. However the latency and jitter numbers of protocols<br />

such as Profinet IRT and EtherCAT are still a factor of 10<br />

better.<br />

With the introduction of 802.1Qbv [5] time aware shaper<br />

(TAS), it is possible to separate streams for real-time (RT)<br />

packets into time windows and avoid jitter caused by non-realtime<br />

(NRT) packets. Time aware shaper assumes there is time<br />

synchronization according to 802.1AS-rev [6]. The network<br />

and the streams are managed using 802.1Qcc [7] standard<br />

module. Before packets are transferred to egress queues, the<br />

filter database (FDB) defines the route of a packet through a<br />

bridge. From an IEEE802.1 Ethernet bridge standard<br />

perspective, the frame filter looks at Ethernet header including<br />

VLAN tag and its FDB entry to decide which egress ports the<br />

packet goes to. Industrial Ethernet like EtherCAT do not have<br />

www.embedded-world.eu<br />

889


this extra step of forwarding decision as all frames pass<br />

through a network node on-the fly. The bridge delay for<br />

automatic frame forwarding is in the range of 320 ns. Industrial<br />

Ethernet protocols which make forwarding decision based on<br />

Ethernet header and protocol header reach a forwarding delay<br />

< 3us when using cut-through scheme, i.e. start forwarding the<br />

frame after decision point in frame is verified. Cut-through<br />

switching at 100 Mbit reduces the latency from 125us to 3 µs.<br />

At gigabit rate this latency reduces from 12.5 µs to 1 µs.<br />

For TSN to reach the latency and jitter performance of<br />

today’s Industrial Ethernet it requires a combination of time<br />

aware shaper and cut-through switching. Figure 4 shows the<br />

transmit queues, shaper and transmit selection logic with time<br />

aware gates. In theory, the IEEE standard would allow mixing<br />

various shapers, cyclic queueing and frame pre-emption on a<br />

single transmit port. Engineering such complex traffic model<br />

over a larger network is very difficult. What makes sense is the<br />

time aware shaper for IO data and map different control<br />

systems into different streams with dedicated time windows.<br />

Figure 4 - Transmit Port Scheduler<br />

B. TSN Profile for Industry 4.0 Production Cell<br />

With the use cases described in the introduction of this paper<br />

following converged network configuration using TSN with<br />

cut-through switch is discussed. IO packets representing one<br />

control system characteristic are mapped into one stream. This<br />

stream is identified through a single VLAN ID and mapped to<br />

one traffic class which has its own transmission window.<br />

gate_open<br />

queue 7<br />

9 x 64B =<br />

6.0 us<br />

motion IO IO IO vision NRT NRT NRT<br />

cut-through<br />

gate_open<br />

queue 6<br />

32 x 64B = 21.5 us<br />

cut-through<br />

gate_open<br />

queue 5<br />

100 us cycle time<br />

4 x 512B = 17.0 us<br />

cut-through<br />

gate_close<br />

queue 5<br />

best effort – strict priority<br />

cut-through<br />

Figure 5 Communication Cycle with TAS<br />

store and<br />

forward<br />

Cyclic queuing and forwarding as defined in 802.1Qch [8]<br />

supports mapping of different streams to traffic classes which<br />

are controlled through time gates in a cyclic manner. In the<br />

example shown in Figure 5, there are three traffic classes with<br />

associated streams executed in reserved time windows by<br />

defining gate open and close times exclusively for one traffic<br />

class.<br />

Motor control parameters such as PWM output data, current<br />

values and position data can be transferred over TSN network<br />

with the frequency of PWM cycle time. An 8 kHz torque loop<br />

is executed with 125 µs cycle time on TSN network. The size<br />

of the Ethernet packet can be minimum 64 bytes. For one 3<br />

phase motor the number of PWM data fits into 12 bytes.<br />

Motor parameters in one 64 bytes packet serve up to 3 axes.<br />

The mapping of various control systems onto one gigabit TSN<br />

network, as shown in Figure 5, supports 9 motion control<br />

packets. In total, control parameters for 27 motors are mapped<br />

in this example.<br />

PLC IO devices can be single sensor/actuator device,<br />

concentration of multiple channels to a decentralized remote<br />

IO or DIN-rail IO systems with modular backplane<br />

architecture to concentrate many IO modules into a single<br />

frame. For the first two devices a typical packet size is 64<br />

bytes. Only the cabinet deployed remote IO uses larger packet<br />

size up to 1500 bytes.<br />

Machine vision sensors span a wide range of image size and<br />

frame rates. For 2D line scanners and lower resolution 3D<br />

time-of-flight cameras the bandwidth of 100 Mbit/s Ethernet is<br />

sufficient. Adding multiple machine vision sensors to one<br />

TSN network requires a gigabit Ethernet Network. The frame<br />

rate of a scanner is in the range of 100 Hz per image. However<br />

the image size may exceed the 1500 byte limit of Ethernet<br />

frames. A full image can be spread out over multiple frames in<br />

a cyclic manner and still meet the frame rate which is more<br />

than 100x slower than the TSN cycle time. In the example of<br />

Figure 5 there are 4 packets defined in the vision stream with<br />

512 byte for each camera. Higher resolution cameras do not<br />

stream raw images over the network as required bandwidth<br />

exceeds the gigabit data rate. These cameras compress the<br />

original image to stream over gigabit interfaces. A vision<br />

computer system decodes the stream and runs analytics to<br />

detect objects. In addition machine vision sensors have local<br />

intelligence to measure and detect objects. The immediate<br />

processing of images support a much faster re-action time for<br />

control systems.<br />

Another traffic class for Industry 4.0 production systems is<br />

condition monitoring to enable predictive maintenance<br />

through data analytics outside the classical PLC system. The<br />

maintenance cycles to replace tools, lubricate and replace<br />

bearings are days, weeks or even years. Data collection for<br />

condition monitoring therefore can be part of the non-realtime<br />

traffic (NRT) window. For NRT, as part of cyclic TSN<br />

profile, it is important that the frame does not overlap into<br />

next communication cycle which starts with a TAS window.<br />

IV.<br />

EMBEDDED PROCESSOR WITH INTEGRATED TSN SWITCH<br />

The examples given in previous chapter describe a<br />

converged network for industrial control systems based on<br />

TSN. All devices in the network need to work on a common<br />

890


time base. IEEE802.1AS is the protocol which enables a<br />

common understanding of time in an Industrial Ethernet<br />

network. There is at least one timing master in the system<br />

which provides the reference time. Such reference time then is<br />

used by each slave to manage the communication parameters<br />

of the TSN switch. In described example, these are gate times<br />

for different streams and cycle time.<br />

In addition to time delay through physical layer and TSN<br />

switch, there is an unknown delay which comes with different<br />

cable length. To compensate for variable delay the 802.1AS<br />

protocol supports peer to peer line delay measurement. Cable<br />

delay, Ethernet Phy delay and System on Chip (SoC) bridge<br />

delay are summed up to calculate the same master time on each<br />

network node. A frequency drift from crystals on different<br />

network nodes is compensated by comparing time stamp of<br />

Ethernet packets with local time. Figure 6 shows three different<br />

time domains on a forwarding bridge. The receive function of<br />

the Ethernet Phy recovers timing from the received symbols. A<br />

receive PLL follows the clock oscillator of previous network<br />

node. The interface between Ethernet Phy and TSN switch uses<br />

recovered receive clock. On the TSN switch device a receive<br />

time stamp (RX_TS) is taken with a local clock of the device.<br />

This time domain is also used for transmit time stamp (TX_TS)<br />

which is needed in case the forwarding delay of the bridge is<br />

not constant for time synchronization packets.<br />

shows the new generation of Sitara TM embedded processors<br />

with integrated gigabit verion of PRU-ICSS. External<br />

interface is gigabit MII to Ethernet Phys. Each instance of<br />

PRU-ICSS supports two pysical Ethernet ports. With three<br />

PRU-ICSS subsystems on one SoC, an implementation with<br />

two rings on the field level and one ring on the control level is<br />

possible. PRU-ICSS provides programmable logic above<br />

physical layer with non pipelined cores, broadside data bus of<br />

up to 1000 bit bus width and hardware timer for Ethernet<br />

traffic control. TSN time aware gates use the hardware timer<br />

to transfer packets with zero latency between transmit queue<br />

selection and transmit interface to physical layer. PRU can<br />

transfer Ethernet packets of 64 bytes in 4 ns. This high<br />

throughput architecture of 128 Gb/s to physical ports and host<br />

port serves as a basis for low latency gigabit TSN bridges. In<br />

order to maintain the low latency communication up to the<br />

application CPU, multiple interfaces with DMA support are<br />

available. A high speed and high bandwidth crossbar switch<br />

distributes multiple real-time (RT), non-real-time (NRT) and<br />

network management (NW) interfaces to various masters in<br />

the system. A bus master can also be direct memory access<br />

(DMA) peripheral which takes the payload of an IO packet<br />

and writes the data directly to external interface.<br />

CPU 1<br />

CPU 2 MCU DMA<br />

crossbar<br />

Figure 6 - Time Domains on a Single Network Node<br />

Ethernet Phys use an external clock source to generate<br />

transmit clock between bridge and physical layer. As an<br />

optimization one can use single oscillator source for SoC<br />

bridge and Ethernet Phy transmit. This will reduce the jitter in<br />

the system. An alternative location for time stamp capture is<br />

the physical layer. The challenge for physical layer time stamp<br />

comes with gigabit packets, different time synchronization<br />

protocols and the need for an extra interface to transfer time<br />

stamp data from the Ethernet Phy to the SoC. Time stamp<br />

accuracy in embedded processors such next generation<br />

Sitara TM [9] processor AM6x is 4 ns for gigabit Ethernet using<br />

a 250 MHz clock reference. This time base and accuracy is<br />

used for clock synchronization. It can be tuned in one ns steps<br />

and supports time synchronization over larger networks well<br />

below 100 ns. The jitter between TSN network nodes has<br />

direct impact to scheduler and time aware shaper. In case time<br />

synchronization jitter is in the range of 1 µs means that two<br />

minimum sized packets of 64 bytes cannot be transmittet in a<br />

TAS window at gigabit rate.<br />

A flexible and deterministic intellectual property (IP) core to<br />

support TSN is Programmable-Realtime Unit and Industrial<br />

Communications Subsystem (PRU-ICSS) [10]. Figure 7<br />

RT<br />

NRT NW<br />

PRU-ICSS PRU-ICSS PRU-ICSS<br />

RGMII, SGMII<br />

Figure 7 - AM6x Embedded Processor with Gigabit TSN<br />

Switches<br />

Industry 4.0 production systems connect operation technology<br />

(OT) on the factory floor with information technology (IT)<br />

inside the company and off-site cloud services. The edge<br />

gateway at this boundary between OT and IT requires network<br />

security. Security accelerator on embedded processors with<br />

latest ciphers and key management guarantee high throughput<br />

of the gateway.<br />

V. CONCLUSION<br />

Security<br />

Industry 4.0 reference architecture provides a guideline for<br />

connectivity through the life cycle of a product. This guideline<br />

addresses issues of compatibility, security and<br />

communication. The modern production system has many<br />

different control characteristics and in order to reach a<br />

converged network, a very flexible communication standard is<br />

www.embedded-world.eu<br />

891


equired. TSN is one of the core standards for wired<br />

communication which has a rich set of standard modules to<br />

enable a converged network. Time synchronization and time<br />

aware shaper are most relevant modules to real-time<br />

deterministic communication over Ethernet.<br />

Current Industrial Ethernet standards specify the timing<br />

parameters to ensure real-time deterministic communication.<br />

In addition they provide certification tests, interop events and<br />

application interfaces to control systems. The challenge for<br />

TSN to be adopted and finally accepted as a replacement for<br />

existing Industrial Ethernet standards is the specification and<br />

validation for a set of industrial profiles. Most critical<br />

parameter for industrial control systems is deterministic and<br />

low latency bridge delay. The IEEE TSN working groups need<br />

to address time synchronization jitter in few tens of ns, cutthrough<br />

switching and fixed delay through the forwarding<br />

bridge in order to compete with existing standards.<br />

An embedded processor solution family with integrated<br />

gigabit switch technology was presented. It supports both, the<br />

current Industrial Ethernet standard as well as future standards<br />

based on TSN. The real-time deterministic behavior of PRU-<br />

ICSS and multiple interfaces to application CPUs on the SoC<br />

ensure real-time data path up to the application controller. SoC<br />

integration of security acceleration IP and multiple instances<br />

of PRU-ICSS enable a flexible edge gateway with control<br />

CPU and cloud connection.<br />

[1] IEEE Std 802.1Q-2014 (Revision of IEEE Std 802.1Q-2011) - IEEE<br />

Standard for Local and Metropolitan Area Networks--Bridges and<br />

Bridged Networks.<br />

[2] IEEE Std 802.1Qci – IEEE Standard for Local and Metropolitan Area<br />

Networks-Media Access Control (MAC) Bridges and Virtual Bridged<br />

Local Area Networks Amendment: Per-Stream Filtering and Policing<br />

[3] IEEE Std 802.3br – IEEE Standard for Ethernet Amendment 5:<br />

Specification and Management Parameters for Interspersing Express<br />

Traffic<br />

[4] IEEE Std 802.1Qbu – IEEE Standard for Local and Metropolitan Area<br />

Networks-Media Access Control (MAC) Bridges and Virtual Bridged<br />

Local Area Networks - Amendment: Frame Preemption.<br />

[5] IEEE Std 802.1Qbv – IEEE Standard for Local and Metropolitan Area<br />

Networks-Media Access Control (MAC) Bridges and Virtual Bridged<br />

Local Area Networks Amendment: Enhancements for Scheduled<br />

Traffic.<br />

[6] IEEE Std 802.1AS-Rev – IEEE Standard for local and metropolitan area<br />

networks – Timing and Synchronization for Time-Sensitive<br />

Applications.<br />

[7] IEEE Std 802.1Qcc – IEEE Standard for Local and Metropolitan Area<br />

Networks-Media Access Control (MAC) Bridges and Virtual Bridged<br />

Local Area Networks Amendment: Stream Reservation Protocol (SRP)<br />

Enhancements and Performance Improvements.<br />

[8] IEEE Std 802.1Qch – IEEE Standard for Local and Metropolitan Area<br />

Networks-Media Access Control (MAC) Bridges and Virtual Bridged<br />

Local Area Networks Amendment: Cyclic Queuing and Forwarding<br />

[9] Sitara TM Processors http://www.ti.com/processors/sitara/overview.html<br />

[10] Programmable Real-time Unit and Industrial Communication<br />

SubSystem (PRU-ICSS) http://processors.wiki.ti.com/index.php/PRU-<br />

ICSS<br />

REFERENCES<br />

892


TSN<br />

Future Industrial Ethernet Standard or just AVB 2.0?<br />

Dipl.-Ing. (FH) Torsten Rothe<br />

Technology Engineering and Services CE<br />

Avnet EMG AG<br />

Rothrist, Switzerland<br />

Torsten.Rothe@avnet.com<br />

Abstract - Over the last decades Ethernet has become the defacto<br />

standard for high-bandwidth standardized communication<br />

between computing devices. Originally invented for Local Area<br />

Networks it has been adopted in many other use cases, like VoIP<br />

communication or AVB (Audio Video Bridging)). Ethernet is<br />

standardized, exists everywhere and is one of the most costefficient<br />

ways to connect a huge number of communicating<br />

devices. However, when it comes to industrial control applications,<br />

additional requirements such as determinism, redundancy and<br />

short latencies arise. These requirements so far were addressed by<br />

industrial Ethernet protocols, such as EtherCAT or ProfiNET.<br />

With the emerging and now mostly finalized IEEE 802.1 TSN<br />

Time Sensitive Network standard this is about to change. Having<br />

evolved from the existing 1722.1 AVB standard, it adds abilities<br />

such as frame preemption and seamless redundancy to also meet<br />

industrial requirements. As an open standard and driven by both<br />

the automotive and industrial world, TSN is destined to become<br />

the next big step in industrial Ethernet communication. In this<br />

paper we take a closer look at TSN from an industrial automation<br />

viewpoint and discuss different use-cases that could be enabled by<br />

this new standard. We discuss the various IEEE 802.1 substandards<br />

and their respective significance to these use-cases, and<br />

also review the feasibility of TSN for replacing traditional<br />

industrial Ethernet protocols in these scenarios.<br />

Keywords - TSN; 802.1, 1722.1, AVB; real-time Ethernet;<br />

deterministic Ethernet; industrial automation; industrial control<br />

I. INTRODUCTION<br />

When Ethernet was defined in the 1970s, the focus was on<br />

best-effort communication and on achieving maximum<br />

throughput versus cost for the bandwidth available. Clearly, at<br />

that time determinism was neither considered nor a design<br />

target. However, over the years, Ethernet has become<br />

increasingly important also for deterministic communication<br />

applications due to its openness and wide acceptance. Today it<br />

is available in all virtually infrastructures, easy to use and,<br />

despite of a multitude of different vendors, compatibility is<br />

ensured by standardized and mandatory extensive testing. It is<br />

robust and can work seamlessly across media boundaries. The<br />

telecom world has already made the change from classic circuit<br />

designs, such as Sonet/SDH, to packet oriented designs using<br />

Ethernet, leveraging AVB and, in the future, TSN. So what<br />

keeps us from using TSN also as an IEEE standardized<br />

replacement for proprietary and vendor-driven industrial field<br />

bus standards? In this paper we will take a closer look whether<br />

TSN is just an extended version of AVB or if it is suitable to<br />

become the future de-facto standard of deterministic Ethernet<br />

communication.<br />

We start with a short overview of the current status of TSN<br />

from an industrial application perspective and provide a short<br />

market overview about the existing deterministic industrial<br />

Ethernet solutions. This is followed by an introduction to the<br />

TSN standards relevant for industrial applications and their<br />

current state of development. We also will take a closer look at<br />

the most important standards of TSN from an industrial control<br />

perspective, followed by more information on the configuration<br />

of such a network. Finally, we look into a test were we compare<br />

the influence of hardware and software support in redundant ring<br />

system. This is concluded by an outlook to the next relevant<br />

steps for TSN for industrial applications in the future.<br />

II.<br />

MARKET OVERVIEW<br />

Figure 1 [1] shows the 2016 summary of today’s Industrial<br />

Networking Protocols and their respective worldwide market<br />

share. Obviously this is divided into (traditional) field-bus<br />

protocols and Ethernet-based solutions.<br />

Figure 1: 2016 Industrial network protocol shares according to HMS<br />

www.embedded-world.eu<br />

893


When looking at figure 1, two major trends can be observed.<br />

There is quite a stable market for field-bus interfaces with only<br />

a moderate growth and a fast growing market for industrial<br />

Ethernet based applications which even today is as big as the one<br />

for field-busses. If the shown growth-rates remain constant, in 3<br />

years from now, Ethernet-based applications will be twice the<br />

market of traditional field busses and therefore have outgrown<br />

traditional field-bus protocols. However, most likely the<br />

emerging trend Industry 4.0 which demands for much higher<br />

data rates will further accelerate the growth rates. This means<br />

that even in 2019, Ethernet based protocols will make up the<br />

majority of freshly installed industrial networks.<br />

While all of the different real-time Ethernet standards have<br />

their pros and cons, they have one major drawback in common:<br />

they are not compatible to each other. A major reason for this is<br />

that most of the big vendors have adopted (or invented) one of<br />

these standards and are pushing it into the market. If one of these<br />

standards gains market share, this also means that some vendors<br />

are gaining market share. So there is not even an interest in<br />

improving interoperability. However, for Industry 4.0 to become<br />

a success, one standardized and interoperable industrial Ethernet<br />

solution is required. The current fragmentation aggravates<br />

installation, configuration and maintenance of such networks<br />

and results in significant higher costs. On the other hand, many<br />

Industry 4.0 applications require hard real-time networks with a<br />

bounded latency. TSN now promises to offer these features as<br />

an IEEE standard and is seen as the prodigy to become the future<br />

industrial Ethernet standard. Let’s have a closer look at its<br />

features.<br />

III.<br />

TSN – HISTORY, CONCEPT AND TECHNICAL FEATURES<br />

With the current fragmentation of standards, a merge of OT<br />

(operation technology) and IT (information technology) in an<br />

industrial plant is impossible within a reasonable cost structure.<br />

However, with Industry 4.0 this is a major demand in the market<br />

and was so far not possible without expensive solutions. This<br />

was a major reason, why TSN was advanced from the existing<br />

AVB (Audio Video Bridging) standard by the AVNU alliance<br />

[2]. AVB, which had been developed for high-quality audio and<br />

video stream transmission over Ethernet, was the first nonproprietary<br />

Ethernet solution for best-Effort real-time stream<br />

data transmission and originally standardized as IEEE 1722.1.<br />

The TSN group was founded in 2012 out of the AVB group. This<br />

was necessary because some of the AVB standards were very<br />

interesting for industrial applications, but AVB alone could not<br />

fulfill all their needs. So by separating both standards, the AVB<br />

group could finish their work and TSN had the chance to create<br />

some enhancements and additional standards with clear design<br />

targets from the beginning. The most important design targets of<br />

TSN were:<br />

<br />

<br />

<br />

Open standard to guarantee interoperability and longterm<br />

stability<br />

Real-time with guaranteed latency, low jitter and zero<br />

congestion loss<br />

Reducing cost and complexity for installation and<br />

maintenance networks<br />

<br />

<br />

<br />

Coexistence of (real-time) industrial control and besteffort<br />

standard Ethernet traffic resulting in convergence<br />

of OT and IT<br />

Immunity of control traffic determinism against besteffort<br />

traffic influences<br />

Vendor independency<br />

The new TSN standard therefore should support all typical<br />

use cases in industry, with some examples shown below.<br />

Vertical<br />

Industrial<br />

Automation<br />

Automotive<br />

Industrial Control<br />

Power Generation<br />

Plants and<br />

Substations<br />

Transportation<br />

Building<br />

Automation<br />

Use Case<br />

Machine Control, PLC, motion control, safety<br />

application<br />

Media streaming, infotainment , but also control<br />

application and in the future connected cars or<br />

autonomous driving ADAS<br />

Machine Control.<br />

Power provider have proprietary networks today<br />

for substation management. Common requirements<br />

are reaction times below 3 ms for a substation to<br />

initiate the disconnect from the power grid.<br />

Passanger information, transport control,.<br />

Monitor and control of heating, ventilation, air<br />

condition, light, access control.<br />

Table 1<br />

Today, AVB and TSN are merged into the IEEE 802.1<br />

standard. Most sub-standards originate from AVB 1722.1<br />

standards, for example [3, 4].<br />

802.1BA: Audio Video Bridging (AVB) Systems<br />

802.1AS: Timing and Synchronization for Time-Sensitive<br />

Applications (gPTP),<br />

802.1Qat: Stream Reservation Protocol (SRP)<br />

802.1Qav: Forwarding and Queuing for Time-Sensitive<br />

Streams (FQTSS).<br />

The TSN standard defines additions to enable additional usecases<br />

such as industrial Ethernet. Some of these are looked in<br />

more detail in the next chapter. This table provides an overview<br />

about these additions and the current status as of the publishing<br />

time of this paper.<br />

Standard Name Status Ref<br />

IEEE 802.1AS-REV Timing and Synchronization Draft 6.0 [5]<br />

for Time-Sensitive<br />

Applications<br />

IEEE 802.1Qbv Enhancements for Scheduled active [6]<br />

Traffic<br />

IEEE 802.1Qbu Frame Preemption active [7]<br />

IEEE 802.1Qca Path Control and Reservation active [8]<br />

IEEE 802.1Qcc] Stream Reservation Protocol Draft 2.0 [9]<br />

(SRP) Enhancements<br />

IEEE 802.1Qci Per Stream Filtering and Draft 2.1 [10]<br />

Policing<br />

IEEE 802.1CB Frame Replication and Draft 2.9 [11]<br />

Elimination for Reliability<br />

IEEE 802.3br Interspersing Express Traffic active [12]<br />

IEEE 802.1Qch Cyclic Queuing and<br />

Draft 2.2 [13]<br />

Forwarding<br />

IEEE 802.1Qcp YANG Data Model Draft 2.0 [14]<br />

IEEE 802.1Qcr Asynchronous Traffic Draft 0.3 [15]<br />

Shaping<br />

Table 2 TSN’s Industrial Protocol Features<br />

894


The most important features of industrial networks are:<br />

<br />

<br />

<br />

Configurable, predictable and guaranteed end to end<br />

latency and bandwidth between nodes<br />

Common time basis with minimal jitter variation and<br />

time deviation between participation nodes<br />

No (or minimal) packet loss through features to avoid<br />

congestion loss and concepts for media redundancy<br />

Some of the TSN additions mentioned in the last chapter help<br />

to realize these features. These are:<br />

<br />

<br />

<br />

<br />

<br />

<br />

Timing and Synchronization for Time-Sensitive<br />

Applications<br />

Enhancements for Scheduled Traffic<br />

Frame Preemption<br />

Frame Replication and Elimination for Reliability<br />

Cyclic Queuing and Forwarding<br />

Per Stream Filtering and PolicingIEEE802.1Qci<br />

In the following section we would like to give a short<br />

introduction to these and explain how they contribute to enable<br />

industrial requirements. All these enhancements are realized as<br />

Layer 2 features in hardware to ensure real-time capabilities.<br />

However, some of them need additional software support on<br />

higher levels. The additional features are only enabled if all<br />

required participants (e.g. both nodes of a link) support it,<br />

therefore keeping compatibility with existing Ethernet nodes.<br />

For the end user this means that applications can be developed<br />

as usual, adding TSN features gradually when needed.<br />

A. Common Time Basis<br />

One basic key necessity to realize deterministic applications<br />

with multiple participants is to establish a common time basis by<br />

time synchronization. Contrary to time synchronization in ITnetworks,<br />

it is not focused on global time but on minimizing<br />

time deviations between nodes. For TSN, the existing AVB<br />

standard 802.1AS (gPTP, generalized Precision Time Protocol)<br />

[16] was improved into the 802.1 AS-REF standard. Even with<br />

AVB, deviations of less than +/500ns over 7 hops to a time grand<br />

master could be achieved by using BCMA (Best Clock Master<br />

Algorithm) [17]. However, if the grand master itself fails, gPTP<br />

requires a significant time to switch to a new time grand master.<br />

In 802.1 AS-REF [5], multiple domains and time masters can be<br />

assigned which allows to switch to a redundant time master<br />

without delays in case of failures. To ensure a seamless<br />

switchover in critical applications, redundant time masters can<br />

be predefined. For AVB, the major application of time<br />

synchronization was to ensure synchronous play-out of audio<br />

and video streams for multiple talkers and to synchronize<br />

samples from multiple listeners. For industrial control<br />

applications, it is necessary to guarantee minimum deviations in<br />

the start and end of transfer cycles and to synchronize the time<br />

slots for IEEE802.1Qbv scheduled traffic [17, 18] which we are<br />

looking at in the next section.<br />

B. Traffic Shaping and Scheduling<br />

Industrial process control is characterized by cyclically<br />

recurring, as well as sporadic events. For combining these<br />

synchronous and asynchronous transfers with non-mission<br />

critical traffic, TSN has implemented a Time Aware Scheduler<br />

IEEE802.1Qbv [6]. This enhancement became necessary,<br />

because with the previous possibilities of prioritization (802.1Q<br />

and the AVB 802.1Qav Credit Based Traffic Shaping) it is not<br />

possible to send packets at any fixed time. IEEE 802.1Q only<br />

cares about the prioritization of the packets. This means that<br />

best-effort traffic is already queued in the outbound queue of a<br />

switch or in processing cannot be interrupted by prioritized<br />

traffic frames. This leads to cycle times being missed which is<br />

critical for control applications. Similarly, AVB‘s 802.1Qav<br />

Credit Based Traffic Shaper is only concerned with ensuring a<br />

guaranteed bandwidth for prioritized traffic. This fully meets the<br />

requirements for audio and video applications, but not the<br />

requirements for an accurate end to end latency, as required for<br />

industrial applications. In AVB streams are defined and reserved<br />

between talker and listener endpoints. Both have a common<br />

understanding of time, to ensure that these streams are<br />

transmitted and played in a timely accurate manner. The<br />

intermediate nodes, like switches, are responsible for forwarding<br />

the data streams. However, since streams are sent with a<br />

sufficient amount of time in advance, credit-based traffic<br />

shaping is adequate to ensure that streams arrive at the listener<br />

in time and with the necessary bandwidth. Only the endpoint<br />

uses traffic shaping based on time while playing the streams.<br />

Requirements for industrial control are different, since packets<br />

need to arrive at all participating stations at a precisely defined<br />

and exact time. [19,20]<br />

The Time Aware Scheduler (IEEE8021.1Qbv) resolves this<br />

by introducing a time slot based scheduling. Figure 2 shows such<br />

a state-of-the art switch with multiple (usually 8) port output<br />

queues representing different frame priorities. Traffic frames are<br />

pushed into these queues based on assigned classes and priorities<br />

to be forwarded to the MAC (media access controller). In a<br />

switch without time aware scheduler according to IEEE 802.1Q,<br />

queues with higher priority are normally served first. Lower<br />

priority queues are only served, when all higher priority queues<br />

have been served and have become empty. However, if a lowerpriority<br />

frame is already in processing to the MAC, a freshly<br />

arrived high-priority frame has to wait until this lower priority<br />

frame is processed, leading to unpredictable delays of highpriority<br />

traffic through a switch. To resolve this issue, the time<br />

aware scheduler allows the definition of (repeating) cycles<br />

(starting and ending at the t 0 times in the figure 2). At predefined<br />

times in each cycle (d in the picture), all best-effort<br />

traffic queues are stopped and one of the defined high-priority<br />

traffic queues is granted by the time aware scheduler for a time<br />

TS1. This ensures that periodically (with the cycle time period),<br />

high-priority traffic is passing through the switch at pre-defined<br />

times. After this slot is finished, the best-effort queues are served<br />

according to IEEE 802.1Q. However, one requirement of<br />

Ethernet is, that a frame, once its transmission has started has to<br />

be processed completely by the MAC to allow CRC (cyclic<br />

redundancy check) checking. Therefore, even at the beginning<br />

of the priority time slot of a new cycle, a big (best-effort) frame<br />

could not be finished yet and violate the priority scheduling. To<br />

avoid this, so-called guard band (GB) is introduced at the end of<br />

www.embedded-world.eu<br />

895


each cycle, in which no new frames are scheduled at all<br />

(blocking all existing queues) and only already scheduled frames<br />

are drained. The length of the guard band depends on the<br />

maximum allowed length of the frames processed by the switch.<br />

It should be chosen as small as possible, because (by blocking<br />

all traffic) it limits the total possible throughput of the switch.<br />

For time aware shaping to work, all network participants have to<br />

have a common, synchronized time basis to meet the agreed<br />

priority time slots and start new cycles at exactly the same time.<br />

Queue 3<br />

Queue 2<br />

Queue 1<br />

Queue 0<br />

t0<br />

Q3<br />

Qbv Schedule<br />

d<br />

Q1,Q2,Q3<br />

Gate Control List<br />

t0-gb = 0 0 0 0 t0-gb = Block non-Time Critical Data<br />

t0 = 1 0 0 0 t0 = Time Critical Queue is open<br />

t0+d = 0 1 1 1 t0+d = Time Critical data is done<br />

repeat each cycle<br />

MAC<br />

Q3 Q0,Q1,Q2 Q3 Q0,Q1,Q2<br />

TS 1 TS 2 GB TS 1<br />

Figure 2 Time Aware Shaper [19]<br />

Cycle n Cycle n+1<br />

C. Frame Preemption<br />

In order to optimize the usage of the available bandwidth<br />

compared to time aware shaping, an Ethernet frame<br />

preemption feature (IEEE802.1Qbu [7] for switches and<br />

IEEE802.3br [12] for MACs) has been added to the TSN<br />

standard. It allows large lower-prioritized frames to be<br />

interrupted by higher prioritized frames. The interruption can<br />

occur at 64-byte boundaries (the minimum length of an<br />

Ethernet frame). This allows to reduce the guard band length<br />

significantly, since it only needs to allow the next 64-byte<br />

chunk to finish before starting the priority slot. For a detailed<br />

analysis on the minimal required guard times, see [21]. An<br />

interrupted frame will continue transmission after the<br />

transfer of the prioritized frame is finished [19, 22]. Since<br />

interrupting frames results in CRC violations, both the<br />

sender and receiver have to support frame preemption for it<br />

to be enabled. Figure 3 illustrates the concept.<br />

Guard Band without Frame Preemption<br />

interfering frame<br />

C1 C2,C3,C4 interfering frame C1<br />

GB TS 1 TS 2 GB TS 1<br />

Cycle n Cycle n+1<br />

Guard Band with Frame Preemption<br />

part1 C1 C2,C3,C4 part 2<br />

GB<br />

TS 1 GB TS 1<br />

Cycle n Cycle n+1<br />

Figure 3 Frame Preemption<br />

C1<br />

gb<br />

t0<br />

d<br />

C2,C3,C4<br />

C2,C3,C4<br />

In the upper part of the picture, the processing of an<br />

interfering frame without Frame Preemption is shown. The (low<br />

priority) white frame requires a guard band which is long as the<br />

largest possible interfering frame. With frame preemption, the<br />

situation is different, as shown in the lower part. Although the<br />

white frame’s (low priority) transmission has already started, it<br />

is preempted at its next 64 byte boundary and the blue frame and<br />

all the prioritized traffic is transmitted. After this is done, the<br />

transmission of the white frame is continued. Seamless<br />

switching between the prioritized and preempted frame is<br />

achieved by introducing two different kinds of MACs, called<br />

pMAC (pre-emptible MAC) and eMAC (express MAC). The<br />

pMAC handles the preemption of non-priority queue frames,<br />

handshaking with the eMAC (responsible for priority frames) to<br />

ensure a seamless handover at the MAC.<br />

For synchronous traffic, time aware shaping with frame<br />

preemption is a feasible solution. However, for (sporadically<br />

occurring) traffic in control networks such as alarm messages,<br />

status messages, etc. this is not the case. In order to ensure lowlatency<br />

communication for this kind of traffic, other<br />

mechanisms are required. Currently, the existing Traffic Shaper<br />

from AVB IEEE802.1Qav could be used but wasn’t actually<br />

designed for deterministic systems.<br />

As stated already in chapter B (Traffic Shaping and<br />

Scheduling), one of the main issues with IEEE802.1Qav is, that<br />

it can’t be predicted, at which moment in time any individual<br />

message arrives at the listener. This is essential, for example in<br />

an application where sporadic events need<br />

deterministic/predictable behavior.<br />

Therefore the IEEE defined a new standard (IEEE<br />

P802.1Qch) in order to cope with the demand for such sporadic<br />

events.<br />

The IEEE P802.1Qch norm - on top of the behaviour defined<br />

in IEEE802.1Qbv – defines the assignment of bandwidth for<br />

sporadic events, but only when such an event occurs. This<br />

means, the periodic traffic in each cycle remains untouched, but<br />

sporadic events get some bandwidth and prioritization assigned<br />

when needed. The bandwidth used in such an event could<br />

otherwise be used for best-effort traffic. The standard defines the<br />

traffic between two neighbouring hops, therefore allowing to<br />

guarantee a maximum end-to-end latency, which is 1 cycle time<br />

more per hop compared to IEEE802.1Qbv [20, 23].<br />

D. Functions to protect against Malfunctions and Faults<br />

Another important aspect for Industrial Networks is to<br />

maximize availability and immunity against single failing nodes.<br />

Basically, two different general types of failures are imaginable,<br />

misbehaving nodes and failing nodes. Misbehaving nodes are<br />

usually harder to detect than completely failing nodes.<br />

Some examples for misbehaviors affecting the network are:<br />

<br />

<br />

<br />

erroneous communication or corruption of messages<br />

slow communication or forwarding performance<br />

flooding the network with traffic<br />

While the first two misbehaviors can be detected by<br />

verifying messages and timing in the receiving node, the last one<br />

is a serious danger to the whole network because it can prevent<br />

and overrule other important communication in the network and<br />

make it fail in total. So in the next subchapter we will take a look<br />

896


at ingress Traffic Control, a method to protect nodes against<br />

traffic flooding.<br />

Failing nodes are usually easier to detect. Examples of such<br />

failures are:<br />

failure due to faulty (re-)configuration<br />

failure due to software bugs, attacks or single event upsets<br />

physically damaged devices (e.g. power failure)<br />

media fault (“cut power or network cable”)<br />

Usually such defects can be mitigated in the network by enabling<br />

a second (redundant) communication channel which does not<br />

rely on the defective device. In the second subchapter we will<br />

look into concepts to implement such redundancy.<br />

1) Ingress Traffic Control<br />

Several vendors have implemented proprietary mechanisms<br />

to protect switches against erroneous ingress traffic. TSN<br />

defines “Per-Stream Filtering and Policing“ (IEEE802.1Qci) as<br />

an interoperable standard. Fundamentally, it allows to perform<br />

frame counting, filtering, policing, and service class selection for<br />

a frame based on the particular data stream to which the frame<br />

belongs, and a synchronized cyclic time schedule. Policing and<br />

filtering functions include the detection and mitigation of<br />

disruptive transmissions, such as high-bandwidth traffic or<br />

packet flooding by other systems in a network, improving the<br />

robustness of that network [20, 23].<br />

2) Redundancy<br />

A typical fault is for example, an open/short-circuited media<br />

line or broken plug connection. For IT network systems, a<br />

variety of spanning tree protocol implementations for switches,<br />

such as STP (IEEE 802.1D) [24] and RSTP (IEEE 802.1w) [25]<br />

exist. These implementations reconfigure the network if there is<br />

a redundant path available. However, for real-time Ethernet<br />

packet-integrity requirements, this technology is not suitable for<br />

several reasons. First of all, it completely neglects the endpointavailability.<br />

So, in case of a cable loss between a switch and an<br />

endpoint, this endpoint would be permanently disconnected<br />

from the network. Furthermore, in case of a malfunction, all<br />

packets would be lost, until the reconfiguration is finished, along<br />

with the time synchronization. Finally, for simple loose contacts<br />

at plugs, resulting in toggling (on/off) connections, protocols<br />

such as RSTP would cause the network to constantly reconfigure<br />

and fail. One solution to handle such malfunctions is building a<br />

permanent redundant path to endpoints. HSR (High Availability<br />

Seamless Redundancy) and PRP (Parallel Redundancy<br />

Protocol) according to IEC 62439-3 are existing solutions.<br />

Highly simplified, HSR achieves redundancy by establishing a<br />

ring structure between endpoints, while PRP uses redundant<br />

parallel cabling. Preferences are mostly dictated by physical and<br />

environment restrictions. TSN adds the same mechanisms into<br />

the new 802.1CB [7] standard, which works as follows: To<br />

achieve redundancy, the packets are duplicated at the sender<br />

endpoint and sent (redundantly) via separate paths to the<br />

recipient. The receiver forwards the first arriving packet to the<br />

application level. The (later arriving) duplicate gets discarded.<br />

The concept is illustrated in the picture below, showing an<br />

HSR-like ring structure. Talker T duplicates the packet and<br />

injects it into the red and blue data paths [26].<br />

T<br />

B<br />

B<br />

Talker Replicates<br />

Listener Removes duplications<br />

Figure 4 Frame Replication and Elimination<br />

B<br />

This task can also be taken over by intelligent bridges or<br />

switches, depending on the technology available at the<br />

endpoint. The duplicated packages are distinguished by<br />

"sequence numbers". In this example, the red packet arrives<br />

first at the listener L and is directly forwarded to the application.<br />

The redundant blue packet is eliminated to prevent infinite loop<br />

cycles. If the red packet had been deleted due to an issue in the<br />

red path, the blue packet would have arrived first and be used<br />

instead. If no duplication is performed in the endpoint, this is<br />

an indication of a defect in the network. Compared to HSR and<br />

PRP IEC62439-3, IEEE802.1CB is not tied to a specific<br />

topology and can also use more than two redundant paths. This<br />

means, that parallel and ring structures can be combined in the<br />

same network. However, this requires compliance with the<br />

required latencies on these paths and thus a configuration of the<br />

TSN network [19]<br />

IV.<br />

B<br />

B<br />

NETWORK CONFIGURATION<br />

To enable huge Industry 4.0 networks, just having a common<br />

deterministic communication standard is not enough. A<br />

common standard for network configuration, along with tools to<br />

design, optimize and monitor such networks is required as well.<br />

Compared to classic Ethernet, the TSN standard allows for a<br />

huge variety of configuration alternatives and – for most<br />

functions to work - requires a common understanding between<br />

communicating devices. To give an example, even just<br />

configuring IEEE802.1Qbv on a single node is quite complex.<br />

However, for IEEE802.1Qbv to work, priority slots, cycle times<br />

and message queue priorities between communicating devices<br />

have to be aligned and maximum latencies and required<br />

bandwidth have to be calculated. This is especially true for larger<br />

networks with many hops and different topologies.<br />

But let’s move from bottom to top. For the configuration of<br />

single network devices (e.g. an MPU attached to a small switch<br />

to connect to the TSN network) a solution using NETCONF [27]<br />

as the management protocol exists. NETCONF provides<br />

mechanisms to read and write standardized configuration files to<br />

nodes and also offers layers to ensure secure and reliable<br />

transport of the configuration data. The configuration data are<br />

usually XML files and can be generated from a YANG [28]<br />

language model. YANG is a standardized language which can<br />

L<br />

B<br />

www.embedded-world.eu<br />

897


e used to describe network configuration and state data in a<br />

more human readable format than XML [29].<br />

But having means to configure single nodes is not sufficient<br />

for network configuration, because time-critical networks are<br />

more than just the sum of their atomic nodes. Standardized<br />

methods and tools to calculate, monitor and push common<br />

network configuration to these nodes are required.<br />

One approach to further standardize configuration of nodes<br />

could be to use OPC UA. OPC UA is a standard protocol to<br />

support client/server or publisher/subscriber based<br />

communication between nodes and can also be used to distribute<br />

configuration as well as status and monitoring information. By<br />

supporting authentication and encryption, OPC UA is inherently<br />

secure and reliable. There are ongoing discussions about using<br />

OPC UA for TSN Network configuration [30] and organizations<br />

such as VDAM [31] und ODVA have already committed to this<br />

[32].<br />

The complexity to calculate and find a solution that fulfils all<br />

bandwidth and timing requirements grows exponentially with<br />

the number of nodes. Automatic monitoring, administration and<br />

reconfiguration of devices is essential to improve turnaround<br />

times and reduce costs, because just adding or replacing a single<br />

node might require a recalculation and reconfiguration of the<br />

network. As a consequence, the complete network has to be<br />

administered and controlled by a central authority or tool. Right<br />

now, a few solutions exist in the market [33, 34]. Although these<br />

solutions provide cross-vendor support, they are not vendor<br />

independent [36, 36].<br />

V. EXPERIMENTAL RESULTS<br />

A. Concept<br />

Unfortunately, at the time of writing this, no TSN conformal<br />

switch to evaluate features, such as frame preemption, time<br />

aware scheduling or redundancy, were available. The authors<br />

therefore decided to focus on evaluating the behavior of a<br />

redundancy ring structure using HSR, which is fully compliant<br />

to 802.1CB and for which current implementations exist. The<br />

concepts can directly be re-used and expanded, once TSN<br />

hardware becomes available.<br />

B. Test Setup<br />

The most important feature to test in an HSR setup is the<br />

redundancy itself by ensuring that no packets are lost if the ring<br />

is opened at a single point of failure. For performance<br />

measurement, two basic indicators exist, latency and throughput.<br />

Three different kinds of latency exist for a redundant network<br />

node, the egress (t e), the ingress (t i) and the cut-through latency<br />

(t t). Ingress and egress latencies describe the latencies of packets<br />

passing from the CPU to the redundant port, or from the<br />

redundant port to the CPU respectively. While these latencies<br />

are definitely interesting for local optimization of the node’s<br />

software stack, they have little effect on the total ring<br />

performance, since they add only once each to the total latency<br />

of a packet (at the sending and the receiving node). The cutthrough<br />

latency on the other hand has a very big influence,<br />

because in a worst case scenario (ring open between two<br />

neighboring nodes, packet transmitted between these two<br />

nodes), the total latency of a packet in a network with N nodes<br />

is:<br />

N<br />

t max = t e + ∑ t tn + t i<br />

n=1<br />

In a sane redundant network, the average latency should be<br />

t max/2. This is why latency measurements were limited to t t. For<br />

most industrial networks, the network throughput is less<br />

important than the latency, but an important mean to detect<br />

bottlenecks. For the sake of simplicity and traceability, no<br />

complicated analysis tools were utilized. Instead, ping [37] was<br />

used to detect packet losses and to measure latencies, while iperf<br />

[38] was used to analyze the network throughput.<br />

To measure the cut-through latency of each node, the<br />

following setup was used:<br />

1. Directly connect sender and receiver node with one<br />

Ethernet cable and measure ping time.<br />

2. Add an additional node (the device under test) between<br />

the sender and receiver and ping again. The cut through<br />

time is the time difference between these two ping<br />

times.<br />

3. To ensure that all nodes work correctly as expected and<br />

as a sanity test, close the redundant Ethernet port<br />

between sender and receiver (hereby creating a ring) and<br />

ensure that the ping times return to the ones measured in<br />

step 1. Open and close the Ethernet connections at<br />

various steps to ensure that no packet losses occur<br />

during the ping.<br />

The same principle was used to measure the network throughput.<br />

To ensure that performance limitations are detected, both sender<br />

and receiver have to perform with a higher throughput than the<br />

actual device under test. In our setup, therefore known-good<br />

deterministic software HSR devices were used, which perform<br />

worse in terms of latency, but very well in terms of throughput<br />

compared to hardware devices.<br />

Figure 5 shows the test principle test setup of 3 boards. Each<br />

of the nodes represents different Evaluation Boards configured<br />

as DANH (Double Attached Node HSR).<br />

Node A is the sending device. Node B be is the device under<br />

test (DUT) and node C is the receiver. At the beginning of every<br />

test we connect node A and C directly and measure the ping<br />

round time of the two devices in both directions from A to C and<br />

C to A (Route A). In the next step we connect node B like in<br />

figure 5 below and disconnect route A. By doing this, we ensure<br />

that the system is forced to reroute the ping-test via node B<br />

which is the DUT. At the end we close route A again. Repeating<br />

the test should show us the same results as in the previous test<br />

from A to C. In above mentioned test via node B (route A open)<br />

the signal takes one more hop and experiences therefore a minor<br />

delay when passing through node B. This delay is called the cutthrough<br />

time t t<br />

898


A<br />

End-Node<br />

A<br />

DANH<br />

B<br />

C<br />

DANH<br />

End-Node<br />

B<br />

The main advantages of the software solution are clearly the<br />

cost and the flexibly. But, as the measured delays (t t) in table 3<br />

show, the hardware implementation from Microchip and<br />

Renesas can forward the frames without any help from the CPU<br />

in times below 1 µs. The software solution form NXP on the<br />

other hand needs an average of 40 µs to forward a frame. The<br />

LS1021A TSN board needs 80 µs. This additional delay is likely<br />

the result of the Real-Time extension which guarantees better<br />

determinism at the cost of lower performance. All results were<br />

measured without additional CPU load. In real-world<br />

applications, there usually would be an application running in<br />

parallel on the CPU, which could have an unpredictable impact<br />

on the software bridging and HSR implementation. This might<br />

result in even larger throughput delays. The conclusion is, that<br />

for deterministic applications like TSN, a hardwareimplemented<br />

redundancy (as defined in IEEE802.1CB) seems<br />

inevitable. Otherwise it is impossible to guarantee determinism<br />

for the system. For further details about the results and test setup,<br />

please contact the authors.<br />

DANH<br />

End-Node<br />

C<br />

Figure 5 HSR Ring<br />

C. Results<br />

From our test we got four different solutions. Two with HSR<br />

software implementation and two with hardware<br />

implementation. From NXP [39] we used the LS1021A TSN<br />

reference design (running Linux and the PREEMPT_RT Linux<br />

patch) and the LS1021A Tower Board (running mainline Linux<br />

4.14). Both are equipped with a dual core Cortex-A7 and two<br />

SGMII (Serial Gigabit Medium Independent Interface) ports,<br />

each with a single PHY. Both boards enable the HSR bridge<br />

function between the two ports on a software basis. As a first<br />

examples of HSR hardware implementations, we used the<br />

Microchip [40] 7 port KSZ9477 switch attached to a SAMA5<br />

Cortex-A5 MPU as management CPU. The second hardware<br />

implementation example is Renesas’ RZ/N1D [41], an<br />

integrated solution with a dual Core-A7 and an on-chip 5 port<br />

real-time Ethernet Switch. The following table shows the<br />

average throughput results t t (measured with 100 pings).<br />

DUT t AtoC / µs t CtoA / µs t t / µs<br />

MCP


REFERENCES<br />

[1] https://www.hms-networks.com/images/librariesprovider6/defaultalbum/company-images/network-shares-according-tohms.jpg?sfvrsn=ff60d2d6_2<br />

[2] http://avnu.org/<br />

[3] http://www.ieee802.org/1/pages/avbridges.html,<br />

[4] http://www.ieee802.org/1/pages/tsn.html<br />

[5] http://www.ieee802.org/1/pages/802.1AS-rev.html<br />

[6] http://standards.ieee.org/findstds/standard/802.1Qbv-2015.html<br />

[7] http://standards.ieee.org/findstds/standard/802.1Qbu-2016.html<br />

[8] http://www.ieee802.org/1/pages/802.1ca.html<br />

[9] http://www.ieee802.org/1/pages/802.1cc.html<br />

[10] http://www.ieee802.org/1/pages/802.1ci.html<br />

[11] http://www.ieee802.org/1/pages/802.1cb.html<br />

[12] http://standards.ieee.org/findstds/standard/802.3br-2016.html<br />

[13] http://www.ieee802.org/1/pages/802.1ch.html<br />

[14] http://www.ieee802.org/1/pages/802.1cp.html<br />

[15] http://www.ieee802.org/1/pages/802.1cr.html<br />

[16] http://ieeexplore.ieee.org/document/7466451/<br />

[17] http://www.elektroniknet.de/elektronik-automotive/bordnetzvernetzung/viel-mehr-als-nur-echtzeit-141430.html<br />

[18] http://www.strategiekreis-automobilezukunft.de/public/projekte/seis/das-sichere-ip-basiertefahrzeugbordnetz/pdfs/TP2_Vortrag4.pdf<br />

[19] D. Pannel and J. Bergen, "IEEE TSN Standards Overview &<br />

Update," Marvell, 2015.<br />

[20] Dr. René Hummen, Stephan Kehrer and Dr. Oliver Kleineberg,<br />

"TSN-Time-Sensitive-Networking-White-Paper-<br />

EMEA_EN.pdf," Hirschmann, 2016.<br />

[21]https://zenodo.org/record/263879/files/2016ETFA-TUBS.pdf<br />

[22] U. Schulze, "Keine Zeit verschwenden," iX, no. 1, pp. 94-96, 2018.<br />

[23] https://mentor.ieee.org/802.24/dcn/17/24-17-0020-00-sgtgcontribution-time-sensitive-and-deterministic-networkingwhitepaper.pdf<br />

[24] http://ieeexplore.ieee.org/document/1309630<br />

[25] http://ieeexplore.ieee.org/document/4039960/<br />

[26] http://netmodule.com/en/technologies/industrialethernet/IEC62439<br />

[27] https://tools.ietf.org/html/rfc624<br />

[28] http://www.yang-central.org/twiki/bin/view/Main/WebHome<br />

[29] https://www.nxp.com/docs/en/user-guide/OPEN-LINUX-IND-<br />

UM.pdf<br />

[30] http://opcconnect.opcfoundation.org/2017/12/opc-ua-over-tsn-anew-frontier-in-ethernet-communications/<br />

[31] https://ias.vdma.org/viewer/-/article/render/15646006<br />

[32] https://www.odva.org/Optimization-40/Optimization-of-Machine-<br />

Integration-OMI<br />

[33] http://www.hirschmann.de/de/Hirschmann/Industrial_Ethernet/<br />

Netzmanagement/Industrial_HiVision_Network_Management_D<br />

E/index.phtml1<br />

[34] https://www.tttech.com/products/industrial/deterministicnetworking/network-configuration/slate-xns/<br />

[35] A. Hennecke and S. Weyer, "http://www.computerautomation.de/feldebene/vernetzung/artikel/143840/,"<br />

[Online].<br />

[36] Goerge A.Ditzel and Paul Didier, "Time Sensictive Networl (TSN)<br />

Protocola and use in EtherNet/IP Systems," 2015 ODVA Industry<br />

Conference & 17th Annual Meeting , Frisco, Texas, USA ,<br />

October 13-15, 2015 .<br />

[37] https://en.wikipedia.org/wiki/Ping_(networking_utility)<br />

[38] https://iperf.fr/<br />

[39] https://www.nxp.com/support/developer-resources/referencedesigns/time-sensitive-networking-solution-for-industrialiot:LS1021A-TSN-RD?fsrch=1&sr=1&pageNum=1<br />

[40] http://www.microchip.com/DevelopmentTools/<br />

ProductDetails.aspx?PartNO=EVB-KSZ9477<br />

[41] https://www.renesas.com/en-eu/products/microcontrollersmicroprocessors/rz/rzn/rzn1d.html#productInfo<br />

[42] https://www.iiconsortium.org/press-room/11-28-17.htm<br />

[43] https://lni40.de/<br />

900


Demystifying Time Aware Traffic Shaping<br />

Technologies for TSN<br />

A Case Study for Linux Driver Enabling<br />

Ong, Boon Leong<br />

Internet of Things Group<br />

Intel Corporation<br />

Penang, Malaysia<br />

boon.leong.ong@intel.com<br />

Abstract— The evolution of IEEE802.1 Audio Video Bridging<br />

(AVB) Task Group to Time-Sensitive Network (TSN) Task Group<br />

and the creation of Avnu Alliance have generated much attention<br />

in industrial, automotive and professional Audio/Video<br />

applications. As hardware Intellectual Properties are created<br />

according to IEEE standards, software components are developed<br />

to enable them. This paper provides a concise yet extensive<br />

introduction of various TSN technologies and the readiness of TSN<br />

framework in Linux world. In addition, the paper describes a<br />

modular approach that has been taken in-house to develop TSNcapable<br />

Ethernet driver.<br />

Keywords— Linux; Time Sensitive Network; TSN; IEEE802.1<br />

Qav; IEEE802.1 Qbv; IEEE802.1 Qbu; Frame Preemption; Gate<br />

Control; Credit Based Shaper; Traffic Class; Ethtool; Networking.<br />

I. INTRODUCTION<br />

The Internet has been around for decades and the range of<br />

applications that are powering it has grown tremendously from<br />

simple web pages with text and pictures, Internet Relay Chat to<br />

on-demand streaming of audio/video contents, Voice over<br />

Internet Protocol telephony and two-way interactive video calls.<br />

The need for networking bandwidth too has sky-rocketed from<br />

Fast Ethernet to Gigabit Ethernet (1G), 10G, 40G, 100G and<br />

beyond in the data centers backhaul. Ethernet technology was<br />

originally designed to provide best effort delivery for lightly<br />

loaded network and soon has been enhanced to provide traffic<br />

prioritization to certain data streaming applications that need<br />

bandwidth reservation.<br />

The dawn of Internet of Things and market movements such<br />

as Industry 4.0 (“Smart Factory”) and autonomous vehicles is<br />

driving Ethernet technology to provide data transfer in a reliable<br />

and timely manner. Such is known as Time-Sensitive<br />

Networking and the “TSN” word has been a recent buzz word in<br />

Ethernet Technology domain especially for automotive and<br />

industrial automation applications ever since the evolution of<br />

IEEE802.1 Audio Video Bridging (AVB) Task Group (TG) to<br />

TSN TG because the scope of former TG has grown beyond<br />

time-sensitive A/V stream. The objectives of TSN are manifold:<br />

(a) time synchronization, (b) deterministically low latency for<br />

scheduled traffics (industrial and automotive control loop) and<br />

bandwidth reserved traffics (audio and video streaming), (c)<br />

bandwidth utilization reservation and (d) fault<br />

tolerance/reliability. In layman’s terms, TSN enables networked<br />

applications/entities to interact in real-time fashion: bounded<br />

delay and well-known time.<br />

For data centers and embedded devices, Linux-based<br />

Operating System (OS) from commercial companies or<br />

community releases, e.g., Red Hat Enterprise Linux, SUSE<br />

Enterprise Linux, Ubuntu Linux, Yocto Project/OpenEmbedded<br />

built Linux and OpenWrt, is popular choice due to its featurerich<br />

networking stack and broad device drivers support for many<br />

different types of network interface controllers (NICs). The<br />

same argument applies to the Android mobile OS that too is<br />

using Linux kernel. For decades, the software industry has seen<br />

a continuous innovation on higher level protocols above the<br />

standard socket interface provided by Linux kernel and within<br />

the Linux kernel. To name a few, NAPI (New API) packet<br />

processing framework for receive interrupt mitigation,<br />

iptables for configuring Netfilter (an IP rules-based packet<br />

filtering and packet mangling), tc (traffic control) for<br />

configuring Linux kernel packet scheduler i.e. queuing<br />

discipline and XDP (eXpress Data Path) pre-stack packet<br />

processing. ethtool is also another popular utility for showing<br />

and making changes to the parameters belong to Ethernet NIC,<br />

e.g., speed, duplex mode, auto-negotiation, checksum offload,<br />

DMA ring sizes, interrupt moderation/coalesce and receive flow<br />

hashing for load-balancing across multi-queue NICs.<br />

TSN TG belongs to IEEE802 standard committee and is not<br />

charted for standard certification. Avnu Alliance is formed with<br />

the aim to create interoperable TSN ecosystem and has<br />

certification labs that test and certify commercial product against<br />

a rich set of conformance and interoperability tests. Though at<br />

its infancy, Avnu Alliance already has first set of conformance<br />

tests on time synchronization, or IEEE802.1AS and is partnering<br />

www.embedded-world.eu<br />

901


with Open Platform Communications (OPC) Foundation to<br />

provide conformance testing and certification of OPC UA over<br />

TSN devices [1] for the industrial ecosystem. To fuel the<br />

creation of AVB/TSN solution, OpenAvnu project [2],<br />

sponsored by the Avnu Alliance, the primary aim is to provide<br />

building block components. The project contains both GPLv2<br />

for kernel driver ingredients and BSD licenses for user-space<br />

sample application, library, and daemon.<br />

In this paper, the two key technologies of TSN time<br />

synchronization and traffic shaping are discussed in Section II<br />

and Section III. Section IV focuses on the current state of the art<br />

for TSN technologies in Linux kernel and other parallel kernelrelated<br />

projects such as traffic control and ethtool. An overview<br />

of the technologies offered in OpenAvnu is also described in<br />

Section IV. Section V describes the modular software<br />

architecture approach taken for enabling TSN-capable Ethernet<br />

kernel driver despite the lack of complete TSN framework in<br />

Linux networking subsystem at the time this paper is created.<br />

Section VI provides two examples covering how to apply<br />

various TSN technologies for a network that carries a mix of<br />

traffic patterns: scheduled, time-sensitive and best effort.<br />

II. OVERVIEW OF TIME SYNCHRONIZATION<br />

IEEE Std. 1588-2008 [3] also popularly known as Precision<br />

Time Protocol Version 2 (PTPv2) enhances the accuracy of time<br />

synchronization between two networked nodes from<br />

millisecond (achievable by Network Time Protocol (NTP)) to<br />

microsecond or sub-microsecond. This is made possible as<br />

packet time-stamping is done at hardware level instead of<br />

software level in the case of NTP. The transport of PTP message<br />

can be over UDP/IPv4, UDP/IPv6, IEEE802.3 Ethernet and<br />

several industrial automation control protocols, e.g.,<br />

DeviceNET, ControlNET, and PROFINET.<br />

IEEE Std. 802.1AS-2011 [4] also known as generalized<br />

Precision Time Protocol (gPTP) is based on IEEE Std 1588-<br />

2008 but differs in various aspects as documented in section 7.5<br />

of the specification. For examples, gPTP has faster best master<br />

clock algorithm (BMCA) convergence time, all gPTP message<br />

is done only using IEEE 802 MAC and gPTP time domain can<br />

span beyond across heterogeneous networks, e.g., Ethernet,<br />

Wireless, Media over Coax Alliance and HomePlug.<br />

To bring about new enhancements and performance<br />

improvements to IEEE Std. 802.1AS-2011, IEEE 802.1ASbt<br />

was started late 2011 and eventually superseded by IEEE<br />

802.1AS-Rev which is still at the draft stage as of this writing.<br />

Examples of the enhancements are, Link Aggregation support,<br />

Fine Timing Measurement for IEEE 802.11 transport, one-step<br />

processing, faster grandmaster change over and further reduce<br />

BMCA converge time.<br />

III. OVERVIEW OF TRAFFIC SHAPING<br />

In packet-switched computer networking technology such as<br />

Ethernet, network packets flow through an inter-connected mesh<br />

of network bridges/switches in bandwidth optimized fashion.<br />

Depending on the bandwidth utilization at the certain point of<br />

time, traffic congestions may happen unpredictably. Such<br />

randomness contributes towards varying transmission latency in<br />

the application data flow and eventually causes poor service<br />

experience to its users.<br />

Quality of Service (QoS) in computer networking is about<br />

ensuring certain application data flows are given higher priority<br />

over others. IEEE Std. 802.1Q-2005 defines (1) Virtual Local<br />

Area Network (VLAN) which includes Priority Code Point<br />

(PCP) for marking packet priority, (2) strict priority transmission<br />

selection algorithm for prioritization of traffics. Multiple queues<br />

are added in both ingress and egress side of a networked device<br />

in order to reorder higher priority packets over lower priority<br />

packets. Traffic prioritization helps improve the application<br />

service quality under lightly-loaded network. In Ethernet<br />

technology, a frame that is the midst of transmission must be<br />

fully transmitted together with its checksum or else it is treated<br />

as a corrupted frame. Therefore, a higher priority frame is not<br />

allowed to be transmitted until an earlier lower priority packet is<br />

completely transmitted. The transmission latency of the higher<br />

priority frame becomes greater if the earlier frame is having long<br />

payload such as a jumbo frame. Clearly, the situation worsens if<br />

the entire network is heavily loaded. In gist, traffic prioritization<br />

does not assure low transmission time with bounded latency.<br />

Internet applications such as audio/video streaming, VoIP<br />

telephony, and two-way video call, use Real Time Protocol that<br />

runs over User Datagram Protocol (UDP) on Internet Protocol<br />

(IP). The data streams of such applications carry huge contents<br />

and are sensitive to transmission latency. RTP streams contain<br />

time-stamp and a sequence number which are used to manage<br />

stream transmission jitter, packet loss and out-of-order. For invehicle<br />

infotainment or professional AV system whereby media<br />

source, speakers, and display unit are located closely, Ethernetbased<br />

AVB technology is a better option than RTP because AVB<br />

uses IEEE 1722 AV Transport Protocol Layer 2 payload to carry<br />

multiple streams and has less header overhead, no IP and UDP<br />

headers.<br />

Both AVTP and RTP media streams require low bounded<br />

latency and latency variation in the packet-switched network and<br />

this means reserving transmission bandwidth for AV streams on<br />

the usually congested network. The recommendation to map and<br />

regenerate VLAN tag encoded priority for bandwidth reserved<br />

streams and a controlled bandwidth queue draining algorithms<br />

called Credit-Based Shaper (CBS) are defined in IEEE Std.<br />

802.1Qav-2009 [5]. CBS, in essence, is a mean to space out AV<br />

streams as far as possible to prevent the formation of long bursts<br />

of high priority traffic that both (1) degrade QoS offered by<br />

lower priority traffic classes and (2) interfere with other high<br />

priority traffic [6]. IEEE Std. 802.1Qat-2010 [7] describes<br />

Stream Reservation Protocol (SRP) for registering and<br />

deregistering AV streams and their associated Traffic<br />

Specification (TSpec). SRP has been implemented on top of an<br />

existing network management protocol called the Multiple<br />

Registration Protocol (MRP). Both Multiple VLAN Registration<br />

Protocol (MVRP) and Multiple MAC Registration Protocol<br />

(MMRP) use MRP and may be used with SRP [8].<br />

It is worth to note that section 33.6.1 of [5], end stations that<br />

are SR talkers shall apply CBS algorithm to per-stream queue<br />

and per-traffic class queue. TSpec that describes the bandwidth<br />

reservation for SR class, for specification scalability reason,<br />

does not include the overhead of the underlying Ethernet MAC<br />

service. Section 34.4 of [5] described the way to calculate framelevel<br />

bandwidth requirements (used as CBS idle slope) based SR<br />

stream TSpec.<br />

902


Control applications in automotive and industrial networks<br />

require even lower latencies than AV application and their traffic<br />

pattern is categorized as scheduled traffic by IEEE Std.<br />

802.1Qbv-2015 [9]. IEEE Std. 802.1Qbv-2015 defines Time-<br />

Aware Shaper (TAS) whereby the selection of transmit frame<br />

from transmit queue is controlled by the associated gate control<br />

that opens or closes based on a pre-defined time schedule called<br />

gate control list (gate, open/close, time interval). To protect<br />

scheduled traffic from being delayed by other traffics, guard<br />

band with duration as long as the time for transmitting a longest<br />

VLAN Ethernet frame (1522 bytes) is set before gate open time<br />

for scheduled traffic. In another word, there shall be no frame<br />

transmission, i.e. loss of bandwidth usage, during guard band in<br />

order for scheduled traffics to be selected for transmission<br />

without delay.<br />

To reduce the effect of bandwidth loss in guard band, IEEE<br />

Std. 802.3br-2016 [10] enhances the capability of Media Access<br />

Control (MAC) to include express MAC (eMAC), preemptable<br />

MAC (pMAC) and MAC Merge sublayers for the purpose of<br />

interspersing express traffic (scheduled traffic) with<br />

preemptable traffic into a normal Ethernet frame transparent to<br />

Physical Layer. IEEE Std. 802.1Qbu-2016 [11], complementary<br />

to IEEE Std. 802.3br, defines a mean to (1) map traffic priority<br />

to frame preemption (FPE) status (express or preemptable) and<br />

(2) hold or release the transmission of preemptable frames in<br />

pMAC. Since preemption occurs only if at least 60 bytes of the<br />

preemptable frame have been transmitted and at least 64 bytes<br />

(including the frame CRC) remain to be transmitted, the guard<br />

band for scheduled traffic can be reduced to as small as 64 bytes.<br />

Prior to the introduction of TAS and FPE, the capability to<br />

determine the time to pre-fetch data from system memory into<br />

NIC internal memory and the transmission time of a packet from<br />

internal memory to physical line is available in Intel Ethernet<br />

Controller I210. Both of these times (prefetching and<br />

transmission) of a packet is calculated based on per-packet<br />

LaunchTime value set (in 32-nanosecond unit) in the associated<br />

transmission descriptor entry. LaunchTime is available on<br />

stream reservation (SR) transmission queue and not in best effort<br />

queue. By assigning SR queue with higher priority than best<br />

effort queue and proper configuration of LaunchTime, we can<br />

segregate time-sensitive traffics from best effort traffics, i.e., a<br />

close resemblance of traffic pattern modulated by TAS.<br />

IV. CURRENT STATE OF ART: TSN SUPPORT IN LINUX<br />

This section provides a glimpse of what is currently<br />

supported in OpenAvnu project and Linux kernel as of this<br />

writing.<br />

A. OpenAvnu project<br />

Table I tabulates a partial list of software components<br />

available under OpenAvnu [2] open source project. I210 driver<br />

(igb_avb.ko) is an out-tree Linux kernel module (maintained<br />

outside of Linux mainline) with an intention to offer hardware<br />

capabilities such as (1) direct transmit and receive descriptor<br />

ring and data buffer access (known as media queue [12]) with<br />

LaunchTime configuration (2) hardware PTP clock access, (3)<br />

CBS configuration, and (4) receive flex filter configuration<br />

through I210 user-space library (libigb). The driver<br />

architecture bypasses the TCP/IP suite in Linux networking<br />

stack and avoids its associated data path latency because<br />

IEEE1722 AVTP frame is Layer 2 protocol without TCP and IP<br />

headers.<br />

Essentially, OpenAvnu sample applications and gPTP<br />

daemon listed in Table 1 use the Application Programming<br />

Interface (API) defined by libigb directly for all timesensitive<br />

data path operations. The drawback of this approach is<br />

in its scaling capability when Linux mainline evolves and the<br />

ease to port another types of Ethernet NIC on OpenAvnu<br />

framework.<br />

TABLE I.<br />

PARTIAL LIST OF OPENAVNU SOFTWARE<br />

Component Directory Path Description<br />

MRP<br />

Daemon<br />

daemons/mrpd<br />

gPTPd daemons/gptp gPTP daemon<br />

MAAP<br />

AVTP talker<br />

& listener<br />

AVTP Live<br />

Stream<br />

MRP Client<br />

I210 driver &<br />

library<br />

daemons/maap<br />

examples/simp<br />

le_talker<br />

examples/simp<br />

le_rx<br />

examples/live<br />

_stream<br />

examples/mrp_<br />

client<br />

kmod/igb<br />

lib/igb<br />

MRP daemon that supports<br />

MSRP, MMRP & MVRP.<br />

MAC Address Acquisition<br />

Protocol for allocating multicast<br />

MAC address for AVTP.<br />

Sample AVTP talker & listener<br />

that register with MRPD and<br />

use the assigned VLAN ID &<br />

PCP for Ethernet traffic<br />

prioritization.<br />

Sample AVTP talker and<br />

listener that can be piped with<br />

other AV applications e.g.<br />

GStreamer and ALSA.<br />

Sample MRP client to<br />

demonstrate<br />

stream<br />

joining/leaving and MMRP or<br />

MVRP query.<br />

Linux kernel module and<br />

library for intel Ethernet NIC<br />

I210 to demonstrate AVB.<br />

B. Linux and its ecosystem<br />

linuxptp [13] project contains time synchronization userspace<br />

programs (ptp4l and phc2sys) to synchronize (1)<br />

hardware PTP clock (in Ethernet NIC) with the master clock<br />

(remote end station), and (2) hardware PTP clock with the<br />

system clock (local end station).Unlike OpenAvnu gPTPd, the<br />

linuxptp project uses API of modern Linux mainline and<br />

make use of ethtool to query and set hardware time-stamping<br />

of Ethernet NIC. Per-packet transmit or per-receive time-stamp<br />

value that is recorded in transmit or receive descriptor by<br />

Ethernet controller is subsequently passed to ptp4l as a control<br />

message with cmsg_level SOL_SOCKET, cmsg_type<br />

SCM_TIMESTAMPING and payload type struct<br />

scm_timestamping [14].<br />

Linux traffic control subsystem has traffic shaping capability<br />

in egress and traffic policing in ingress. User-space application<br />

tc, developed in parallel with Linux kernel, is used to configure<br />

the traffic control operation inside the kernel, e.g. queue<br />

discipline (qdisc) for scheduling, classes for shaping and filters<br />

for classification or policing.<br />

IEEE 802.1 Qav Credit Based Shaper support has been<br />

recently added to Linux mainline [15][16][17]. It is<br />

www.embedded-world.eu<br />

903


implemented as cbs classful qdisc with user configurable<br />

parameters (locredit, hicredit, sendslope and<br />

idleslope) done through tc application and supports<br />

software fallback for NIC that does not have hardware CBS<br />

capability. Multiqueue Priority Qdisc (mqprio) is classful<br />

qdisc that maps traffic flows to hardware queues: traffic flow (as<br />

identified by socket option SO_PRIORITY, 16 priorities in<br />

total) is mapped to traffic classes which is 1:1 mapped to<br />

hardware queue. To attach CBS to a hardware queue, cbs qdisc<br />

is attached as a child of mqprio qdisc. Clearly, the cbs qdisc<br />

support for per-traffic class queues is available in Linux<br />

mainline.<br />

There has been recent discussion and Request For Comment<br />

(RFC) submission in the netdev mailing list about<br />

LaunchTime technology [19]. The technique stores the value of<br />

per-packet LaunchTime in control message (cmsg) as with<br />

cmsg_level SOL_SOCKET, cmsg_type SO_TXTIME and<br />

payload type 64-bit unsigned integer. Control messages are sent<br />

to Linux kernel through the use of sendmsg() socket API.<br />

tbs qdisc was proposed as a mean to reorder transmit frames<br />

before they are committed to network device hardware queue.<br />

Decades-old ethtool application, developed in parallel<br />

with Linux mainline, has been primarily used to query and<br />

control Ethernet NIC and PHY general hardware settings. In<br />

addition, ethtool covers the settings for advanced receive<br />

flow classification and steering capabilities such as receive flow<br />

hash indirection hardware table and receive side scaling hash<br />

key. Based on discussion [18] in Linux netdev mailing list, the<br />

community is leaning towards Linux traffic control subsystem<br />

instead of ethtool for enabling TSN-related traffic<br />

scheduling and shaping capabilities discussed above. Launch<br />

To summarize the state of support for TSN in Linux mainline<br />

today, there is no CBS support for per-stream queue, per-traffic<br />

class gate control (IEEE 802.1Qbv), per-transmission port frame<br />

preemption (IEEE 802.1Qbu) and LaunchTime technology.<br />

V. CASE: LINUX DRIVER ENABLEMENT FOR TSN<br />

For the purpose of validating the health of next generation<br />

TSN-capable Ethernet NIC before design tape-in, we have<br />

developed Linux kernel driver that covers traffic shaping<br />

capabilities: LaunchTime, IEEE 802.1Qbv and IEEE 802.1Qbu.<br />

From section IV, we have seen that TSN framework is still under<br />

software architectural definition phase in Linux mainline as of<br />

this writing. Therefore, to ensure source code developed today<br />

can be easily adapted to future TSN framework in Linux<br />

mainline, we have taken a modular approach for the kernel<br />

driver which has two sub-components: (1) TSN Core Library<br />

and (2) TSN Glue Logic. For LaunchTime, we adopted the<br />

approach contributed in [19]. In the ensuing section, for brevity,<br />

we refer IEEE 802.1Qbu frame preemption as FPE and IEEE<br />

802.1Qbv Enhancements for Scheduled Traffic as EST.<br />

TSN Core Library contains a collection of functions<br />

implemented for TSN hardware capabilities in Ethernet<br />

controller and for two purposes: (1) hooks for device driver<br />

frameworks general initiation and setup in Fig. 1, and (2) runtime,<br />

user-driven input for TSN configuration in Fig. 2.<br />

Fig. 1. Functions for TSN Init, Setup, ISR and Enable/Disable.<br />

Fig. 2. Functions for TSN Configuration.<br />

Fig. 3. TSN Configuration via ethtool ioctl.<br />

Linux kernel has well-defined PCI driver and network<br />

device frameworks that include a list of common management<br />

callback functions (hooks) as displayed in Fig. 1. Fig. 1 shows<br />

the relationship of how these common management hooks are<br />

associated with six of TSN Core Library functions with their<br />

purposes tabulated in Table II. Across multiple Linux versions,<br />

the design of PCI driver and network device frameworks are<br />

fairly stable and mature, so, the implementation of these<br />

904


functions in Table II is fairly future proof. IEEE 802.1Qbu [11]<br />

Annex R.2 describes how EST and FPE capabilities can be used<br />

in isolation, this is the reason why separate enable/disable<br />

functions are developed. IEEE 802.3br [10] Section 99.4.2<br />

describes the method to determine link partner preemption<br />

capability through the use of verify and response mPacket.<br />

When a user enables frame preemption through<br />

set_fpe_enable(), verify mPacket is sent to link partner<br />

automatically. In return, a frame preemption capable link partner<br />

would send response mPacket. Such response mPacket<br />

processing is handled within interrupt service routine<br />

fpe_irq_status(). Likewise, when the physical link is not<br />

supporting full-duplex mode, frame preemption is automatically<br />

disabled and the auto response of mPacket from the local station<br />

will be disabled.<br />

TABLE II.<br />

Function Name<br />

tsn_init<br />

tsn_setup<br />

est_irq_status<br />

fpe_irq_status<br />

set_est_enable<br />

set_fpe_enable<br />

TABLE III.<br />

Component<br />

set_est_gce<br />

set_est_gcl_len<br />

get_est_gcl_len<br />

set_est_gcl_times<br />

get_est_gc_cfgs<br />

reconfigure_cbs<br />

set_fpe_config<br />

get_fpe_config<br />

TSN CORE LIBRARY FUNCTIONS FOR DEVICE DRIVER<br />

FRAMEWORK<br />

Description<br />

Discover hardware capabilities, e.g., support<br />

for FPE and EST and the depth of gate control<br />

list.<br />

Setup interrupts for TSN, e.g., gate control<br />

errors (EST) and preemption support in linkpartner<br />

a (FPE).<br />

Interrupt service routines for EST and FPE.<br />

Enable/disable EST and FPE independently b .<br />

a.<br />

IEEE 802.3br Section 99.4.2.<br />

b.<br />

IEEE 802.1Qbu Annex R.2 Preemption used in isolation<br />

TSN CORE LIBRARY FUNCTIONS FOR RUN-TIME<br />

CONFIGURATION<br />

Description<br />

Configure a gate control entry according to<br />

index: commands, gate state and time interval.<br />

EST command: SetGateStates a .<br />

FPE commands: Set-And-Hold-MAC b and Set-<br />

And-Release-MAC b .<br />

Set/get the length of gate control list as bounded<br />

by the depth of the list discovered by<br />

tsn_init().<br />

Set gate control associated time parameters c :<br />

AdminBaseTime, AdminCycleTime and<br />

AdminCyleTimeExtension and trigger a GCL<br />

change from Admin copy to Operational copy.<br />

Get gate control configurations in state<br />

machines, i.e., gate control list and its<br />

associated time parameters.<br />

Reconfigure CBS IdleSlope parameters of pertraffic<br />

class queue based on total gate control<br />

open time d .<br />

Set/Get preemption state of traffic class queue.<br />

get_fpe_pmac_sts Get current pMAC holdRequest state e .<br />

get_tsn_err_stat<br />

clear_tsn_err_stat<br />

Get/clear error status related to EST & FPE.<br />

Component<br />

set_tsn_tunables<br />

get_tsn_tunables<br />

Description<br />

Set/get tunable parameters for hardware<br />

tuning/offset and TSN standards, e.g., for FPE,<br />

holdAdvance and releaseAdvance e .<br />

a.<br />

IEEE 802.1Qbv Table 8-6<br />

b.<br />

IEEE 802.1Qbu Table 8-6.<br />

c.<br />

IEEE 802.1Qbv Section 8.6.9.4.<br />

d.<br />

IEEE 802.1Qbv Section 8.6.8.2.<br />

e.<br />

IEEE 802.1 Qbu Section 12.30.1.<br />

Fig. 2 shows the part of TSN Core Library that is mainly<br />

developed for user-driven TSN run-time configuration such as<br />

setting gate control list and getting the holdRequest state of<br />

pMAC. A functional summary of these TSN functions is<br />

provided in Table III. Gate Control List (GCL) is a collection of<br />

gate control entries that define the gate operational state (open<br />

or close) and the interval/duration of the operation. A gate<br />

control is 1:1 associated with traffic-class queue. The<br />

set_est_gce()function is used repeatedly, each time with<br />

different entry index, to program the GCL. To set the length of<br />

gate control list, set_est_gcl_len() is used. The main<br />

reason for not implementing set_est_gcl() for<br />

programming the entire GCL is to offer the flexibility of<br />

modifying individual gate control entry from older GCL at will<br />

before committing the newly updated GCL to the state machine.<br />

The set_est_gcl_times() function is meant for setting<br />

time-related GCL parameters such as base time, cycle time and<br />

cycle time extension and implicitly triggering a GCL commit to<br />

the hardware, i.e. making a switch of ownership to GCL from<br />

administrative copy to operational copy, as described in IEEE<br />

802.1Qbv [9] Section 8.6.9.4.7. Therefore, there is no need to<br />

offer a separate function to set ConfigChange. IEEE 802.1Qbv<br />

Section 8.6.8.2 discusses the formula to adjust the CBS<br />

idleSlope value inversely to the total gate open time during the<br />

gating cycle for the queue and it is implemented in<br />

reconfigure_cbs(), which is executed whenever there is<br />

a change in (1) idleSlope value set by tc command as discussed<br />

in Section IV part B of this writing, or (2) a new GCL or gate<br />

control entry is configured.<br />

The value of framePreemptionAdminStatus as defined in<br />

IEEE 802.1 Qbu [11] section 12.30.1.1.1 deserves some further<br />

explanation. The “priority” mentioned in the standard for<br />

framePreemptionAdminStatus is referring to the frame priority<br />

as defined by VLAN tagged PCP (8 priorities in total). The perframe<br />

priority value of framePreemptionAdminStatus specifies<br />

whether a frame with that priority shall be transmitted by using<br />

preemptable or express MAC service. As multiple frames with<br />

different priorities may be enqueued to the same traffic class<br />

queue, the values of framePreemptionAdminStatus set for these<br />

priorities must be consistent too. As discussed in Section IV part<br />

B of this writing, the priority of a transmit frame is specified at<br />

socket session through socket option SO_PRIORITY (perstream<br />

priority). With hardware offload, mqprio qdisc maps<br />

these socket priorities to traffic class queues which in turn also<br />

are 1:1 mapped to hardware transmit queues in Ethernet<br />

controller. Based on the above discussions, we can conclude that<br />

it is sufficient to specify preemption configuration at the<br />

www.embedded-world.eu<br />

905


hardware queue level instead of data stream level and this is<br />

what was designed in set_fpe_config() function.<br />

TSN Glue Logic, in Fig. 2 and Fig.3, is a thin layer of logic<br />

that binds TSN Core Library to higher layer networking<br />

subsystem such as traffic control or ethtool. Before the<br />

discussion of TSN framework in Linux mailing list [18], for the<br />

ease of prototyping for hardware validation, TSN Glue Logic<br />

was implemented to hook TSN Core Library into ethtool<br />

subsystem as shown in Fig. 3. The area of developments done<br />

for such purpose are as follow:-<br />

Linux kernel: include/uapi/linux/ethtool.h,<br />

include/linux/ethtool.h and<br />

net/core/ethtool.c.<br />

ethtool application [20]: ethtool-copy.h and<br />

ethtool.c.<br />

Fig. 4. An example of TSN configuration Implementation.<br />

FILE is for setting the administrative copy of GCL. The content<br />

of a text file named as FILE is a list of gate control entries that<br />

follow the format suggested in [18], <br />

. First of all,<br />

is a character chosen from {S, H, R} set<br />

corresponding to {SetGateStates, Set-And-Hold-MAC, Set-And-<br />

Release-MAC} gate control operation defined in IEEE 802.1<br />

Qbv and IEEE 802.1Qbu. The next field, , is simply a<br />

hexadecimal number which represents the state of gate control.<br />

For example, to set 1 st , 4 th , and 8 th gate control to open for 10<br />

microseconds and stop pMAC from sending preemptable<br />

frames, the gate control entry in the FILE text file is “H 0x89<br />

10000”.<br />

To set the GCL associated time parameters, e.g. base time,<br />

cycle time and cycle time extension, ethtool set-estinfo<br />

cycle N.D base N.D ext N.D command is used<br />

whereby N is the numerator value of time in seconds and D is<br />

the denominator value of time up to nanoseconds accuracy.<br />

From Fig. 4, it should be clear that the ethtool set-gcl<br />

command calls drv_set_gcl() function in TSN Glue Logic<br />

for parsing the GCL list stored in ethtool_gcl data structure.<br />

For each of the gate control entries, the drv_set_gcl()<br />

function repetitively calls the set_est_gce() function for<br />

setting the value of the respective gate control entry. Before<br />

drv_set_gcl() exits, it calls the set_est_gcl_len()<br />

function to set the length of the GCL. To set up GCL associated<br />

time-related parameters and commit the GCL to hardware,<br />

ethtool se-est-info command is called. This function,<br />

as shown in Fig. 4, make calls to the drv_set_est_info()<br />

then set_est_gcl_times() for setting the GCL-related<br />

time parameters.<br />

VI. EXAMPLES: APPLICATION OF TSN TECHNOLOGIES<br />

In this section, we will take a look at how to use various<br />

TSN technologies that have been discussed earlier to provide<br />

network service for traffic patterns: scheduled, time-sensitive<br />

and best effort.<br />

There are two ethtool.c shown in Fig. 3, labelled by<br />

Circle-1 and Circle-2, for the user- and kernel-space<br />

respectively. The user-space ethtool.c contains mainly<br />

operations to parse string-based inputs for commands and TSN<br />

parameters into data structure newly introduced for TSN<br />

capabilities in ethtool-copy.h. The same set of data<br />

structure definitions is mirrored in<br />

include/uapi/linux/ethtool.h, a file in Linux kernel<br />

project. On the other hand, the kernel space ethtool.c covers<br />

data marshalling operations, i.e. data copying between kerneland<br />

user-space for the above-mentioned input parameters and<br />

response, and eventually calls ethtool_ops functions<br />

implemented in TSN Glue Logic as labelled by Circle-3. The<br />

function prototypes of ethtool_ops is defined in<br />

include/linux/ ethtool.h.<br />

Fig. 4 shows an excerpt of TSN configuration<br />

implementation that uses the software architecture that we have<br />

just discussed. The command ethtool set-gcl file<br />

Fig. 5. TSN technologies for network with scheduled, time-sensitive and<br />

best effort traffics.<br />

906


Fig. 5 shows an example of how CBS, gate control, and<br />

frame preemption can be in an end-point that have various<br />

applications that generate a mixed of scheduled (industrial and<br />

automotive control), time-sensitive (A/V streams) and best<br />

effort traffics.<br />

Scheduled traffics are periodic, short and need very small<br />

latency in nature. These frames are enqueued to the highest<br />

priority transmit queue, in Fig. 5 that is TxQ3. At the fixed<br />

period of gating cycles, only the gate control of TxQ3 will open<br />

and the rest of the other gate controls belong to the other traffic<br />

class queues must close. The interval of the TxQ3 gate is open<br />

is normally small in order not to add significant delay to other<br />

frames from other queues. To hold back earlier preemptable<br />

frames from delaying the actual start time of the frame of<br />

scheduled traffic, the set-and-hold-MAC gate control operation<br />

is used. The scheduled frame scheduled from TxQ3 is serviced<br />

by Express MAC (eMAC) service.<br />

For time-sensitive traffic, SR Class A and Class B data<br />

streams are enqueued to TxQ2 and TxQ1 which are attached to<br />

independent credit-based shapers. Each of the CBS parameters<br />

(idle slope, send slope, locredit and hicredit) are set with<br />

different values to ensure an SR Class A stream generates 8000<br />

packets per second (125us apart) and a Class B stream generates<br />

4000 packets per second (250us apart). There is a convenient<br />

python script (calc_cbs_params.py) provided in [15] for<br />

calculating the above-said parameters based on the bandwidth<br />

allocated for each SR streams. The gate control pattern for<br />

TxQ2 and TxQ1 may be set as shown in the GCL of Fig. 5:<br />

At T02, Gate#2 and Gate#1 are opened and Gate#0 is<br />

closed, this helps to prevent best effort frame from<br />

delaying SR frames. In addition, set-and-release-MAC<br />

gate control operation should be used to allow<br />

preemptable MAC service from transmitting frames<br />

connect to it.<br />

At T03, Gate#2, Gate#1 and Gate#0 are open, this<br />

avoids best effort frames from being over-starved by<br />

scheduled traffics and time-sensitive traffics.<br />

Fig. 6. TSN technologies for network with time-sensitive and best effort<br />

traffics.<br />

The strict priority scheduling ensures that frames from<br />

TxQ2, TxQ1, and TxQ0 are selected for transmission through<br />

preemptable MAC service in the right traffic class order.<br />

Fig. 6 shows the usage of CBS for SR Class A and Class B<br />

streams which are connected to express MAC service. Best<br />

effort traffics are sent through preemptable MAC service. Such<br />

configuration has the benefit to reduce the latency introduced<br />

to SR frame by best effort frame to as small as 123-byte as<br />

explained in IEEE 802.1Qbu [11] Section R.2. Without frame<br />

preemption, the latency imposed to SR frame can be as large as<br />

the maximum frame size, i.e. 1522-byte.<br />

VII. CONCLUSION<br />

In conclusion, we have discussed two important elements<br />

in TSN, time synchronization, and traffic shaping, and provided<br />

a concise introduction to many TSN related standards. The<br />

current state of the art of TSN related software components, e.g.<br />

gPTPd, linuxptp, tc and ethtool are discussed and we<br />

concluded that at the time of this writing, TSN framework in<br />

Linux networking subsystem is still under definition stage. We<br />

have demonstrated how with modular software architecture<br />

approach (TSN Core Library and TSN Glue Logic), TSNcapable<br />

Ethernet kernel driver could be developed for IP<br />

validation and the solution is scalable to future TSN framework<br />

in Linux kernel, which seems to heavily base on traffic control<br />

subsystem. Lastly, the paper provided two examples of how<br />

various TSN technologies may be used to cater for a network<br />

that has a mix of traffic patterns: scheduled, time-sensitive and<br />

best effort.<br />

ACKNOWLEDGMENT<br />

Much of the discussions made in this paper is derived from<br />

various IEEE standards, public mailing list discussion, and<br />

numerous papers, the author would like to thank all of the<br />

contributors as listed in the references section.<br />

In addition, the author would like to thank the Ethernet and<br />

TSN hardware and software team in Intel Corporation Internet<br />

of Things Group (IOTG) for countless hours and effort towards<br />

the creation and validation of TSN technologies in intel product.<br />

Special mentions are Mr. Gavin Hindman, Mr. Jesus Sanchez-<br />

Palencia, Mr. Kweh Hock Leong, Mr. Vinicius Gomes, and Mr.<br />

Voon Weifeng from Intel Corporation for good partnership in<br />

the Linux kernel driver development.<br />

REFERENCES<br />

[1] Avnu Alliance Delivers First TSN Conformance Tests for Industrial<br />

Devices. http://avnu.org/wp-content/uploads/2014/05/Avnu-SPS-IPC-<br />

Conformance-Testing-Release-FINAL-UPDATED.pdf<br />

[2] https://github.com/AVnu/OpenAvnu<br />

[3] IEEE, “IEEE Stamdard for a Precision Clock Synchronization Protocol<br />

for Networked Measurement and Control Systems”, IEEE Std 1588-<br />

2008.<br />

[4] IEEE, “Timing and Synchronization for Time-Sensitive Applications in<br />

Bridged Local Area Networks”, IEEE Std 802.1AS-2011.<br />

[5] IEEE, “Forwarding and Queuing Enhancements for Time-Sensitive<br />

Streams”, IEEE Std 802.1Qav-2009.<br />

[6] M. D. Johas Teener et al., "Heterogeneous Networks for Audio and Video:<br />

Using IEEE 802.1 Audio Video Bridging," in Proceedings of the IEEE,<br />

vol. 101, no. 11, pp. 2339-2354, Nov. 2013.<br />

[7] IEEE, “Stream Reservation Protocol”, IEEE Std 802.1Qat-2010.<br />

www.embedded-world.eu<br />

907


[8] Levi Pearson, “Stream Reservation Protocol”, Avnu Alliance Best<br />

Practoces, Nov. 2014.<br />

[9] IEEE, “Enhancements for Scheduled Traffic”, IEEE Std 802.1Qbv-<br />

2015.<br />

[10] IEEE, “Specification and Management Parameters for Interspersing<br />

Express Traffic”, IEEE Std 802.3br-2016.<br />

[11] IEEE, “Frame Preemption”, IEEE Std 802.1Qbu-2016.<br />

[12] Eric Mann, “Linux Network Enabling Requirements for Audio/Video<br />

Bridging”, Linux Plumber 2012.<br />

https://linuxplumbers.ubicast.tv/videos/linux-network-enablingrequirements-for-audiovideo-bridging-avb/<br />

[13] https://sourceforge.net/projects/linuxptp/<br />

[14] https://www.kernel.org/doc/Documentation/networking/timestamping.tx<br />

t<br />

[15] Vinicius Costa Gomes, “TSN: Add qdisc based config interface for CBS”,<br />

https://marc.info/?l=linux-netdev&m=150820212927379&w=2<br />

[16] Vinicius Costa Gomes, “net/sched: Introduce Credit Based Shaper (CBS)<br />

qdisc”, https://marc.info/?l=linux-netdev&m=150820214127384&w=2<br />

[17] Vinicius Costa Gomes, “net/sched: Add support for HW offloading for<br />

CBS”, https://marc.info/?l=linux-netdev&m=150820212927381&w=2<br />

[18] Vinicius Costa Gomes, “[RFC] TSN: Add qdisc-based config interfaces<br />

for traffic shapers”,<br />

https://marc.info/?l=linux-netdev&m=150422919415560&w=2<br />

[19] Jesus Sanchez-Palencia, “Time based packet transmission”,<br />

https://marc.info/?l=linux-netdev&m=151623051317593&w=2<br />

[20] The ethtool project.<br />

https://www.kernel.org/pub/software/network/ethtool/<br />

908


TSN and OPC UA for Industrial Automation<br />

Challenges in Getting Fieldbus-like Performance<br />

and Scalability within Convergent Networks<br />

Henke, Torben; Zahn, Peter; Frick, Florian; Lechler, Armin<br />

Institute for Control Engineering of Machine Tools and Manufacturing Units<br />

University of Stuttgart<br />

Stuttgart, Germany<br />

Torben.Henke@isw.uni-stuttgart.de<br />

Abstract—Convergent communication networks with a<br />

multitude of differing data flows are a promising approach to<br />

increase information consistency and device interoperability and<br />

so enable new applications and business models in industrial<br />

automation and production environments. A required key<br />

feature for the application of control applications is deterministic<br />

real-time behavior of such networks. With the presence of IEEE<br />

802.1 TSN, there is a promising vendor and application<br />

independent networking technology available. This paper<br />

analyzes typical parameters and requirements of fieldbus<br />

technology which is used today for real-time applications and<br />

their relation to TSN mechanisms within convergent networks.<br />

Also, the potential of future realizations using TSN in<br />

conjunction with OPC UA Pub/Sub is shown. Focus is laid on<br />

specific challenges which will require some further work to<br />

enable fieldbus-like performance over such infrastructures.<br />

Time Sensitive Networking (TSN), an extension of the<br />

IEEE 802.1 bridging standard, which is currently under<br />

development, extends Standard Ethernet with real-time<br />

capability. Thus, TSN enables to replace a core functionality of<br />

existing fieldbus systems, namely deterministic data exchange,<br />

with a vendor and industry independent standard.<br />

For component providers, a uniform communication<br />

standard means a significant reduction in development effort.<br />

In addition, due to the expected high spread of TSN in the<br />

industries of automotive, industrial, IT, entertainment and<br />

finance, network adapters will be available in large quantities<br />

and the associated lower prices. This leads to lower equipment<br />

costs.<br />

Keywords—TSN, OPC UA Pub/Sub, fieldbus, communication<br />

I. INTRODUCTION (Heading 1)<br />

In the course of Industry 4.0 machine communication<br />

across different layers is of increasing importance to enable<br />

new technical applications and business models. Direct data<br />

exchange between individual components of a production<br />

machine, within the factory network or even up to cloud<br />

infrastructures will be more and more requested.<br />

Industrial communication within single production<br />

machines and automation devices today is based on fieldbus<br />

technology to guarantee real-time-capable communication, take<br />

care of the semantic of data and device descriptions (profiles)<br />

as well as for the configuration of the communication entities.<br />

A major part of the fieldbus standards does not allow direct<br />

communication between devices in fieldbus network and the<br />

rest of the IT network using a common protocol. Furthermore,<br />

the multitude of different, non-interoperable fieldbus systems<br />

leads to high costs. On the one hand, this is due to higher<br />

development effort as component manufacturers have to adapt<br />

their devices to the various fieldbuses. On the other hand, the<br />

required communication hardware to guarantee real-time<br />

behavior is only produced in small quantities which makes<br />

them quite expensive.<br />

Fig. 1. Pyramid of automation<br />

II.<br />

INDUSTRIAL COMMUNICATION – STATE OF THE ART<br />

Industrial communication networks are vital resources to<br />

transfer information related to control, diagnostics, tracing and<br />

configuration within production environments. Different layers<br />

of communication can be distinguished, to date the pyramid of<br />

automation is mostly used as the underlying model. Here, from<br />

bottom to top, the requirements concerning time determinisms<br />

decreases and the abstraction and amount of the data increases.<br />

www.embedded-world.eu<br />

909


On the field level, typically small telegrams with a small cycle<br />

time, often depending of control loop frequencies, are<br />

exchanged. While today, for that purpose several specialized<br />

fieldbus environments are used, things get more challenging<br />

when using standard IT protocols. In the next section, the<br />

related requirements are presented.<br />

A. Fieldbus Technologies and their Requirements<br />

Field buses are used for communication between field<br />

devices such as sensors, actuators and automation devices<br />

today. Their main purpose is to transmit process data between<br />

the participants in a deterministic real time behavior, often in<br />

form of cyclic telegrams. Additionally, service and<br />

configuration related data has to be considered. To ensure the<br />

deterministic behavior, often the communication topology, the<br />

participants and the timing schedule are configured statically<br />

within given bounds.<br />

The following describes typical requirements placed on<br />

fieldbus systems today. Their weighting depends on the<br />

respective application, resulting in different fieldbus<br />

environments used for different applications.<br />

1) CYCLE TIME<br />

The cycle time specifies the length of one communication<br />

cycle, where a data transfer between the included network<br />

nodes takes place. Often, actual values, which were sampled at<br />

the beginning of the cycle, and set points, which are to be valid<br />

beginning with the next cycle, are submerged. The cycle time<br />

is limited both by the transmission time of a packet and the<br />

processing time in the subscribers. With a bandwidth of 100<br />

Mbps on an Ethernet link, it takes 5.76 µs to send one frame<br />

with minimum size. If several frames are sent from/to a<br />

subscriber, the Inter Frame Gap (IFG) of 0.96 µs must be<br />

added, so that the minimum total duration per frame is 6.72 µs.<br />

In addition to the transmission time, there are also the<br />

propagation delay on the line and the time that the participants<br />

need to process the data.<br />

Depending on the application, different demands are placed<br />

regarding cycle times. Typically, for decentralized control<br />

systems, 1 to 2 ms are sufficient. Cycle times between 25 and<br />

250 µs are required for a central control system. As the cycle<br />

time corresponds to a dead time within an overlaid control<br />

loop, it often is also limited bounded from control performance<br />

point of view.<br />

2) SYNCHRONICITY<br />

Synchronization of several entities is crucial for example in<br />

motion tasks, where several drives have to operate coupled at<br />

high speeds. A slight shift in the time-position profile of a drive<br />

may lead to poor product quality or even machine damage.<br />

Fieldbuses for such applications must therefore provide a<br />

common time base for all components, the accuracy<br />

requirements of the process determines the amount of allowed<br />

jitter. Various protocols and methods exist to compensate for<br />

uncertainties in communication delays as also clock<br />

differences.<br />

3) RELIABILITY<br />

Reliability is of importance in automation and control<br />

technology, because failures of systems leads to a standstill of<br />

the plant and thus financial loss for the operator. In order to<br />

minimize such situations, various options for setting up<br />

redundant structures are available. An example is line<br />

redundancy, here several independent communication channels<br />

are established between the participants. If one is interrupted or<br />

disturbed, it is still possible to communicate via others. If one<br />

frame is forwarded multiple times in the network, care has to<br />

be taken that is processed only once at the receiver.<br />

4) NUMBER OF STATIONS<br />

For decentralized control scenarios with many distributed<br />

stations, it is important that a fieldbus can handle a sufficient<br />

number of stations. The maximum number of participants is<br />

limited by two factors. On the one hand, through the addressing<br />

range of the protocol, on the other hand by the amount of data<br />

exchanged between participants. If the amount of data exceeds<br />

the available bandwidth, the scheduled transmission cannot be<br />

finished within a cycle.<br />

5) DATA RATE<br />

To achieve short cycle times with a high number of<br />

subscribers, the available data rate and the amount of<br />

application payload are decisive. The latter depends on how the<br />

protocol is structured and how much data is transferred per<br />

subscriber. For an Ethernet-based fieldbus, for example, each<br />

data packet sent has a minimum size of 64 bytes. If it contains<br />

only two bytes of payload, bandwidth utilization is inefficient.<br />

Also, a high data rate is desirable in terms of minimum<br />

transmission delays.<br />

Fig. 2. Overview: Requirements concerning fieldbus technologies<br />

B. Key mechanisms and representative protocols<br />

To give an overview on the different mechanisms, which<br />

can be used to fulfill the presented requirements, some typical<br />

fieldbus protocols are explained in the following section.<br />

1) EtherCAT<br />

EtherCAT uses a so-called summing frames, all participants<br />

form a logical ring. The initial telegram is sent from the master<br />

910


to the first slave in the ring and from there in sequence from<br />

slave to slave, before they finally reach the master again. The<br />

summing frame acts as placeholder for all data exchanged in<br />

the current cycle. The slaves know, where in the summing<br />

frame their input data are located and to which address they<br />

should write their output data. In order to keep the processing<br />

delay in the slave as low as possible, the summing frames are<br />

processed in hardware on-the-fly. This means, that the frame is<br />

not received completely before it is processed. Instead, the<br />

received data is continuously interpreted and forwarded, so that<br />

a delay of only a few bits occurs. By using summing telegrams<br />

within Ethernet frames, the protocol efficiency can be<br />

significantly increased.<br />

In addition to direct exchange for cyclical process data, it is<br />

possible to tunnel other communication through a specific<br />

mailbox mechanism. In order to avoid collisions in the<br />

network, only one master must be present, no other network<br />

device may transfer data on its own initiative. Devices that are<br />

not compliant must not be connected to the network. Direct<br />

communication between the devices is only possible to a<br />

limited extent, data can only be transferred directly from a<br />

slave previously arranged in the logical ring to a later slave. In<br />

the other direction, the data must first be forwarded by the<br />

master in the next cycle.<br />

As the master sends and receives standard Ethernet frames,<br />

no special hardware is required. Only slave devices need a<br />

network interface, capable of processing frames on-the-fly.<br />

2) Profinet<br />

Profinet works according to the provider-consumer model.<br />

All network participants send their data to the respective<br />

recipients at fixed, pre-configured points in time without being<br />

asked. If a controller sends output data, it acts as a provider of<br />

the data. The receiving field devices are the consumers of this<br />

data. This is reversed for the input data.<br />

The devices participating in a Profinet network are divided<br />

into three classes:<br />

1. Supervisor: Used to configure and diagnose<br />

participants of a Profinet network.<br />

2. IO controller: PLCs and NC controls are located here.<br />

3. IO device: All decentralized I/O field devices, such as<br />

Bus Terminals and drives.<br />

Several IO controllers may be present in a Profinet network<br />

simultaneously, an IO controller can consume the data of<br />

different IO devices. Cyclic communication between IO<br />

devices is not supported in Profinet.<br />

Dynamic Frame Packing (DFP) makes it possible to<br />

transmit data to several subscribers similar to a summing frame<br />

in an Ethernet frame. To do this, the participants must be<br />

placed in line. Within a DFP frame, the Profinet frames for the<br />

participants are arranged in the order of the line. The first node<br />

takes its frame and forwards a new Ethernet frame together<br />

with the other Profinet frames for the following nodes.<br />

3) EtherNet/IP<br />

EtherNet/IP is based on the very common TCP/IP and<br />

UDP/IP layer, while the exchange of data between the nodes is<br />

carried out via the Common Industrial Protocol (CIP) up from<br />

layer 5 in the ISO/OSI model. The CIP is an object-oriented<br />

data protocol for communication between field devices and<br />

controllers. A CIP object consists of data and services, among<br />

other things. A network node is described as a collection of<br />

objects. Within a node, the data of the objects is mapped to the<br />

internal data. Communication between nodes also takes place<br />

via so-called communication objects.<br />

Two different communication relations are available: Pointto-point<br />

connections using TCP/IP are used for the exchange of<br />

non-real-time data. The exchange of real-time data takes place<br />

via UDP/IP packets and is transmitted as multicast. Here, four<br />

different modes are available<br />

Cyclic transmission of full payload<br />

Cyclic transmission only of incremental data<br />

Requesting individual nodes to get their data (polling)<br />

Request all nodes to send their data at the same time<br />

As a conclusion, TABLE I. gives an overview on typical<br />

performance parameters of the presented fieldbus protocols. It<br />

can be concluded, that for the selection of a protocol, the<br />

requirements of the application have to be taken into account as<br />

there is no global optimum.<br />

TABLE I.<br />

COMPARISON OF TECHNICAL FIELDBUS PARAMETERS<br />

EtherCAT PROFINET EtherNet/IP<br />

Min. cycle Time 11 s 31,25 1 ms<br />

Synchronicity ± 20 ns < 1µs < 200 ns<br />

Data Rate 100 Mbit/s 1 Gbit/s Ethernet max<br />

RT-Mechanism Exclusive Exclusive Quality of Service<br />

bus access bus access<br />

Frame format Summing frame Ethernet TCP/IP<br />

Topology arbitrary arbitrary arbitrary<br />

Direct cross<br />

communication<br />

no optional yes<br />

Due to the high demands on determinism and the<br />

incompatibility of most fieldbuses with standard Ethernet, it is<br />

necessary to separate devices belonging to the Information<br />

Technology (IT) from the Operational Technology (OT). As a<br />

result, several physically independent networks are set up.<br />

III.<br />

PATH TO CONVERGENT NETWORKS<br />

One of the main justification for field buses is the lack of<br />

determinism in standard Ethernet. With TSN, however,<br />

standard Ethernet becomes real-time capable. This makes it<br />

possible to create so-called convergent networks, in which<br />

different traffic classes coexist, which offers various<br />

advantages over proprietary, sealed infrastructures. Separate<br />

cabling is no longer necessary. If all devices are in one single<br />

network, it is also possible to access individual devices directly<br />

from upper layers. This enables added value for the customer<br />

as for access on devices not special tools or methods are<br />

needed anymore. Devices, for example, can instead be<br />

configured via an integrated web server. Also, direct access to<br />

huge amounts of production data allows those to be used for<br />

big data analytics.<br />

www.embedded-world.eu<br />

911


Another advantage of a convergent network is its lower<br />

technical complexity. This concerns both the network itself and<br />

the devices, as standardized hardware and stacks can be used.<br />

The complexity of the field devices is reduced by the fact that<br />

they no longer have to be adapted to the several field buses. On<br />

the other hand, much more effort has to be put on the<br />

management of such complex networks.<br />

Considering convergent networks for control and<br />

automation applications, some base functionalities have to be<br />

provided: On the one hand, it is necessary to have a uniform<br />

time base throughout the entire network. This ensures<br />

synchronous operation of distributed systems. Also mandatory<br />

is time deterministic data transmission, which is currently the<br />

main obstacle for convergent networks. Another challenge to<br />

be solved is the configuration of such networks. In addition to<br />

the traditional network parameters determining switching and<br />

routing, a convergent network in control technology also<br />

includes timing-related parameters.<br />

Beside the requirements of the network itself, there are<br />

aspects that need to be considered. This includes migration<br />

strategies for existing devices and protocols. Efficient methods<br />

for integrating such devices into a convergent network will be<br />

crucial for the acceptance of such novel communication<br />

infrastructures. Interoperability between devices is moving in a<br />

similar direction. Although this often is not absolutely<br />

necessary at process data level, the benefit from convergent<br />

networks correlates with the availability of data.<br />

IV.<br />

BENEFITS FROM TSN & OPC UA<br />

One technical solution which covers the requirements for<br />

such convergent networks is currently being specified by IEEE<br />

under the term TSN. It brings a collection of standards that<br />

extend the IEEE 802.1 Ethernet standard and cover various<br />

aspects of time-critical data transmission via Ethernet. Some<br />

standards, such as IEEE 802.1Qbv [1], deal with the timedeterministic<br />

transmission of data. Other standards increase the<br />

transmission security of the data through redundancy. IEEE<br />

802.1AS-Rev [2] specifies a time synchronization protocol,<br />

however there are several preliminary variants currently in the<br />

field which are not fully interoperable.<br />

TSN only focusses on the transport layer, whereas it does<br />

not specify higher level protocols and descriptions. Here, OPC<br />

UA has obtained increasing importance for enabling a<br />

manufacturer-independent communication which is already<br />

used widely for diagnosis and configuration data. Although, to<br />

date it does not have any real-time guaranties.<br />

In addition to the pure data description, OPC UA also<br />

contains mechanisms for the verification and protection of data<br />

through encryption. This makes it also suitable for Industry 4.0<br />

scenarios. OPC UA basically works according to the client<br />

server principle, while a device can be both client and server<br />

simultaneously. This is not suitable for cyclic communication<br />

with low cycle times, as the overhead in the communication<br />

increases and bandwidth efficiency drops. A solution to this<br />

problem is the Publisher/Subscriber (Pub/Sub) extension for<br />

OPC UA [3]. Here, a publisher transmits its data, cyclically or<br />

on change, as a multicast telegrams into the network. Other<br />

devices can subscribe to and receive this data. As a device also<br />

can take both roles at a time, arbitrary communication relations<br />

can be established.<br />

The combination of OPC UA Pub/Sub and TSN is expected<br />

to be the future multi-vendor communications solution in<br />

convergent networks. However, there are still some challenges<br />

to be solved for broad industrial application which will be<br />

shown below.<br />

1) Resource Consumption<br />

For devices with limited resources, a full OPC UA stack<br />

can be challenging with respect to CPU, RAM, and the<br />

available memory resources. The OPC Foundation is currently<br />

working on a possible reduced communication model, where<br />

the content of the messages and the position of single data<br />

within is permanently configured at startup. This eliminates the<br />

need for a complete decoding of Pub/Sub messages and allows<br />

static access on the process data in it.<br />

2) Network Configuration<br />

Another challenge is the configuration of the network<br />

devices according to the needed communication relations.<br />

Several models for this are currently being developed which<br />

are not necessarily compatible. Also, the have not yet been<br />

integrated into common engineering tools which is crucial for<br />

industrial acceptance of the technology.<br />

3) Scalability<br />

A third challenge when using OPC UA Pub/Sub and TSN<br />

for control technology in convergent networks is the limited<br />

scalability, depending on the scenario. This can be seen from<br />

two different points of view when simply increasing the<br />

number of devices which participate in the cyclic<br />

communication even with small payload.<br />

On the one hand, real-time data traffic can consume a large<br />

part of the network's bandwidth. In an example network with<br />

100 devices that send 90 bytes (including protocol overhead) to<br />

the central controller every millisecond, this application<br />

occupies 72% of the bandwidth when using a network with a<br />

data rate of 100 Mbps. Of course, to date network devices and<br />

links with 1 GBps or more are common. Although, in industrial<br />

line installations with less powerful devices, lower speeds still<br />

will be used in the next years for the reason of robustness or<br />

cost.<br />

On the other hand, data transfer needs time and several<br />

frames cannot be received at the same time in a centralized<br />

controller. If the data from all devices in this example shall be<br />

present at the controller in the same point of time, the earliest<br />

has telegram has to be sent 720 µs in advance, which means<br />

they are more outdated than in a scenario with fewer devices.<br />

Furthermore, the high data volume, especially when using<br />

complex layered stacks, leads to a high computing load on the<br />

controller which is not available for the control application<br />

itself. This is especially of importance when using embedded<br />

devices with limited resources.<br />

One possible solution could be suitable aggregation<br />

methods for many-to-one communication using Pub/Sub.<br />

Together with an exact time base in the network, transmission<br />

time could be optimized for each device to guarantee minimum<br />

total latency of the transmission.<br />

912


V. SUMMARY AND OUTLOOK<br />

TSN together with OPC UA has the potential as an<br />

enabling technology for convergent networks and thus shaping<br />

the future production environment. As many aspects are<br />

subject to current discussions and specification activities,<br />

several questions are left open to date, which need to be solved<br />

before broad industrial application.<br />

To bridge the gap between network and automation device<br />

vendors as also machine builders, the industry working group<br />

“TSN for Automation” [4] was established at the Institute for<br />

Control Engineering of Machine Tools and Manufacturing<br />

Units at the University of Stuttgart. Its scope is to build a<br />

bridge between IT and automation world and assist in getting<br />

into the technology. Latest information from various related<br />

committees are provided as also an accessible laboratory<br />

together with a reference solution for TSN end devices.<br />

Solutions to the named challenges regarding scalability and<br />

configuration will be addressed in future research work in close<br />

cooperation with partners from industry.<br />

REFERENCES<br />

[1] LAN/MAN STANDARDS COMMITTEE OF THE IEEE COMPUTER SOCIETY:<br />

IEEE Std 802.1Qbv-2015 (Amendment to IEEE Std 802.1Q-2014<br />

as amended by IEEE Std 802.1Qca-2015, IEEE Std 802.1Qcd-<br />

2015, IEEE Std 802.1Q-2014/Cor 1-2015) IEEE Standard for Local and<br />

metropolitan area networks—Bridges and Bridged Networks—<br />

Amendment 25: Enhancements for Scheduled Traffic (2015)<br />

[2] LAN/MAN STANDARDS COMMITTEE OF THE IEEE COMPUTER SOCIETY:<br />

IEEE 802.1: 802.1AS-Rev - Timing and Synchronization for Time-<br />

Sensitive Applications. URL http://www.ieee802.org/1/pages/802.1ASrev.html<br />

[3] OPC FOUNDATION: OPC Unified Architecture Part 14: PubSub :<br />

Release Candidate 1.04.24, 2017<br />

[4] http://www.tsn4automation.com<br />

www.embedded-world.eu<br />

913


Beyond the Capabilities of Wireshark:<br />

Effective and Efficient Generation of Mostly-<br />

Valid Messages for Bad-Case Testing of<br />

Communication Protocol Implementations<br />

Dipl.-Phys. Andreas Walz, Prof. Dr.-Ing. Axel Sikora<br />

Institute of Reliable Embedded Systems and Communication Electronics (ivESK)<br />

Offenburg University of Applied Sciences<br />

914


Introduction<br />

(Security) Testing of software is highly important<br />

(Mostly) Positive testing is much more common than negative testing<br />

• Verify correct behavior of DUT given valid and expectable input<br />

• What about mostly-valid, invalid or generally unexpected input?<br />

Negative (bad-case) testing<br />

• Cumbersome for rich and complex message formats<br />

• Numerous implicit and explicit consistency requirements<br />

• Many ways messages can be invalid<br />

• Involves parsing, interpretation, and manipulation of “on-the-wire” encoded messages<br />

Here: Present a powerful concept & toolbox supporting negative testing campaigns<br />

with mostly-valid messages for complex communication protocol stacks<br />

915<br />

2018-03-01 Embedded World xxx Conference 2018<br />

2


Agenda<br />

1. The hassle with negative testing<br />

2. The concept of Generic Message Trees (GMTs) and<br />

GMT manipulation operators<br />

3. How GMTs can help with negative testing …<br />

4. The TLS Presentation Language (TPL) and automatic<br />

code generation<br />

916<br />

2018-03-01 Embedded World xxx Conference 2018<br />

3


A Typical/Simple Test Setup<br />

Transport Layer Security (TLS) implementations<br />

• Server: the Device Under Test (DUT)<br />

• Client: The test agent<br />

Client<br />

Server<br />

ClientHello<br />

TLS Client<br />

Test Agent<br />

ServerHello<br />

TLS Server<br />

DUT<br />

917<br />

2018-03-01 Embedded World xxx Conference 2018<br />

4


A Typical/Simple Test Setup<br />

Transport Layer Security (TLS) implementations<br />

• Server: the Device Under Test (DUT)<br />

• Client: The test agent<br />

TLS Client<br />

Test Agent<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2<br />

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b<br />

08d9c01300000bff01000100000b00020100<br />

“On-the-wire” representation<br />

TLS Server<br />

DUT<br />

918<br />

2018-03-01 Embedded World xxx Conference 2018<br />

5


A Typical/Simple Test Setup<br />

TLS Client<br />

Test Agent<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2<br />

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b<br />

08d9c01300000bff01000100000b00020100<br />

“On-the-wire” representation<br />

TLS Server<br />

DUT<br />

919<br />

2018-03-01 Embedded World xxx Conference 2018<br />

6


A Typical/Simple Test Setup<br />

TLS Client<br />

Test Agent<br />

“read-only” <br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2<br />

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b<br />

08d9c01300000bff01000100000b00020100<br />

“On-the-wire” representation<br />

TLS Server<br />

DUT<br />

920<br />

2018-03-01 Embedded World xxx Conference 2018<br />

7


A Typical/Simple Test Setup<br />

TLS Client<br />

Test Agent<br />

“read-only” <br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2<br />

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b<br />

08d9c01300000bff01000100000b00020100<br />

TLS Server<br />

DUT<br />

921<br />

2018-03-01 Embedded World xxx Conference 2018<br />

8


A Typical/Simple Test Setup<br />

TLS Client<br />

Test Agent<br />

“read-only” <br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

020000430303497a816ad6e3411002c14a172aad5935e4a7d9bd2<br />

d782a27658b7aa5cd1be7c21019c33f380b13a56a9dec0da6d89b<br />

08d9c01300000bff01000100000b00020100<br />

TLS Server<br />

DUT<br />

922<br />

2018-03-01 Embedded World xxx Conference 2018<br />

9


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

923<br />

2018-03-01 Embedded World xxx Conference 2018<br />

10


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

924<br />

2018-03-01 Embedded World xxx Conference 2018<br />

11


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

Serialization / encoding<br />

925<br />

2018-03-01 Embedded World xxx Conference 2018<br />

12


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

?<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

Serialization / encoding<br />

926<br />

2018-03-01 Embedded World xxx Conference 2018<br />

13


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

?<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

Serialization / encoding<br />

927<br />

2018-03-01 Embedded World xxx Conference 2018<br />

14


Structured Message Manipulation<br />

Parsing / dissecting<br />

Tree-like representation<br />

Flat “on-the-wire” representation<br />

160301011c0100011803035c6ac7070b212bef9e9c8c03c131802<br />

5e3b38257ff9e42bbb9bb4f004bdcc2e5000082c030c02cc028c0<br />

24c014c00a00a3009f006b006a0039003800880087c032c02ec02<br />

ac026c00fc005009d003d00350084c02fc02bc027c023c013c009<br />

?<br />

00a2009e0067004000330032009a009900450044c031c02dc029c<br />

025c00ec004009c003c002f00960041c011c007c00cc002000500<br />

04c012c00800160013c00dc003000a00ff0100006d000b0004030<br />

00102000a00340032000e000d0019000b000c00180009000a0016<br />

00170008000600070014001500040005001200130001000200030<br />

00f0010001100230000000d0020001e0601060206030501050205<br />

03040104020403030103020303020102020203000f000101<br />

Serialization / encoding<br />

928<br />

2018-03-01 Embedded World xxx Conference 2018<br />

15


Generic Message Trees<br />

Protocol messages are represented as a tree data structure<br />

• Called Generic Message Trees (GMTs*)<br />

• Similar to parse trees<br />

• Structure (i.e. composition rules) given by protocol definition<br />

• Internal nodes: composite structures<br />

• Leaf nodes: atomic data with raw (binary) data representation<br />

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS<br />

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947<br />

929<br />

2018-03-01 Embedded World xxx Conference 2018<br />

16


Generic Message Trees<br />

Protocol messages are represented as a tree data structure<br />

• Called Generic Message Trees (GMTs)<br />

• Similar to parse trees<br />

• Structure (i.e. composition rules) given by protocol definition<br />

• Internal nodes: composite structures<br />

• Leaf nodes: atomic data with raw (binary) data representation<br />

930<br />

2018-03-01 Embedded World xxx Conference 2018<br />

17


Generic Message Trees<br />

Protocol messages are represented as a tree data structure<br />

• Called Generic Message Trees (GMTs)<br />

• Similar to parse trees<br />

• Structure (i.e. composition rules) given by protocol definition<br />

• Internal nodes: composite structures<br />

• Leaf nodes: atomic data with raw (binary) data representation<br />

Record header<br />

Handshake header<br />

ClientHello<br />

931<br />

2018-03-01 Embedded World xxx Conference 2018<br />

18


Input Generation Strategy<br />

1. Use valid message as input<br />

2. Convert message to tree (GMT) representation<br />

3. Apply randomized or deterministic manipulation(s)<br />

4. Serialize to flat message and use as test message<br />

Random or deterministic message Manipulation<br />

932<br />

2018-03-01 Embedded World xxx Conference 2018<br />

19


Message Manipulation (1)<br />

Different generic manipulation operators available:<br />

Deterministic operators<br />

• Voiding / Removing operators<br />

Void or remove node/subtree<br />

• Duplicating operator<br />

Duplicate subtree and add as sibling<br />

Randomized (fuzzing) operators<br />

• Truncating fuzz operator<br />

Truncate subtree (may remove nodes)<br />

• Integer fuzz operator<br />

Randomize value of integer or<br />

enumeration field<br />

• Content fuzz operator<br />

Fill a leaf node with random content<br />

(random raw data)<br />

• Appending fuzz operator<br />

Append random content to a leaf node<br />

933<br />

2018-03-01 Embedded World xxx Conference 2018<br />

20


Message Manipulation (2)<br />

Two special operators:<br />

• Repairing operator<br />

Restore consistency among children<br />

of a tree node (from last to first)<br />

• Repairing fuzz operator<br />

Traverse tree towards root and<br />

apply repairing operator on each<br />

visited node with a fixed probability<br />

Length of<br />

Length of<br />

934<br />

2018-03-01 Embedded World xxx Conference 2018<br />

21


A Few Concrete Examples …<br />

1. Manipulating a Single Length Field<br />

2. Removing a certain message component<br />

3. Full-fledged randomized testing<br />

935<br />

2018-03-01 Embedded World xxx Conference 2018<br />

22


Example 1:<br />

Manipulating a Single Length Field<br />

936<br />

2018-03-01 Embedded World xxx Conference 2018<br />

23


Example 1:<br />

Manipulating a Single Length Field<br />

TLSRecord record;<br />

record.dissect(message);<br />

record.propSet(".value@**/extensions/_N", 0);<br />

937<br />

2018-03-01 Embedded World xxx Conference 2018<br />

24


Example 1:<br />

Manipulating a Single Length Field<br />

TLSRecord record;<br />

record.dissect(message);<br />

record.propSet(".value@**/extensions/_N", 0);<br />

938<br />

2018-03-01 Embedded World xxx Conference 2018<br />

25


Example 2:<br />

Removing a Certain Message Component<br />

Remove ec_point_format extension from TLS ClientHello message<br />

939<br />

2018-03-01 Embedded World xxx Conference 2018<br />

26


Example 2:<br />

Removing a Certain Message Component<br />

Implementation in C++ code using GMTs<br />

TLSRecord record;<br />

record.dissect(message);<br />

Cursor cursor(record);<br />

cursor.seekByPath("**/Extension:ec_point_formats%");<br />

cursor.doRemove();<br />

RepairingOperator operator;<br />

operator.apply(cursor);<br />

940<br />

2018-03-01 Embedded World xxx Conference 2018<br />

27


Example 2:<br />

Removing a Certain Message Component<br />

Before manipulation<br />

941<br />

2018-03-01 Embedded World xxx Conference 2018<br />

28


Example 2:<br />

Removing a Certain Message Component<br />

After manipulation<br />

942<br />

2018-03-01 Embedded World xxx Conference 2018<br />

29


Example 3:<br />

Randomized Testing: Input Generation<br />

Select and apply random operator<br />

Perform randomized repairing<br />

Recursive manipulation of subtree<br />

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS<br />

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947<br />

943<br />

2018-03-01 Embedded World xxx Conference 2018<br />

30


Example 3:<br />

Randomized Testing: Test Diversity<br />

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS<br />

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947<br />

944<br />

2018-03-01 Embedded World xxx Conference 2018<br />

31


Example 3:<br />

Randomized Testing: Test Diversity<br />

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS<br />

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947<br />

945<br />

2018-03-01 Embedded World xxx Conference 2018<br />

32


Example 3:<br />

Randomized Testing: Identified Bug<br />

Inconsistent treatment of length fields by MatrixSSL 3.8.4<br />

A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box Testing of TLS<br />

Implementations," in IEEE Transactions on Dependable and Secure Computing, doi: 10.1109/TDSC.2017.2763947<br />

946<br />

2018-03-01 Embedded World xxx Conference 2018<br />

33


GMT Approach<br />

Benefits:<br />

• User-friendly navigation through and access to message fields<br />

• Allows (semi-)automatic manipulation of messages<br />

• Allows rapid prototyping<br />

Available as open-source library (C++) under 3-clause BSD license<br />

[https://github.com/phantax/gmt-cpp]<br />

947<br />

2018-03-01 Embedded World xxx Conference 2018<br />

34


GMT Approach<br />

Benefits:<br />

• User-friendly navigation through and access to message fields<br />

• Allows (semi-)automatic manipulation of messages<br />

• Allows rapid prototyping<br />

Available as open-source library (C++) under 3-clause BSD license<br />

[https://github.com/phantax/gmt-cpp]<br />

How to obtain format-specific dissectors?<br />

948<br />

2018-03-01 Embedded World xxx Conference 2018<br />

35


TLS Presentation Language (1)<br />

TLS Presentation Language (TPL)<br />

• Data format description language (not a programming language!)<br />

• Introduced with the draft specification of SSL by Netscape (now TLS)<br />

• Used to describe (“present”) the on-the-wire format of SSL/TLS messages<br />

• Enhanced version (eTPL*) used for automatic code generation of parsers/dissectors<br />

• Suitable not only for (D)TLS, but also other protocols<br />

struct {<br />

[RFC 5246]<br />

ProtocolVersion client_version;<br />

Random random;<br />

SessionID session_id;<br />

CipherSuite cipher_suites;<br />

CompressionMethod compression_methods;<br />

select (extensions_present) {<br />

case false: struct {};<br />

case true: Extension extensions;<br />

};<br />

} ClientHello;<br />

949<br />

2018-03-01 Embedded World xxx Conference 2018<br />

36<br />

* A. Walz and A. Sikora, "eTPL: An enhanced version of the TLS<br />

presentation language suitable for automated parser generation,"<br />

IDAACS 2017, doi: 10.1109/IDAACS.2017.8095200


TLS Presentation Language (2)<br />

Elements of the (enhanced) TLS Presentation Language<br />

• Basic built-in types (integer fields)<br />

• Enumerated fields<br />

• Composite (constructed) types<br />

• Variants (dynamic choices within composite types)<br />

• Vectors (both fixed and variable length)<br />

struct {<br />

[RFC 5246]<br />

ProtocolVersion client_version;<br />

Random random;<br />

SessionID session_id;<br />

CipherSuite cipher_suites;<br />

CompressionMethod compression_methods;<br />

select (extensions_present) {<br />

case false: struct {};<br />

case true: Extension extensions;<br />

};<br />

} ClientHello;<br />

950<br />

2018-03-01 Embedded World xxx Conference 2018<br />

37


eTPL/GMT Tool Chain<br />

Format definition<br />

eTPL Parser<br />

User code<br />

Code Generator<br />

Format-specific code<br />

GMT Library (generic)<br />

etpl-tool<br />

[https://github.com/phantax/etpl-tool]<br />

gmt-cpp<br />

[https://github.com/phantax/gmt-cpp]<br />

(e)TPL Python C++<br />

951<br />

2018-03-01 Embedded World xxx Conference 2018<br />

38


Summary & Conclusion<br />

Addressed the hassle related to bad-case (negative) testing of protocol implementations<br />

• How to obtain mostly-valid test messages<br />

Built an efficient bidirectional bridge between two types of protocol message representations<br />

• Flat “on-the-wire” representation<br />

• Tree-like representation<br />

Presented the GMT concept and the eTPL/GMT tool chain<br />

• Systematic message manipulation on a structured message representation<br />

• All the message parsing/encoding done automatically<br />

• Dealing with complex messages can be made easy, efficient, and user-friendly<br />

952<br />

2018-03-01 Embedded World xxx Conference 2018<br />

39


Thank you for your attention!<br />

Questions… ?<br />

Prof. Dr.<br />

Axel Sikora , Dr.-Ing. Dipl.-Wirt.-Ing<br />

Scientific Director<br />

Institute of Reliable Embedded Systems and Communication Electronics<br />

Andreas Walz , Dipl.-Phys.<br />

Research Engineer<br />

Institute of Reliable Embedded Systems and Communication Electronics<br />

Phone +49 (0)781 205-416<br />

Fax +49 (0)781 205-45 416<br />

axel.sikora@hs-offenburg.de<br />

Badstraße 24<br />

77652 Offenburg<br />

www.hs-offenburg.de<br />

Phone +49 (0)781 205-4803<br />

Fax +49 (0)781 205-45 4803<br />

andreas.walz@hs-offenburg.de<br />

Badstraße 24<br />

77652 Offenburg<br />

www.hs-offenburg.de<br />

953


References<br />

• A. Walz and A. Sikora, "Exploiting Dissent: Towards Fuzzing-based Differential Black Box<br />

Testing of TLS Implementations," in IEEE Transactions on Dependable and Secure<br />

Computing, doi: 10.1109/TDSC.2017.2763947<br />

• A. Walz and A. Sikora, "eTPL: An enhanced version of the TLS presentation language<br />

suitable for automated parser generation," IDAACS 2017, doi:<br />

10.1109/IDAACS.2017.8095200<br />

• gmt-cpp: https://github.com/phantax/gmt-cpp<br />

• etpl-tool: https://github.com/phantax/etpl-tool<br />

954<br />

2018-03-01 Embedded World xxx Conference 2018<br />

41


System and Device Design Recommendations for<br />

CAN FD Networks<br />

Holger Zeltwanger<br />

CAN in Automation (CiA) e. V.<br />

90429 Nuremberg, Germany<br />

headquarters@can.cia.org<br />

Abstract—Several semiconductor manufacturers have implemented<br />

already the CAN FD protocol in stand-alone controller<br />

chips as well as micro-controllers. Some OEMs (original equipment<br />

manufacturer) started to integrate CAN FD networks in<br />

their in-vehicle network architectures. This paper provides some<br />

guidelines and recommendations, in particular for the bit-timing<br />

settings for the arbitration phase and the data-phase.<br />

Keywords—CAN FD, in-vehicle network, bit-timing, network<br />

topology, ringing suppression, phase-margin<br />

I. INTRODUCTION AND BASICS<br />

In 2012, the CAN FD (controller area network with flexible<br />

data-rate) protocol was launched at the 13 th international CAN<br />

Conference in Hambach castle. In the meantime, the protocol<br />

has been internationally standardized in ISO 11898-1:2015.<br />

This standard just specifies the CAN FD and the Classical<br />

CAN protocol. In order to avoid misunderstandings, it was<br />

agreed to use the term ISO CAN FD or just CAN FD, when the<br />

implementation complies with the ISO 16845:2016 conformance<br />

test plan. Implementations based on CAN FD version 1.0<br />

should be named non-ISO CAN FD.<br />

Transceiver chips and System Base Chips (SBC) compliant<br />

with ISO 11898-2:2016 support optionally bit-rates up to 2<br />

Mbit/s or 5 Mbit/s. Parameters for higher bit-rates are not specified.<br />

Nevertheless, you can achieve higher bit-rates, when<br />

limiting the temperature, for example.<br />

Both standards, ISO 11898-1:2015 and ISO 11898-2:2016,<br />

do not provide any system and device design recommendations,<br />

etc. The legacy standards contained some system and<br />

device design rules.<br />

The new editions of ISO 11898-1 and ISO 11898-2 are<br />

written for semiconductor manufacturers. Device designers<br />

need additional guidelines and recommendations for the CAN<br />

FD device interface. Normally, they are depending on the CAN<br />

FD system design given by the OEM.<br />

The ISO 11898-1:2015 document does not specify the interface<br />

to the host controller in detail. It just gives some basic<br />

information, which is not sufficient for interoperability and<br />

system design aspects. For example, the oscillator frequency is<br />

not specified, because this is a device design issue. The CiA<br />

601-2 CAN controller interface specification recommends<br />

using 20 MHz, 40 MHz, or 80 MHz. Other frequencies should<br />

not be used. Another recommendation in this document is the<br />

number of bit-timing registers to be implemented. The ISO<br />

standard just requires a small register, which is sufficient for<br />

some bit-rate combinations. The CAN FD protocol may use<br />

two bit-rates: one for the arbitration phase and another one or<br />

the same for the data-phase. In case of using a large ratio between<br />

arbitration and data-phase bit-times, the standardized<br />

size of the bit-timing registers is not appropriate. Therefore the<br />

CiA 601-2 document recommends for the arbitration phase a<br />

register a programmability of 5 time-quanta (tq) to 385 tq. The<br />

configurability for the data-phase register should be the range<br />

from 4 tq to 49 tq. Additionally, the CiA 601-2 specification<br />

contains some recommendations regarding interrupt sources<br />

and message buffer behaviors.<br />

In order to understand the ISO 11898-2:2016 standard from<br />

a device designer’s point-of-view, the CiA 601-1 specification<br />

provides some useful information about the transceiver loop<br />

delay symmetry, the bit-timing symmetry, the transmitter delay<br />

compensation (TDC). This document explains how to interpret<br />

and consider the parameters given by the transceiver chip suppliers.<br />

II.<br />

BIT-TIMING SETTINGS FOR CAN FD<br />

A. General guidelines<br />

As said, the ISO 11898 series do not specify device or system<br />

design aspects. In order to achieve interoperability of devices,<br />

the bit-timing should be the same in all nodes. This is<br />

nothing new for engineers familiar with Classical CAN network<br />

designs. However, in Classical CAN networks there are<br />

some tolerances allowed regarding the bit-timing settings. They<br />

are necessary, when nodes with different oscillator frequencies<br />

are in the same network. Typically, the sample-point (SP) is<br />

given as a range such as 85 % to 90 % with nominal value of<br />

87,5 % (CANopen). The SP is between the phase segment 1<br />

and the phase segment 2 of a bit-time. The bit-time comprises<br />

the synchronization segment (always one time-quantum), the<br />

propagation segment, the phase segments 1 and 2.<br />

In CAN FD networks, the rules and recommendations<br />

needs to be more strict, because higher bit-rates bring the net-<br />

www.embedded-world.eu<br />

955


work closer to the physical limits. Of course, when not using<br />

the bit-rate switch function, the bit-timing is as in Classical<br />

CAN. But when using two bit-rates, the system designer should<br />

take care that all nodes apply the very same bit-timing settings.<br />

The nonprofit SAE (Society of Automotive Engineers) International<br />

association developed two recommended practices<br />

for CAN FD node and system designers. The SAE J2284-4<br />

document specifies a bus-line network running at 2 Mbit/s with<br />

all necessary device and system parameters including the bittiming<br />

settings. The SAE J2284-5 document does the same for<br />

a point-to-point CAN FD communication running at 5 Mbit/s.<br />

The given parameter values are mainly deriving from General<br />

Motors CAN FD first system designs. If you read between the<br />

lines, you can adapt the specification also for other topologies<br />

and bit-rates.<br />

The Japan Automotive Software Platform and Architecture<br />

(JasPar) association develops also guidelines for CAN FD<br />

device and system design. The Japanese nonprofit group cooperates<br />

with CAN in Automation (CiA). Both associations exchange<br />

documents and comment them each other. Recently,<br />

there was a joint meeting to discuss the ringing suppression, in<br />

order to achieve higher bit-rates or to support hybrid topologies<br />

such as multi-star networks.<br />

Currently, CiA has released its CiA 601-3 document. Besides<br />

the oscillator frequency (see above), it recommends the<br />

bit-timing configuration and some optimization hints for the<br />

phase margin. This includes recommendations for the topology,<br />

the device design (especially limiting parasitic capacitance).<br />

B. Bit-timing configuration recommendations<br />

The bit-timing configuration has two aspects: Setting the<br />

nominal time-quantum for the arbitration phase and the data<br />

time-quantum for the data-phase as well as setting of the related<br />

sample-points including the secondary sample-point (SSP)<br />

in the data-phase, when the TDC is used.<br />

The recommendations given below consider that with each<br />

resynchronization, a receiving node can correct a phase error of<br />

sjw D in the data-phase and sjw A in the arbitration phase. The<br />

larger the ratio sjw D :BT D , the larger the resulting CAN clock<br />

tolerance in the data-phase. The same holds for the arbitration<br />

phase with sjw A :BT A . The absolute number of resynchronizations<br />

per unit of time increases towards higher bit-rates. However,<br />

the absolute value of sjw D or sjw A decreases proportionally<br />

with the bit-time. In other words, a higher bit-rate leads to<br />

more, but smaller resynchronizations. A CAN FD node performs<br />

the bit-rate switching at the SP of the BRS (bit-rate<br />

switch) bit and the CRC (cyclic redundancy check) delimiter<br />

bit. All three available SPs are independent of each other: arbitration<br />

phase SP, data phase SP, and data phase SSP. They can<br />

be chosen independently.<br />

In the arbitration phase, the nodes are synchronized and<br />

need the propagation segment as a waiting time for the roundtrip<br />

of the bit-signal. In the data-phase, the nodes are not synchronized.<br />

Therefore, no delays need to be considered. Nevertheless,<br />

the phase-segment 1 should be large enough, to guarantee<br />

a stable signal.<br />

For the data-phase bit-timing settings all the following recommendations<br />

should be considered. For the arbitration phase<br />

just recommendation 3 and 5 apply.<br />

Recommendation 1: Choose the highest available CAN<br />

clock frequency<br />

This allows shorter values for the tq. Use only recommended<br />

CAN clock frequencies (see above).<br />

Recommendation 2: Set the BRP A bit-rate prescaler equal<br />

BRP B<br />

This leads to identical tq values in both phases. This prevents<br />

that during bit-rate switching inside the CAN FD data<br />

frame an existing quantization error can transform into a phase<br />

error.<br />

Recommendation 3: Choose BRP A and BRP D as low as<br />

possible<br />

Lower BRPs lead to shorter tq, which allows a higher resolution<br />

of the bit-time. This has the advantage that the SP can be<br />

placed more accurately to the optimal position. The size of the<br />

synchronization segment is shorter and reduces the quantization<br />

error. Additionally, the receiving node can synchronize<br />

more accurately to the transmitting node, which increases the<br />

available robustness.<br />

Recommendation 4: Configure all CAN FD nodes to have<br />

the same arbitration phase SP and the same data phase SP<br />

The simplest way to achieve this is to use the identical bittiming<br />

configuration in all CAN nodes. This is not always<br />

possible, when different CAN clock frequencies are used. The<br />

arbitration phase SP and the data phase SP can be different,<br />

without any impact on robustness. Different SPs in the CAN<br />

FD nodes reduce robustness, because this leads to different<br />

lengths of the BRS bits and CRC delimiter bits in the different<br />

nodes and a phase error introduced by the bit-rate switching.<br />

The SSP can be different in the CAN nodes, without influencing<br />

robustness.<br />

Recommendation 5: Chose sjw D and sjw A as large as possible<br />

The maximal possible values are min (ps1 A/D , ps2 A/D ). A<br />

large sjw A value allows the CAN node to resynchronize quickly<br />

to the transmitting node. A large sjw D value maximizes the<br />

CAN clock tolerance.<br />

Recommendation 6: Enable TDC for data bit-rates higher<br />

than 1 Mbit/s<br />

In this case, the BRP D shall be set to 1 or 2 (see ISO 11898-<br />

1:2015). It is not recommended to configure the TDC with a<br />

fixed value, because the large transmitter delay varitions.<br />

C. SP positioning<br />

The SP locations of the arbitration phase and the data-phase<br />

may be different. If in the arbitration phase the SP is at the very<br />

far end of the bit-time, the maximum possible network length<br />

can be achieved. Sampling earlier reduces the achievable network<br />

length, but increases robustness. A value of higher than<br />

80 % is not recommended for automotive applications due to<br />

robustness reasons.<br />

956


SE<br />

scope Tx1 (trigger)<br />

trigger once<br />

at all nodes<br />

scope bus 1 scope bus 2<br />

scope RX1<br />

scope RX2<br />

SE<br />

µC<br />

TX<br />

RX<br />

SE<br />

TX<br />

CAN_H<br />

CAN_L<br />

RX<br />

diff<br />

CAN CAN topology<br />

under Network test<br />

diff<br />

TX<br />

CAN_H<br />

CAN_L<br />

RX<br />

SE<br />

TX<br />

RX<br />

µC<br />

measure once at all nodes<br />

CAN node 1<br />

for all trigger positions<br />

CAN node 12<br />

µC<br />

TX<br />

RX<br />

TX<br />

CAN_H<br />

CAN_L<br />

RX<br />

result:<br />

matrix with n² measurements<br />

TX<br />

CAN_H<br />

CAN_L<br />

RX<br />

TX<br />

RX<br />

µC<br />

CAN CAN node node n n<br />

CAN CAN node node n-1 n-1<br />

Fig. 1. Measurement setup for the evaluation of the asymmetry introduced by the topology (source: CiA 601-3)<br />

The SP location in the data-phase depends on the maximum<br />

possible bit asymmetries. There are two asymmetries, one for<br />

the worst lengthening of dominant bits (A1) and another for the<br />

worst shortening of dominant bits (A2) in a given network setup.<br />

Both values are given normally in ns. Both values are the<br />

sum of asymmetries caused by the physical network elements<br />

including transceiver, cabling, connectors, and optional circuitry<br />

(e.g. galvanic isolation). In order to avoid compensations,<br />

absolute values are added. ISO 11898-2:2016 specifies the<br />

asymmetry values for 2 Mbit/s and for 5 Mbit/s qualified transceivers.<br />

The asymmetries caused by the other physical network<br />

components are given by datasheets or needs to be estimated or<br />

measured. The system designer selects the worst-case connections<br />

in network and calculates or measures the both asymmetry<br />

values. Another option is to simulate it. There are providers<br />

offering such simulation services.<br />

A1 topology and A2 topology values are different for every communication<br />

relationship. This means in a setup with n CAN<br />

nodes there are n 2 values for A1 topology and n 2 values for<br />

A2 topology . To represent the worst-case, the maximal A1 topology<br />

and the maximal A2 topology values are used to calculate A1 respectively<br />

A2. CiA provides with the CiA 601-3 specification a<br />

spreadsheet to prove the robustness of the chosen bit-timing<br />

settings and the sample-points.<br />

D. Phase margin (PM) calculation<br />

The PM is the allowed shift of a bit edge towards the SP of<br />

the bit, at a given tolerance of the CAN clock frequency (d fused ).<br />

In other words, this is the edge shift caused by physical layer<br />

effects that is still tolerated by the CAN protocol.<br />

The worst-case bit sequence, i.e. that leads to the lowest<br />

PMs, is when the transmitting node sends five dominant bits<br />

followed by one recessive stuff bit (for details see /CiA601-1/).<br />

This is the longest possible sequence of dominant bits followed<br />

by a recessive bit inside a frame. Current transceiver designs<br />

cause the largest bit asymmetry at this bit sequence, i.e. the<br />

recessive bit is typically shorter than its nominal value. Further<br />

effects additionally raise the asymmetry: e.g. asymmetric rise<br />

and fall times, bus topology, EMC jitter, etc.<br />

The PM1 and PM2 values given in s (aecond) can be calculated<br />

by the following equations:<br />

PM1 < !∙!" !!!"! ! !!" !<br />

− !∙!" !<br />

(!!!" !"#$ ) (!!!" !"#$ )<br />

PM2 < !∙!" !<br />

− !∙!" !!!"! !<br />

(!!!" !"#$ ) (!!!" !"#$ )<br />

(1)<br />

(2)<br />

with PM1 = phase margin 1, PM2 = phase margin 2, BT D =<br />

data-phase bit-time, PS2 D = data-phase phase segment 2<br />

III. OPTIMIZATION HINTS<br />

The transceiver chips or the SBCs cause a significant part<br />

of the overall asymmetry. Therefore it is recommended, to use<br />

always components qualified for higher bit-rates. Even if for 2-<br />

Mbit/s CAN FD networks, 5-Mbit/s qualified chips should be<br />

chosen.<br />

The “badly” designed wiring harness can add many asymmetries.<br />

The following recommendations should be considered:<br />

• Use a linear topology, terminated at both ends.<br />

• Reduce the total bus length.<br />

• Limit the number of CAN nodes.<br />

• Avoid long, not terminated stubs, which are branches<br />

from the well-terminated CAN lines; use stubs of<br />

“cm-range” instead of “m-range”. Consider a highohmic<br />

termination of not terminated stubs.<br />

www.embedded-world.eu<br />

957


• Optimize the low-ohmic termination (resistor position<br />

and resistor value). Another option is to increase the<br />

low-ohmic termination resistance (e.g. 124 Ω instead<br />

of 120 Ω) to compensate for the high-ohmic terminations<br />

in systems with many nodes.<br />

• Reduce the number of stubs per star point. The more<br />

stubs are connected to one star point, the higher the reflection<br />

factor gets.<br />

• In case, a star point with many branches is required<br />

due to mechanical constraints, avoid identical stub<br />

lengths per star point.<br />

• In case, multiple star points are required, keep a significant<br />

distance between the two star points.<br />

• Cable cross section: increase it to approximately 2 x<br />

0,35 mm of the CAN_H and CAN_L wire.<br />

Besides these system design recommendations, the device<br />

designer should consider the following hints:<br />

• Limit the parasitic capacitance of the device. The parasitic<br />

capacitance of the device includes the following<br />

parameters: additional ESD protection elements; parasitic<br />

capacitance of the connector; parasitic capacitance<br />

of the CAN_H or CAN_L wire; parasitic capacitance<br />

of the CMC; the parasitic capacitance of the transceiver<br />

input pins. All this parasitic capacitance should be<br />

below 80 pF per channel.<br />

• CAN_H and CAN_L PCB tracks from connector to<br />

transceiver should be of equal distance and parallel.<br />

• Keep the TXD and RXD PCB tracks between host<br />

controller and transceiver short.<br />

• Configure the host controller TXD output pin with<br />

strong push-pull behavior: a pull-up or pull-down resistor<br />

behavior can cause additional asymmetries and<br />

propagation delays.<br />

• Avoid any serial components like logical gates or resistors<br />

within the TXD and RXD connection lines between<br />

host controller and transceiver. In case galvanic<br />

isolation is required, take care of the potential additional<br />

asymmetry and select components accordingly.<br />

• Use a CAN clock source with lower clock jitter.<br />

• Avoid galvanic isolation, or use a galvanic isolation solution<br />

that adds only a small asymmetry.<br />

In order to optimize the PM, the following hints should be<br />

considered:<br />

• Optimize the bit-timing configuration by reducing the<br />

tq length. This increases PM1 by reducing the quantization<br />

error.<br />

• Use a CAN clock with lower tolerance (d fused ). This<br />

improves PM1 and PM2.<br />

IV. OTHER PHYSICAL LAYER OPTIONS<br />

Besides galvanic isolation, there are some other options,<br />

which the system and device designer may consider. European<br />

carmakers often use common-mode chokes, for example. Further<br />

add-on circuitry includes a split-termination (two 60-Ω<br />

resistors) with a capacitor to ground.<br />

The CiA community discusses a ringing suppression option,<br />

which will be specified in the CiA 601-4 document. It is<br />

still under development. In general, such ringing suppression<br />

circuitry changes dynamically the network impedance to reduce<br />

the ringing in the beginning of the bit-time. Before the SP,<br />

the impedance is dynamically switched back to the nominal<br />

value. There are two approaches discussed:<br />

• Ringing suppression circuitry on the critical receiving<br />

nodes (CiA 601-4 version 1.0)<br />

• Ringing suppression circuitry on the transmitting nodes<br />

The updated CiA 601-4 will just specify the requirements<br />

and not the implementations. The automotive industry is highly<br />

interested in ringing suppression. It would allow achieving<br />

higher bit-rates (desired is 8 Mbit/s) or to allow higher asymmetries<br />

caused by the network topology.<br />

The common-mode choke specification for CAN FD networks<br />

will be given in CiA 601-6. Also this document is under<br />

development. It will mainly contain recommendations and how<br />

to measure the values to be provided in datasheets. It is the<br />

goal, to make datasheet values more comparable than today.<br />

CiA members are also working on a cable specification<br />

(CiA 110). It is intended to define parameters and how to<br />

measure them, in order to make datasheets comparable.<br />

V. SUMMARY AND OUTLOOK<br />

The ISO 11898 series does not provide device and system<br />

design recommendations and specifications. CiA, JasPar, and<br />

SAE give those in their documents. Some of these documents<br />

are still under development. CiA, for example, provide in the<br />

CiA 601 series additional node and system design recommendations<br />

and design guidelines.<br />

A requirement specification for ringing suppression circuitry<br />

is under development (CiA 601-4). Guidelines for commonmode<br />

chokes are also in preparation (CiA 601-6).<br />

REFERENCES<br />

[1] Arthur Mutter: "Robustness of a CAN FD bus system – About oscillator<br />

tolerance and edge deviations”, 14th international CAN Conference<br />

(iCC 2013), Paris, France, 2013<br />

[2] Florian Hartwich: “The configuration of the CAN bit-timing”, 6th<br />

international CAN Conference (iCC1999), Turin, Italy, 1999<br />

[3] Marc Schreiner: “CAN FD system design”, 15th international CAN<br />

Conference (iCC 2015), Vienna, Austria, 2015<br />

[4] Y. Horii, “Ringing suppression technology to achieve higher data rates<br />

using CAN FD,” 15th international CAN Conference (iCC 2015),<br />

Vienna, Austria, 2015<br />

[5] CAN Newsletter magazine 2012 to 2017 (several articles), Nuremberg,<br />

Germany.<br />

[6] CiA 601 series, Nuremberg, Germany<br />

958


CANopen FD<br />

Embedded network as base for IoT applications<br />

Reiner Zitzmann<br />

CAN in Automation GmbH<br />

Nuremberg, Germany<br />

headquarters@can-cia.org<br />

Abstract—In 2012, Bosch presented the improved CAN with<br />

flexible data rate (CAN FD). Since these days, the international<br />

standardization of CAN FD has been finalized, the conformance<br />

test plan has been released, implementation guidelines for system<br />

and device design are available (CiA 600 document series) and<br />

first microcontrollers from several manufacturers are available.<br />

As CAN FD is therefore ready to use, CAN in Automation and its<br />

members like to offer the advantages of CAN FD to the CANopen<br />

users as well. Therefore the CiA working group SIG application<br />

layer prepared the improved and simple to use CANopen FD.<br />

CANopen FD combines the advantages of the well-accepted<br />

CANopen with being well-equipped for meeting future<br />

requirements in embedded networking.<br />

CANopen FD offers a lengthened PDO that allows providing<br />

a high data throughput for providing big data application with a<br />

large data base. The larger PDO payload complements<br />

additionally sophisticated security measures. The new USDO,<br />

which substitutes the classical SDO, supports a high design<br />

flexibility. Any CANopen FD device gets the ability to have access<br />

to any other CANopen FD device. In contrast to classical<br />

CANopen, no system designer is required but the crosscommunication<br />

between any CANopen FD devices can be<br />

established dynamically during runtime. This supports not only<br />

the trend to flexible systems, where the end user acts as system<br />

designer by adding or removing system components during<br />

system runtime. Also configuration and diagnostic is simplified,<br />

as a tool can have access to any network participant during<br />

runtime.<br />

CANopen FD was released in the document CiA 1301, by<br />

CAN in Automation. In addition the aforementioned features,<br />

CANopen FD provides an improved EMCY write service and a<br />

comprehensive error history, including time stamps and detailed<br />

information on the type of error. Nevertheless most of the wellknown<br />

CANopen functionality was kept, so that CANopen users<br />

can easily transit to CANopen FD and can reuse most of their<br />

CANopen know-how.<br />

Keywords—CAN FD, CANopen FD, Internet of things (IoT),<br />

Big data, Security<br />

I. INTRODUCTION<br />

More and more applications can be controlled and<br />

monitored via web-based applications, e.g. from a tablet or<br />

smart phone. These web-based applications such as e.g.<br />

temperature control in a private house hold, rental of a bike<br />

from a bike sharing provider, etc. rely on data that is often<br />

generated in an embedded and deeply embedded network of<br />

the application to be monitored or even controlled. To allow in<br />

future as many reasonable web-based applications as possible,<br />

system designers of embedded networks will provide huge<br />

amount of data, in addition to the really required control data.<br />

As a consequence of this trend, we can forecast a higher<br />

demand for data throughput and a high degree on<br />

communication flexibility on the embedded network level. To<br />

meet these requirements in the best way, CAN in Automation<br />

(CiA) has updated the well-accepted CANopen application<br />

layer and communication profile in a way, so that it is able to<br />

combine the advantages provided by the new CAN FD data<br />

link layer and the classical CANopen.<br />

II. CAN WITH FLEXIBLE DATA RATE (CAN FD)<br />

In 2012, on occasion of CiA’s 13 th international CAN<br />

conference, Bosch presented the improved CAN, called CAN<br />

FD [1]. It is as reliable as classical CAN but enables the user to<br />

generate a higher data throughput in the embedded network.<br />

up to 64 byte of data<br />

Arbitration phase! Data transmission phase! ACK phase!<br />

50 kbit/s to 1 Mbit/s free transmission 50 kbit/s to 1 Mbit/s<br />

FIGURE 1 – CAN WITH FLEXIBLE DATA RATE (CAN FD)<br />

As illustrated in Figure 1, a CAN FD frame is able to carry<br />

up to 64 byte of data. To increase the efficiency, the data field<br />

of a CAN FD frame can be transmitted with a higher<br />

transmission speed. Because of additional features for checking<br />

www.embedded-world.eu<br />

959


the data integrity [9], CAN FD can provide data at least with<br />

the same reliability as classical CAN.<br />

As many car manufacturers plan the usage of CAN FD, to<br />

complement automotive Ethernet, we can expect a high<br />

availability of microcontrollers with integrated CAN FD<br />

controllers, as it is today the situation for classical CAN. This<br />

leads to the fact that an improved CAN is available, combining<br />

robustness, flexibility, simplicity with high data throughput and<br />

increased reliability. By updating CANopen with regard to<br />

CAN FD, CiA intends to make these attributes available to the<br />

CANopen users.<br />

III.<br />

CANOPEN FD<br />

A. Impacts of CAN FD on CANopen<br />

In order to offer the benefits of CAN FD to their users,<br />

CAN-based higher layer protocols need to be adapted. Among<br />

others, the standardized and in many projects successfully<br />

installed CANopen protocol [5] was updated as well. As a<br />

result, in September 2017, CAN in Automation released the<br />

CAN FD-based successor of CANopen, CANopen FD [6].<br />

At the beginning of the review process the CAN in<br />

Automation (CiA) working group “SIG application layer”<br />

evaluated [2], which kind of the existing CANopen services<br />

would benefit from CAN FD and the increased CAN FD data<br />

throughput. CANopen services can be differentiated according<br />

to data transport oriented services and network management<br />

oriented services. A detailed examination of the network<br />

management oriented CAN services shows that these services<br />

are not suffering any bandwidth issues and most of them just<br />

use some Byte of the classical CAN data frame. An overview is<br />

provided in Table 1.<br />

TABLE I.<br />

CANOPEN NETWORKMANAGEMENT ORIENTED SERVICES<br />

CANopen service<br />

Synchronization service<br />

Time stamp service<br />

EMCY write service<br />

Network management service<br />

Error control services<br />

Used data byte<br />

0 to 1 Byte<br />

6 Byte<br />

8 Byte<br />

2 Byte<br />

1 Byte<br />

Only the EMCY write service, which is intended for<br />

diagnostic tasks, utilizes the maximum size of eight byte of a<br />

classical CAN frame. But a closer examination of this service<br />

indicates that the EMCY write service uses just three<br />

standardized data bytes. The remaining data field can<br />

optionally be filled up with manufacturer-specific error<br />

information. As a result of the evaluation of the network<br />

management related services, the SIG application layer<br />

concluded to keep these services and the related protocols<br />

unchanged in an updated, CAN FD capable version of<br />

CANopen [4]. Only the EMCY write service shall be<br />

reorganized and shall provide more detailed error diagnostic<br />

information, including a time stamp.<br />

Subsequently the CiA working group evaluated the data<br />

transport oriented services of CANopen. As illustrated in Table<br />

II, the size of the classical CAN frame data field limits these<br />

services. The mapping of these services to CAN FD, would<br />

offer the possibility to provide an enhanced functionality to the<br />

user.<br />

TABLE II.<br />

CANopen service<br />

Service data object, expedited<br />

Service data object, segmented<br />

Service data object, block transfer<br />

Process data object<br />

Multiplex process data object<br />

CANOPEN DATA TRANSPORT ORIENTED SERVICES<br />

Used data byte<br />

Payload limited to 4 Byte by<br />

classical CAN frame<br />

Payload limited to 7 Byte per<br />

classical CAN frame<br />

Payload limited to 7 Byte per<br />

classical CAN frame<br />

Payload limited to 8 Byte by<br />

classical CAN frame<br />

Payload limited to 4 Byte by<br />

classical CAN frame<br />

Both, Process Data Object (PDO) as well the different types<br />

of Service Data Objects (SDO) are limited by the size of the<br />

classical CAN data field. A mapping of these objects to CAN<br />

FD adds an increased performance to CANopen [3].<br />

B. Adaptation of the Process Data Object (PDO)<br />

CANopen PDOs, intended for high-prior command- and<br />

status information, are determined by two parameter sets; the<br />

PDO communication as well as the PDO mapping parameters.<br />

In general the description of PDOs can remain unchanged. The<br />

communication parameters specify, the CAN-Identifier that is<br />

used for a PDO, the triggering event, which triggers the<br />

transmission of a PDO as well as some busload management<br />

features. Only the decision, which kind of CAN frame shall be<br />

used for the PDO, depends on a classical CAN or CAN FD<br />

data link layer. All the rest is completely independent of the<br />

used data link layer and can therefore remain unchanged. As<br />

CANopen FD uses overall the CAN FD frame format, no<br />

additional settings that selecting the type of CAN-ID, are<br />

required. As a result, the entire PDO communication parameter<br />

set can remain unchanged.<br />

With regard to the PDO mapping parameters, the result is<br />

rather similar. Currently the content of a PDO is determined by<br />

a link-listing that allows linking 64 different parameters to one<br />

PDO. In case the smallest unit to be linked to a PDO has the<br />

size of one Byte, the existing table of 64 references is<br />

sufficient, to fill up a PDO with 64 Byte payload. With regard<br />

to this it is very advantageous that SIG application layer<br />

decided to maintain only data in the CANopen object<br />

dictionary that has the size of an integer multiple of 1 byte.<br />

Therefore the style of todays CANopen PDO mapping<br />

parameter sets can be reused for CANopen FD, in the same<br />

way. With regard to the description of PDOs, no adaptations<br />

are required from point of view of the specification. Stack<br />

manufacturers and users have to be aware, that PDOs cannot be<br />

triggered any longer by means of CAN remote frames, as<br />

CANopen FD does no longer support them. The<br />

implementation of PDOs can remain rather unchanged. Just the<br />

todays 8 byte limit has to be adapted to the used CAN FD<br />

frame size (see Table 3).<br />

960


TABLE III.<br />

CAN FD DLC<br />

SIZE OF DATA FIELD IN CAN FD FRAMES<br />

Size of CAN FD data field<br />

0 0 Byte<br />

1 1 Byte<br />

2 2 Byte<br />

3 3 Byte<br />

4 4 Byte<br />

5 5 Byte<br />

6 6 Byte<br />

7 7 Byte<br />

8 8 Byte<br />

9 12 Byte<br />

10 16 Byte<br />

11 20 Byte<br />

12 24 Byte<br />

13 32 Byte<br />

14 48 Byte<br />

15 64 Byte<br />

Users have to be aware that the run-time verification of<br />

PDOs becomes in CAN FD-based systems of limited use [3].<br />

As CAN FD increases the sizes of the CAN FD data field in<br />

certain steps, it may be more difficult to detect<br />

misconfigurations. E.g. in case of an configuration error, a<br />

transmitter sends just 17 instead of 18 intended process data<br />

bytes in a PDO, a receiver cannot detect this configuration<br />

error by checking the length of the used CAN FD frame. A<br />

receiver gets at least 20 byte data, as this is the next supported<br />

CAN FD data frame limit.<br />

C. Adaptation of the Service Data Object (SDO)<br />

In contrast to the PDO, it was much more complex to adapt<br />

the SDO to CAN FD. Todays SDOs use a bit-coded command<br />

specifier, to distinguish between the different SDO protocols.<br />

Unfortunately, there is no coding left to indicate a new<br />

enhanced, CAN FD-based SDO service. Therefore there was<br />

no chance to simply extend the well-known CANopen SDO<br />

service. In addition, the members of SIG application see some<br />

limitations in the existing SDO service. Especially systems that<br />

are modified during system run-time, by e.g. the end user. In<br />

these systems a system integrator would be needed that<br />

configures the required cross-communication. In addition, the<br />

members of SIG application layer recommend a strict<br />

separation between systems based on classical CAN and CAN<br />

FD. Therefore the SIG application layer discarded the idea of<br />

adapting the well-known SDO service. Instead, SIG application<br />

layer introduced an entire new CANopen service, the Universal<br />

SDO (USDO).<br />

D. Universal Service Data Object (USDO)<br />

Introducing a new service offers the chance to overcome<br />

limits of the existing SDO service. In times of the Internet of<br />

Things (IoT) and Industry 4.0, embedded networks such as<br />

CANopen are faced with more and more challenges. There<br />

may be the demand for remote diagnostic, remote monitoring<br />

or remote (fleet) management of the applications. To meet the<br />

requirements of these applications, high design flexibility is<br />

demanded. Communication links are not necessarily known in<br />

advance but established during runtime. Furthermore the end<br />

user that is accessing an embedded network remotely is not<br />

necessarily an expert in CAN and CANopen. As end users<br />

could just be familiar with their application, they would prefer<br />

a logical addressing instead of a geographical addressing with<br />

many CANopen details (e.g. such as: a command “set<br />

temperature in the living room to 22°C” might be more<br />

convenient than a corresponding CANopen command “Write<br />

22 to Index 3000 h sub-index 05 h in device with CANopen node<br />

number 35). Furthermore, in case an embedded network is<br />

managed from outside the network, communication services<br />

would be appreciated that allow encapsulating many single<br />

tasks to one comprehensive service. E.g. in case the very same<br />

configuration has to be adjusted in several or all devices, a<br />

confirmed broad- and multicast service would be of interest. In<br />

case of accessing cascaded CANopen systems, an inherent<br />

routing capable service would be appreciated.<br />

To be prepared for future challenges, the USDO provides<br />

solutions for all of these requirements. The USDO as illustrated<br />

in Figure 2, allows uni-, multi- and broadcast communication<br />

coherences between one server and several clients. The coding<br />

of the “Destination address” determines the communication<br />

coherences between either one or several servers.<br />

Client ! ! ! ! ! ! ! Server!<br />

USDO download request!<br />

Destination Command Session Subindex!<br />

Index! Data Size! Application data!<br />

address! specifier! ID!<br />

!<br />

type!<br />

0 1 2 3 4 + 5 6 7 8 up to 63 !<br />

Destination address!<br />

00 h! Broadcast (to all nodes)!<br />

Destination<br />

address<br />

USDO download response!<br />

Command<br />

specifier<br />

Session<br />

ID<br />

Description!<br />

01 h to 7F h! Unicast (to node with indicated node-ID)!<br />

80 h to FF h! Multicast (to some nodes part of indicated group)!<br />

Subindex<br />

Index<br />

0 1 2 3 4 + 5!<br />

FIGURE 2 – CANOPEN FD USDO DOWNLOAD<br />

In addition to a “Command specifier”, which informs the<br />

server on the type of the intended access, a “Session ID” is<br />

provided in the protocol. This allows one client to run several<br />

USDOs in parallel to the very same server. In contrast to<br />

todays SDOs, this USDO offers therefore the opportunity to<br />

execute a program download to a device and to monitor the<br />

process via an additional USDO communication session.<br />

During such an USDO access, the client can up- or download<br />

single entries of an USDO server’s object dictionary.<br />

Furthermore, the USDO may provide source as well as final<br />

destination address information, as illustrated in Figure 3.<br />

www.embedded-world.eu<br />

961


Client ! ! ! ! ! ! ! Server!<br />

USDO upload request (local)!<br />

Local<br />

destination<br />

address!<br />

Destination<br />

address<br />

Network ID!<br />

Command<br />

specifier!<br />

Command<br />

specifier<br />

Long distance USDO upload request !<br />

Session ID!<br />

Sub-Index!<br />

Session<br />

ID<br />

Index!<br />

Sub-<br />

Index<br />

Destination<br />

network ID!<br />

Index<br />

0 1 2 3 4 + 5 !<br />

Description!<br />

00 h! Network ID unknown, limited USDO remote protocol handling<br />

possible!<br />

01 h to 7F h! Unique network ID!<br />

FF h! Unconfigured CANopen FD device!<br />

Destination<br />

node-ID!<br />

Source<br />

network ID!<br />

Source<br />

node-ID!<br />

0 1 2 3 4 + 5 6 7 8 9!<br />

FIGURE 3 – COMPARISON LOCAL AND REMOTE USDO UPLOAD REQUEST<br />

As network- and node-ID of initial client (source) and final<br />

server (destination) are always provided in the USDO remote<br />

access format, a data exchange exceeding local network<br />

borders can be realized rather simple.<br />

In one of the next versions the USDO shall get the feature<br />

to offer the transfer of complete Arrays or Records, by means<br />

of Multiple Sub-index Access (MAS). This will simplify the<br />

transfer of data structures. In addition the SIG application layer<br />

intends to meet IoT requirements, such as logical addressing.<br />

The logical addressing will be derived from the solution that is<br />

currently developed by the SIG CANopen Internet of Things.<br />

E. Current status of CANopen FD<br />

The basic CANopen FD specification CiA 1301 was<br />

released in September 2017. Further specifications are going to<br />

follow. Currently CiA working groups are specifying the<br />

network start-up and management; including the dynamic<br />

node-ID, network-ID and bit timing adjustment. As<br />

prerequisite for the obligatory conformance testing, the update<br />

of the conformance test plan as well as of the electronic device<br />

description is currently in the focus of the related CiA working<br />

groups. In parallel, CiA working groups are evaluating CiA<br />

device and application profiles, with the focus on making use<br />

of the new CANopen FD features, in the best way.<br />

IV. CANOPEN FD AND IOT<br />

Todays and future embedded systems are faced with the<br />

requirement that they generate the data base for many webbased<br />

applications. Lots of data have to be provided to a huge<br />

data pool so that many web applications can generate added<br />

value and can provide nice functions to customers. E.g. for<br />

predictive maintenance, lots of data like the amount of working<br />

hours in total and since last service, or today, have to be<br />

communicated in addition to the pure control data. The<br />

increased data throughput, derived by communicating these<br />

“nice-to-have data”, can be met rather easily, by means of the<br />

lengthened CANopen FD PDOs as well as the higher<br />

communication speed.<br />

For remote diagnostic, system maintainer would like to<br />

have remote access to the embedded network and would like to<br />

upload diagnostic data and e.g. to update the firmware of oldfashioned<br />

devices.<br />

To enable such use cases in a simple way, CANopen FD<br />

can reuse emerging solutions that are currently developed by<br />

the CiA working group SIG CANopen_IoT. This working<br />

group defines the “user handling” within a CANopen (FD) –<br />

IoT gateway [7]. Depending on the registered user and his/her<br />

user class, the gateway application provides gateway resources<br />

to that user. In this context “resources” may be access rights<br />

(read or write) to dedicated network participants, memory and<br />

processing power in the gateway, etc.<br />

The CANopen FD USDO eases establishing dynamically<br />

communication channels to any CANopen FD device in a sublayered<br />

CANopen FD network architecture. That an external<br />

user gets information on how does the CANopen FD network<br />

architecture look like, system discovery services are<br />

introduced, by the SIG CANopen_IoT. For the purpose of the<br />

system discovery, configuration and diagnostics, new attributes<br />

in the electronic device description and configuration file<br />

formats provide a mapping of a generic function to the<br />

application-specific use case [7]. Along with the so-called<br />

nodelist information, e.g. GraphML capable tools are<br />

considered to provide the application-related visualization of<br />

the entire control system. An IoT application accesses the<br />

gateway and can discover the sub-layered CANopen (FD)<br />

system, e.g. based on the nodelist.graphml files. As soon as the<br />

system is known, the IoT application can use the remote<br />

CANopen access services according to CiA 309 [8].<br />

Currently the SIG CANopen_IoT specifies the mapping of<br />

the generic access services to HTTP requests. This should free<br />

the IoT application from the necessity to have knowledge on<br />

any CANopen (FD) specifics. Additionally the logical<br />

addressing will support this. CANopen device descriptions<br />

allow already logical addressing based on the reference<br />

designation system. CANopen FD will complement this by<br />

enhancing CANopen FD USDOs by logical addressing.<br />

Connecting an embedded network to the IoT, causes<br />

immediately the issue of preventing unintended access to that<br />

embedded network. The working group TF security is currently<br />

developing a solution that allows CANopen FD users to<br />

guarantee that only intended parties get access to specific data.<br />

The TF benefits from the possibility of having larger payload<br />

in the exchanges CAN FD frames that allows, e.g. the transfer<br />

of security keys in a single CANopen FD PDO.<br />

V. SUMMARY<br />

CiA released CANopen FD in the specification CiA 1301,<br />

in September 2017. Several CANopen FD protocol stack<br />

manufacturers tested the new CANopen FD features on,<br />

occasion of CiA plug fests.<br />

CANopen FD keeps on the one hand the basic attributes of<br />

the well-known CANopen. On the other hand CANopen FD<br />

enriches CANopen by an extended data throughput and higher<br />

design flexibility. Therefore, todays CANopen users can reuse<br />

most of their existing CANopen knowledge, when migrating<br />

from CANopen to CANopen FD and can just focus on the new<br />

functionality. Depending on the used communication bit rates,<br />

they can use CANopen FD in existing network topologies.<br />

CANopen FD users will benefit from the new USDO,<br />

which serves in CANopen FD as multi-function tool.<br />

962


Establishing any kind of communication relationship,<br />

dynamically and use-case dependent, will meet the<br />

requirements of future system design.<br />

The larger payload provided by CAN FD data frames, will<br />

support the introduction of any safety or security solution in<br />

CANopen FD systems. CiA working groups are currently<br />

working on this subject.<br />

The basic specification CiA 1301 has been released.<br />

Microcontrollers with integrated CAN FD controllers are<br />

available from several semiconductor manufacturers. CiA<br />

provides recommendations for CAN FD device design and<br />

system design. Therefore everything is available to CANopennetworking<br />

in the next decade.<br />

REFERENCES<br />

[1] CAN in Automation, Florian Hartwich, Robert Bosch GmbH, CAN with<br />

Flexible Data-Rate, Proceedings of the 13th international CAN<br />

Conference<br />

[2] CAN in Automation, Heinz-Jürgen Oertel, Using CAN with flexible<br />

data-rate in CANopen systems, Proceedings of the 13th international<br />

CAN Conference<br />

[3] CAN in Automation, Dr. Martin Merkel, Ixxat Automation GmbH,<br />

CANopen on CAN FD, Proceedings of the 14th international CAN<br />

Conference<br />

[4] CAN in Automation, SIG application layer meeting minutes 2013 to<br />

2017<br />

[5] CiA 301, CANopen application layer and communication profile,<br />

Version 4.2<br />

[6] CiA 1301, CANopen FD application layer and communication profile,<br />

Version 1.0<br />

[7] CAN in Automation, SIG CANopen Internet of Things (IoT) meeting<br />

minutes 2012 to 2018<br />

[8] CiA 309, CANopen access from other networks – Part 1: General<br />

principles and services, Version 2.0<br />

[9] ISO 11898-1:2015. Road vehicles – Controller area network – Part 1:<br />

Data link layer and physical signalling<br />

www.embedded-world.eu<br />

963


High Integrity Software Is Fundamental to<br />

Autonomous Embedded Systems<br />

Jeffrey Fortin<br />

Vector Software<br />

Vector Informatik<br />

East Greenwich, RI USA<br />

jeffrey.fortin@vectorcast.com<br />

The growth of autonomous systems is expected to grow<br />

substantially over the next decade. This emerging technology is<br />

having a disruptive force in the market, enabling new business<br />

models that challenge the current mobility solutions. But will<br />

these autonomous systems be trusted and accepted by the public?<br />

For these predictions to be correct, these new autonomous<br />

systems must have at least the same level of integrity as the<br />

existing solutions we use today.<br />

These new autonomous embedded systems are expected to<br />

have a significant amount of functionality implemented in<br />

software. This is foreshadowed in the recent trend of software<br />

playing a significant role in automation systems in general.<br />

Understanding the behavior of this software is the key to<br />

assessing its integrity. To meet the goal of a fully automated<br />

driving system for instance we must, at a minimum, adopt the<br />

proven methods for developing high integrity system currently<br />

used by the safety critical industries.<br />

The automobile industry itself has taken a lead in establishing<br />

standards for software integrity. ISO 26262 and MISRA are the<br />

two software standards that apply to the verification and<br />

validation of vehicle based software.<br />

The application of these methods must be balanced with<br />

business objectives. The systems must not be under-tested nor<br />

over-tested with the amount and level of testing being determined<br />

by the level of risk.<br />

Keywords—Autonomous; Automoted Software Testing; High<br />

Integrity Software; Safety; ISO 26262; MISRA<br />

I. INTRODUCTION<br />

Integrity means being trustworthy. High Integrity means<br />

having a high degree of trust. A fundamental principle of an<br />

autonomous embedded system is that it must be trusted to do<br />

the right thing. It is this trust that provides the value of the<br />

system.<br />

To address this issue of trust, the automobile industry has<br />

established standards for software integrity. ISO 26262 and<br />

MISRA are the two standards that apply to the verification and<br />

validation of vehicle based software. ISO 26262 is a Functional<br />

Safety standard entitled "Road vehicles -- Functional safety".<br />

The standard is an adaptation of the Functional Safety standard<br />

IEC 61508 for electrical/electronic/programmable electronic<br />

safety-related systems. Part 6 of the ISO 26262 standard<br />

addresses the recommendations for software testing and<br />

verification as part of the standard for software development.<br />

Recommended activities include both unit level and system<br />

level testing such as functional tests and structural coverage<br />

tests. Test tools that support capture and reporting of structural<br />

code coverage are highly recommended in the standard for all<br />

Automotive Safety Integrity Levels (ASIL) defined by ISO<br />

26262.<br />

These standards must however be practical from a business<br />

perspective. The level of testing effort must be correlated to<br />

associated level of risk. Inefficient testing practices must be<br />

replaced with an automated repeatable software quality testing<br />

process allowing for rapid innovation (agility) while at the<br />

same time maintaining the integrity levels mandatory for an<br />

autonomous embedded system.<br />

II.<br />

AN AUTOMATION EXAMPLE<br />

Let us take an example from the automotive industry. Early<br />

automobiles required a high degree of operator interaction for<br />

the automobile to function. The operator had to be mindful of a<br />

wide array of complex interactions and be skilled in<br />

understanding the correct settings and operations required to<br />

run the automobile. The Ford Model-T was a very popular<br />

early automobile. For the owners of these vehicles, here are the<br />

steps that were needed to start the Model-T:<br />

1. Check the fuel level by raising the front seat<br />

cushion, insert a dipstick, and check that you have<br />

enough fuel.<br />

2. Check the oil by going underneath the car, open<br />

the top petcock, if oil drips out you have enough<br />

oil.<br />

www.embedded-world.eu<br />

964


3. Making sure the ignition is switched off, go to the<br />

front of the car and prime the engine by pulling<br />

the choke then turn the engine crank to prime it.<br />

4. Get in the car, turn on the ignition and adjust the<br />

spark advance to the top, open the throttle slightly.<br />

5. Get out of the car and turn the crank, making sure<br />

to use your left arm, that way if the engine should<br />

backfire there is less chance of breaking your arm.<br />

6. If all goes well, the car will start.<br />

Over time these systems, such as the starter and the<br />

carburetor choke became automatic, meaning the operator no<br />

longer needed to worry about that aspect of the operation of the<br />

car. (It must have been a great relief to no longer have to risk<br />

breaking an arm just to start the car.) These improved systems<br />

worked automatically and could be relied on to do the right<br />

thing. They also improved the overall safety of the car. In<br />

today’s modern automobile we have moved beyond<br />

mechanical automation systems to using Electronic Control<br />

Units (ECUs), leveraging the power of computers and software<br />

to provide sophisticated control systems for anti-lock braking,<br />

collision detection, and lane change warnings just to name a<br />

few. Ultimately this pattern continues and now we see the<br />

emergence of Advanced Driver Assistance Systems (ADAS)<br />

and Automated Driving Systems promising a future when<br />

manually operating an automobile will be considered reckless<br />

behavior.<br />

III.<br />

AUTOMATION IS NOT THE SAME AS AUTONOMOUS<br />

Automation systems surround us in our daily lives and have<br />

become an essential part of industry. Robotics, industrial<br />

control, integrated avionics systems and medical devices all<br />

leverage automation. But when we use the term “autonomous”<br />

we are really taking these systems to a new level. A familiar<br />

example use case is the Autonomous Vehicle. A vehicle<br />

capable of being navigated and maneuvered not under the<br />

control of a human but rather being computer controlled under<br />

a range of driving situations. Because of this higher level of<br />

automation, the risks are higher making it all the more<br />

important to insure the integrity levels necessary for these<br />

systems to be trusted and successful in the marketplace.<br />

IV.<br />

STEPS NEEDED TO DEVELOP AN AUTONOMOUS<br />

EMBEDDED SYSTEM<br />

To develop an autonomous system, we can look to the steps<br />

that are used currently for the development of high integrity<br />

systems.<br />

A. Requirments<br />

The fundamental starting point for development is to start<br />

with testable requirements. Taking the time ensure the<br />

requirements are well understood and testable is paramount.<br />

Software bugs are often the result of poorly written<br />

requirements. When developers are given incomplete<br />

requirements, they will make assumptions to fill in the gaps.<br />

These assumptions are then encoded into the system and in<br />

most cases, are not consistent with the overall system<br />

requirements. Ensure your requirements are correct and can be<br />

tested.<br />

B. Using Test-Driven Development<br />

An effective way to ensure your requirements are testable<br />

and correct is to use a Test-driven development approach. In<br />

this approach the tests are written up front and agreed upon by<br />

the system engineers to fulfil the intent of the requirements. In<br />

this way the development proceeds with the goals or objectives<br />

already defined, limiting wasted effort on coding incorrect<br />

software or software that does not meet the requirements. The<br />

tests act as guardrails on the development process. Allowing<br />

for rapid development with a lower risk of introducing costly<br />

and potentially dangerous bugs.<br />

C. Use Industry Best Practices<br />

Look to industry best practices such as ISO 26262 and<br />

MISRA that are used to develop high integrity automotive<br />

systems. Take advantage of the lessons learned from years of<br />

industry expertise. Even if your system is not regulated, it still<br />

needs to be trusted in order for it to provide the value intended.<br />

D. Leverage Test Automoation<br />

Test Automation allows you to free scarce resources for use<br />

on other tasks. Tests should be fast and easy for anyone to run.<br />

This facilitates collaboration, so everyone is involved in<br />

improving the integrity of the system.<br />

V. AN INDISTRY BEST PRACTICE - ISO 26262<br />

ISO 26262 defines four levels of Safety Integrity Levels<br />

with level A being the least critical and level D being the most<br />

critical. For each level there are associated requirements for<br />

testing. There are five types of tests defined in the standard but<br />

Requirements-based tests are highly recommended for all<br />

integrity levels. This reflects the importance requirements play<br />

in high integrity systems.<br />

The standard is also very clear that structural code coverage<br />

metrics must be collected. The standard defines three types of<br />

coverage metrics, Statement coverage, Branch coverage and<br />

Modified Condition/Decision Coverage. Structural coverage<br />

metrics can be measured using software tools making this task<br />

much easier to conduct. The standard also calls out tool<br />

confidence levels and the methods that are used to qualify a<br />

tool.<br />

VI.<br />

THE MISRA CODING STANDARDS<br />

MISRA is The Motor Industry Software Reliability<br />

Association and they have developed standards that are focused<br />

on language restrictions to mitigate reliability faults. To test<br />

against these standards, static analysis tools are used to<br />

examine the code and find any violations of the rules specified<br />

within a given standard. For example, there is a rule that a<br />

switch statement needs to include a default case. Using an<br />

automatic static analysis tool greatly simplifies the code<br />

inspection requirement to show compliance with the many<br />

rules that are specified in the standards.<br />

VII. AN EXAMPLE TEST AUTOMATION ENVIRONMENT<br />

The effort required to analyze and test all the code and to<br />

collect the code coverage metrics is made much easier with the<br />

use of automated testing tools. These tools can parse the code<br />

and automatically generate a test driver that can be used to call<br />

965


the function under test and to instrument the code to collect the<br />

code coverage metrics. They also have the ability to<br />

automatically generate detailed test reports and provide testing<br />

metrics that can be used to show compliance with standards<br />

and to determine release readiness. An example test harness is<br />

shown in Fig 1. Here we can see the original source code, the<br />

test driver that will be used to call the Module Under Test as<br />

well as stub functions that are used to mock the software<br />

interfaces that are external to the unit. In addition, it may be<br />

desirable to have real functions present in the test environment.<br />

The test automation environment automatically generates<br />

test harnesses for each code coverage metric. For ISO 26262<br />

testing, the type of code coverage is defined for each integrity<br />

level. For efficiency, it’s important to be able to perform the<br />

right level of testing for the given level. In this way there is a<br />

balance between the effort required to perform the test and the<br />

associated risk level.<br />

Fig 1 Automated creation of a test harness<br />

A. Support for Test-Driven Development<br />

The test automation system should support a Test-driven<br />

Development methodology. This means it must have the ability<br />

to generate a test harness purely on the interface definition for<br />

the unit under test and the test inputs and expected outputs. In<br />

this way the code developers can use the test harness to prove<br />

the work they have done meets the requirements and they have<br />

a clear definition of done. Before any code is written the test<br />

should be run to be sure it fails. If the test can pass with no<br />

code written, then it does not have any value to prove the code<br />

is correct. The developer will add code sufficiently to allow the<br />

test to pass. Ideally only the code necessary to pass the test<br />

should be written as any additional code is not required and<br />

wastes effort that can be placed elsewhere. These tests should<br />

be incorporated into the overall test process and run whenever<br />

a change is made to the code or to the test itself.<br />

B. Reporting and Metrics<br />

Included with the test automation system should be a<br />

reporting and metrics facility. Most software development<br />

standards require test artifacts that must be shown to an auditor<br />

to confirm compliance. Safety standards such as ISO 26262<br />

also have reporting requirements that are audited to show<br />

compliance with the regulations.<br />

Metrics drive the efficiency of the development process.<br />

They focus the resources on the highest risk areas and provide<br />

insight into the progress of the development. With metrics,<br />

clear release readiness criteria can be defined and tracked.<br />

Ideally these metrics should be visible to the entire team so<br />

everyone is aware of the project status and can participate in<br />

improving the overall quality and integrity of the software.<br />

C. Tool Qualification<br />

The tools used for test automation should be evaluated for<br />

use by a certification authority. These authorities will evaluate<br />

the software development procedures used to develop the tool<br />

and certify that the tool fulfills the requirements for a particular<br />

safety related standard such as ISO 26262. The certification<br />

organization acts as a trusted agent providing a recognized<br />

authority for the qualification of the tool.<br />

VIII. CONCLUSION<br />

The growth of Autonomous Embedded Systems will place<br />

high demand for efficient testing methods for proving the<br />

integrity of the software. Within the next decade, our current<br />

autonomous systems will look as outdated as a Model-T does<br />

to us today. But for this growth to happen, the systems must be<br />

trustworthy. Following the steps outlined in this paper will go a<br />

long way to meeting the challenge of providing a trusted<br />

autonomous embedded system in way that is in line with<br />

business objectives.<br />

www.embedded-world.eu<br />

966


Safety & Security Testing of Cooperative<br />

Automotive Systems<br />

Dominique Seydel, Gereon Weiss<br />

Application Architecture Design & Validation<br />

Fraunhofer ESK<br />

Munich, Germany<br />

{dominique.seydel, gereon.weiss}@esk.fraunhofer.de<br />

Daniela Pöhn, Sascha Wessel<br />

Secure Operating Systems<br />

Fraunhofer AISEC<br />

Garching, Germany<br />

{daniela.poehn, sascha.wessel}@aisec.fraunhofer.de<br />

Franz Wenninger<br />

Design, Test & System Integration<br />

Fraunhofer EMFT<br />

Munich, Germany<br />

franz.wenninger@emft.fraunhofer.de<br />

Abstract— Cooperative behavior of automated traffic<br />

participants is one next step towards the goals of reducing the<br />

number of traffic fatalities and optimizing traffic flow. The<br />

notification of a traffic participant’s intentions and coordination<br />

of driving strategies increase the reaction time for safety<br />

functions and allow a foresighted maneuver planning. When<br />

developing cooperative applications, a higher design complexity<br />

has to be handled, as components are distributed over<br />

heterogeneous systems that interact with a varying timing<br />

behavior and less data confidence. In this paper, we present a<br />

solution for the development, simulation and validation of<br />

cooperative automotive systems together with an exemplary<br />

development flow for safety and security testing.<br />

Keywords— automotive safety; cooperative applications;<br />

security testing; validation; autonomous systems; ITS<br />

I. INTRODUCTION<br />

In comparison to the development of traditional ADAS<br />

functions, testing and simulation of connected applications<br />

have to consider the interaction of heterogeneous systems that<br />

are distributed within a wireless networked architecture. As the<br />

communication link is more unreliable in contrast to common<br />

input sensors, the application has to cope with varying timing<br />

behavior and less data confidence. However, the higher<br />

complexity in the development process of cooperative<br />

applications is justified by several advantages. They arise from<br />

the aspect that foreign traffic participants are no longer solely<br />

observed from the outside to predict their behavior but they<br />

give insights on their status, intentions and their involvement in<br />

cooperative maneuvers. This results in an increased reliability<br />

of predicted vehicle movements, which in turn can be used for<br />

safety functions and allow an increased reaction time of safety<br />

mechanisms. In terms of driving comfort, the traffic<br />

participant’s cooperation allows a foresighted maneuver<br />

planning.<br />

By the current state of available tools, the development, test<br />

and certification of autonomous systems is complex, costly in<br />

terms of time and equipment, potentially hazardous and often<br />

incomplete. For instance, if it comes to complex applications<br />

that require a distributed consensus, e.g. Merging Assistance<br />

[1], an application distributed among various foreign entities<br />

has to be validated. Thus, it appears that development and<br />

simulation environments are not yet ready to rapidly develop<br />

prototypes of cooperative driving functions.<br />

Therefore, we provide an approach for an integrated testing<br />

environment that can cover the whole innovation cycle for<br />

prototype development of cooperative automotive systems.<br />

Incorporating safety and security aspects, it starts from the<br />

design of applications over simulation to integrating and<br />

validating the respective prototypes. Thus, our approach<br />

supports the whole development process for prototyping and<br />

testing cooperative functions.<br />

The following Chapter II gives an overview on an efficient<br />

approach for the development of cooperative applications.<br />

Further aspects of the simulation and testing phase are<br />

discussed in Chapter III. The current scope of software analysis<br />

is presented in Chapter IV and a secured deployment process in<br />

the following Chapter V. We conclude our work with Chapter<br />

VI and provide a brief outlook to next steps.<br />

II.<br />

RAPID APPLICATION DEVELOPMENT<br />

A. Application Development Cycle<br />

The testbed combines several aspects of Vehicle-to-X<br />

(V2X) application development life cycle. It covers<br />

development steps beginning from application design,<br />

967


continuous integration into simulation environments, testing<br />

environments over tools performing functional and security<br />

analyses, up to a secure deployment and update process. The<br />

application development flow of the innovation- and testbed<br />

concept is shown in Fig. 1. It is also designed to only use single<br />

aspects in a building-block style when developing innovative<br />

applications.<br />

The testing layer consists of several testing methods that<br />

are specific for each step in the development process and<br />

include several simulation environments, integration testing<br />

and field tests. The testbed provides the ezCar2x ® framework,<br />

described in [2], that allows testing connected applications<br />

within a simulation environment, using network and traffic<br />

simulation as well as integration of hardware-in-the-loop, e.g.<br />

Road Side Units (RSUs). Another main feature of the testbed is<br />

a combined simulation and field testing approach, where<br />

virtual and real traffic participants can be tested in a<br />

synchronized environment.<br />

Additional analysis tools enable to examine the developed<br />

application. On the one hand, our DANA tool (“Description<br />

and Analysis of Networked Applications”) for functional<br />

validation can be used iteratively in every testing step [3]. For<br />

the integration testing of the application, the analysis toolbox<br />

provides application security testing methods in order to detect<br />

software vulnerabilities in an early development stage. Using<br />

static application security testing (SAST) as well as dynamic<br />

application security testing (DAST) allows quick analyses<br />

during integration testing in order to detect potential software<br />

vulnerabilities.<br />

Finally, the application is built and subsequently signed<br />

within the software repository and pushed to the update server,<br />

which is part of the back end. The update server again signs the<br />

application and deploys it to V2X devices.<br />

B. Application Design<br />

For the initial development step of designing a cooperative<br />

driving function, the testing environment comprises interfaces<br />

to common automotive modelling tools, like Matlab Simulink<br />

or ADTF. The deployed application uses the ezCar2x ®<br />

framework, an ETSI ITS (Intelligent Transport Systems)<br />

compliant communication stack, which can either run on real<br />

communication hardware or on a virtual node within a network<br />

simulation. Furthermore, application security testing can be<br />

conducted with static and dynamic methods.<br />

Fig. 1 Application Development Cycle using the Testbed Services<br />

If enhanced safety mechanisms are required for the<br />

intended application, state-of-the-art software methods, e.g.<br />

graceful degradation strategies [4] or model-based<br />

communication [5], can be incorporated into the application<br />

model within this development step as well. For example, a<br />

connected application with safety-critical functionality, as<br />

Platooning, strongly depends on Quality of Service (QoS)<br />

parameters of the communication link. Our safety function for<br />

Resilient Control uses these QoS parameters, such as the<br />

current Packet Loss Rate (PLR), to decide which degradation<br />

mode is sufficient, e.g. readjusting the distance to the vehicle<br />

ahead. The safety mechanisms for resilient control are<br />

developed as generic component and can be integrated into<br />

common automotive software architectures, as AUTOSAR,<br />

AUTOSAR Adaptive and further concepts. Also existing<br />

architectures from non-safety domains like infotainment, as<br />

developed by the GENIVI Alliance, can be integrated to handle<br />

the unreliability of the communication link.<br />

Another aspect that is getting more relevant for application<br />

design and validation are so-called Plastic Architectures. Parts<br />

of an application can be distributed over several entitites. For<br />

example, in the case of a Collision Warning application [6] this<br />

includes the interaction of the originating, the warning and<br />

optionally edge or cloud components form the overall function.<br />

As the specific architecture may frequently change, the<br />

architectures change depending on the context, thus becoming<br />

formable or plastic. Also, in future parts of the application can<br />

be dynamically relocated during runtime, e.g. from cloud over<br />

edge to in-vehicle components. Thereby, the system boundaries<br />

dynamically change depending on the current communication<br />

relations. Although, there are concepts to solve the underlying<br />

network aspects [7], these runtime conditions have already to<br />

be covered within the design phase of the specific applications.<br />

III. SIMULATION & TESTING<br />

One goal of simulation and testing for cooperative<br />

automated driving is to achieve a fail-operational behavior of<br />

the application, even when the context information is<br />

unconfident. Therefore, the coverage of the testcases used<br />

within the simulation environment and during virtual testing<br />

should be as realistic and as comprehensive as possible. This is<br />

reached by (automatically) defining reference scenarios and<br />

generate variations out of them, e.g. by stochastic variations<br />

[8].<br />

One of the parameter variations is the realistic behavior of<br />

the communication channel during a certain driving scenario.<br />

Therefore, a network simulation tool, e.g. ns-3 or OMNET++,<br />

and a traffic simulation tool, e.g. SUMO, VTD or CarMaker,<br />

are integrated into the simulation environment. Our testbed<br />

could also be integrated with other microscopic and<br />

macroscopic traffic simulation tools, as each of them has<br />

advantages when testing a specific connected application.<br />

A. Simulation Environment<br />

The suggested concept combines three different simulation<br />

aspects into one integrated simulation environment.<br />

The first component is a traffic simulator that is used to<br />

model and run driving test cases on a realistic road network.<br />

968


The second component is a network simulation tool for<br />

evaluating applications under real communication conditions.<br />

For the heterogeneous use of common vehicular<br />

communication technologies, e.g. 802.11p, 4G or LTE, the<br />

ezCar2x ® framework provides additional network layer<br />

components. The network simulation tool also facilitates<br />

interfaces to control traffic simulation and integration<br />

hardware-in-the-loop tests or vehicle-in-the-loop tests (as for<br />

Virtual Platooning), e. g. including RSUs.<br />

The third component of the testbed is for test control.<br />

Traces from all simulation components are monitored and<br />

analyzed within the test control component. For ensuring the<br />

security of cooperative systems, testing covers white-, gray-,<br />

and black-box approaches (e.g. Data-Flow Analysis, Fuzzing<br />

or Penetration Testing). In order to validate the applications,<br />

test cases have to reach full coverage and should therefore be<br />

generated (semi-)automatically for each application.<br />

B. Integrated and Hybrid Simulation<br />

As already described in Chapter II.A, the application<br />

implementation is deployed on each virtual V2X node within<br />

the network simulation environment. Together with the<br />

ezCar2x ® Framework each virtual node can be equipped with<br />

developed applications and also with V2X communication<br />

ability. Hence, its interaction with other nodes can be simulated<br />

as realistically as possible.<br />

The interaction between all virtual nodes is realized using a<br />

virtual wireless channel. Thereby, we consider the specific<br />

characteristics of each communication technology by using<br />

individual channel models, e.g. dedicated models for ITSG5,<br />

LTE or 5G. This virtual wireless channel can also be used to<br />

integrate real hardware into the simulation by using a channel<br />

proxy and creating a mirror node for each hardware component<br />

within the network simulation.<br />

The network simulation is coupled with a macroscopic<br />

traffic simulation for large scale traffic scenarios, e.g., to test<br />

security mechanisms for V2X messages, and with a<br />

microscopic traffic simulation for smaller driving scenarios,<br />

e.g. Cooperative Merging [6] or Platooning. The coupling via a<br />

control interface is needed to synchronize the behavior of<br />

communication nodes and traffic participants in each<br />

simulation for the given driving scenario.<br />

The environment can be extended with hybrid simulation<br />

capabilities by including hardware-in-the-loop. A RSU<br />

comprising an application, e.g. Smart Lighting, can be<br />

integrated into the simulation loop, by connecting it to the<br />

wireless channel interface of the simulation environment. The<br />

RSU can again interact with further communication hardware,<br />

e.g. test vehicles that are in communication range. The RSU<br />

can also be connected with sensors, that are integrated into the<br />

testing environment and which can provide status data to<br />

generate event messages, e.g. Decentralized Environmental<br />

Notification Messages (DENMs).<br />

C. Sensor Integration<br />

The effectiveness of a connected applications’ simulation<br />

depends on how realistic the input data for a certain driving<br />

scenario is. The input data from distributed sources, e.g. the<br />

status data within Cooperative Awareness Messages (CAMs)<br />

from other vehicles or roadside sensors, have to be<br />

synchronized during recording and replay phase.<br />

Synchronization is required to setup the intended driving<br />

conditions for the application that are to be tested. To achieve<br />

realistic input data, it is beneficial to have sensors integrated in<br />

an early development step, through Hardware-in-the-Loop<br />

(HiL) or recorded input data stream integration.<br />

The realistic sensor data can also be used to develop<br />

algorithms for sensor analysis, timing and improvement of<br />

machine learning processes. In addition, to develop tamperproof<br />

algorithms it is advantageous to have a realistic behavior<br />

of sensor data available in order to avoid manipulation or<br />

attacks with sensor hardware or sensor data.<br />

Within our simulation environment, vehicle sensors and<br />

infrastructure sensors are integrated as components in<br />

ezCar2x® via generic sensor interfaces. When recording test<br />

data, the real sensors can easily be included into the<br />

synchronized recording process. The same setup can be used<br />

during field tests, where infrastructure sensors usually are<br />

integrated into RSUs and thereby provide their environment<br />

data. Thus, virtual, hybrid and integration testing can be carried<br />

out with low effort.<br />

IV. SOFTWARE ANALYSIS<br />

In each step of the development process it is beneficial to<br />

perform additional analyses to get detailed knowledge of the<br />

overall system and the applications behavior for debugging,<br />

monitoring, security, and validation purposes. In this chapter,<br />

we give an overview on our methods and tools for software<br />

analysis.<br />

An exemplary flow of the development process for an<br />

application is shown in Fig. 2. The application under<br />

development can be prototyped as implemented source code or<br />

as a software model to be further developed and optimized<br />

within the testbed.<br />

A. Monitoring and Functional Validation<br />

For software validation and verification, model-based<br />

techniques are advantageous during the design and integration<br />

phase. Our DANA platform [3], an open and modular<br />

environment based on Eclipse, is a tool built for specifying and<br />

analyzing networked applications. For this purpose, the<br />

specified valid behavior of the application is described as a<br />

layered reference model. This model provides a basis for<br />

further model-based development steps. On the one hand, it<br />

can be used for various transformations of behavior models,<br />

e.g., for generating test cases or code for running simulations.<br />

On the other hand, it can be used for static analyses to check<br />

conformance to modeling guidelines, metrics for interfaces,<br />

and the compatibility of behavior models. The model-based<br />

approach also allows a quick integration of new messages<br />

sources, e.g. additional communication protocols or wireless<br />

channels. Furthermore, the DANA tool can be used for<br />

verifying and validating software interface behavior, as<br />

messages in these interfaces can contain complex data and<br />

include intricate interactions.<br />

In our proposed testbed we use DANA as a central<br />

monitoring tool to have all the status, debug and behavior<br />

969


Fig. 2 Exemplary Development Flow for Safety & Security Testing<br />

information, the error messages and timing data centrally<br />

available from each component. This aggregation helps to<br />

simplify and to speed up the debugging process during<br />

development and runtime. Further validation checks can be<br />

applied on this collected data, as described in the previous<br />

paragraph.<br />

B. Application Security Testing<br />

By integrating static as well as dynamic testing methods<br />

into the development process, applications can be tested against<br />

a broad spectrum of software vulnerabilities, as outlined in [9].<br />

Detecting vulnerabilities in early development stages is crucial,<br />

as it prevents the necessity of expensive software patching<br />

post-release. Therefore, we propose a work flow, where<br />

application security testing is part of continuous integration.<br />

Static code analysis as part of SAST allows inspecting<br />

program behavior without actually running the program. A<br />

large amount of tools for programming languages that are<br />

typically used within the V2X domain are available for<br />

detecting software requirements being violated, e.g. erroneous<br />

program behavior, reachability or dead code. At that, many<br />

tools are specifically designed to detect program flaws that lead<br />

to potential vulnerabilities, e.g. buffer overflows.<br />

SAST depends on the application source code and<br />

vulnerability specification as input, where the latter defines<br />

what kind of vulnerabilities should be detected by the tool. The<br />

static analysis then runs fully automated and outputs a report of<br />

the detected potential vulnerabilities.<br />

Available tools can be applied directly on source code as<br />

well as on binaries and bytecode. The applied methods differ in<br />

efficiency and effectiveness depending on the underlying<br />

approach. Broadly, applied methods can be divided into lexical<br />

scanning and data or control flow analysis.<br />

While SAST has shortcomings, e.g. detection of<br />

vulnerabilities in authentication, access control or<br />

cryptographic protocols, as well as uncovering flaws in the<br />

security design, DAST can tackle these problems. Furthermore,<br />

it provides deeper analysis and can imitate attack scenarios.<br />

Since applications are executed within this testing method,<br />

DAST depends on predefined input data. These inputs either<br />

represent specific test cases or are randomly chosen.<br />

Dynamic code analysis is a white-box testing approach.<br />

This means that the internal structure of an application, e.g.<br />

source code or intermediate representation, is available and can<br />

be leveraged for analyzing the application.<br />

Monitoring of function calls by hooking into security<br />

critical functions, e.g. system calls, is one typical testing<br />

method. In addition, function parameter analysis is applied,<br />

inspecting the input-output relations of function calls. More<br />

sophisticated methods like dynamic taint analysis are able to<br />

analyze execution paths an attacker may use to exploit an<br />

application. This method can also be applied if source code of<br />

an application is not available.<br />

All of these methods aim at detecting potential<br />

vulnerabilities that occur during runtime of the application.<br />

This allows verifying alarms that have been reported by SAST.<br />

Furthermore, it can be used as preparation for penetration<br />

testing, as it provides potential entry points for actual attacks.<br />

V. SECURE DEPLOYMENT AND PROTOTYPING<br />

To rapidly transfer the developed application into a<br />

prototype, as needed for field tests, an integrated deployment<br />

process is beneficial. The testbed offers an integrated solution<br />

containing a secure V2X platform, a continuous integration<br />

workflow, secure deployment and update processes and a<br />

communication link to back end or cloud platforms.<br />

Novel cooperative functions can be integrated with<br />

ezCar2x® into secure ITS prototype devices. These build upon<br />

trust2X [10], a hardened platform that includes hardware- and<br />

software-based security in order to isolate and protect<br />

970


processes and data of cooperative driving functions from other<br />

operating systems (e. g. AUTOSAR), functional modules and<br />

communication interfaces (e. g. backend communication for<br />

secure software updates and app deployment).<br />

The hardened V2X platform trust2X is a hardened Linuxbased<br />

operating system that provides hardware- and softwarebased<br />

security features for target devices of V2X applications.<br />

Its main goal is the secure isolation of different software<br />

entities that run concurrently on top of a single Linux kernel.<br />

ezCar2x® can run in a Linux container, isolated from other<br />

guest operating systems (e.g. AUTOSAR) and other functional<br />

modules or services (e.g. TLS, VPN). By isolating software<br />

entities, each entity is protected against compromised or<br />

malfunctioning software that runs in a different container.<br />

Therefore, safety- and security-critical functions can run within<br />

an isolated environment, unaffected by failure or compromise<br />

of separate entities in the system.<br />

Each commit to the central software repository<br />

automatically triggers a build of the application under<br />

development. While each build is self-testing, SAST and<br />

DAST are triggered automatically. Static security testing tools<br />

are applied either directly on source code, or on compilation<br />

products such as bytecode or binaries. Therefore, either single<br />

commits can be tested using light-weight analysis tools, as well<br />

as individual software modules or the complete application.<br />

Static application testing tools can run without user interaction<br />

and provide a report, which lists each finding of potential<br />

vulnerabilities. DAST tools might require user interaction<br />

depending on the applied testing method.<br />

In case a potential vulnerability has been found the<br />

continuous integration server alerts the development team.<br />

Based on the review of the testing reports either further<br />

security tests can be applied, or the developer commits a patch<br />

in order to eradicate the vulnerability. If the application has<br />

been successfully tested and no potential vulnerabilities were<br />

found, the continuous integration server triggers a merge<br />

request, or directly merges with master. Subsequently, the<br />

application is build and signed.<br />

VI. CONCLUSION<br />

We provided an approach for an integrated testing<br />

environment that covers the whole development process for<br />

prototyping and testing cooperative functions. Incorporating<br />

safety and security aspects starting from the design phase, the<br />

complex task of simulation and cooperative applications and<br />

several testing steps have been described. Finally, a software<br />

solution for the validation and deployment of the prototypes<br />

was presented, which makes tools available for the whole<br />

development cycle. Thereby, a toolkit is provided that is<br />

intended to rapidly bring an idea for a connected application<br />

into a prototype with a decreased investment risk.<br />

In future, all testbed services described in the previous<br />

chapters could also be made available as an online web-service.<br />

For this purpose, a next step of the described solution is to<br />

enable access to configurable or pre-configured simulations via<br />

an online service. On this website, users could remotely control<br />

simulation parameters, define new scenarios and get qualified<br />

evaluation results. This ongoing development addresses the<br />

rapid development of connected applications by abstracting<br />

technology know–how and increasing time-to-market speed.<br />

Thus, innovators and developers can concentrate on the actual<br />

function and idea of the intended application and are able to<br />

experience, improve, and validate their solution in early stages<br />

prior to competitors.<br />

ACKNOWLEDGMENT<br />

This project was partially funded by the Bavarian Ministry<br />

of Economic Affairs and Media, Energy and Technology<br />

within the High Performance Center Secure Networked<br />

Systems.<br />

REFERENCES<br />

[1] Ntousakis, I. A., Nikolos, I. K., & Papageorgiou, M. (2017). Cooperative<br />

Vehicle Merging on Highways-Model Predictive Control (No. 17-<br />

00930).<br />

[2] Roscher, K., Bittl, S., Gonzalez, A. A., Myrtus, M., and Jiru, J. (2014).<br />

ezCar2X: Rapid-Prototyping of Communication Technologies and<br />

Cooperative ITS Applications on Real Targets and Inside Simulation<br />

Environments, In: 11th Conference Wireless Communication and<br />

Information. vwh, pp. 51 – 62.<br />

[3] Drabek, C., Weiss, G. (2017) DANA - Description and Analysis of<br />

Networked Applications. In: International Workshop on Competitions,<br />

Usability, Benchmarks, Evaluation, and Standardisation for Runtime<br />

Verification Tools (RV-CuBES), pp. 71-80.<br />

[4] Schleiss P., Drabek C., Weiss G., Bauer B. (2017) Generic Management<br />

of Availability in Fail-Operational Automotive Systems. In: Tonetta S.,<br />

Schoitsch E., Bitsch F. (eds) Computer Safety, Reliability, and Security.<br />

SAFECOMP 2017.<br />

[5] Moradi-Pari, E., Mahjoub, H. N., Kazemi, H., Fallah, Y. P., and<br />

Tahmasbi-Sarvestani, A. (2017). Utilizing Model-Based Communication<br />

and Control for Cooperative Automated Vehicle Applications. IEEE<br />

Transactions on Intelligent Vehicles.<br />

[6] Zhang, R., Cao, L., Bao, S., & Tan, J. (2017). A method for connected<br />

vehicle trajectory prediction and collision warning algorithm based on<br />

V2V communication. International Journal of Crashworthiness, 22(1),<br />

15-25.<br />

[7] An, X., et al. (2017) On end to end network slicing for 5G<br />

communication systems. In: Transactions on Emerging<br />

Telecommunications Technologies, 28. Jg., Nr. 4.<br />

[8] Damm W., Heidl P. (Hrsg) (2017) Positionspapier und Roadmap zu<br />

,,Hochautomatisierte Systeme: Testen, Safety und<br />

Entwicklungsprozesse“, SafeTRANS e. V. http://www.safetransde.org/de/Aktuelles/?we_objectID=2,<br />

Access: 18.1.2018.<br />

[9] Eckert, Claudia. (2017) Cybersicherheit beyond 2020!. In: Informatik-<br />

Spektrum, 40. Jg., Nr. 2, pp. 141-146.<br />

[10] Waidner M. (2018) Safety und Security. In: Neugebauer R. (eds)<br />

Digitalisierung. Springer Vieweg, Berlin, Heidelberg<br />

971


Sensor Simulation<br />

Validation of safety-related sensors with real time capability<br />

Dr.-Ing. Kristian Trenkel<br />

Research<br />

iSyst Intelligente Systeme GmbH<br />

Nürnberg, Germany<br />

Kristian.Trenkel@isyst.de<br />

Abstract— Currently there are no suitable solutions for the<br />

simulation of advanced sensors with digital interfaces in real-time.<br />

But this is necessary for development and testing. The presented<br />

solution enables the simulation of sensors in real time. The<br />

simulation of sensor values and the injection of errors is possible.<br />

Based on the used CAN interface, the solution can be easily<br />

integrated into existing test systems. Furthermore, it is easy to use<br />

at the developer's workstation.<br />

Keywords— Simulation of sensors, Simulation of sensors busses,<br />

Environment simulation<br />

I. INTRODUCTION<br />

In the automotive industry the number of electronic control<br />

units (ECU) and the number of functions increases every vehicle<br />

generation. More and more functions need information about the<br />

environment of the vehicle. Because of that the number of<br />

sensors is increasing. Beside the sensors with analog or PWM<br />

interfaces more and more digital sensors interface like SPI,<br />

SENT or PSI5 become important in the development of<br />

automotive ECUs. This interface leads to new requirements for<br />

the development and test.<br />

For the development and test it is necessary to simulate the<br />

sensor values and the behavior in the case of faults of the sensor<br />

in real time. With a real sensor not all possible behaviors and<br />

tolerances can be caught. Compared to the simulation of analog<br />

or PWM signals the simulation of sensors with a digital interface<br />

is much more complex. The sensor provides, beside the sensor<br />

signal, more and more diagnostic and configuration information.<br />

For safety relevant applications (e.g. according to ISO26262) it<br />

is necessary to test all elements of the sensor communication<br />

with focus on the fault detection. The currently available sensor<br />

simulation systems don’t provide the possibility of a real time<br />

simulation. Over the mostly used USB interface it is only<br />

possible to program fixed values or predefined signal sequences.<br />

Also, there is no possibility to synchronise the simulation to the<br />

test system.<br />

The introduced simulation platform provides the possibility<br />

to simulate sensor values in real time, which means a response<br />

time below 1µs. The system uses one CAN interfaces for the<br />

integration in the test system. It is possible to simulate the real<br />

and dynamic behavior of a sensor in a simulated test<br />

environment like a HIL system or a PC on the workplace of a<br />

development engineer. The presented system of the iSyst<br />

Intelligente Systeme GmbH allows the simulation of sensor with<br />

a SPI, PSI5, LIN or SENT interface [1]. This solution enables<br />

the engineer to accomplish an effective development and the test<br />

of sensors with digital interfaces in safety relevant area, too.<br />

The paper shows the possibilities and limits of the sensor<br />

simulation platform. The advantages of the presented solution<br />

are illustrated by results from real test projects.<br />

II.<br />

STATE OF THE ART<br />

Until now, sensors have been widely used in the automotive<br />

sector, but also in other areas of industry, which output the<br />

sensor value as an analog signal or PWM signal. These sensors<br />

are typically connected via three wires, which contain the supply<br />

voltage (e. g. 5V), ground and the actual sensor signal. The<br />

transmitted information can be displayed as current or voltage.<br />

It is not possible to exchange further information such as<br />

diagnostic information from the sensor. An error is detected by<br />

checking the actual sensor signal and the position of the signal<br />

within defined limits.<br />

www.embedded-world.eu<br />

972


III.<br />

DIGITAL SENSOR INTERFACES<br />

A. Used sensor interaces<br />

Currently, sensors with digital interfaces are frequently used.<br />

These enable on the one hand to receive several sensor signals<br />

from a single sensor. On the other hand, it is possible to obtain<br />

diagnostic and status information directly from the sensor. This<br />

enables extended fault detection, which in turn is important for<br />

safety-critical applications such as airbag control units.<br />

SPI, PSI5 and SENT are examples of digital sensor<br />

interfaces.<br />

SPI (Serial Peripheral Interface) [1] is a bi-directional,<br />

synchronous serial protocol that is often used for communication<br />

of ICs on boards.<br />

PSI5 (Peripheral Sensor Interface 5) [2] is a two-wire<br />

interface that can be operated synchronously or asynchronously.<br />

A current modulation with Manchester encoding is used for<br />

communication from the sensor (slave) to the control unit<br />

(master). Modulation of the supply voltage is used for<br />

communication from the control unit to the sensor. PSI5 has<br />

been developed for the connection of sensors in the automotive<br />

sector.<br />

SENT (Single Edge Nibble Transmission - SAE J2716) [3]<br />

is a unidirectional, asynchronous protocol using three wires for<br />

supply voltage, ground and signal. The signal is transmitted as<br />

modulated signal voltage with constant amplitude and different<br />

pulse length for each nibble (4 bit). It has been developed for the<br />

connection of sensors in the automotive sector.<br />

In addition to the actual sensor value or sensor values,<br />

advanced sensors provide a large number of diagnostic<br />

functions. They perform cyclical self-tests (e. g. testing of the<br />

clock source and memory) and report the results to the connected<br />

processor of the system. Furthermore, the sensors can also<br />

monitor the processor by providing watchdog functions. The<br />

processor in turn checks the sensor values (e. g. checking for the<br />

presence of noise) to ensure that the sensor functions correctly.<br />

B. Testing of sensors<br />

During the development and testing of embedded systems<br />

with sensors, the task now is to support and test all functions that<br />

the real sensor provides via its interface. The real sensors can<br />

only be used to implement the behaviour of this special sensor,<br />

which lies somewhere in the permissible tolerance band. On the<br />

one hand, this makes it difficult for software developers to check<br />

their implemented plausibility checks. On the other hand, the test<br />

department cannot enforce a faulty behaviour of the sensor and<br />

therefore cannot check the correctness of the monitoring. In<br />

safety-critical applications, however, the testing of failures and<br />

the safeguarding of monitoring functions are an absolute<br />

necessity.<br />

In order to enable the realization of these development and<br />

test tasks, a platform for the emulation of sensors was developed,<br />

whose function and possibilities are illustrated by the following<br />

example.<br />

Figure 1: Emulation module for sensors with SPI interface<br />

IV.<br />

EXAMPLE OF USE<br />

In the following, the challenges and solution proposals for<br />

safeguarding acceleration sensors connected via SPI are<br />

presented. The procedure is shown by way of example on a<br />

control unit, which uses three acceleration sensors to provide<br />

measured values for the Electronic Safety Program (ESP). The<br />

recording and transmission of the sensor values is classified as<br />

ASIL B (Automotive Safety Integrity Level). All error detection<br />

mechanisms of the sensor were tested as well as the transmission<br />

of the sensor data and error states via FlexRay to other ECUs.<br />

Therefore, it was necessary to integrate the sensor emulation into<br />

a hardware-in-the-loop (HIL) test system.<br />

V. INTEGRATION INTO THE TEST SYSTEM<br />

Sensor simulation, or better to say emulation, is integrated<br />

into the HIL test system via the CAN bus. In this case, only the<br />

default values for the sensor data are transmitted cyclically in a<br />

500 µs pattern via the CAN bus from the HIL test system to the<br />

sensor emulation. The error injection is controlled as required to<br />

keep the bus load on the CAN bus as low as possible. This makes<br />

it possible to define the sensor data from the HIL test system in<br />

real-time. In the given example, the sensor data is queried in the<br />

control unit with a cycle time of 1 ms. The transfer of the sensor<br />

values takes place on the SPI interface according to the requests<br />

of the ECU. Furthermore, the sensor emulation also takes over<br />

the generation of a noise on the sensor signals synchronously to<br />

the queries of the ECU. This is necessary for a realistic<br />

simulation of the sensor behaviour.<br />

973


of sensor emulation via SPI in real time to the ECU to be tested.<br />

The result of the plausibility check must also be measured as a<br />

value on the FlexRay. Different plausible and not plausible<br />

sensor values can be specified and the evaluation can be checked<br />

by the ECU.<br />

A second test case is the falsification of the cyclic<br />

redundancy code (CRC) checksum within the responses on the<br />

SPI interface. The error injection can be performed for<br />

individual or all responses. In case of a faulty CRC, the control<br />

unit must mark the sensor values on the FlexRay bus as invalid.<br />

This can be easily checked by the control PC.<br />

Figure 2: Test system integration<br />

The schematic structure of the test system can be seen in<br />

Figure 2. Python and the test automation system iTestStudio are<br />

used to implement and execute the tests on the control PC. This<br />

is connected to the dSPACE real-time computer in the HIL test<br />

system via a proprietary connection. The control PC can also<br />

read the FlexRay communication (rest bus simulation) between<br />

the HIL test system and the control unit and thus also check the<br />

sensor values on the FlexRay. Furthermore, the control PC can<br />

use the Universal Measurement and Calibration Protocol (XCP)<br />

to read and write variables within the ECU software. With this<br />

test system design, both white box and black box tests for the<br />

sensors are possible.<br />

VI.<br />

EXECUTION OF THE TESTS<br />

In the following, two tests are described in more detail to<br />

illustrate the possibilities of sensor emulation and the HIL test<br />

system.<br />

The plausibility of the sensor data is checked in the first test<br />

case. The plausibility check is carried out in the real application<br />

by a second ECU (ESP), which also has acceleration sensors and<br />

makes these values available on the FlexRay. In the example,<br />

the second ECU is simulated by the HIL test system. It is now<br />

possible to provide sensor values on the FlexRay and by means<br />

Furthermore, the sensor emulation makes it possible to<br />

change all values provided by the sensor (e. g. status bits, chip<br />

ID and clock counter) via CAN. In addition, faulty lengths for<br />

the responses, a sensor failure and a missing jitter of the sensor<br />

values can be set.<br />

Errors in the requests to the sensor emulation are detected<br />

and reported cyclically every 500 ms to the HIL test system via<br />

CAN.<br />

VII. CONCULTION<br />

For the development and test of embedded systems with<br />

advanced sensors, real-time sensor emulation is essential. Only<br />

with the help of an emulation is it possible to implement and test<br />

all functions and also the diagnostic function as well as to ensure<br />

all plausibility checks.<br />

REFERENCES<br />

[1] iSyst GmbH, “Test Components” Online]. Available:<br />

https://www.isyst.de/en/products/test-components/<br />

[2] Wikipedia, “Serial Peripheral Interface” [Online]. Available:<br />

http://de.wikipedia.org/wiki/Serial_Peripheral_Interface<br />

[3] Robert Bosch GmbH, “PSI5 Peripheral Sensor Interface 5.” [Online].<br />

Available: http://psi5.org/<br />

[4] SAE International, “J 2716 - SENT - Single Edge Nibble Transmission<br />

for Automotive Applica-tions,” 27.01.2010. [Online]. Available:<br />

http://standards.sae.org/j2716 201001/<br />

www.embedded-world.eu<br />

974


Effective Power Interruption Testing<br />

How Best to Fail<br />

Thom Denholm<br />

Technical Product Manager<br />

Datalight, Inc.<br />

Bothell, WA<br />

Thom.Denholm@datalight.com<br />

Abstract—From dropped batteries to system failures,<br />

embedded designs need solid power interruption testing.<br />

Reliability demands for embedded products have increased as the<br />

desired lifetime of high reliability products has grown. To achieve<br />

the most comprehensive reliability test in the least time, stress<br />

testing must utilize I/O at the point of power interruption.<br />

This session will survey the failure points of file systems and<br />

flash media, with a discussion of the most effective strategies for<br />

ensuring that test design accounts for the variety of real world<br />

failures that can occur. Validation of data and hardware<br />

requirements will also be discussed.<br />

Keywords—reliability; NAND; flash media; power interruption;<br />

file system; O_DIRECT<br />

I. INTRODUCTION<br />

When your team is responsible for validating the reliability<br />

of a design for the embedded marketplace, you need to do more<br />

than just tick a box or read a marketing document. Real testing<br />

of reliability involves getting into the guts of the embedded<br />

design, and covers everything between the application and the<br />

hardware. This whitepaper is focused on testing the file system<br />

and its interactions. Catching a vulnerability in design or<br />

testing is far more efficient (and far less expensive) than<br />

handling problems in the field.<br />

II.<br />

DEFINITION OF RELIABILITY<br />

Let’s start by defining just what reliability is – whatever a<br />

given customer thinks it means.<br />

For some customers, reliability is just being able to turn on,<br />

start up, work properly. This is akin to a pre-electronic<br />

appliance, ready when you needed it. Other customers expect<br />

their favorite settings or programmed routes to be available.<br />

They want changes they have made to be reflected in their next<br />

use of the device. Still other customers have an even higher<br />

definition of reliability. They want to start up with settings that<br />

were just changed before the device was shut off. Medical<br />

designs are one class of device where this level of reliability is<br />

mandated – it must be known exactly what the device was<br />

programmed for, even if the power is removed immediately<br />

thereafter.<br />

It is safe to say that even though the needs of the individual<br />

customer vary, the bar must be set quite high. Likely this is<br />

even higher than the average developer expects.<br />

A. System reliability vs Data on the media<br />

Examining the previous cases from the computer’s point of<br />

view, the first is related directly to the system files. Sudden<br />

power loss could corrupt other files, as long as it doesn’t affect<br />

the system files. One trick used by developers is to put the<br />

system files on a separate partition (or even separate media) to<br />

achieve this goal. Files can be marked read only, though that<br />

only keeps the file system from touching them. Media failures<br />

can overwhelm both attributes and partitions to corrupt files,<br />

and this is a particular problem with NAND flash media. [1]<br />

To save the end user’s settings, those changes have to make<br />

it to the media. On embedded Linux, for instance, those<br />

changes must make it through the write-back cache and any<br />

buffers. The data also needs to clear any hardware block cache.<br />

In time, all the data will eventually be flushed – if the user<br />

waits long enough, their settings will be saved.<br />

In addition to user data, the file system metadata also must<br />

be in the appropriate place. For standard Linux file systems, a<br />

journal is used to commit data immediately, and these<br />

journaled metadata writes also make it to the media in time.<br />

For most file systems, Linux flush and fsync commands are<br />

used to achieve complete control of the file system, which is<br />

the only way to ensure data is committed to the media. [2]<br />

That control over the file system and block device are the<br />

key to the final use case, where the data must be committed<br />

immediately – waiting for the block cache to time out is not an<br />

option.<br />

The next step is to refine the granularity even more – what<br />

happens when the power is interrupted?<br />

B. Write interrupted mid-block<br />

Writes can be interrupted at any point, especially when<br />

power is lost unexpectedly. This results in two options – the<br />

system could be in the middle of writing a block or in between<br />

writing blocks.<br />

On older magnetic media, an interruption in the middle of<br />

writing a block would leave a partial write. The block being<br />

written is corrupted – it likely contains fragments of new data<br />

www.embedded-world.eu<br />

975


and old data. If this was the only copy of that data, it (and the<br />

entire file that contained that block) are now useless. Keeping a<br />

copy can be done via techniques such as writing to another file<br />

and then renaming, or using a copy-on-write file system.<br />

Testing for this situation should be part of any comprehensive<br />

device test.<br />

New NAND based media, from SSDs to eMMC to SD and<br />

Compact Flash, has an even larger problem with an interrupted<br />

write. It is so large that vendors specifically recommend<br />

maintaining power during block writes just to avoid it. Going<br />

into further detail on this type of failure is outside the scope of<br />

this paper. Assuming that media power is maintained with a<br />

capacitor or other technique, an interrupted write becomes<br />

instead write interruption between blocks.<br />

C. Write interrupted between blocks<br />

From the user perspective, each write consists of one or<br />

more blocks of data, along with any metadata that must be<br />

written. If a given write is a small amount of data, up to one<br />

block, with no metadata changes, then power interruption<br />

won’t be a problem. For larger writes, an interruption between<br />

write blocks will interrupt what is known as an atomic<br />

operation.<br />

File systems such as FAT don’t do anything special for an<br />

atomic operation, so interrupting one of those is not much<br />

better than a mid-block interruption on magnetic media. The<br />

file being written to has some updated blocks and some not yet<br />

updated. For most user data, this is enough to render the file<br />

useless. Earlier techniques like writing to a separate file or<br />

using a file system that works with atomic writes will at least<br />

keep the older file intact.<br />

Testing to make sure longer writes are committed the way a<br />

system designer expects is another requirement for<br />

comprehensive testing.<br />

III.<br />

VALIDATE THE WRITTEN DATA<br />

At this point, the developer has an expectation of how data<br />

will be written and/or recovered as part of a power interruption.<br />

This is all assuming the writes happen in the order they were<br />

initiated, which is not always the case.<br />

Another important consideration is whether data has been<br />

overwritten. If a file data write is not a completely atomic<br />

operation, then a multiple block data write may be only<br />

partially completed. This would leave it to the application to<br />

understand just what one of those partial writes means. There<br />

are file systems which provide the granularity for a completely<br />

atomic write; is that what you need for your customers?<br />

IV.<br />

ALL I NEED IS O_DIRECT<br />

On Linux file systems, there is a particularly persistent<br />

myth that opening a file with O_DIRECT is all that is required<br />

for data to be reliability stored on the media. The fsync()<br />

command would not be necessary in this case. To assess the<br />

validity of this, our team measured the performance of<br />

sequential writes using fsync, O_DIRECT, and neither.<br />

Performance was high – faster than the physical speed of<br />

225MB/sec – for tests with neither protection, and for tests<br />

with only O_DIRECT. When fsync() was factored in,<br />

performance dropped to a reasonable number below the<br />

physical maximum of the device.<br />

Metadata should also be part of that atomic write. If it isn’t,<br />

one of two situations can occur. If the data is written but the<br />

metadata has not been, the data would be lost as the system<br />

recovered from power interruption. A potentially worse case is<br />

if the metadata is written but the data has not yet been. A<br />

system recovering in this situation would then try to open this<br />

updated file and find either garbage or, worse, other data from<br />

the media – a potential security risk.<br />

We did find that when the amount of data written is larger<br />

than any cache in the physical media, the write performance is<br />

roughly the same with or without fsync(). For reliability<br />

purposes, when the data absolutely has to be on the media,<br />

either write large files or use fsync() – O_DIRECT is not<br />

enough.<br />

V. STRESS!<br />

That covers the basic reliability testing for normal system<br />

use. The next thing to focus on is system stress – unusual cases<br />

that will likely occur in the field.<br />

976


The first of these that most developers don’t examine is<br />

what happens when the disk is full. Besides the potential<br />

performance implications, a disk full situation generates can<br />

generate more writes as garbage is collected. Related to that are<br />

extreme situations in the number of files – does the system<br />

latency increase noticeably beyond the first 100 or so files?<br />

Both of these situations lead to a larger write error window –<br />

and a larger potential for failure.<br />

Another stressor is a system update. This is a situation that<br />

is especially important to test thoroughly, especially the results<br />

of a potential power interruption. Atomic writes here can be a<br />

major factor which allows the device to recover from a failed<br />

update.<br />

Extreme use cases also hit the media especially hard. In the<br />

case of NAND flash media, thousands of reads from a given<br />

file can cause read disturb, adding bit errors to the media. If<br />

these bit errors are not taken care of (a process called<br />

scrubbing), the correctible errors can grow to uncorrectable<br />

errors.<br />

Other potential media failures include some of these items:<br />

• Specific write patterns – on NAND media without a<br />

randomization filter, writing all zeroes (0x00) is<br />

actually worse for the flash than writing other patterns.<br />

• Hot Spots – media locations that are prone to failure for<br />

reasons unknown to the developer (and possibly to the<br />

vendor)<br />

VI.<br />

DISCARDS (THE TRIM STATEMENT)<br />

Another storage related item is discards – using the trim<br />

command to inform the media that data is no longer in use. On<br />

devices where discards are not used (or under-utilized), latency<br />

can increase noticeably once all the blocks on the media have<br />

been written once. [3] This increase latency causes a noticeable<br />

drop in performance, and of course when writes take longer,<br />

the potential error window grows.<br />

For that matter, the firmware on most NAND based<br />

solutions is a black box. What is happening when the media is<br />

busy discarding data or wear leveling or garbage collecting?<br />

What happens when the power is interrupted during those<br />

operations? Using the file system to generate these sorts of<br />

failures can be very hit or miss, and a custom media test should<br />

be developed to validate its operation during a variety of power<br />

interruptions.<br />

VII. TACTICS FOR EFFECTIVE TESTING<br />

We have examined a number of methods that can be used<br />

to generate more effective power interruption testing, so now<br />

we must put them all together.<br />

First of all, interrupting an embedded device while<br />

quiescent or while reading will result in nothing being lost,<br />

unless it is updating something in the background. This means<br />

power interruption testing needs to trigger those background<br />

writes where possible and, most importantly, focus on the<br />

writes.<br />

To validate this assumption, we modified the standard<br />

power interruption tests regularly performed by Datalight.<br />

These tests are on Linux, and can utilize any file system. For<br />

this test, we chose VFAT. This stochastic test performs random<br />

operations, reading and writing, creating and removing folders,<br />

etc. We found that with default values, a power interruption<br />

would cause a chkdsk failure 5.3% of the time. When we<br />

halved the chances of write operations occurring, these failures<br />

dropped to 1.4% of tests.<br />

These random failures during writes are likely to exercise<br />

the safety routines of the file system and application. The next<br />

step is validating the data written, not just the structure. Make<br />

sure that what is most important to your customer is being<br />

confirmed here – data ordering, overwrites, and completely<br />

atomic operations.<br />

Data order is most important to databases, and has some<br />

importance to journaled file systems. If data is overwritten in<br />

place, then an entire operation must be atomic to prevent a<br />

corrupt state – half old data, half new data, all useless.<br />

While the user data is important, the system files can be<br />

even more important. If corruption of other files affects system<br />

files, the entire device can be rendered unusable. The same<br />

could happen if power interrupts an unprotected system update.<br />

Most file systems provide a utility based on chkdsk to detect<br />

these sorts of failures, though they can’t usually correct them<br />

very well.<br />

The original MS DOS chkdsk found data that was on the<br />

media but not represented in the file allocation table (FAT) and<br />

also find a number of errors within the FAT. It was not able to<br />

connect that data to file names, so files full of lost chains and<br />

other errors resulted in nearly useless files such as<br />

FILE0001.CHK – taking up space but not useful for most<br />

applications.<br />

VIII. SUMMARY<br />

The best testing meets both the needs of the strictest users<br />

and the goals of stressing the system in the best way.<br />

Interruptions during writes will demonstrate the most failures<br />

and accurately reflect field stress. Other factors to consider<br />

include validation of data, atomic operations and cases that also<br />

stress the media. Real testing of reliability is more than just a<br />

requirement, it also leads to long term success of the embedded<br />

design.<br />

REFERENCES<br />

[1] P. Slocum - "Are read only partitions safe from corruption if there's also<br />

a read/write partition on the same sd card?"<br />

https://raspberrypi.stackexchange.com/questions/67035/<br />

[2] T. Denholm - "Reliably committing data in Linux", first presented at<br />

Embedded World 2017;<br />

https://www.datalight.com/resources/whitepapers/reliably-committingdata-in-linux<br />

[3] T. Denholm - "Performance drop without discards", May 31 2017,<br />

https://www.datalight.com/blog/2017/05/31<br />

www.embedded-world.eu<br />

977


Using Google Test for Safety-critical Software<br />

Development<br />

Miroslaw Zielinski<br />

Principle Software Engineer<br />

Parasoft<br />

Krakow, Poland<br />

miroslaw.zielinski@parasoft.com<br />

Abstract— This paper explores the essential elements of the<br />

unit testing environment for safety-critical projects. It describes<br />

how to augment open source unit testing frameworks to<br />

successfully certify the software.<br />

Keywords—unit testing; Google Test; safety-critical software;<br />

software certification<br />

I. INTRODUCTION<br />

The volume of safety-sensitive software has grown<br />

significantly, along with the continuously increasing number of<br />

connected devices and recent advancements in AI, notably AI’s<br />

applications to autonomous driving. As a result, it is becoming<br />

much more difficult to work around software certification. This<br />

is because the project modules subject to the rigorous criteria<br />

of safety standards, such as IEC 61508, ISO 26262, and DO<br />

178B/C, are becoming a much bigger part of the codebases. To<br />

this end, software must be rigorously tested.<br />

Safety standards mandate numerous testing practices and<br />

processes on software development. There is a cost associated<br />

with all of them. Some software quality practices, such as unit<br />

testing, require significant investment in tools and impact<br />

development schedules. In addition to the initial investment in<br />

tools and process implementation, there is an overhead related<br />

to the creation and maintenance of test cases proportional to the<br />

amount of created code. A thorough implementation of unit<br />

testing, including all the reporting required to get certification<br />

credit, causes it to be one of the most expensive techniques for<br />

assuring software quality. This is especially true for C and C++<br />

languages. The cost associated with implementing this<br />

technology, as well as qualifying the tool chain, itself, makes<br />

the difficult process of selecting the unit testing solution very<br />

important because it affects the development process in many<br />

ways.<br />

In this paper, we explore the essential components of a<br />

widely-understood unit testing solution required for developing<br />

safety-critical software. We also discuss the feasibility of<br />

building the unit testing solution based on software that is free<br />

for commercial use. The discussion includes the commercial<br />

tools for functionalities that do not have sufficiently functional<br />

counterparts in the open source world. The main intention of<br />

this paper is to analyze practices and methods recommended by<br />

the safety standards at a high-level and link them to specific<br />

features of the unit testing solution. This will help build an<br />

understanding of where it is reasonable to rely on the open<br />

software. The discussion assumes that C and C++ languages<br />

are the most popular for safety-critical software development<br />

and uses Google Test as an example of a free unit testing<br />

framework. Most conclusions, however, apply equally well to<br />

any of the frameworks available in the open source ecosystem.<br />

II. WHY CONSIDER OPEN SOURCE UNIT TESTING SOLUTIONS?<br />

Teams producing safety-critical systems traditionally select<br />

commercial unit testing solutions. This is mainly because open<br />

frameworks do not provide sufficient functionality to assure<br />

successful certification of the software, especially for the most<br />

stringent levels of safety. Commercial unit testing solutions<br />

usually come with modules for creating unit tests,<br />

stubbing/mocking, calculating various code coverage metrics,<br />

and, of course, colorful reporting. All-in-one solutions that<br />

completely satisfy the requirements imposed by safety<br />

standards, such as ISO26262 or DO-178C, certainly have many<br />

benefits. A significant benefit is tool certification (in terms of<br />

IEC 61508 and related standards) and qualification kits for<br />

other standards that greatly reduces tool qualification<br />

workloads. Not only do we have a good answer for all<br />

requirements from safety standards, but we can also consult our<br />

vendor’s support team to address any concerns.<br />

What are the benefits of an open source unit testing<br />

framework when a commercial solution can meet the<br />

requirements put forth by a standard? If we assume for the<br />

moment that a hypothetical open source unit testing tool meets<br />

all safety standards requirements, the following benefits can be<br />

achieved:<br />

• Relying on a popular free framework increases our<br />

ability to find software engineers already familiar with<br />

the tool.<br />

• Developers are more willing to learn and use a popular<br />

and readily available solution.<br />

• There are usually many open libraries and modules with<br />

existing sets of unit tests that may potentially be<br />

integrated into our projects. The same is valid for the<br />

978


code developed in-house. Modules developed without<br />

safety standard in mind and covered with open source<br />

test cases sometimes fall unexpectedly into the scope of<br />

software certification.<br />

• Test cases created in an open source format protect our<br />

investment in unit testing and free us from the<br />

constraints of a vendor’s commercial solution. If a<br />

company decides to give up the commercial tool or<br />

switch vendors, all test cases created with abandoned<br />

solution may need to be rewritten or imported.<br />

• It is easier to function in long supply chains (as is<br />

common with automotive software) when executing test<br />

cases that prove that the supplied source code does not<br />

require commercial tools.<br />

These benefits represent important decision factors and can<br />

have strategic importance in many cases.<br />

There are also disadvantages. The most severe is that by<br />

selecting an open source unit testing framework, we cover only<br />

a fraction of the requirements that are typically imposed by the<br />

safety standard. We will return to this point later, but a quick<br />

example is the support for structural code coverage. Open<br />

source solutions provide reasonable support for only the<br />

simplest metrics. If our project requires a more advanced<br />

metric, such as MC/DC (modified condition/decision<br />

coverage), we will need to augment the selected framework<br />

with a commercial solution to provide code coverage statistics.<br />

There are many more functionalities in the widely-understood<br />

area of unit testing that may require a commercial plugin to<br />

provide sufficient functionality.<br />

Despite the disadvantages, the fact remains that creating<br />

unit tests is a very expensive process. When we decide to fully<br />

base our unit testing process on a commercial solution, we will<br />

need to implement the unit test cases in the format supported<br />

by the commercial tool, which ties us to the vendor for a long<br />

time. Attempts to reuse code created along with the tests will<br />

require using the same tooling. If our toolbox includes an open<br />

source framework augmented with dedicated tools to<br />

implement specific sub-functions, reusing the assets we create<br />

is much easier. We can hand created code and test cases over to<br />

our contractor, enabling them to use their own coverage tools<br />

and verify that code quality is as expected. Relying on the<br />

freely available formats for unit test creation is a reasonable<br />

idea in the absence of standardized format of unit tests<br />

description.<br />

III. WHAT DO WE NEED FROM A UNIT TESTING SOLUTION?<br />

The essential features of a solution depend on the safety<br />

standard and our project’s risk classification level (SIL, ASIL,<br />

or DAL). The discussion includes not only the core features of<br />

the unit testing framework but also the accompanying<br />

functionalities, such as stubbing/mocking tools, traceability<br />

frameworks (which are obligatory to assure completeness of<br />

testing), and the ability to produce the required reports.<br />

For sake of simplicity, let's analyze the requirements from<br />

two popular industry standards: ISO 26262 (automotive) DO-<br />

178C (aviation). Table 1 lists a selection of important<br />

methodologies for unit testing that are required to meet the<br />

objectives from the standards. Selection focuses on the most<br />

important practices only, and present generalized view.<br />

In contrast to ISO 26262, DO-178C does not explicitly<br />

require unit testing. The standard does, however, impose<br />

requirements that are often difficult to meet without<br />

implementing unit testing process. As a result, many<br />

organizations effectively assume that unit testing is a de-facto<br />

requirement for DO-178C compliance. Looking at the<br />

objectives from safety standards, the solution for unit testing<br />

would ideally contain:<br />

• Unit testing framework (assertions, test suites,<br />

execution automation)<br />

• Code coverage tool<br />

• Stubbing/mocking framework<br />

• Integration with hardware processor or simulator<br />

• Reporting<br />

• Validation test cases to qualify the entire solution<br />

• Tool certification and/or tool qualification kit<br />

Each of these modules plays a specific role in the<br />

development process and is expected to generate artifacts that<br />

support the certification claim. The following sections discuss<br />

each module, highlights crucial features, and assess the<br />

feasibility of using freely available modules.<br />

A. Unit Testing Framework<br />

Safety standards do not list any specific features expected<br />

from the unit testing framework itself. There are, however,<br />

some requirements stemming from the framework<br />

implementation process and safety-oriented structures in the<br />

organization responsible for controlling and documenting the<br />

verification and validation processes. One such requirement<br />

relates to reports generated by the framework. It is a common<br />

practice that test results are reviewed by a separate, dedicated<br />

team in the organization. For this purpose, unit tests execution<br />

should be well documented. Generated reports shall contain a<br />

section that details the following:<br />

• The function or method tested<br />

• The initial values for the parameters<br />

• The configuration and expectations related to test<br />

doubles<br />

• Results of all assertions, including assertions that were<br />

positively verified<br />

• Correlation with the requirement validated by the given<br />

test<br />

This level of details may seem to be an overkill, but in<br />

reality, it allows to confirm the correctness of the test case and<br />

execution result, without looking at the body of the test case. It<br />

simplifies the work by facilitating the “independent review”<br />

process. Every required information is in one document and<br />

there is no need to reach to the code base for additional<br />

979


TABLE 1: REQUIRED UNIT TESTING METHODOLOGIES FOR DO-178C AND ISO 26262<br />

DO-178C ISO 26262<br />

Unit testing (methodology) 6.4.c, 6.4.d Part 6, clause 9<br />

Requirements-based testing / traceability 6.5.a, 6.5.b, 6.5,c Part 6, 9.4.2<br />

Statement Coverage 6.4.4.c (Levels A,B.C) Part 6, 9.4.4 (ASIL A, ASIL B)<br />

Branch Coverage 6.4.4.c (Levels A,B) Part 6, 9.4.4 (ASIL B, ASIL C, ASIL D)<br />

MC/DC Coverage 6.4.4.c (Level A) Part 6, 9.4.4 (ASIL D)<br />

Fault injection/robustness test cases 6.4.2.2 Part 6, 9.4.2<br />

Test environment represetative for production env. 6.4.1 Part 6, 9.4.5<br />

Software tool qualification 12.2 Part 8, 11<br />

information. An example of such a report from the<br />

commercial unit testing framework is presented in Fig. 1.<br />

Free unit testing frameworks do not usually provide<br />

sufficiently detailed reports out of the box. Frameworks with<br />

an open architecture, such as Google Test, can usually integrate<br />

multiple plugins that contribute to the execution, which makes<br />

the extension relatively simple. When extending the open<br />

source framework users should consider:<br />

• Test case association with the requirements<br />

• Extending assertions to generate messages for positive<br />

and negative assertions verification<br />

• Dedicated macros for outputting additional meta-data<br />

about the test case<br />

Safety standards also suggest that the test environment<br />

should be as close as possible to the production environment.<br />

As a result, executing unit test cases on the target processor or<br />

at least on the processor simulator is desirable. Teams often try<br />

to limit this type of testing because it is more time consuming<br />

and difficult to automate than testing on the host platform. But<br />

even if source code can easily be tested on the host computer, a<br />

periodic verification with the target processor is typically<br />

conducted to avoid problematic argumentation and to prove<br />

that differences between the target and the host processor do<br />

not hide potential errors.<br />

For example, DO-178C states that “The difference between<br />

the target computer and the emulator or simulator, and the<br />

effects of these differences on the ability to detect errors and<br />

verify functionality should be considered.” [1]. Preparing the<br />

data that supports the claim that host and on target testing are<br />

equivalent is not easy, especially for the more stringent levels<br />

of safety. In most cases, it is easier to adapt the unit testing<br />

framework for on-target test execution. Commercial unit<br />

testing frameworks include dedicated support for a large<br />

collection of cross-development environments. Support usually<br />

includes the integration with debuggers, allowing seamless<br />

communication for uploading test binaries and downloading<br />

results. Open source frameworks require some modification to<br />

adapt for on-target execution:<br />

• Cross compilation of the framework with the target<br />

compiler<br />

• Implementation of a plug-in that outputs results from<br />

the target to the host machine<br />

• Scripts to automate the interaction with the target to<br />

upload the test binary, start execution, and download the<br />

results<br />

• Conform to the hardware resource limitations of the<br />

target, such as processing speed and available memory<br />

Although these gaps seem challenging at first, they are not<br />

difficult to bridge. Google Test requires a “C++98-standardcompliant<br />

compiler,” which is a reasonable requirement. A<br />

bigger challenge would be if C-only compilation were<br />

required. In this case, we need to look for the C-based unit<br />

testing framework, such as CUnit or cmocka. Implementation<br />

of the communication layer to transport testing results<br />

commonly requires providing a function that can transport a<br />

buffer from the target to the host. The remaining part of report<br />

building happens in the upper layer of the framework. Finally,<br />

automation of the tests execution can be typically achieved<br />

with the interface provided by the debugger. Most of the crossdevelopment<br />

environments support some simple scripting to<br />

automate debugging activities, which is more than enough to<br />

automate unit tests execution. It is important, however, that an<br />

open source testing framework can also work within the<br />

hardware resource constraints of an embedded target.<br />

Frameworks with heavy memory and processing requirements<br />

might be impractical for embedded devices.<br />

B. Code Coverage<br />

Once we settled on the selected unit testing framework, the<br />

next step is to decide the code coverage tools. Coverage<br />

metrics are consistently required by all safety standards. The<br />

implied objective is to identify code that was not exercised by<br />

requirements-based testing and refine the tests, requirements,<br />

or both. The type of required metrics depends on the risk level<br />

associated with the system (see Table 1). The process of<br />

refining the tests and requirements with the help of a coverage<br />

report is time consuming. Thus, coverage reports should be<br />

easy to analyze and well-integrated with other components of<br />

the unit testing solution to minimize the amount of manual<br />

work. The coverage tool should at least include the following<br />

features:<br />

• Support for all required coverage metrics<br />

• Ability to present coverage results generated per<br />

execution of specific test case<br />

• Ability to present coverage results in context of specific<br />

selected requirement (traceability)<br />

980


Fig. 1: Example report from a commercial unit testing solution.<br />

• Ability to merge coverage results from different testing<br />

sessions and different working stations<br />

• Ability to merge coverage results from different types<br />

of testing, such as unit testing, integration testing, and<br />

system level testing<br />

• Ability to collect results from host, target, and simulator<br />

• Ability to annotate reports with additional information,<br />

such as date, tester name, session ID, tool identification<br />

(i.e., compiler and linker hashes)<br />

• Ability to collect coverage results on a per-build basis<br />

to allow for comparisons between builds and baselines<br />

There are popular coverage tools in the free software<br />

domain, including GNU gcov and Clang-based tools. There are<br />

also several promising projects in early phases of development,<br />

but none of them satisfy all the enumerated criteria. One of the<br />

most significant limitations is lack of support for statement and<br />

MC/DC coverage. Most of the available free tools only support<br />

line and branch coverage metrics. Line coverage can<br />

sometimes be used as a replacement for statement coverage if<br />

we engage static analysis tools to enforce a coding convention<br />

to place only one statement in each line. Such an approach,<br />

however, is far from convenient and actually obfuscates the<br />

code. There does not seem to be a good option among free<br />

tools for MC/CD coverage, which represents a significant<br />

roadblock to adoption in high safety integrity level projects.<br />

C. Traceability<br />

Support for traceability from requirements to code and<br />

associated tests is another important feature. Traceability<br />

facilitates requirements-based testing, and in context of code<br />

coverage means correlation between a test case and the code<br />

coverage results generated when executing a specific test case.<br />

Additionally, the test-case-to-requirement correlation helps<br />

developers understand how well a specific requirement is being<br />

tested.<br />

A traceability framework needs to provide bidirectional<br />

links between all important artifacts created during the<br />

software verification process. In the context of code coverage<br />

tools, the important element is to assure the ability to annotate<br />

code coverage results with the information about the executed<br />

test case. Commercial solutions for unit testing offer these<br />

capabilities out of the box, whereas open source unit testing<br />

frameworks require integration with the coverage tool to fulfill<br />

this requirement. A convenient method of doing this is via the<br />

981


API, which is often provided by coverage tools. Such an API<br />

enables the integration of the unit testing framework with the<br />

coverage tool. There are commercial solutions that offer<br />

integration APIs, enabling easy interaction with any testing<br />

framework.<br />

The API does not have to be complex. In most cases,<br />

simple functions to notify about test start/stop events are<br />

sufficient:<br />

void TestStart(const char* testName);<br />

void TestStop(void);<br />

This kind of API assumes that calling TestStart annotates<br />

the coverage results stream with the ID of the executed test<br />

case. Calling TestStop closes the results section assigned to the<br />

specific test case. This simple integration schema is presented<br />

in Fig. 2.<br />

With the API discussed above, integration of coverage tools<br />

with unit testing frameworks is relatively simple. Most of the<br />

open frameworks support plug-ins for monitoring test<br />

execution, which can be used to send messages to the coverage<br />

tool about the beginning and end of the test execution. The<br />

following example shows how to use Google Test’s<br />

testing::TestEventListener interface to bridge the unit testing<br />

framework and coverage tool:<br />

class CoverageAnnotator : public ::testing::EmptyTestEventListener<br />

{<br />

public:<br />

virtual void OnTestStart(const ::testing::TestInfo& test_info)<br />

{<br />

TestStart(test_info.test_case_name()); /* Coverge tool API call */<br />

}<br />

virtual void OnTestEnd(const ::testing::TestInfo& test_info)<br />

{<br />

TestStop(); /* Coverge tool API call */<br />

}<br />

};<br />

Reports generated from the integrated unit testing<br />

framework and coverage tool allow developers to review<br />

coverage results generated by a specific unit test. An example<br />

showing a collection of Google Test test cases with associated<br />

coverage results is shown in Fig. 3.<br />

D. Target-based Testing and Metrics Collection<br />

ISO 26262 and DO-178C recommend that test environment<br />

shall be as close as possible to the production environment.<br />

This means that results, including code coverage, shall be<br />

collected from the target processor or at least from the<br />

simulator. For code coverage tool, collecting results from the<br />

embedded hardware needs to be supported. The subject is<br />

broad, but with some simplification, code coverage can be<br />

collected using two types of technology:<br />

• Source code instrumentation (injecting extra code into<br />

original code)<br />

• Processor core trace logic (collecting information about<br />

instructions executed by the core)<br />

Source instrumentation is flexible and can be applied at the<br />

build time. It is independent of the hardware and allows<br />

execution on the target processor. Dedicated integration may<br />

be required, however, to work with specific cross-compilers.<br />

The technology supports all known coverage metrics, from the<br />

statement through the path and condition coverage up to full<br />

MC/DC coverage. The main limitation of this technology is<br />

that it imposes some overhead on the execution time (i.e., time<br />

for executing the injected instrumentation) and it increases the<br />

footprint of the binary executable, which may be an issue for<br />

smaller MCUs.<br />

An instruction trace is an alternative method of collecting<br />

code coverage metrics. This method requires dedicated support<br />

from the hardware. Processors must contain core trace logic,<br />

which generates a stream of information about machine<br />

instructions executed by the core. This information is recorded<br />

and later mapped to the high-level language (C/C++) to<br />

provide source code level coverage metrics. This approach is<br />

also feasible for simulators. The significant advantage of this<br />

technology is that it does not impose any overhead on<br />

execution time or binary footprint, which may be important for<br />

testing portions of the code in which timing is critical.<br />

Fig. 2: Diagram showing a simple Google Test integration with a code coverage solution.<br />

982


Fig. 3: Example report from a commercial tool showing Google Test test cases associated with coverage results.<br />

The severe limitation of this technology is that mapping of<br />

machine instructions to the high-level language structures is<br />

not trivial. For C/C++, the object code does not contain enough<br />

information to support full tracing of machine instructions to<br />

high-level language constructs. In effect, solutions that are<br />

currently available only support statement and condition<br />

coverage in a reasonable way. More complex metrics are not<br />

supported. This eliminates this methodology from the<br />

applications in the projects where MC/DC coverage is<br />

required.<br />

Practical implementation of the unit testing solution for<br />

software certification induces some additional requirements on<br />

the coverage tools, which do not stem directly from the safety<br />

standards. Example is the ability to merge coverage results<br />

from different execution sessions into one report, proving there<br />

are no unexpected gaps in the testing process. The need for this<br />

functionality is well known to any practitioner. In real world<br />

projects, some sections of the code can be tested only when<br />

using the actual hardware, while others can be examined using<br />

simpler, less expensive setups with simulators or even host<br />

processors. The ability to combine the testing results from PIL<br />

and SIL testing sessions saves time. Time savings can be also<br />

obtained from the combination of system level coverage testing<br />

results with unit testing coverage results.<br />

The analysis of the features expected from the code<br />

coverage tool deployed for safety-critical software<br />

development suggests that free coverage tools are not yet at a<br />

maturity level that would allow organizations to integrate them<br />

into safety-oriented production development environments. In<br />

this moment, commercial tools are better suited for this goal.<br />

E. Fault Injection and Robustness Test Cases<br />

Fault injection is a method used in software testing to<br />

assure that the system can safely handle all the errors that<br />

survived verification and validation processes–so called<br />

residual errors. This method assumes that there is a safety<br />

mechanism that can bring the system into a safe state when an<br />

unexpected error occurs. The goal of fault injection testing is to<br />

prove that those safety mechanisms are there and are effective.<br />

This methodology is explicitly listed in ISO 26262 in<br />

context of unit testing: “includes injection of arbitrary faults in<br />

order to test safety mechanism” [2]. DO-178C also discusses<br />

this methodology “Robustness test cases demonstrate the<br />

ability of the software to respond to abnormal inputs and<br />

conditions.” [3]<br />

There are numerous studies and publications discussing the<br />

effectiveness and viable approaches to fault injection testing<br />

983


[4], [5]. Fault injection testing is recommended by safety<br />

standards for software testing at unit, integration, and system<br />

levels. The easiest and most common way of implementing this<br />

methodology is mutating the section of the software and<br />

observing the response of the test suites. In the context of unit<br />

testing, a legitimate strategy is to replace a function or method<br />

with an alternative implementation that returns a mutated value<br />

or injects a side effect, such as a modification of a global<br />

variable (although this capability is not limited to the fault<br />

injection testing only). The natural application is isolating<br />

tested components during unit testing to make the tests faster,<br />

more robust, and less complicated. An ideal framework should<br />

offer the ability to intercept any function or method call in the<br />

tested code and:<br />

• Stub it (provide a dummy implementation in case the<br />

original definition is not available)<br />

• Simulate the return value<br />

• Modify global variables or object state<br />

• Check asserted expectations about the call, such as the<br />

values of the parameters used for the call, etc.<br />

• Perform a proxy call to an original symbol, potentially<br />

with modified parameters or other side effects<br />

There are several technologies that can help achieve the<br />

above goals. Some of them only help when replacing the<br />

original calls with a “test double,” others support the process of<br />

programming the expected behavior. For complete control of<br />

the outcome, however, users may need to rely on the<br />

combination of the below:<br />

• Mocking frameworks (e.g., Google Mock or<br />

CppUMock).<br />

• Link time substitution: replacing object files containing<br />

original definitions with prepared test doubles.<br />

• Runtime function pointer substitution: for every<br />

function expected to be replaced with the test double,<br />

declare a corresponding function pointer and assign it<br />

with the default definition. At test time, the pointer can<br />

be reassigned to an alternative implementation. This<br />

approach is usable only for C-style functions<br />

• Source code instrumentation, which uses a dedicated<br />

tool to analyze the source code and apply<br />

instrumentation that replaces the invocations of<br />

functions with desired “test doubles.”<br />

• Binaries instrumentation: another dedicated tool that<br />

analyzes the binaries at project link time and performs<br />

all the required rewiring to call desired test doubles in<br />

place of the original functions.<br />

• Preprocessor substitution, which is renaming function<br />

calls using preprocessor macros.<br />

F. Test Framework Fault Injection Implementation<br />

Mocking frameworks, such as Google Mock, offer very<br />

convenient APIs for programming expected behavior. The<br />

benefits of which are the readability of programmed<br />

expectations for mock object behavior and the ability to store<br />

those definitions together with the test case. A serious<br />

limitation, however, is that mocking with frameworks like<br />

GMock only works effectively for virtual methods of C++<br />

classes. Mocking C style functions or non-virtual methods<br />

requires significant redesign of the code. This is often<br />

unacceptable. If our code is compiled as C code, we may not be<br />

able to use the mocking framework at all. This is because many<br />

of them rely on the C++ language features. In general, mocking<br />

frameworks seem to be a viable option for fault injection<br />

testing and isolation testing, but only if this was planned from<br />

the beginning of the project and the software architecture was<br />

designed with this intention.<br />

There is also the possibility of using the linker for<br />

substituting test doubles for the real code. At least two<br />

approaches are possible. Link-time substitution, which assumes<br />

there are entire modules that can be replaced with alternative<br />

implementations during linking phase [6]. This approach may<br />

work in the early phases of the project where it is easy to<br />

substitute entire modules with test implementations. But as<br />

complexity increases, it is more and more difficult to inject<br />

alternative implementations. In real-world projects, this<br />

approach is seldom used and is rarely recommended.<br />

The second approach is to rewire calls to inject alternate<br />

code at link time with the dedicated support of the linker. Some<br />

linkers, such as the GNU linker, provide the ability to redirect<br />

all references to a symbol to an alternative definition. On some<br />

platforms, the GCC linker allows specifying the "–wrap"<br />

option with the name of the symbol to be rewired (which needs<br />

to exist somewhere in the objects or libraries.) After gaining<br />

some experience with this approach, practitioners find it to be<br />

quite powerful, as it enables point injections to existing objects<br />

or code without modifications. Problems emerge, however,<br />

when used with C++ because the information passed to linker<br />

must be in the form of mangled symbol name, which is<br />

inconvenient and error prone. In general, it is an interesting<br />

alternative if the fault injection testing is conducted in the<br />

limited scope and the compiler offers corresponding<br />

functionality. Overall, using this approach during unit testing<br />

for standard mocking and isolation may be too inconvenient<br />

due to the difficulty in managing test doubles.<br />

And finally, there is the possibility of injecting alternative<br />

behavior into the tested code using so-called code patching.<br />

This process usually requires user-provided configuration for<br />

the test doubles and the instrumentation of the source code it<br />

replaces. Although more technologically advanced than<br />

previously described methods, this approach gives a good level<br />

of flexibility. Users do not have to design their code in a<br />

specific way to accommodate for stubbing and mocking. There<br />

is no need to remove the original definition of a mocked<br />

function from the test binary. Moreover, most fault injection<br />

frameworks support so-called proxy calls, where the injected<br />

test double perform some of the operations required by the test<br />

scenario and then invokes the original function–quite useful in<br />

many situations. There are commercial solutions that offer<br />

these capabilities, which proves useful for fault injection<br />

testing, as well as for regular isolation during unit testing.<br />

984


A critical factor related to all the described implementation<br />

techniques is how to program the behavior of the test double.<br />

How do we express the action required to inject a fault into the<br />

testing process? Mocking frameworks, such as Google Mock,<br />

include a convenient and easy-to-use API. The significant<br />

advantage of the Google Mock API is that the definition of the<br />

mock can be stored inside the test case. Those who have tried<br />

implementing unit tests manually understands how important it<br />

is to see the preconditions of the test and the test double<br />

definition in one place. If an alternative method is used, such as<br />

link-time substitution, we will need to program the test double<br />

behavior inside the test double’s definition, which complicates<br />

maintaining the alternative logic for multiple test cases. There<br />

are interesting frameworks, such as CppUMock, that address<br />

exactly this issue by providing generic functionality that<br />

enables the separation of the test-specific logic definition for<br />

the test double from its body. This allows storing the test<br />

doubles configuration together with the test case.<br />

G. Tool Qualification<br />

When making a decision about tools for safety-critical<br />

development, it is important to consider the tool qualification<br />

process. According to ISO 26262, “The objective of tool<br />

qualification is to provide evidence of software tool suitability<br />

for use when developing safety-related item or element.” [7].<br />

Safety standards differ in terminology and requirements related<br />

to this process, but the guidance generally requires that the<br />

process starts with the tool classification. The tool<br />

classification process determines whether qualification is<br />

required, as well as the objectives and appropriate methods to<br />

qualify the tool. The actual qualification is conducted<br />

according to guidelines stemming from the classification<br />

process. A commonly chosen method for qualification is based<br />

on the validation of the software tool in the development<br />

environment. This method assumes that software tool has welldefined<br />

functional requirements and that appropriate test cases<br />

are available to validate those requirements.<br />

Commercial tools usually offer dedicated qualification kits,<br />

which significantly simplify the qualification process. For<br />

example, a commercial code coverage tool will most likely<br />

provide a required set of test cases together with expected<br />

results definition to help confirm the correct operation of the<br />

tool. It’s reasonable to check with the vendor before purchasing<br />

the tool if qualification support is provided for project specific<br />

environment.<br />

Open source components of the development environment,<br />

such as a unit testing framework, will require additional work<br />

to qualify them. Teams will need to prepare a definition of<br />

functional requirements and collect appropriate test cases that<br />

prove the correctness of functionality. In many cases, it is<br />

possible to reuse the test cases created for the open source tool<br />

for standard quality control. Google Test, for example, is<br />

distributed together with the reasonable set of tests that can be<br />

reused for the qualification process. The qualification process<br />

does not have to cover the entire functionality of the tool—just<br />

the features of the tool that are actually used in the<br />

development process.<br />

The documentation created during the qualification process<br />

should contain instructions for the developers—the so-called<br />

“safety manual,” which clearly defines which functionalities of<br />

the tool are qualified and can be used for safety-critical<br />

development, as well as the settings required for safe usage of<br />

the tool. It is sufficient to perform the tool qualification process<br />

once for a given project, assuming we will not change the<br />

versions of the tools.<br />

The qualification process for a solution containing many<br />

separate components is challenging, especially if the solution<br />

contains open source components that were not designed with<br />

the qualification activities in mind. In the likely case of<br />

selecting validation as the qualification method, the most<br />

expensive part of the process will be the definition of<br />

functional requirements for the open source project and<br />

preparation of the test cases to validate the requirements. The<br />

cost of preparing the validation test cases can be reduced by<br />

using the test cases (if they exist) shipped with the open source<br />

tool.<br />

Unless it is separately regulated by the end customer or the<br />

business’s internal development policy, there are no other<br />

obstacles for using the open source tools for the safety-critical<br />

software development than the qualification process.<br />

IV. SUMMARY<br />

The aim of this paper was to discuss the feasibility of<br />

building a unit testing solution for safety critical systems based<br />

on the free software components. Due to stringent requirements<br />

of safety critical software standards, such a solution will have<br />

to be a mixture of open source and commercial tools in most<br />

cases. An obvious drawback of such a mixed solution is the<br />

cost of maintenance and tool qualification will likely be higher<br />

than using a uniform commercial solution. It doesn’t mean,<br />

however, that this approach is unreasonable. An important<br />

aspect of open source solutions is the reliance on open<br />

standards and formats, which secures the investment made in<br />

test case implementation and eases the exchange of compliance<br />

artifacts in the customer’s supply chain. The final decision<br />

about a unit testing solution must be made by the development<br />

team based on their understanding of the specifics of their<br />

development environment, project requirements, and customer<br />

expectations. The concept of using an open source unit testing<br />

framework supported with commercial tools for advanced code<br />

coverage and test doubles should be considered a viable<br />

solution. With the growing number of software projects that<br />

require certification, and the increasing amount of open source<br />

code integrated into safety critical systems, this approach is<br />

likely to gain popularity across software organizations.<br />

REFERENCES<br />

[1] DO-178C, Software Consideratins in Airborn Systems and Equipment<br />

Certification, RTCA, Inc. December 13, 2011, 4.4.3.b<br />

[2] ISO 26262 Road vehicles – Functional safety, part 6, 9.4.2, Table 12,<br />

subscript a<br />

[3] DO-178C, Software Consideratins in Airborn Systems and Equipment<br />

Certification, RTCA, Inc. December 13, 2011, 6.4.2.2<br />

[4] D. Cotroneo, R. Natella, “Fault Injection for Software Certification”<br />

IEEE Security and Privacy Magazine · July 2013<br />

985


[5] J. M. Voas and G. McGraw, Software Fault Injection: Inoculating<br />

Programs Against Errors. John Wiley & Sons, Inc., 1998<br />

[6] J. W. Grenning, Test-Driven Development for Embedded C, Pragmatic<br />

Bookshelf, September 2014<br />

[7] ISO 26262 Road vehicles – Functional safety, part 8, 11.1<br />

986

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!