A Network Interface Card Architecture for I/O Virtualization in ... - TUM

A Network Interface Card Architecture for I/O Virtualization 

in Embedded Systems 

Holm Rauchfuss 

Technische Universität 

München 

Institute for Integrated 

Systems 

D-80290 Munich, Germany 

holm.rauchfuss@tum.de 

Thomas Wild 


München 


Systems 


thomas.wild@tum.de 

Andreas Herkersdorf 


München 


Systems 


herkersdorf@tum.de 

ABSTRACT 

In this paper we present an architectural concept for network 

interface cards (NIC) targeting embedded systems and supporting 

I/O virtualization. Current solutions for high performance 

computing do not sufficiently address embedded 

system requirements i.e., guarantee real-time constraints and 

differentiated service levels as well as only utilize limited 

HW resources. The central ideas of our work-in-progress 

concept are: A scalable and streamlined NIC architecture 

storing the rule sets (contexts) for virtual network interfaces 

and associated information like DMA descriptors and 

producer/consumer lists primarily in the system memory. 

Only for currently active interfaces or interfaces with special 

requirements, e.g. hard real-time, the required information 

is cached on the NIC. By switching between the contexts 

the NIC can flexibly adapt to service a scalable number 

of interfaces. With the contexts the proposed architecture 

also supports differentiated service levels. On the NIC 

(re-)configurable finite state machines (FSM) are handling 

the data path for I/O virtualization. This allows a more 

resource-limited NIC implementation. With a preliminary 

analysis we estimate the benefits of the proposed architecture 

and key components of the architecture are outlined. 

Categories and Subject Descriptors 

C.4 [Performance of Systems]: Design Studies, Performance 

Attributes; B.4.2 [Input/Output and Data Communications]: 

Input/Output Devices—Channels and Controllers 

General Terms 

Design, Performance 

Keywords 

I/O Virtualization, Embedded Systems, Network Interface 

Card 

1. INTRODUCTION 

Over the last decade(s), virtualization has become a mainstream 

technique in data centers for better resource utilization 

by server consolidation. By abstraction, the physical 

ressources are shared between several virtual machines 

This paper appeared at the Second Workshop on I/O Virtualization (WIOV 

’10), March 13, 2010, Pittsburgh, PA, USA. 

(VM), so called domains. The improvement of underlying 

virtual machine monitors (VMM) ([1], [2]) and HW ([4]) 

for data centers have been targeted by research extensively. 

However virtualization is still an emerging topic for embedded 

systems, in particular multiprocessor system-on-chips. 

Their increasing performance and the combination of applications 

with different requirements on a single shared platform 

make them particularly well-suited for virtualization. 

First steps have been taken to analyze and adopt virtualization 

here ([6], [7]). 

A critical aspect is the virtualization of I/O, since there the 

computational overhead and the performance degradation is 

high, in both data centers and embedded systems. Research 

for High Performance Computing (HPC) shows that near 

native throughput i.e., throughput equal to a set-up without 

virtualization, can be achieved by improvements in SW 

packet handling and offloading virtualization onto the NIC 

([9], [10]). Since their focus is on overall system throughput 

maximization, but not on resource-limited architectures 

of NICs, the proposed architectures are not optimal for the 

usage in embedded systems and their specific requirements. 

The paper is structured as follows: Section 2 provides an 

overview on state of the art of I/O virtualization. Section 

3 describes the specific requirements for embedded systems 

and the fundamental concepts of the proposed NIC architecture. 

A preliminary performance estimation is given in 

section 4. An exploration of key components is described in 

section 5. Section 6 outlines future work and summarizes 

the paper. 

2. STATE OF THE ART 

Sharing physical network access between domains can be implemented 

in HW, SW or in a mixed mode [12]. The generic 

solution i.e., VMM only, dedicates one virtualization domain 

as driver domain and exclusively assigns the network card to 

it. In such a system, other domains gain network access by 

transferring packets via a SW-based bridge and front- and 

back-end device drivers [1]. Several protocol improvements 

reduce the overhead of the actual transmission of the packets 

between the domains, a comprehensive overview is given 

by [11]. I/O virtualization can also be performed within the 

VMM itself i.e., the hypervisor provides drivers for network 

cards and switches packets between the domains ([3]).

NIC 

Rx MAC Tx 

DMA 

NIC-CPU 

Management 

DMA-Mgmt. 

Signaling 

Header-Parsing 

Queueing 

Scheduling 

NIC Internal / Instruction Memory 

DMA 

System Bus 

CPU CPU 

System Memory 

P/C Lists 

Rx/Tx Rings 

Packets 

communication path. This results in an increased latency 

and (complex) scheduling dependencies. Processing time of 

the host cpu and system memory are utilized by this driver 

domain. If the hypervisor is directly performing I/O virtualization, 

the trusted computing base of the hypervisor is 

broadened with side-effects on security, footprint and verification. 

Multiple-queue network cards are limited in their number 

of available queue pairs. For supporting a scalable number 

of domains, such a NIC has either to keep unused pairs in 

reserve or fallback to SW-based bridging for excess domains. 

Rx queues are served in the order given by packet arrival, 

resulting in possible head-of-line blocking for high-priority 

packets. 

Figure 1: RiceNIC with central processing on PowerPC 

CPU 

A further improvement to the upper scenario is the usage 

of multi-queue network cards [9] such as Intel’s VMDq [13]. 

Those network cards offer multiple pairs of Tx/Rx queues. 

This allows HW offloading of packet (de-)multiplexing and 

queuing for domains based on their MAC address (and VLAN 

tag). A Tx/Rx pair is assigned to a VM and the driver domain 

is granted access to the memory region with the respective 

Tx/Rx buffers. Tx queues are served round-robin. 

Domains can also directly access a NIC via virtual network 

interfaces. Apparently, such approaches require extensions 

of the NIC i.e., dedicated queues, buffers, interfaces and 

additional management logic. Before a domain can use its 

virtual network interface the VMM has to configure the NIC 

accordingly. 

This concept is presented based on an IXP2400 network processor 

as a self-virtualizing network card [8]. Here, one microengine 

is used for demultiplexing Rx traffic and another 

one for multiplexing Tx traffic. Management of the network 

card is performed in SW on the NIC XScale CPU. 

The set-up is restricted to 8 domains, since the microengine 

is limited to 8 threads. To avoid coordination by the SW on 

the XScale, none of the other free microengines can be used 

for processing Rx or Tx traffic in parallel. 

Direct I/O is also addressed by RiceNIC [10]. Here concurrent 

network access is provided by a network card based on 

an FPGA. It contains a PowerPC CPU and several dedicated 

HW components (see Fig. 1 for an abstract representation). 

The SW on the PowerPC performs data and control 

path functions for packet processing. Each virtual network 

interface requires 388 KB of NIC memory: 4 KB for context 

and 128 KB each for metadata, Tx buffer and Rx buffer. 

Although the aforementioned solutions provide near native 

throughput, they have several shortcomings in respect to 

their applicability in embedded environments. 

Similarly, the concept for direct I/O is also restricted by the 

number of in HW supported virtual network interfaces. 

The utilized IXP2400 network processor is targeted as line 

card for packet forwarding and processing i.e., it does not 

represent an optimal reference architecture for network cards 

supporting virtualization due to its limited interface to the 

host. 

The primary goal of RiceNIC is to have a configurable and 

flexible NIC architecture. Therefore most functionality is 

performed by the firmware on the PowerPC. As negative 

side effect of this, the firmware is in the critical path for all 

packet processing e.g., header parsing, DMA descriptor generation 

and packet (de-)multiplexing. Furthermore, extending 

RiceNIC with extra virtual network interfaces requires 

additional NIC memory for each of them. 

Finally, as the overall throughput performance is focus of 

the I/O virtualization research, minor efforts have been put 

into resource-limited concepts for network cards themselves. 

This motivates our proposed concept, that is presented subsequently. 

3. CONCEPT FOR AN ES-VNIC ARCHITEC- 

TURE 

To better understand the need for efficient I/O virtualization 

in embedded systems, we give an introductory example 

here: An automotive head unit for premium cars represents 

a flexible and high-performance, but still embedded 

system. It consolidates infotainment (video, audio, Internet 

access, etc.) and numerous car-related, safety-critical 

functions (park distance control, user interface for driver 

assistance systems, warning signals, etc.) on one HW platform 

and is connected via network to other electronic control 

units. Based on the actual driving situation, different sets of 

functions – which can be partitioned in domains to achieve 

robustness via isolation – and their communication are active. 

Those situations can change quickly e.g., jumping from 

normal radio listening to displaying an urgent traffic warning. 

Most functions have to be running concurrently to prevent 

disruptive delays by starting them first. To be usable 

in an automotive environment, the head unit has also to be 

implemented in a very cost- and power-efficient way. 

In case of SW-based bridging and multi-queue network cards 

rely on a driver domain which is interleaved in the network

3.1 Requirement Analysis 

To fit both embedded systems and I/O virtualization NIC 

architecture concepts need to address special requirements: 

• The goal of overall maximum throughput has to be 

complemented with low latency and real-time processing 

of packets for specific domains. For an embedded 

system a mix of hard real-time, soft real-time and 

best-effort domains has to be supported. As example, 

a hard real-time domain with a networked closed-loop 

control requires to transmit traffic without jitter as 

in opposite to a best-effort domain with bursty video 

streams. Overall, the network card should provide calculable 

and predictable response time for traffic transfers. 

With this requirement the usage of SW should 

not be considered in the critical transmission path – 

either on the NIC itself or via driver domain. 

• Different service levels require enriched methods to 

process packets and to signal specific events to the 

VMM and domains. This includes prioritization of 

packets and interfaces, and also observation of bandwidth 

guarantees and packet dropping probabilities. 

• The general design of the network card has to include 

only a limited number of HW components for enabling 

virtualization. In relation to the power consumption 

and performance of the complete embedded system the 

NIC should only contribute a small fraction to it, but 

still provide high throughput i.e., several 100 Mb/s or 

higher. Furthermore, the usage of NIC memory should 

be limited to a minimum. Instead the system memory 

should be used as much as possible. 

• Performing I/O virtualization by the VMM or domains 

should be avoided to keep the cores free for actual processing 

as in embedded system CPU power is usually 

more spare than in HPC systems. 

In general, I/O virtualization requires a NIC to perform the 

following tasks efficiently: 

• Header-Parsing: The header of incoming packets 

has to be parsed to determine the destination domain. 

The MAC destination address and VLAN tag of the 

Ethernet header are only required for layer 2 switching. 

• Buffering: It must be possible to efficiently buffer a 

packet, because prior packets blocks further processing 

or packets with higher priority have to be processed 

first. 

• Scheduling: The NIC should be able to switch processing 

between packets either due to temporarily blockings 

or to handle packets of domains with higher priority 

first. Therefore, the NIC can multiplex outgoing 

packets from the domains and demultiplex incoming 

traffic more sophisticated than by simple round-robin. 

• DMA: The NIC should have the ability to transfer a 

packet to or from the (system) memory on its own. 

NIC 

Rx MAC Tx 

Local Cache for Contexts, 

P/C Lists, Rx/Tx Queues 

Header-Parsing 

FSMs 

Management 

Scheduling 

Queue-Alloc 

NIC Buffer 

Signaling 

DMA 

System Bus 

System Memory 

CPU 

CPU 

Contexts 

P/C Lists 

Rx/Tx Rings 

Packets 

Figure 2: Concept of ES-VNIC architecture 

• Signaling: Based on pre-defined service levels the 

NIC should be able to individually signal certain events 

to the VMM or directly to domains. Events can be interrupts 

for new packet arrival or requesting new DMA 

descriptors. 

• Management: The basic management for packet processing 

i.e., (re-)configuration of HW blocks or coordination 

of the individual tasks should be performed 

within the NIC. 

3.2 Proposed Architecture and Exemplary Packet 

Processing 

The above requirements and considerations are driving our 

proposal for a new Embedded System specific VNIC (ES- 

VNIC) architecture (see Fig. 2). It should provide the right 

trade-off between high throughput and QoS combined with 

real-time versus ultimate throughput (in server or HPC environments 

with 10s of Gb/s). It relies on a tailored set 

of finite state machines specifically crafted for handling the 

tasks described above. By this, the footprint of I/O virtualization 

in the HW is reduced and better support for 

real-time constraints and service levels of domains can be 

provided. By decoupling those FSMs, parallel and pipelined 

processing is possible. 

To improve scalability, the resources (queues, caches, buffer) 

on the NIC are not be constantly occupied by domains or 

interfaces, but instead assigned (dynamically). Different levels 

of service may be provided. For interfaces with real-time 

constraints, configuration and queues always reside within 

the ES-VNIC. Best-effort interfaces in opposite share available 

resources i.e., their rule sets are loaded on-demand from 

system memory replacing the information of inactive interfaces. 

The NIC contains a standard MAC which is wrapped by 

flexible HW extensions to enable direct I/O. Those extensions 

are described best by explaining their interaction for 

processing an incoming Ethernet packet (see Fig. 3). This 

figure is a message sequence chart representation of the incoming 

packet processing: The communication between the

MAC NIC Buffer Header-Parsing Scheduling Queue-Alloc Management DMA System Memory 

Figure 3: Processing packet with ES-VNIC (Rx path) 

different extensions is visualized by directed lines i.e., handing 

over data or triggering those extensions. A block stands 

for a delay in this extension either for processing or storing 

data. Time is progressing down the Y axis i.e., the figure 

has to be read from top to down. 

A packet that arrives at the MAC is temporarily stored in 

the NIC buffer and the header is sent in parallel to the 

header-parsing unit where the relevant information regarding 

to which domain this packet should be routed is extracted. 

These actions are performed at line speed. As only 

the header is parsed the header-parsing unit completes before 

the complete packet is stored at the buffer. 

The NIC buffer allows to store a maximum-sized Ethernet 

packet on the whole. It is possible to access any packet 

arbitrarily. Therefore, packets do not have to be processed 

in their incoming order e.g., high-priority packets for realtime 

tasks can be preferred. The address of the packet is 

handed to the header-parsing unit which combines it with 

the extracted header information for identifying the packet. 

With the extracted header information the management FSM 

can then start to select the context for processing this packet. 

In this context all relevant information regarding the handling 

is stored, for example which priority such a packet 

should have, which are the conditions for signaling the domain 

of the arrival of the packet, etc. The main store for 

those contexts is on the system memory in order to limit the 

resources in the ES-VNIC. Only a small cache for contexts 

with packets under processing is present on the ES-VNIC. 

Contexts for critical domains can be pinned to the cache 

permanently. Contexts for best-effort or low-priority packets 

instead have to be loaded from system memory, involving 

writing back contexts which need to be replaced due to the 

cache size limitation. A context can contain the rule set 

for a complete domain, but also for individual Rx or Tx 

network interfaces. A context can have several kilobytes of 

data due to containing advanced rules, priority settings and 

configurations. 

As loading and writing back may take a reasonable amount 

of time, the management FSM is designed to handle several 

such processes and contexts in parallel, switching between 

them to decrease stalling. At any time several packets shall 

be processed by the ES-VNIC in parallel. 

Similar to the context, the DMA descriptors and the respective 

producer/consumer lists (P/C lists) have to be available 

at the local cache or to be fetched from the system memory if 

required. The DMA descriptors are stored in generic queues 

where they can be read by the scheduler. The queue-alloc 

unit is responsible to assign and fill those queues. 

Based on the contexts of the current packets, the scheduling 

unit decides which packet should be processed next and 

fetches a DMA descriptor from the respective queue. Along 

the respective address of packet in the NIC buffer, this information 

is handed over to the DMA unit. The DMA unit 

will then write the packet over the system bus to the system 

memory. Afterwards, it informs the management unit 

about the completion of the action. The respective producer/consumer 

list is updated and written back to the system 

memory where it can be read by the domain. Then 

the management unit configures the signaling unit accordingly 

to the context i.e., immediate interrupt for the packet 

or wait for reaching a threshold of packets. The respective 

signaling concludes the packet processing. 

The same units are utilized for sending a packet. Only the 

header-parsing unit is not used as a packet is already associated 

with a Tx interface and therefore with the respective 

context. The ES-VNIC management is triggered from the 

driver to send a packet. The respective context is loaded 

and the DMA descriptor is read to an allocated queue. If 

the scheduling unit decides to send this packet the descriptor 

is handed over to the DMA unit which writes the packet 

to the NIC buffer. After completely written it is sent out 

via the MAC. 

Domains can modify the data structures for context and 

DMA descriptors in the system memory only after being 

validated by the hypervisor to prevent erroneous or malicious 

input. This is abstracted via calls to the hypervisor 

in the driver for the domain. The hypervisor notifies the 

ES-VNIC which invalidates cached information and fetches 

new input from system memory.

MAC 

DMA 

NIC Internal / Instruction 

Memory 

NIC-CPU DMA System Memory 

can be performed in less clock cycles with finite state 

machines. 

• Having a pipelined architecture with different stages, 

that are FSMs, allows the same throughput with a 

lower frequency than performing the respective tasks 

in sequential SW on a CPU. 

These points lead to the work hypothesis that the ES-VNIC 

architecture needs low and deterministic processing time. 

Prerequisite is that the FSMs are flexible enough to service a 

mix of hard real-time, soft real-time and best-effort domains. 

In a formal approach the processing time by ES-VNIC can 

be formulated as follows: 

T DelayRx = max(T NIC Buffer , T Header−P arsing ) 

Figure 4: Processing packet with a CPU-centric 

NIC (Rx path) 

4. PRELIMINARY PERFORMANCE ESTI- 

MATION 

Based on the presented ES-VNIC architecture concept we 

assess a preliminary performance estimation. Focus is on 

the incoming packet processing sequence as introduced and 

described for ES-VNIC in section 3. 

The processing sequence for a network card performing I/O 

virtualization via CPU firmware like RiceNIC is depicted 

in Fig. 4. Incoming packets are transferred from the MAC 

via DMA to the NIC internal memory. Afterwards the NIC 

CPU is notified. The SW then processes the packet including 

header-parsing, scheduling and queuing plus managing 

and configuring the other HW blocks. During processing the 

SW has to access the NIC internal memory for packet data 

and instruction code. The number of accesses depends on 

the size and association of the NIC CPU. After being queued 

the packet is transferred via DMA to the system memory. 

A simple qualitative comparison of the sequences reveals the 

following points: 

• The firmware on the single CPU performing the tasks 

for I/O virtualization constitute a sequential trail of 

tasks which due to the processing latency may evolve 

to a bottleneck. Adding further CPUs is not a favorable 

solution as it would contradict the goal of a 

resource-limited implementation. 

• On a CPU with data cache (re-)loading and instruction 

fetching, it is not optimal to perform tasks like header 

parsing, queuing or managing DMA descriptors due 

to the lack of temporal locality (for example header 

parsing is performed only once per packet). These task 

+ T Management 

+ max(T Scheduling , T Queue−Alloc ) 

+ T DMA (1) 

T NIC Buffer is the time needed to transfer the incoming 

packet to the NIC Buffer, T Header−P arsing to parse the respective 

header. Both actions are performed in parallel and 

at line speed. Apparently, T NIC Buffer is dominant here 

and dependent on the packet size. 

T Management subsumes setting the configuration for the following 

FSMs according to the context of this packet. This 

includes the conditional fetch of this context from the system 

memory first. If the context is cached, it should only 

need a few clock cycles to perform this operation. The time 

for fetching context is dominated by the performance of system 

bus and memory. Contexts for (hard) real-time interfaces 

need to be pinned to the cache. On the one hand 

this constraint results in an easy calculable upper bound for 

T Management, but on the other hand will reduce the slots for 

contexts of best-effort or low-priority packets. 

The queue-alloc and scheduler unit are triggered both by 

the management unit and run concurrently. The queue-alloc 

unit needs T Queue−Alloc to allocate the needed DMA descriptor 

and the scheduler unit requires T Scheduling to schedule 

the next packet to be transferred via DMA. As DMA descriptors 

need to be fetched from system memory in case 

that they are not already on the NIC, the queue-alloc unit 

needs more time to finish. For (hard) real-time packets the 

DMA queues should therefore already be pre-allocated and 

the descriptors pre-fetched to guarantee an upper bound. 

Finally, T DMA is the time needed to transmit a packet from 

the NIC Buffer to the system memory and depends on packet 

size and on the performance of system bus and memory. 

The following term describes the delta time of ES-VNIC i.e., 

the time which can be spent in each stage of the pipelined 

architecture for processing a packet:

System 

Memory 

Rx Rings 

Tx Rings 

A B [m] C D 

[n] 

System 

Memory 

Contexts 

A B … 

Z 

[m+n] 

NIC 

NIC 

From P/C Lists 

Assignable 

Queues 

A 

[o] 

To Scheduling 

Local 

Cache 

A 

X 

X 

X 

[v] 

To P/C Lists 

A 

Multithreaded 

[w] 

FSMs 

To Queue-Alloc 

To Scheduling 

Figure 5: Key component: Queue-Allocation 

From Header Parsing 

Figure 6: Key component: Management (with Contexts) 

T DeltaRx = max( max(T NIC Buffer , T Header−P arsing ), 

T Management, 

max(T Scheduling , T Queue−Alloc ), 

T DMA) (2) 

If this time matches the rate of consecutive incoming packets, 

ES-VNIC can cope with the the speed of this traffic so 

that no packet drops will occur. This is crucial to support 

network interfaces for hard real-time and critical domains. 

This time is strongly dominated by the system bus and memory. 

The performance of the ES-VNIC is apparently driven by 

the system bus and memory i.e., systematically linked to 

the performance of the (embedded) system itself. 

As worst-case scenario for T DeltaRx the requirement to handle 

a constant flow of packets with minimum frame size and 

minimum interval for a 1 Gbit/s MAC can be used. A packet 

size of 64 byte and 20 byte overhead for preamble, start-offrame-delimiter 

and interframe gap results in: 

(64 + 20) ∗ 8bit 

1Gbit/s 

= 672 nanoseconds (3) 

This means that every 672 nanoseconds a new packet arrives 

and has to be processed. With a clock of 125 MHz for Gigabit 

Ethernet every pipeline stage would have only 84 cycles 

to complete its task. 

5. EXPLORATION OF KEY ARCHITECTURE 

COMPONENTS 

We started to model the key components of the proposed 

ES-VNIC architecture for simulation in SystemC [14]. As 

described in section 3, the architecture should only utilize 

flexible HW resources. Focus is therefore on the related 

FSMs, structures and data elements in queue-allocation (see 

Fig. 5) and management (see Fig. 6). The focus should be 

on design, the exploration of the size of local buffers as well 

as the underlying data paths of the components and efficient 

loading of contexts. 

5.1 Queue-Allocation 

The Rx and Tx rings that contain the DMA descriptors are 

stored in the system memory – in this example Rx interfaces 

A, B and Tx interfaces C, D. Their content is defined by the 

network drivers. 

On the NIC, a limited set of assignable queues is available. 

For interfaces with real-time constraints such a queue is 

blocked and filled with the maximum number of available 

descriptors. Otherwise, if triggered by a context for either 

sending or receiving a packet, a queue-allocation is done 

i.e., if no queue already contains the respective descriptor(s) 

for this context a queue is reserved and the descriptors are 

fetched from the system memory. This fetching is done by a 

dedicated HW engine. In Fig. 5 one queue is blocked for A 

(depicted by an inscribed A in this queue), the others have 

to share the second queue. This may result in flushing of 

descriptors for an inactive context or a context with lower 

priority. A further fetch is issued if a threshold for the P/C 

list is reached. That threshold is defined by the context. 

There can be more or less network interfaces for receiving 

packets than for sending, since Rx and Tx rings do not have 

to be paired. With this feature it is possible to have a Tx 

interface for broadcasting status information and no correspondent 

Rx interface (if no acknowledges are needed); this 

is a quite common scenario for embedded systems. Furthermore, 

to prevent head-of-line blockings for one domain, several 

Rx interfaces for receiving packets with different service 

levels can be established. 

In general, the number of assignable queues (o) is limited 

and smaller than the number of Rx rings (m) and Tx rings 

(n) in the system memory i.e., m + n > o. 

5.2 Management (with Contexts) 

Contexts in system memory, cache for them on the NIC, 

multithreaded FSMs and connections to other units do assembly 

management. In our example the interfaces A to Z 

exist and their contexts are kept in system memory (m for 

Rx interfaces plus n for Tx interfaces). 

If sending or receiving a packet and not having the respec-

tive context in the ES-VNIC the context is fetched from the 

system memory and stored in a cache slot (v). The data of 

the context is loaded into one of the multithreaded FSMs 

(w) by a dedicated HW engine. Using fixed entry points the 

packet processing management is then started. 

Loading the context results in two things: 

• First the FSM is (re-)configured i.e., the respective 

state diagram is modified. By default the state diagram 

is preset to the most common case for an interface. 

The context can then add or remove states 

and transitions adapting the ES-VNIC for processing 

packet for this specific interface. For example, FSMs 

for interfaces being polled can be stripped from states 

and transitions for signaling incoming messages. Another 

option are additional (security) steps for a critical 

packet and its interface preventing deletion of the 

packet from the NIC buffer after being copied in the 

system memory and being validated there. 

• Second data from the context is used as input for registers 

that define and trigger the other FSMs (queuealloc, 

scheduling, P/C lists). For multithreading there 

are multiple sets of the input and output register for an 

FSM. By mapping a thread to a packet the ES-VNIC 

can switch fast between processing of several packets 

(similar to processing in a multithreaded CPU). 

Similar to queues in queue-alloc, contexts can be pinned to 

cache slots and FSMs. In our example here this would be 

for A representing a hard real-time interface. The other 

interfaces have to share the other available resources. 

6. FUTURE WORK AND SUMMARY 

Future work comprises of: Simulation of the key components 

to validate the proposed architecture and the preliminary 

performance estimations. Here, set-ups which require 

displacement of contexts, DMA descriptors and P/C lists 

on the ES-VNIC during run-time are of particular interest. 

This will involve dimensioning of cache size, packet buffers, 

queues and the number of multithreaded FSMs as well as 

functional verification of the those FSMs. Afterwards, the 

network card architecture should be physically implemented 

as part of an MPSoC demonstrator in an FPGA to prove 

the applicability to real world scenarios. 

In this work-in-progress paper we introduced a new virtualizing 

NIC architecture concept particularly addressing the 

requirements of I/O virtualization in embedded systems. 

We showed that current concepts that address HPC do not 

match with those requirements. Thus, the needs for this application 

area have been discussed and a favorable design has 

been deduced. A preliminary performance estimation and a 

short presentation of key elements have also been given. 

7. REFERENCES 

[1] P. Barham, B. Dragovic, K. Fraser, S. Hand, T. Harris, 

A. Ho, R. Neugebauer, I. Pratt, and A. Warfield. Xen 

and the art of virtualization. In Proceedings of the 

nineteenth ACM symposium on Operating Systems 

Principles (SOSP19), ACM Press, 2003. 

[2] A. Kivity, Y. Kamay, and D. Laor. kvm: the Linux 

Virtual Machine Monitor. In Linux Symposium, 2007. 

[3] M. Mahalingam and R. Brunner. I/O Virtualization 

(IOV) For Dummies. In VMWorld, 2007. 

[4] L. van Doorn. Hardware virtualization trends. In 

Proceedings of the 2nd international conference on 

Virtual execution environments, 2006 (June). 

[5] A. Menon, A. L. Cox, and W. Zwaenepoel. Optimizing 

network virtualization in Xen. In Proceedings of the 

USENIX Annual Technical Conference, 2006 (June). 

[6] G. Heiser. The role of virtualization in embedded 

systems. In Proceedings of the 1st workshop on Isolation 

and integration in embedded systems, 2008 (April). 

[7] H. Inoue, A. Ikeno, M. Kondo, J. Sakai, and 

M. Edahiro. VIRTUS: A new processor virtualization 

architecture for security-oriented next-generation 

mobile terminals. In Proceedings of the 43rd annual 

conference on Design automation, 2006. 

[8] H. Raj and K. Schwan. Implementing a scalable 

self-virtualizing network interface on a multicore 

platform. In Workshop on the Interaction between 

Operating Systems and Computer Architecture, 2005 

(October). 

[9] K. K. Ram, J. R. Santos, Y. Turner, A. L. Cox, and 

S. Rixner. Achieving 10 Gb/s using safe and 

transparent network interface virtualization. In 

Proceedings of the 2009 ACM SIGPLAN/SIGOPS 

international Conference on Virtual Execution 

Environments. 

[10] P. Willmann, J. Shafer, D. Carr, A. Menon, S. Rixner, 

A. L. Cox, and W. Zwaenepoel. Concurrent direct 

network access for virtual machine monitors. In 

Proceedings of the International Symposium on 

High-Performance Computer Architecture, 2007 

(February). 

[11] J. Wang. Survey of State-of-the-art in Inter-VM 

Communication Mechanisms. In Research Proficiency 

Report, 2009 (September). 

[12] J. R. Santos, Y. Turner, and J. Mudigona. Taming 

Heterogeneous NIC Capabilities for I/O Virtualization. 

In Proceedings of Workshop on I/O Virtualization, 

2008. 

[13] S. Chinni, R. Hiremane. Virtual Machine Device 

Queues. In Whitepaper, Intel, 2007. 

[14] T. Grötker, S. Liao, G. Martin and S. Swan. System 

Design with SystemC. In Kluwer Academic Publishers, 

2002. 

With this paper, it is our objective to raise awareness for the 

research of I/O virtualization in embedded system network 

cards and the new challenges here.

A Network Interface Card Architecture for I/O Virtualization in ... - TUM

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?