Flow Classification Optimizations in DPDK

Flow Classification 

Optimizations in DPDK 

Sameh Gobriel & Charlie Tai - Intel 

DPDK US Summit - San Jose - 2016

Agenda 

Flow Classification in DPDK 

Cuckoo Hashing for Optimized Flow Table Design 

Using Intel Transactional Synchronization Extensions (TSX) for scaling Insert 

performance 

Using Intel AVX instructions for scaling lookup performance 

Research Proof of Concept: 2 level lookup for OVS Megaflow Cache

Memory 

Flow Classification on Network Appliances 

vs General Purpose Server H/W 

3 

Flow 




Target 

Target 

Target 

Target 





Action 

Action 

Action 

Action 





In Out 

In Out 

In Out 

In Out 





Target 

Target 

Target 

Target 





Action 

Action 

Action 

Action 





In Out 

In Out 

In Out 

In Out 

NFV 

Flow Classification 

Implemented on 

General Purpose 

Processors 

Hypervisor (e.g. ESXi, KVM,.. etc.) 

C 

C 

C 

C 

TEM/OEM 

Proprietary OS 

ASIC, DSP, 

FPGA, ASSP 

Monolithic Purpose-built Boxes 

• Network appliances use purpose-built H/W 

& ASICs (e.g., TCAM) for flow classification 

• Cost & power consumption are limiting 

factors to support large number of flows 

$ Lx 

NIC 

$ LLC 

$ Lx 

• General purpose processors with Cache/memory 

hierarchy can support much larger flow tables. 

• Multicores architecture provide a scalable competitive 

flow classification performance. 

$ Lx 

NIC 

$ LLC 

$ Lx 

Networking VMs on Standard Servers 

Memory

Metrics for Good Flow Table Design 

Hash value used to index 

Flow table 

Packet Header 

Flow Key 

H(..) 

Payload 

Fields of the packet are 

used to form a flow Key 

Hash function is used to 

create a flow table index 

Flow Table 

Key 1 Action 1 Key 2 Action 2 

1. Higher Lookup Rate = Better throughput 

& latency 

2. Higher Insert Rate = Better Flow update 

& Table Initialization 

3. Efficient Table Utilization = More Flows 

Key x Action x Key y Action y Key z Action z 

Key N 

Action N 

Retrieved keys are 

matched with input key 

Key x 

Key y 

Key z 

Flow Key 

Match 

Action

DPDK Framework 

Network Functions (Cloud, Enterprise, Comms) 

LPM 

DISTRIB 

REORDER 

IVSHMEM 

POWER 

METER 

PORT 

TABLE 

HASH 

JOBSTAT 

KNI 

VHOST 

IP FRAG 

SCHED 

PIPELINE 

ACL 

Classify Extensions QoS Pkt Framework 

EAL 

ETHDEV 

CRYPTO 

Future 

MBUF 

IGB 

BNX2X 

MPIPE 

VMXNET3 

BONDING 

QAT 

TBD 

MEMPOOL 

IXGBE 

CXGBE 

NFP 

XENVIRT 

PCAP 

AESNI MB 

RING 

E1000 

ENIC 

SZEDATA2 

VIRTIO 

RING 

AESNI GCM 

I40E 

MLX4 

ENA 

VHOST 

AF_PKT 

SNOW 3G 

TIMER 

FM10K 

MLX5 

NULL 

NULL 

Core 

PMDs: Native & Virtual 

Accelerators 

User Space 

KNI IGB_UIO VFIO 

UIO_PCI_GENERIC 

Kernel

RTE-Hash Exact Match Library 

IP A 

IP H 

H1 

H2 

Traditional 

Exact Match 

Library 

IP P 

IP J 

IP Q 

IP B 

Cuckoo Hashing 

Available since 

DPDK v2.2 

IP Y 

2 

1 

3 

IP Z 

IP D 

IP W 

IP X 

Traditional Exact Match Table library: 

• relies on a “sparse” hash table 

implementation 

• Simple exact match implementation 

• Significant performance degradation with 

increased table sizes. 

Cuckoo Hashing – Better Scalability: 

• Denser tables fit in cache. 

• Can scale to millions of entries.


High Level Overview 

1 

2 

3 

H1(x) 

H2(x) 

H1(x) 

H2(x) 

H1(x) 

H2(x) 

X 

X 

X 

Y 

4 

5 

6 

H1(x) 

H2(x) 

H1(x) 

H2(x) 

H1(x) 

H2(x) 

Y 

Y 

Y 

X 

X 

Z 

X 

Z 

7

Table Load 

Cuckoo Hashing Performance Benefits 

Cuckoo Hashing allows for more flows to be 

inserted in the flow table 

RTE-hash can be used to support flow table 

with millions of keys (e.g. 64M – 5 tuple keys) 

that fits in the CPU cache. 

100.00% 

90.00% 

80.00% 

70.00% 

60.00% 

50.00% 

40.00% 

30.00% 

20.00% 

10.00% 

0.00% 

Table Load at First Key Insertion Failure 

Traditional Exact Match 


Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz 

Hyper-Threading: disabled

Code Snippet for RTE-hash API 

 

 

 

 

struct rte_hash *rte_hash_create (const struct rte_hash_parameters *params) 

int rte_hash_add_key_data (const struct rte_hash *h, const void *key, void *data) 

int rte_hash_lookup_data (const struct rte_hash *h, const void *key, void **data) 

int rte_hash_lookup_bulk_data (const struct rte_hash *h, const void **keys, 

uint32_t num_keys, uint64_t *hit_mask, void *data[]) 

Reference: http://dpdk.org/doc/api/rte__hash_8h.html

Long Cuckoo Paths & Multiple Concurrent Writers 

Insert y 

a * * * 

e * * * 

s * * * 

x * * * 

k * * * 

f * * * 

d * * * 

t * * * 

* ∅ 

cuckoo path: 

a➝e➝s➝x➝k➝f➝d➝t➝∅ (9 writes) 

One Insert may move a lot of items 

especially at high table occupancy 

←collision 

Collision happens when multiple writers 

have intersecting Cuckoo Paths 

10

Build 

Flow-Table Insert Performance Optimizations 

Insert Performance Optimizations 

Traditional Locks 

TSX Hardware Concurrency 

Make Use of IA 

Hardware Features 

Minimize Critical 

Section 

Detect 

Roll Back 

• Limited Concurrency 

• Threads are serialized in 

critical section 

• Hardware monitors cache lines. 

• When data conflict is detected, 

execution is rolled back 

11

Flow-Table Insert Performance Optimizations 

Insert Performance Optimizations 

1 2 

Depth First Search 

Breadth First Search 

Split Path Search from 

Keys Movement 

a * * * 

TSX Lock 



Minimize Critical 

Section 

e * * * 

s * * * 

x * * * 

e * * * 

a * * * 

c * * * 

s * * * * * * * * Ø 

Cuckoo Path Search 

Move Keys 

TSX Unlock 

k * * * 

f * * * 

d * * * 

t * * * 

Cuckoo Path Search 

TSX Lock 

Move Keys 

TSX Unlock 

* Ø 

12

Insert Rate 

(Million ins/sec) 

Summary of Insert Optimizations 

100.00 

90.00 

80.00 

Results 

Insert Optimized 

Original 

11 X 

70.00 

60.00 

50.00 

40.00 

30.00 

20.00 

10.00 

- 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 


Number of Cores 

Hyper-Threading: disabled 

Insert Performance Linearly Scalable with Number of Cores

Code Snippet for RTE-hash with TSX (DPDK 

V16.07) 

#define RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 0x02 

/* Default behavior of insertion, single writer/multi writer */ 

struct rte_hash_parameters { 

... 

uint8_t extra_flag; 

}; 

rte_hash_parameters.extra_flag |= 

(RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT 

| RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD); 

 

To enjoy TSX enabled multiwriter. 

Reference: http://dpdk.org/doc/api/rte__hash_8h.html

Flow-Table Lookup Performance Optimizations 

Lookup Performance Optimizations 

1 

Use AVX Instructions 

2 

Minimize Overhead 



Minimize Implementation 

Overhead 

Pkt 

H(x) 

1. Prefetching of Keys in 

Cache. 

2. Inline Functions 

3. Lookup Pipelining 

lookup 

Pkt 1 Pkt 2 Pkt 3 Pkt 4 

H AVX (x) 

Lookup AVX 

15

Throughput (Millions Lookups/Core/Sec) 

Summary of Lookup Optimizations 

Results 

18.00 

Throughput vs. # of Flows 

16.00 

14.00 

35% 

12.00 

10.00 

8.00 

Default 

Lookup Optimized 

rte_hash 

Cuckoo 

6.00 

4.00 

2.00 



0.00 

1M 2M 4M 8M 16M 32M 

Number of Flows 

~35% Improved Lookup Throughput

Code Snippet for RTE-hash with AVX 

(Targeting DPDK V16.11)

DPDK Framework 

Network Functions (Cloud, Enterprise, Comms) 

LPM 

DISTRIB 

REORDER 

IVSHMEM 

POWER 

METER 

PORT 

TABLE 

HASH 

JOBSTAT 

KNI 

VHOST 

IP FRAG 

SCHED 

PIPELINE 

ACL 

Supporting Wild Card Flow 

Classification and Variable Key Size 

Classify Extensions QoS Pkt Framework 

EAL 

ETHDEV 

CRYPTO 

Future 

MBUF 

IGB 

BNX2X 

MPIPE 

VMXNET3 

BONDING 

QAT 

TBD 

MEMPOOL 

IXGBE 

CXGBE 

NFP 

XENVIRT 

PCAP 

AESNI MB 

RING 

E1000 

ENIC 

SZEDATA2 

VIRTIO 

RING 

AESNI GCM 

I40E 

MLX4 

ENA 

VHOST 

AF_PKT 

SNOW 3G 

TIMER 

FM10K 

MLX5 

NULL 

NULL 

Core 

PMDs: Native & Virtual 

KNI IGB_UIO VFIO 

UIO_PCI_GENERIC 

Accelerators 

User Space 

Kernel

POC: Open vSwitch Flow Lookup 

Packet Header 

Flow Mask 

1111 0000 

1110 0000 

Mask L 

Mask N 

Rules 

1010 xxxx 

0011 xxxx 

1011 xxxx 

110x xxxx 

101x xxxx 

111x xxxx 

011x xxxx 

Match 

01xx xxxx 

10xx xxxx 

1xxx xxxx 

0xxx xxxx 

1. Set of disjoint sub-table 

2. Rule is only inserted into one sub-table (lookup terminates after first match) 

3. Lookup is done by sequentially search each sub-table until a match is found 

Instead of L sequential lookups What if we know which sub-table to hit 

19

OVS with Two Layer Lookup 

Packet Header 

1 st Level of 

Indirection 

Flow Mask 

1111 0000 

1110 0000 

Mask L 

Mask N 

Rules 

1010 xxxx 

0011 xxxx 

1011 xxxx 

110x xxxx 

101x xxxx 

111x xxxx 

011x xxxx 

Match 

01xx xxxx 

10xx xxxx 

1xxx xxxx 

0xxx xxxx 

20

Bloom Filter as 1 st Level of Indirection 

Packet Header 

Mask 1 BF Mask 2 BF Mask L BF Mask N BF 

1 st Level of 

Indirection 

1111 0000 

1110 0000 

Mask L 

Mask N 

1010 xxxx 

0011 xxxx 

1011 xxxx 

110x xxxx 

101x xxxx 

111x xxxx 

011x xxxx 

Match 

01xx xxxx 

10xx xxxx 

1xxx xxxx 

0xxx xxxx 

L Lookups L Bloom Filters + 1 lookup 

21

Cycles 

2 Level Lookup Preliminary Performance 

Results 

7000 

OVS - Hit OVS - Miss bloom - Hit bloom - Miss 

6000 

5000 

4000 

3000 

2000 

1000 

0 

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 


Number of Subtables Traversed 


22

Legal Disclaimers 

No license (express or implied, by estoppel or otherwise) to any intellectual property rights is granted by this 

document. 

Intel disclaims all express and implied warranties, including without limitation, the implied warranties of 

merchantability, fitness for a particular purpose, and non-infringement, as well as any warranty arising from 

course of performance, course of dealing, or usage in trade. 

This document contains information on products, services and/or processes in development. All information 

provided here is subject to change without notice. Contact your Intel representative to obtain the latest 

forecast, schedule, specifications and roadmaps. 

Intel technologies’ features and benefits depend on system configuration and may require enabled 

hardware, software or service activation. Performance varies depending on system configuration. No 

computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more 

at intel.com. 

© 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What’s Inside, and the Intel. Experience 

What’s Inside logo are trademarks of Intel. Corporation in the U.S. and/or other countries. 

*Other names and brands may be claimed as the property of others.

Questions? 

Sameh Gobriel 

sameh.gobriel@intel.com

Flow Classification Optimizations in DPDK

Create successful ePaper yourself

Delete template?

Save as template?