11.08.2016 Views

Flow Classification Optimizations in DPDK

Day01-Session07-SamehGobriel-DPDKUSASummit2016

Day01-Session07-SamehGobriel-DPDKUSASummit2016

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Flow</strong> <strong>Classification</strong><br />

<strong>Optimizations</strong> <strong>in</strong> <strong>DPDK</strong><br />

Sameh Gobriel & Charlie Tai - Intel<br />

<strong>DPDK</strong> US Summit - San Jose - 2016


Agenda<br />

<strong>Flow</strong> <strong>Classification</strong> <strong>in</strong> <strong>DPDK</strong><br />

Cuckoo Hash<strong>in</strong>g for Optimized <strong>Flow</strong> Table Design<br />

Us<strong>in</strong>g Intel Transactional Synchronization Extensions (TSX) for scal<strong>in</strong>g Insert<br />

performance<br />

Us<strong>in</strong>g Intel AVX <strong>in</strong>structions for scal<strong>in</strong>g lookup performance<br />

Research Proof of Concept: 2 level lookup for OVS Megaflow Cache


Memory<br />

<strong>Flow</strong> <strong>Classification</strong> on Network Appliances<br />

vs General Purpose Server H/W<br />

3<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

Target<br />

Target<br />

Target<br />

Target<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

Action<br />

Action<br />

Action<br />

Action<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

In Out<br />

In Out<br />

In Out<br />

In Out<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

Target<br />

Target<br />

Target<br />

Target<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

Action<br />

Action<br />

Action<br />

Action<br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

<strong>Flow</strong><br />

In Out<br />

In Out<br />

In Out<br />

In Out<br />

NFV<br />

<strong>Flow</strong> <strong>Classification</strong><br />

Implemented on<br />

General Purpose<br />

Processors<br />

Hypervisor (e.g. ESXi, KVM,.. etc.)<br />

C<br />

C<br />

C<br />

C<br />

TEM/OEM<br />

Proprietary OS<br />

ASIC, DSP,<br />

FPGA, ASSP<br />

Monolithic Purpose-built Boxes<br />

• Network appliances use purpose-built H/W<br />

& ASICs (e.g., TCAM) for flow classification<br />

• Cost & power consumption are limit<strong>in</strong>g<br />

factors to support large number of flows<br />

$ Lx<br />

NIC<br />

$ LLC<br />

$ Lx<br />

• General purpose processors with Cache/memory<br />

hierarchy can support much larger flow tables.<br />

• Multicores architecture provide a scalable competitive<br />

flow classification performance.<br />

$ Lx<br />

NIC<br />

$ LLC<br />

$ Lx<br />

Network<strong>in</strong>g VMs on Standard Servers<br />

Memory


Metrics for Good <strong>Flow</strong> Table Design<br />

Hash value used to <strong>in</strong>dex<br />

<strong>Flow</strong> table<br />

Packet Header<br />

<strong>Flow</strong> Key<br />

H(..)<br />

Payload<br />

Fields of the packet are<br />

used to form a flow Key<br />

Hash function is used to<br />

create a flow table <strong>in</strong>dex<br />

<strong>Flow</strong> Table<br />

Key 1 Action 1 Key 2 Action 2<br />

1. Higher Lookup Rate = Better throughput<br />

& latency<br />

2. Higher Insert Rate = Better <strong>Flow</strong> update<br />

& Table Initialization<br />

3. Efficient Table Utilization = More <strong>Flow</strong>s<br />

Key x Action x Key y Action y Key z Action z<br />

Key N<br />

Action N<br />

Retrieved keys are<br />

matched with <strong>in</strong>put key<br />

Key x<br />

Key y<br />

Key z<br />

<strong>Flow</strong> Key<br />

Match<br />

Action


<strong>DPDK</strong> Framework<br />

Network Functions (Cloud, Enterprise, Comms)<br />

LPM<br />

DISTRIB<br />

REORDER<br />

IVSHMEM<br />

POWER<br />

METER<br />

PORT<br />

TABLE<br />

HASH<br />

JOBSTAT<br />

KNI<br />

VHOST<br />

IP FRAG<br />

SCHED<br />

PIPELINE<br />

ACL<br />

Classify Extensions QoS Pkt Framework<br />

EAL<br />

ETHDEV<br />

CRYPTO<br />

Future<br />

MBUF<br />

IGB<br />

BNX2X<br />

MPIPE<br />

VMXNET3<br />

BONDING<br />

QAT<br />

TBD<br />

MEMPOOL<br />

IXGBE<br />

CXGBE<br />

NFP<br />

XENVIRT<br />

PCAP<br />

AESNI MB<br />

RING<br />

E1000<br />

ENIC<br />

SZEDATA2<br />

VIRTIO<br />

RING<br />

AESNI GCM<br />

I40E<br />

MLX4<br />

ENA<br />

VHOST<br />

AF_PKT<br />

SNOW 3G<br />

TIMER<br />

FM10K<br />

MLX5<br />

NULL<br />

NULL<br />

Core<br />

PMDs: Native & Virtual<br />

Accelerators<br />

User Space<br />

KNI IGB_UIO VFIO<br />

UIO_PCI_GENERIC<br />

Kernel


RTE-Hash Exact Match Library<br />

IP A<br />

IP H<br />

H1<br />

H2<br />

Traditional<br />

Exact Match<br />

Library<br />

IP P<br />

IP J<br />

IP Q<br />

IP B<br />

Cuckoo Hash<strong>in</strong>g<br />

Available s<strong>in</strong>ce<br />

<strong>DPDK</strong> v2.2<br />

IP Y<br />

2<br />

1<br />

3<br />

IP Z<br />

IP D<br />

IP W<br />

IP X<br />

Traditional Exact Match Table library:<br />

• relies on a “sparse” hash table<br />

implementation<br />

• Simple exact match implementation<br />

• Significant performance degradation with<br />

<strong>in</strong>creased table sizes.<br />

Cuckoo Hash<strong>in</strong>g – Better Scalability:<br />

• Denser tables fit <strong>in</strong> cache.<br />

• Can scale to millions of entries.


Cuckoo Hash<strong>in</strong>g<br />

High Level Overview<br />

1<br />

2<br />

3<br />

H1(x)<br />

H2(x)<br />

H1(x)<br />

H2(x)<br />

H1(x)<br />

H2(x)<br />

X<br />

X<br />

X<br />

Y<br />

4<br />

5<br />

6<br />

H1(x)<br />

H2(x)<br />

H1(x)<br />

H2(x)<br />

H1(x)<br />

H2(x)<br />

Y<br />

Y<br />

Y<br />

X<br />

X<br />

Z<br />

X<br />

Z<br />

7


Table Load<br />

Cuckoo Hash<strong>in</strong>g Performance Benefits<br />

Cuckoo Hash<strong>in</strong>g allows for more flows to be<br />

<strong>in</strong>serted <strong>in</strong> the flow table<br />

RTE-hash can be used to support flow table<br />

with millions of keys (e.g. 64M – 5 tuple keys)<br />

that fits <strong>in</strong> the CPU cache.<br />

100.00%<br />

90.00%<br />

80.00%<br />

70.00%<br />

60.00%<br />

50.00%<br />

40.00%<br />

30.00%<br />

20.00%<br />

10.00%<br />

0.00%<br />

Table Load at First Key Insertion Failure<br />

Traditional Exact Match<br />

Cuckoo Hash<strong>in</strong>g<br />

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />

Hyper-Thread<strong>in</strong>g: disabled


Code Snippet for RTE-hash API<br />

<br />

<br />

<br />

<br />

struct rte_hash *rte_hash_create (const struct rte_hash_parameters *params)<br />

<strong>in</strong>t rte_hash_add_key_data (const struct rte_hash *h, const void *key, void *data)<br />

<strong>in</strong>t rte_hash_lookup_data (const struct rte_hash *h, const void *key, void **data)<br />

<strong>in</strong>t rte_hash_lookup_bulk_data (const struct rte_hash *h, const void **keys,<br />

u<strong>in</strong>t32_t num_keys, u<strong>in</strong>t64_t *hit_mask, void *data[])<br />

Reference: http://dpdk.org/doc/api/rte__hash_8h.html


Long Cuckoo Paths & Multiple Concurrent Writers<br />

Insert y<br />

a * * *<br />

e * * *<br />

s * * *<br />

x * * *<br />

k * * *<br />

f * * *<br />

d * * *<br />

t * * *<br />

* ∅<br />

cuckoo path:<br />

a➝e➝s➝x➝k➝f➝d➝t➝∅ (9 writes)<br />

One Insert may move a lot of items<br />

especially at high table occupancy<br />

←collision<br />

Collision happens when multiple writers<br />

have <strong>in</strong>tersect<strong>in</strong>g Cuckoo Paths<br />

10


Build<br />

<strong>Flow</strong>-Table Insert Performance <strong>Optimizations</strong><br />

Insert Performance <strong>Optimizations</strong><br />

Traditional Locks<br />

TSX Hardware Concurrency<br />

Make Use of IA<br />

Hardware Features<br />

M<strong>in</strong>imize Critical<br />

Section<br />

Detect<br />

Roll Back<br />

• Limited Concurrency<br />

• Threads are serialized <strong>in</strong><br />

critical section<br />

• Hardware monitors cache l<strong>in</strong>es.<br />

• When data conflict is detected,<br />

execution is rolled back<br />

11


<strong>Flow</strong>-Table Insert Performance <strong>Optimizations</strong><br />

Insert Performance <strong>Optimizations</strong><br />

1 2<br />

Depth First Search <br />

Breadth First Search<br />

Split Path Search from<br />

Keys Movement<br />

a * * *<br />

TSX Lock<br />

Make Use of IA<br />

Hardware Features<br />

M<strong>in</strong>imize Critical<br />

Section<br />

e * * *<br />

s * * *<br />

x * * *<br />

e * * *<br />

a * * *<br />

c * * *<br />

s * * * * * * * * Ø<br />

Cuckoo Path Search<br />

Move Keys<br />

TSX Unlock<br />

k * * *<br />

f * * *<br />

d * * *<br />

t * * *<br />

Cuckoo Path Search<br />

TSX Lock<br />

Move Keys<br />

TSX Unlock<br />

* Ø<br />

12


Insert Rate<br />

(Million <strong>in</strong>s/sec)<br />

Summary of Insert <strong>Optimizations</strong><br />

100.00<br />

90.00<br />

80.00<br />

Results<br />

Insert Optimized<br />

Orig<strong>in</strong>al<br />

11 X<br />

70.00<br />

60.00<br />

50.00<br />

40.00<br />

30.00<br />

20.00<br />

10.00<br />

-<br />

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22<br />

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />

Number of Cores<br />

Hyper-Thread<strong>in</strong>g: disabled<br />

Insert Performance L<strong>in</strong>early Scalable with Number of Cores


Code Snippet for RTE-hash with TSX (<strong>DPDK</strong><br />

V16.07)<br />

#def<strong>in</strong>e RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 0x02<br />

/* Default behavior of <strong>in</strong>sertion, s<strong>in</strong>gle writer/multi writer */<br />

struct rte_hash_parameters {<br />

...<br />

u<strong>in</strong>t8_t extra_flag;<br />

};<br />

rte_hash_parameters.extra_flag |=<br />

(RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT<br />

| RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD);<br />

<br />

To enjoy TSX enabled multiwriter.<br />

Reference: http://dpdk.org/doc/api/rte__hash_8h.html


<strong>Flow</strong>-Table Lookup Performance <strong>Optimizations</strong><br />

Lookup Performance <strong>Optimizations</strong><br />

1<br />

Use AVX Instructions<br />

2<br />

M<strong>in</strong>imize Overhead<br />

Make Use of IA<br />

Hardware Features<br />

M<strong>in</strong>imize Implementation<br />

Overhead<br />

Pkt<br />

H(x)<br />

1. Prefetch<strong>in</strong>g of Keys <strong>in</strong><br />

Cache.<br />

2. Inl<strong>in</strong>e Functions<br />

3. Lookup Pipel<strong>in</strong><strong>in</strong>g<br />

lookup<br />

Pkt 1 Pkt 2 Pkt 3 Pkt 4<br />

H AVX (x)<br />

Lookup AVX<br />

15


Throughput (Millions Lookups/Core/Sec)<br />

Summary of Lookup <strong>Optimizations</strong><br />

Results<br />

18.00<br />

Throughput vs. # of <strong>Flow</strong>s<br />

16.00<br />

14.00<br />

35%<br />

12.00<br />

10.00<br />

8.00<br />

Default<br />

Lookup Optimized<br />

rte_hash<br />

Cuckoo<br />

6.00<br />

4.00<br />

2.00<br />

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />

Hyper-Thread<strong>in</strong>g: disabled<br />

0.00<br />

1M 2M 4M 8M 16M 32M<br />

Number of <strong>Flow</strong>s<br />

~35% Improved Lookup Throughput


Code Snippet for RTE-hash with AVX<br />

(Target<strong>in</strong>g <strong>DPDK</strong> V16.11)


<strong>DPDK</strong> Framework<br />

Network Functions (Cloud, Enterprise, Comms)<br />

LPM<br />

DISTRIB<br />

REORDER<br />

IVSHMEM<br />

POWER<br />

METER<br />

PORT<br />

TABLE<br />

HASH<br />

JOBSTAT<br />

KNI<br />

VHOST<br />

IP FRAG<br />

SCHED<br />

PIPELINE<br />

ACL<br />

Support<strong>in</strong>g Wild Card <strong>Flow</strong><br />

<strong>Classification</strong> and Variable Key Size<br />

Classify Extensions QoS Pkt Framework<br />

EAL<br />

ETHDEV<br />

CRYPTO<br />

Future<br />

MBUF<br />

IGB<br />

BNX2X<br />

MPIPE<br />

VMXNET3<br />

BONDING<br />

QAT<br />

TBD<br />

MEMPOOL<br />

IXGBE<br />

CXGBE<br />

NFP<br />

XENVIRT<br />

PCAP<br />

AESNI MB<br />

RING<br />

E1000<br />

ENIC<br />

SZEDATA2<br />

VIRTIO<br />

RING<br />

AESNI GCM<br />

I40E<br />

MLX4<br />

ENA<br />

VHOST<br />

AF_PKT<br />

SNOW 3G<br />

TIMER<br />

FM10K<br />

MLX5<br />

NULL<br />

NULL<br />

Core<br />

PMDs: Native & Virtual<br />

KNI IGB_UIO VFIO<br />

UIO_PCI_GENERIC<br />

Accelerators<br />

User Space<br />

Kernel


POC: Open vSwitch <strong>Flow</strong> Lookup<br />

Packet Header<br />

<strong>Flow</strong> Mask<br />

1111 0000<br />

1110 0000<br />

Mask L<br />

Mask N<br />

Rules<br />

1010 xxxx<br />

0011 xxxx<br />

1011 xxxx<br />

110x xxxx<br />

101x xxxx<br />

111x xxxx<br />

011x xxxx<br />

Match<br />

01xx xxxx<br />

10xx xxxx<br />

1xxx xxxx<br />

0xxx xxxx<br />

1. Set of disjo<strong>in</strong>t sub-table<br />

2. Rule is only <strong>in</strong>serted <strong>in</strong>to one sub-table (lookup term<strong>in</strong>ates after first match)<br />

3. Lookup is done by sequentially search each sub-table until a match is found<br />

Instead of L sequential lookups What if we know which sub-table to hit<br />

19


OVS with Two Layer Lookup<br />

Packet Header<br />

1 st Level of<br />

Indirection<br />

<strong>Flow</strong> Mask<br />

1111 0000<br />

1110 0000<br />

Mask L<br />

Mask N<br />

Rules<br />

1010 xxxx<br />

0011 xxxx<br />

1011 xxxx<br />

110x xxxx<br />

101x xxxx<br />

111x xxxx<br />

011x xxxx<br />

Match<br />

01xx xxxx<br />

10xx xxxx<br />

1xxx xxxx<br />

0xxx xxxx<br />

20


Bloom Filter as 1 st Level of Indirection<br />

Packet Header<br />

Mask 1 BF Mask 2 BF Mask L BF Mask N BF<br />

1 st Level of<br />

Indirection<br />

1111 0000<br />

1110 0000<br />

Mask L<br />

Mask N<br />

1010 xxxx<br />

0011 xxxx<br />

1011 xxxx<br />

110x xxxx<br />

101x xxxx<br />

111x xxxx<br />

011x xxxx<br />

Match<br />

01xx xxxx<br />

10xx xxxx<br />

1xxx xxxx<br />

0xxx xxxx<br />

L Lookups L Bloom Filters + 1 lookup<br />

21


Cycles<br />

2 Level Lookup Prelim<strong>in</strong>ary Performance<br />

Results<br />

7000<br />

OVS - Hit OVS - Miss bloom - Hit bloom - Miss<br />

6000<br />

5000<br />

4000<br />

3000<br />

2000<br />

1000<br />

0<br />

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20<br />

Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />

Number of Subtables Traversed<br />

Hyper-Thread<strong>in</strong>g: disabled<br />

22


Legal Disclaimers<br />

No license (express or implied, by estoppel or otherwise) to any <strong>in</strong>tellectual property rights is granted by this<br />

document.<br />

Intel disclaims all express and implied warranties, <strong>in</strong>clud<strong>in</strong>g without limitation, the implied warranties of<br />

merchantability, fitness for a particular purpose, and non-<strong>in</strong>fr<strong>in</strong>gement, as well as any warranty aris<strong>in</strong>g from<br />

course of performance, course of deal<strong>in</strong>g, or usage <strong>in</strong> trade.<br />

This document conta<strong>in</strong>s <strong>in</strong>formation on products, services and/or processes <strong>in</strong> development. All <strong>in</strong>formation<br />

provided here is subject to change without notice. Contact your Intel representative to obta<strong>in</strong> the latest<br />

forecast, schedule, specifications and roadmaps.<br />

Intel technologies’ features and benefits depend on system configuration and may require enabled<br />

hardware, software or service activation. Performance varies depend<strong>in</strong>g on system configuration. No<br />

computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more<br />

at <strong>in</strong>tel.com.<br />

© 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What’s Inside, and the Intel. Experience<br />

What’s Inside logo are trademarks of Intel. Corporation <strong>in</strong> the U.S. and/or other countries.<br />

*Other names and brands may be claimed as the property of others.


Questions?<br />

Sameh Gobriel<br />

sameh.gobriel@<strong>in</strong>tel.com

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!