Flow Classification Optimizations in DPDK
Day01-Session07-SamehGobriel-DPDKUSASummit2016
Day01-Session07-SamehGobriel-DPDKUSASummit2016
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
<strong>Flow</strong> <strong>Classification</strong><br />
<strong>Optimizations</strong> <strong>in</strong> <strong>DPDK</strong><br />
Sameh Gobriel & Charlie Tai - Intel<br />
<strong>DPDK</strong> US Summit - San Jose - 2016
Agenda<br />
<strong>Flow</strong> <strong>Classification</strong> <strong>in</strong> <strong>DPDK</strong><br />
Cuckoo Hash<strong>in</strong>g for Optimized <strong>Flow</strong> Table Design<br />
Us<strong>in</strong>g Intel Transactional Synchronization Extensions (TSX) for scal<strong>in</strong>g Insert<br />
performance<br />
Us<strong>in</strong>g Intel AVX <strong>in</strong>structions for scal<strong>in</strong>g lookup performance<br />
Research Proof of Concept: 2 level lookup for OVS Megaflow Cache
Memory<br />
<strong>Flow</strong> <strong>Classification</strong> on Network Appliances<br />
vs General Purpose Server H/W<br />
3<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
Target<br />
Target<br />
Target<br />
Target<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
Action<br />
Action<br />
Action<br />
Action<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
In Out<br />
In Out<br />
In Out<br />
In Out<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
Target<br />
Target<br />
Target<br />
Target<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
Action<br />
Action<br />
Action<br />
Action<br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
<strong>Flow</strong><br />
In Out<br />
In Out<br />
In Out<br />
In Out<br />
NFV<br />
<strong>Flow</strong> <strong>Classification</strong><br />
Implemented on<br />
General Purpose<br />
Processors<br />
Hypervisor (e.g. ESXi, KVM,.. etc.)<br />
C<br />
C<br />
C<br />
C<br />
TEM/OEM<br />
Proprietary OS<br />
ASIC, DSP,<br />
FPGA, ASSP<br />
Monolithic Purpose-built Boxes<br />
• Network appliances use purpose-built H/W<br />
& ASICs (e.g., TCAM) for flow classification<br />
• Cost & power consumption are limit<strong>in</strong>g<br />
factors to support large number of flows<br />
$ Lx<br />
NIC<br />
$ LLC<br />
$ Lx<br />
• General purpose processors with Cache/memory<br />
hierarchy can support much larger flow tables.<br />
• Multicores architecture provide a scalable competitive<br />
flow classification performance.<br />
$ Lx<br />
NIC<br />
$ LLC<br />
$ Lx<br />
Network<strong>in</strong>g VMs on Standard Servers<br />
Memory
Metrics for Good <strong>Flow</strong> Table Design<br />
Hash value used to <strong>in</strong>dex<br />
<strong>Flow</strong> table<br />
Packet Header<br />
<strong>Flow</strong> Key<br />
H(..)<br />
Payload<br />
Fields of the packet are<br />
used to form a flow Key<br />
Hash function is used to<br />
create a flow table <strong>in</strong>dex<br />
<strong>Flow</strong> Table<br />
Key 1 Action 1 Key 2 Action 2<br />
1. Higher Lookup Rate = Better throughput<br />
& latency<br />
2. Higher Insert Rate = Better <strong>Flow</strong> update<br />
& Table Initialization<br />
3. Efficient Table Utilization = More <strong>Flow</strong>s<br />
Key x Action x Key y Action y Key z Action z<br />
Key N<br />
Action N<br />
Retrieved keys are<br />
matched with <strong>in</strong>put key<br />
Key x<br />
Key y<br />
Key z<br />
<strong>Flow</strong> Key<br />
Match<br />
Action
<strong>DPDK</strong> Framework<br />
Network Functions (Cloud, Enterprise, Comms)<br />
LPM<br />
DISTRIB<br />
REORDER<br />
IVSHMEM<br />
POWER<br />
METER<br />
PORT<br />
TABLE<br />
HASH<br />
JOBSTAT<br />
KNI<br />
VHOST<br />
IP FRAG<br />
SCHED<br />
PIPELINE<br />
ACL<br />
Classify Extensions QoS Pkt Framework<br />
EAL<br />
ETHDEV<br />
CRYPTO<br />
Future<br />
MBUF<br />
IGB<br />
BNX2X<br />
MPIPE<br />
VMXNET3<br />
BONDING<br />
QAT<br />
TBD<br />
MEMPOOL<br />
IXGBE<br />
CXGBE<br />
NFP<br />
XENVIRT<br />
PCAP<br />
AESNI MB<br />
RING<br />
E1000<br />
ENIC<br />
SZEDATA2<br />
VIRTIO<br />
RING<br />
AESNI GCM<br />
I40E<br />
MLX4<br />
ENA<br />
VHOST<br />
AF_PKT<br />
SNOW 3G<br />
TIMER<br />
FM10K<br />
MLX5<br />
NULL<br />
NULL<br />
Core<br />
PMDs: Native & Virtual<br />
Accelerators<br />
User Space<br />
KNI IGB_UIO VFIO<br />
UIO_PCI_GENERIC<br />
Kernel
RTE-Hash Exact Match Library<br />
IP A<br />
IP H<br />
H1<br />
H2<br />
Traditional<br />
Exact Match<br />
Library<br />
IP P<br />
IP J<br />
IP Q<br />
IP B<br />
Cuckoo Hash<strong>in</strong>g<br />
Available s<strong>in</strong>ce<br />
<strong>DPDK</strong> v2.2<br />
IP Y<br />
2<br />
1<br />
3<br />
IP Z<br />
IP D<br />
IP W<br />
IP X<br />
Traditional Exact Match Table library:<br />
• relies on a “sparse” hash table<br />
implementation<br />
• Simple exact match implementation<br />
• Significant performance degradation with<br />
<strong>in</strong>creased table sizes.<br />
Cuckoo Hash<strong>in</strong>g – Better Scalability:<br />
• Denser tables fit <strong>in</strong> cache.<br />
• Can scale to millions of entries.
Cuckoo Hash<strong>in</strong>g<br />
High Level Overview<br />
1<br />
2<br />
3<br />
H1(x)<br />
H2(x)<br />
H1(x)<br />
H2(x)<br />
H1(x)<br />
H2(x)<br />
X<br />
X<br />
X<br />
Y<br />
4<br />
5<br />
6<br />
H1(x)<br />
H2(x)<br />
H1(x)<br />
H2(x)<br />
H1(x)<br />
H2(x)<br />
Y<br />
Y<br />
Y<br />
X<br />
X<br />
Z<br />
X<br />
Z<br />
7
Table Load<br />
Cuckoo Hash<strong>in</strong>g Performance Benefits<br />
Cuckoo Hash<strong>in</strong>g allows for more flows to be<br />
<strong>in</strong>serted <strong>in</strong> the flow table<br />
RTE-hash can be used to support flow table<br />
with millions of keys (e.g. 64M – 5 tuple keys)<br />
that fits <strong>in</strong> the CPU cache.<br />
100.00%<br />
90.00%<br />
80.00%<br />
70.00%<br />
60.00%<br />
50.00%<br />
40.00%<br />
30.00%<br />
20.00%<br />
10.00%<br />
0.00%<br />
Table Load at First Key Insertion Failure<br />
Traditional Exact Match<br />
Cuckoo Hash<strong>in</strong>g<br />
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />
Hyper-Thread<strong>in</strong>g: disabled
Code Snippet for RTE-hash API<br />
<br />
<br />
<br />
<br />
struct rte_hash *rte_hash_create (const struct rte_hash_parameters *params)<br />
<strong>in</strong>t rte_hash_add_key_data (const struct rte_hash *h, const void *key, void *data)<br />
<strong>in</strong>t rte_hash_lookup_data (const struct rte_hash *h, const void *key, void **data)<br />
<strong>in</strong>t rte_hash_lookup_bulk_data (const struct rte_hash *h, const void **keys,<br />
u<strong>in</strong>t32_t num_keys, u<strong>in</strong>t64_t *hit_mask, void *data[])<br />
Reference: http://dpdk.org/doc/api/rte__hash_8h.html
Long Cuckoo Paths & Multiple Concurrent Writers<br />
Insert y<br />
a * * *<br />
e * * *<br />
s * * *<br />
x * * *<br />
k * * *<br />
f * * *<br />
d * * *<br />
t * * *<br />
* ∅<br />
cuckoo path:<br />
a➝e➝s➝x➝k➝f➝d➝t➝∅ (9 writes)<br />
One Insert may move a lot of items<br />
especially at high table occupancy<br />
←collision<br />
Collision happens when multiple writers<br />
have <strong>in</strong>tersect<strong>in</strong>g Cuckoo Paths<br />
10
Build<br />
<strong>Flow</strong>-Table Insert Performance <strong>Optimizations</strong><br />
Insert Performance <strong>Optimizations</strong><br />
Traditional Locks<br />
TSX Hardware Concurrency<br />
Make Use of IA<br />
Hardware Features<br />
M<strong>in</strong>imize Critical<br />
Section<br />
Detect<br />
Roll Back<br />
• Limited Concurrency<br />
• Threads are serialized <strong>in</strong><br />
critical section<br />
• Hardware monitors cache l<strong>in</strong>es.<br />
• When data conflict is detected,<br />
execution is rolled back<br />
11
<strong>Flow</strong>-Table Insert Performance <strong>Optimizations</strong><br />
Insert Performance <strong>Optimizations</strong><br />
1 2<br />
Depth First Search <br />
Breadth First Search<br />
Split Path Search from<br />
Keys Movement<br />
a * * *<br />
TSX Lock<br />
Make Use of IA<br />
Hardware Features<br />
M<strong>in</strong>imize Critical<br />
Section<br />
e * * *<br />
s * * *<br />
x * * *<br />
e * * *<br />
a * * *<br />
c * * *<br />
s * * * * * * * * Ø<br />
Cuckoo Path Search<br />
Move Keys<br />
TSX Unlock<br />
k * * *<br />
f * * *<br />
d * * *<br />
t * * *<br />
Cuckoo Path Search<br />
TSX Lock<br />
Move Keys<br />
TSX Unlock<br />
* Ø<br />
12
Insert Rate<br />
(Million <strong>in</strong>s/sec)<br />
Summary of Insert <strong>Optimizations</strong><br />
100.00<br />
90.00<br />
80.00<br />
Results<br />
Insert Optimized<br />
Orig<strong>in</strong>al<br />
11 X<br />
70.00<br />
60.00<br />
50.00<br />
40.00<br />
30.00<br />
20.00<br />
10.00<br />
-<br />
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22<br />
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />
Number of Cores<br />
Hyper-Thread<strong>in</strong>g: disabled<br />
Insert Performance L<strong>in</strong>early Scalable with Number of Cores
Code Snippet for RTE-hash with TSX (<strong>DPDK</strong><br />
V16.07)<br />
#def<strong>in</strong>e RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD 0x02<br />
/* Default behavior of <strong>in</strong>sertion, s<strong>in</strong>gle writer/multi writer */<br />
struct rte_hash_parameters {<br />
...<br />
u<strong>in</strong>t8_t extra_flag;<br />
};<br />
rte_hash_parameters.extra_flag |=<br />
(RTE_HASH_EXTRA_FLAGS_TRANS_MEM_SUPPORT<br />
| RTE_HASH_EXTRA_FLAGS_MULTI_WRITER_ADD);<br />
<br />
To enjoy TSX enabled multiwriter.<br />
Reference: http://dpdk.org/doc/api/rte__hash_8h.html
<strong>Flow</strong>-Table Lookup Performance <strong>Optimizations</strong><br />
Lookup Performance <strong>Optimizations</strong><br />
1<br />
Use AVX Instructions<br />
2<br />
M<strong>in</strong>imize Overhead<br />
Make Use of IA<br />
Hardware Features<br />
M<strong>in</strong>imize Implementation<br />
Overhead<br />
Pkt<br />
H(x)<br />
1. Prefetch<strong>in</strong>g of Keys <strong>in</strong><br />
Cache.<br />
2. Inl<strong>in</strong>e Functions<br />
3. Lookup Pipel<strong>in</strong><strong>in</strong>g<br />
lookup<br />
Pkt 1 Pkt 2 Pkt 3 Pkt 4<br />
H AVX (x)<br />
Lookup AVX<br />
15
Throughput (Millions Lookups/Core/Sec)<br />
Summary of Lookup <strong>Optimizations</strong><br />
Results<br />
18.00<br />
Throughput vs. # of <strong>Flow</strong>s<br />
16.00<br />
14.00<br />
35%<br />
12.00<br />
10.00<br />
8.00<br />
Default<br />
Lookup Optimized<br />
rte_hash<br />
Cuckoo<br />
6.00<br />
4.00<br />
2.00<br />
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />
Hyper-Thread<strong>in</strong>g: disabled<br />
0.00<br />
1M 2M 4M 8M 16M 32M<br />
Number of <strong>Flow</strong>s<br />
~35% Improved Lookup Throughput
Code Snippet for RTE-hash with AVX<br />
(Target<strong>in</strong>g <strong>DPDK</strong> V16.11)
<strong>DPDK</strong> Framework<br />
Network Functions (Cloud, Enterprise, Comms)<br />
LPM<br />
DISTRIB<br />
REORDER<br />
IVSHMEM<br />
POWER<br />
METER<br />
PORT<br />
TABLE<br />
HASH<br />
JOBSTAT<br />
KNI<br />
VHOST<br />
IP FRAG<br />
SCHED<br />
PIPELINE<br />
ACL<br />
Support<strong>in</strong>g Wild Card <strong>Flow</strong><br />
<strong>Classification</strong> and Variable Key Size<br />
Classify Extensions QoS Pkt Framework<br />
EAL<br />
ETHDEV<br />
CRYPTO<br />
Future<br />
MBUF<br />
IGB<br />
BNX2X<br />
MPIPE<br />
VMXNET3<br />
BONDING<br />
QAT<br />
TBD<br />
MEMPOOL<br />
IXGBE<br />
CXGBE<br />
NFP<br />
XENVIRT<br />
PCAP<br />
AESNI MB<br />
RING<br />
E1000<br />
ENIC<br />
SZEDATA2<br />
VIRTIO<br />
RING<br />
AESNI GCM<br />
I40E<br />
MLX4<br />
ENA<br />
VHOST<br />
AF_PKT<br />
SNOW 3G<br />
TIMER<br />
FM10K<br />
MLX5<br />
NULL<br />
NULL<br />
Core<br />
PMDs: Native & Virtual<br />
KNI IGB_UIO VFIO<br />
UIO_PCI_GENERIC<br />
Accelerators<br />
User Space<br />
Kernel
POC: Open vSwitch <strong>Flow</strong> Lookup<br />
Packet Header<br />
<strong>Flow</strong> Mask<br />
1111 0000<br />
1110 0000<br />
Mask L<br />
Mask N<br />
Rules<br />
1010 xxxx<br />
0011 xxxx<br />
1011 xxxx<br />
110x xxxx<br />
101x xxxx<br />
111x xxxx<br />
011x xxxx<br />
Match<br />
01xx xxxx<br />
10xx xxxx<br />
1xxx xxxx<br />
0xxx xxxx<br />
1. Set of disjo<strong>in</strong>t sub-table<br />
2. Rule is only <strong>in</strong>serted <strong>in</strong>to one sub-table (lookup term<strong>in</strong>ates after first match)<br />
3. Lookup is done by sequentially search each sub-table until a match is found<br />
Instead of L sequential lookups What if we know which sub-table to hit<br />
19
OVS with Two Layer Lookup<br />
Packet Header<br />
1 st Level of<br />
Indirection<br />
<strong>Flow</strong> Mask<br />
1111 0000<br />
1110 0000<br />
Mask L<br />
Mask N<br />
Rules<br />
1010 xxxx<br />
0011 xxxx<br />
1011 xxxx<br />
110x xxxx<br />
101x xxxx<br />
111x xxxx<br />
011x xxxx<br />
Match<br />
01xx xxxx<br />
10xx xxxx<br />
1xxx xxxx<br />
0xxx xxxx<br />
20
Bloom Filter as 1 st Level of Indirection<br />
Packet Header<br />
Mask 1 BF Mask 2 BF Mask L BF Mask N BF<br />
1 st Level of<br />
Indirection<br />
1111 0000<br />
1110 0000<br />
Mask L<br />
Mask N<br />
1010 xxxx<br />
0011 xxxx<br />
1011 xxxx<br />
110x xxxx<br />
101x xxxx<br />
111x xxxx<br />
011x xxxx<br />
Match<br />
01xx xxxx<br />
10xx xxxx<br />
1xxx xxxx<br />
0xxx xxxx<br />
L Lookups L Bloom Filters + 1 lookup<br />
21
Cycles<br />
2 Level Lookup Prelim<strong>in</strong>ary Performance<br />
Results<br />
7000<br />
OVS - Hit OVS - Miss bloom - Hit bloom - Miss<br />
6000<br />
5000<br />
4000<br />
3000<br />
2000<br />
1000<br />
0<br />
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20<br />
Intel(R) Xeon(R) CPU E5-2699 v4 @ 2.20GHz<br />
Number of Subtables Traversed<br />
Hyper-Thread<strong>in</strong>g: disabled<br />
22
Legal Disclaimers<br />
No license (express or implied, by estoppel or otherwise) to any <strong>in</strong>tellectual property rights is granted by this<br />
document.<br />
Intel disclaims all express and implied warranties, <strong>in</strong>clud<strong>in</strong>g without limitation, the implied warranties of<br />
merchantability, fitness for a particular purpose, and non-<strong>in</strong>fr<strong>in</strong>gement, as well as any warranty aris<strong>in</strong>g from<br />
course of performance, course of deal<strong>in</strong>g, or usage <strong>in</strong> trade.<br />
This document conta<strong>in</strong>s <strong>in</strong>formation on products, services and/or processes <strong>in</strong> development. All <strong>in</strong>formation<br />
provided here is subject to change without notice. Contact your Intel representative to obta<strong>in</strong> the latest<br />
forecast, schedule, specifications and roadmaps.<br />
Intel technologies’ features and benefits depend on system configuration and may require enabled<br />
hardware, software or service activation. Performance varies depend<strong>in</strong>g on system configuration. No<br />
computer system can be absolutely secure. Check with your system manufacturer or retailer or learn more<br />
at <strong>in</strong>tel.com.<br />
© 2016 Intel Corporation. Intel, the Intel logo, Intel. Experience What’s Inside, and the Intel. Experience<br />
What’s Inside logo are trademarks of Intel. Corporation <strong>in</strong> the U.S. and/or other countries.<br />
*Other names and brands may be claimed as the property of others.
Questions?<br />
Sameh Gobriel<br />
sameh.gobriel@<strong>in</strong>tel.com