You also want an ePaper? Increase the reach of your titles
YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.
<strong>Soft</strong>iwarp<br />
A <strong>Soft</strong>ware <strong>iWARP</strong> Driver for OpenFabrics<br />
Bernard Metzler, Fredy Neeser, Philip Frey<br />
IBM Zurich Research Lab<br />
www.openfabrics.org
Contents<br />
‣Background<br />
‣What is it?<br />
‣Do we need <strong>Soft</strong>ware RDMA?<br />
‣How is it made?<br />
‣Some first Test Results<br />
‣Feedback: OFED Issues<br />
‣Project Status & Roadmap<br />
‣Summary<br />
www.openfabrics.org<br />
2
Background<br />
‣ RDMA (from proprietary to standard):<br />
• via, Quadrics, Myrinet, .., InfiniBand, <strong>iWARP</strong><br />
‣ Ethernet (from lame to fast):<br />
• 1,10,100,1000,10000,40000,…MBit<br />
‣ Unified Wire:<br />
• Single link, single switch, single tech. or dump adapter<br />
Top500, Nov 2008<br />
‣ OpenIB<br />
• Focused on InfiniBand<br />
‣ OpenFabrics<br />
• InfiniBand + <strong>iWARP</strong> HW<br />
• + <strong>iWARP</strong> SW?<br />
‣ IBM Zurich Research<br />
• RDMA API standardization<br />
• IETF work on <strong>iWARP</strong><br />
• <strong>Soft</strong>ware <strong>iWARP</strong> stack<br />
www.openfabrics.org<br />
3
<strong>Soft</strong>iwarp: What is it?<br />
‣ Just another OFED <strong>iWARP</strong> driver<br />
• ../hw/cxgb3/, ../hw/siw,<br />
‣ Purely software based <strong>iWARP</strong> protocol<br />
stack implementation<br />
• Kernel module<br />
• Runs on top of TCP kernel sockets<br />
• Exports OFED Interfaces (verbs, IWCM,<br />
management, …)<br />
‣ Client support<br />
• Currently only user level clients<br />
• libsiw: user space library to integrate with<br />
libibverbs, librdmacm<br />
ulib x<br />
IPoIB<br />
OpenFabrics<br />
ulib y<br />
OFA uverbs/API<br />
SRP iSER NFS-R.<br />
Connection Manager<br />
Abstraction<br />
CM CM<br />
OFA kverbs/API<br />
libsiw<br />
driver x driver y siw driver<br />
‣ Current build<br />
• OFED 1.3<br />
• Linux 2.6.24<br />
• ~9000 lines for *.[ch] including comments<br />
HCA x<br />
RNIC y<br />
TCP/IP<br />
E-NIC<br />
www.openfabrics.org<br />
4
OFED and Kernel Integration<br />
Approach: Keep things simple and standard<br />
‣ TCP interface: Kernel Sockets<br />
• TCP stack completely untouched<br />
• Non-blocking write() with pause and resume<br />
• softirq-based read()<br />
‣ Linux Kernel Services<br />
• List-based QP/WQE management<br />
• Workqueue-based asynchronous sending/CM<br />
• …<br />
‣ OFED interface<br />
• verbs,<br />
• Event callbacks,<br />
• Device registration<br />
‣ Fast Path<br />
• No private interface between user lib and<br />
kernel module<br />
• Syscall for each post(SQ/RQ) or reap(CQ)<br />
operation<br />
uverbs<br />
OFED libibverbs/librdmacm<br />
libsiw<br />
OFED Core OFED CM<br />
kverbs<br />
kernel socket<br />
Application<br />
siw<br />
TCP<br />
cm-pi<br />
www.openfabrics.org<br />
5
Why RDMA in <strong>Soft</strong>ware?<br />
‣ Enable systems without RNIC to speak RDMA<br />
• Conventional ENIC sufficient<br />
• Peer with real RNICs<br />
• Help busy server to offload<br />
• Speak RDMA out of the Cluster<br />
<strong>Soft</strong>iwarp<br />
• Enable real RNICs(!)<br />
enabled Clients<br />
• Benefit from RDMA API semantics<br />
• Application benefits<br />
• Async. comm., parallelism<br />
• One-sided operations<br />
• CPU benefits<br />
• Copy avoidance in tx<br />
• Named buffers in rx<br />
‣ Early system migration to RDMA<br />
• Migrate applications before RNIC avail.<br />
• Mix RNIC equipped systems with ENICs<br />
‣ Test/Debug real HW<br />
‣ RDMA transport redundancy/failover<br />
‣ Help to grow OFED Ecosystem for Adoption and Usage beyond HPC<br />
Server farm<br />
using RDMA HW<br />
www.openfabrics.org<br />
6
RDMA Use Case != HPC<br />
Multimedia Data Dissemination via RDMA<br />
‣ RNIC-equipped video server, Chelsio t3 10Gb<br />
• Complete content in Server RAM<br />
• IBM HS21 BladeServers (4core Xeon 2.33 GHz, 8GB Mem.)<br />
‣ Up to 1000 VLC clients to pull FullHD (8.7Mbps)<br />
• VLC client extended for OFED verbs<br />
• Client may seek in data stream<br />
‣ HTTP get (Apache w/sendfile()) or RDMA READ<br />
• Service degradation w/o sendfile<br />
• Increasing load with sendfile<br />
• Zero server CPU load for RDMA<br />
‣ Very simple pull protocol for RDMA<br />
• Minimum <strong>iWARP</strong> server state per Client: RDMA READ!<br />
CPU Load [%]<br />
400<br />
350<br />
300<br />
250<br />
200<br />
150<br />
100<br />
50<br />
<strong>iWARP</strong>,<br />
8.7 Mbps<br />
HTTP sendfile<br />
on, 8.7 Mbps<br />
HTTP sendfile<br />
off, 8.7 Mbps<br />
0<br />
0 200 400 600 800 1000<br />
Number of Clients<br />
40000<br />
35000<br />
30000<br />
iWarp, 8.7 Mbps<br />
HTTP sendfile on, 8.7 Mbps<br />
HTTP sendfile off, 8.7 Mbps<br />
Interrupts / sec.<br />
25000<br />
20000<br />
15000<br />
10000<br />
5000<br />
0<br />
0 200 400 600 800 1000<br />
Number of Clients<br />
www.openfabrics.org<br />
7
<strong>Soft</strong>iwarp TX Path Design<br />
‣ Syscall through OFED verbs API<br />
to post SQ work<br />
‣ Synchronous send out of user<br />
context if socket send space available<br />
‣ Nonblocking socket operation:<br />
• Pause sending if socket buffer full<br />
• Resume sending if TCP indicates<br />
sk_writespace()<br />
• Use Linux workqueue to resume sending<br />
‣ Lock-free source memory validation on<br />
the fly<br />
‣ sendfile()-semantic possible<br />
‣ Post work completions onto CQ<br />
‣ Reap CQE’s asynchronously<br />
post_send<br />
Direct<br />
SQ Processing<br />
SQ<br />
• Process SQ WR<br />
• DDP segmentation<br />
• MPA framing<br />
• CRC<br />
• Put onto CQ<br />
• Call CQ handler<br />
TCP Output<br />
resume<br />
• tcp_sendmsg()/<br />
• sendpage()<br />
CQ<br />
reap_cqe<br />
TX workqueue<br />
• Register SQ work<br />
• Run SQ work<br />
Socket Callback<br />
• sk_writespace()<br />
• TCP ACK frees<br />
send space<br />
www.openfabrics.org<br />
8
<strong>Soft</strong>iwarp RX Path Design<br />
‣ All RX processing done in<br />
softirq context:<br />
• in sk_data_ready() upcall:<br />
• Header parsing<br />
• RQ access<br />
• Immediate data placement<br />
• CRC<br />
• No context switch<br />
• No extra thread<br />
‣ Lock-free target memory<br />
validation on the fly<br />
SQ<br />
post_rcv<br />
RX Processing<br />
• skb_copy_bits():<br />
• Parse Hdr.<br />
• Process DDP/<br />
RDMAP<br />
• Place data/<br />
do CRC/<br />
Post SQ (RREQ)<br />
• Put onto CQ<br />
• Call CQ handler<br />
RQ<br />
CQ<br />
reap_cqe<br />
TX workqueue<br />
• Register SQ work<br />
• Run SQ work<br />
(do RRESP)<br />
workqueue<br />
TX path<br />
‣ Inbound RREQ just posted<br />
at SQ + SQ processing<br />
scheduled to resume later<br />
softirq<br />
TCP Input<br />
• sk_data_ready<br />
• tcp_read_sock<br />
www.openfabrics.org<br />
9
First Tests: <strong>Soft</strong>iwarp<br />
‣ Non-tuned software stack on both sides<br />
‣ Application level flow control<br />
(ping-pong buffers)<br />
‣ SEND‘s for synchronization<br />
‣ 1 connection<br />
10000<br />
9000<br />
8000<br />
total cpu (cli)<br />
total cpu (srv)<br />
RDMA Write Bandwidth Test<br />
200<br />
180<br />
160<br />
Client<br />
WRITE<br />
WRITE<br />
Chunk 1<br />
WRITE<br />
SEND<br />
WRITE<br />
Chunk 2<br />
WRITE<br />
SEND<br />
Wait<br />
RECV-WC<br />
WRITE<br />
Chunk 1<br />
Payload (1,1)<br />
Payload (1,2)<br />
Payload (1,N)<br />
Chunk Done (1)<br />
Payload (2,1)<br />
Payload (2,N)<br />
Chunk Done (2)<br />
Chunk ACK (1)<br />
Payload (1,1)<br />
Server<br />
RECV-WC<br />
SEND<br />
RECV-WC<br />
Wait<br />
Process<br />
Chunk 1 data<br />
Process<br />
Chunk 2 data<br />
Throughput [Mbps]<br />
7000<br />
6000<br />
5000<br />
4000<br />
3000<br />
60<br />
IBM Zurich Research Lab<br />
2000<br />
40<br />
<strong>Soft</strong>iwarp rev 87<br />
1000<br />
20<br />
Test series s07<br />
0<br />
1 2 4 8 16 32 64 128 256 0<br />
Message size [KB]<br />
Write/read application data: off<br />
MPA CRC32C: off<br />
MTU = 9000<br />
140<br />
120<br />
100<br />
80<br />
Total CPU load [%]<br />
www.openfabrics.org<br />
10
First Tests: <strong>Soft</strong>iwarp<br />
Client<br />
Server<br />
Throughput [Mbps]<br />
10000<br />
9000<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
RDMA Write Bandwidth Test<br />
total cpu (cli)<br />
total cpu (srv)<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
3000<br />
30<br />
IBM Zurich Research Lab<br />
2000<br />
20<br />
<strong>Soft</strong>iwarp rev 87<br />
1000<br />
10<br />
Test series s14<br />
0<br />
1 2 4 8 16 32 64 128 256 0<br />
Message size [KB]<br />
100<br />
Total CPU load [%]<br />
Chunk 1<br />
Chunk 2<br />
Wait<br />
Chunk 1<br />
WRITE<br />
WRITE<br />
WRITE<br />
SEND<br />
WRITE<br />
WRITE<br />
SEND<br />
RECV-WC<br />
WRITE<br />
Payload (1,1)<br />
Payload (1,2)<br />
Payload (1,N)<br />
Chunk Done (1)<br />
Payload (2,1)<br />
Payload (2,N)<br />
Chunk Done (2)<br />
Chunk ACK (1)<br />
Payload (1,1)<br />
RECV-WC<br />
SEND<br />
RECV-WC<br />
‣ Same application level flow<br />
control (ping-pong buffers) +<br />
• 1 Core only<br />
• MPA CRC off<br />
• MTU=9000<br />
‣ Sending CPU on its limit<br />
Wait<br />
Process<br />
Chunk 1 data<br />
Process<br />
Chunk 2 data<br />
www.openfabrics.org<br />
11
First Tests: <strong>Soft</strong>iwarp + CRC<br />
Client<br />
Server<br />
Throughput [Mbps]<br />
10000<br />
9000<br />
8000<br />
7000<br />
6000<br />
5000<br />
4000<br />
RDMA Write Bandwidth Test<br />
total cpu (cli)<br />
total cpu (srv)<br />
90<br />
80<br />
70<br />
60<br />
50<br />
40<br />
3000<br />
30<br />
IBM Zurich Research Lab<br />
2000<br />
20<br />
<strong>Soft</strong>iwarp rev 87<br />
1000<br />
10<br />
Test series s15<br />
0<br />
1 2 4 8 16 32 64 128 256 0<br />
Message size [KB]<br />
100<br />
Total CPU load [%]<br />
Chunk 1<br />
Chunk 2<br />
Wait<br />
Chunk 1<br />
WRITE<br />
WRITE<br />
WRITE<br />
SEND<br />
WRITE<br />
WRITE<br />
SEND<br />
RECV-WC<br />
WRITE<br />
Payload (1,1)<br />
Payload (1,2)<br />
Payload (1,N)<br />
Chunk Done (1)<br />
Payload (2,1)<br />
Payload (2,N)<br />
Chunk Done (2)<br />
Chunk ACK (1)<br />
Payload (1,1)<br />
RECV-WC<br />
SEND<br />
RECV-WC<br />
‣ Same application level flow<br />
control (ping-pong buffers) +<br />
• 1 Core only<br />
• MPA CRC ON<br />
• MTU=9000<br />
‣ CRC is killing performance<br />
‣ Still sending CPU on its limit<br />
Wait<br />
Process<br />
Chunk 1 data<br />
Process<br />
Chunk 2 data<br />
www.openfabrics.org<br />
12
First Tests: <strong>Soft</strong>iwarp-Chelsio<br />
WRITE<br />
Server<br />
Payload (1,1)<br />
Client<br />
Chunk 1<br />
WRITE<br />
Sig.WRITE<br />
Payload (1,N-1)<br />
Payload (1,N)<br />
Chunk 2<br />
WRITE<br />
Sig.WRITE<br />
WC WRITE (N)<br />
WRITE<br />
Payload (2,1)<br />
Payload (2,N)<br />
Payload (1,1)<br />
Chunk 1<br />
‣ Test 1: <strong>Soft</strong>iwarp peering Chelsio T3<br />
• Setup:<br />
• RNIC sends WRITEs to <strong>Soft</strong>iwarp target<br />
• Target just places data w/o appl. interaction<br />
• Result:<br />
• Close to line speed at 8KB<br />
• Uups - some issues at larger buffers<br />
‣ Test 2: <strong>Soft</strong>iwarp peering <strong>Soft</strong>iwarp<br />
• Same setup<br />
• Result:<br />
• Maximum Bandwidth from 128KB on<br />
www.openfabrics.org<br />
‣ Conclusions:<br />
• Promising for first test on non-tuned stack<br />
• <strong>Soft</strong>ware stack may perform well on client side<br />
• Further improvement with sendfile() possible<br />
13
<strong>Soft</strong>iwarp: Work in progress<br />
Core Functionality<br />
RDMAP/DDP/MPA<br />
QP/CQ/PD/MR Objects<br />
Send<br />
Receive<br />
RDMA WRITE<br />
RDMA READ<br />
Connection Mgmt (IWCM, TCP)<br />
Memory Management<br />
(x): done, (w): work in progress, (-) not done<br />
www.openfabrics.org<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
x<br />
Features (incomplete)…<br />
MPA CRC<br />
x<br />
MPA Markers -<br />
Memory Windows<br />
w<br />
Inline Data<br />
w<br />
Shared Receive Queue -<br />
Fast Memory Registration -<br />
Termination Messages<br />
w<br />
Remote Invalidation -<br />
Stag 0 -<br />
Resource Realloc. (MR/QP/CQ) -<br />
TCP header alignment<br />
w<br />
Relative adressing (ZBVA)<br />
w<br />
14
<strong>Soft</strong>iwarp Roadmap<br />
‣ Opensource very soon<br />
‣ Discuss current code base in the community<br />
• Be open for changes/critics<br />
• Identify core must-haves which are missing<br />
• Stability!<br />
• Invite others to contribute<br />
• Feedback known issues of OFED core to team<br />
• Don’t touch TCP<br />
‣ Start compliance testing (OFA IWG) soon<br />
‣ Investigate private fast path user interface option<br />
‣ Start working on kernel client support<br />
‣ Investigate partially offloading of CPU intensive tasks<br />
• CRC, tx-markers<br />
• Data placement,..<br />
www.openfabrics.org<br />
15
Feedback: OFED Issues<br />
‣ Late RDMA Transition<br />
• Something not part of RNIC integration is now possible<br />
• Very simple to do with <strong>Soft</strong>iwarp, benefits iSER & Co.<br />
• <strong>Soft</strong>iwarp allows late RDMA mode transition w/o TCP context<br />
migration<br />
‣ OFED CM<br />
• How to coexist with RNIC if SW stack shares link, shall we?<br />
• Can we exist within OFED w/o full (complex) IWCM support?<br />
• Current code spends 2000 lines out of 9000 for CM!<br />
‣ Device Management<br />
• Wildcard listen on multiple interfaces must be translated to individual<br />
wildcard listens on each (port/ipaddr) combination<br />
‣ Zero based virtual adressing<br />
‣ …<br />
www.openfabrics.org<br />
16
Summary<br />
‣ <strong>Soft</strong>ware RDMA is useful<br />
‣ <strong>Soft</strong>ware RDMA is efficient on client side (at least)<br />
‣ RDMA semantics help to use transport efficiently<br />
‣ <strong>Soft</strong>iwarp helps to grow RDMA/OFED ecosystem<br />
• Establish RDMA communication model<br />
• Prepare applications to use RDMA<br />
• Prepare systems to introduce RDMA HW<br />
• Peer & thus enable RDMA HW<br />
‣ <strong>Soft</strong>iwarp is work in progress<br />
• Please join.<br />
www.openfabrics.org<br />
17