28.11.2014 Views

Soft iWARP

Soft iWARP

Soft iWARP

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Soft</strong>iwarp<br />

A <strong>Soft</strong>ware <strong>iWARP</strong> Driver for OpenFabrics<br />

Bernard Metzler, Fredy Neeser, Philip Frey<br />

IBM Zurich Research Lab<br />

www.openfabrics.org


Contents<br />

‣Background<br />

‣What is it?<br />

‣Do we need <strong>Soft</strong>ware RDMA?<br />

‣How is it made?<br />

‣Some first Test Results<br />

‣Feedback: OFED Issues<br />

‣Project Status & Roadmap<br />

‣Summary<br />

www.openfabrics.org<br />

2


Background<br />

‣ RDMA (from proprietary to standard):<br />

• via, Quadrics, Myrinet, .., InfiniBand, <strong>iWARP</strong><br />

‣ Ethernet (from lame to fast):<br />

• 1,10,100,1000,10000,40000,…MBit<br />

‣ Unified Wire:<br />

• Single link, single switch, single tech. or dump adapter<br />

Top500, Nov 2008<br />

‣ OpenIB<br />

• Focused on InfiniBand<br />

‣ OpenFabrics<br />

• InfiniBand + <strong>iWARP</strong> HW<br />

• + <strong>iWARP</strong> SW?<br />

‣ IBM Zurich Research<br />

• RDMA API standardization<br />

• IETF work on <strong>iWARP</strong><br />

• <strong>Soft</strong>ware <strong>iWARP</strong> stack<br />

www.openfabrics.org<br />

3


<strong>Soft</strong>iwarp: What is it?<br />

‣ Just another OFED <strong>iWARP</strong> driver<br />

• ../hw/cxgb3/, ../hw/siw,<br />

‣ Purely software based <strong>iWARP</strong> protocol<br />

stack implementation<br />

• Kernel module<br />

• Runs on top of TCP kernel sockets<br />

• Exports OFED Interfaces (verbs, IWCM,<br />

management, …)<br />

‣ Client support<br />

• Currently only user level clients<br />

• libsiw: user space library to integrate with<br />

libibverbs, librdmacm<br />

ulib x<br />

IPoIB<br />

OpenFabrics<br />

ulib y<br />

OFA uverbs/API<br />

SRP iSER NFS-R.<br />

Connection Manager<br />

Abstraction<br />

CM CM<br />

OFA kverbs/API<br />

libsiw<br />

driver x driver y siw driver<br />

‣ Current build<br />

• OFED 1.3<br />

• Linux 2.6.24<br />

• ~9000 lines for *.[ch] including comments<br />

HCA x<br />

RNIC y<br />

TCP/IP<br />

E-NIC<br />

www.openfabrics.org<br />

4


OFED and Kernel Integration<br />

Approach: Keep things simple and standard<br />

‣ TCP interface: Kernel Sockets<br />

• TCP stack completely untouched<br />

• Non-blocking write() with pause and resume<br />

• softirq-based read()<br />

‣ Linux Kernel Services<br />

• List-based QP/WQE management<br />

• Workqueue-based asynchronous sending/CM<br />

• …<br />

‣ OFED interface<br />

• verbs,<br />

• Event callbacks,<br />

• Device registration<br />

‣ Fast Path<br />

• No private interface between user lib and<br />

kernel module<br />

• Syscall for each post(SQ/RQ) or reap(CQ)<br />

operation<br />

uverbs<br />

OFED libibverbs/librdmacm<br />

libsiw<br />

OFED Core OFED CM<br />

kverbs<br />

kernel socket<br />

Application<br />

siw<br />

TCP<br />

cm-pi<br />

www.openfabrics.org<br />

5


Why RDMA in <strong>Soft</strong>ware?<br />

‣ Enable systems without RNIC to speak RDMA<br />

• Conventional ENIC sufficient<br />

• Peer with real RNICs<br />

• Help busy server to offload<br />

• Speak RDMA out of the Cluster<br />

<strong>Soft</strong>iwarp<br />

• Enable real RNICs(!)<br />

enabled Clients<br />

• Benefit from RDMA API semantics<br />

• Application benefits<br />

• Async. comm., parallelism<br />

• One-sided operations<br />

• CPU benefits<br />

• Copy avoidance in tx<br />

• Named buffers in rx<br />

‣ Early system migration to RDMA<br />

• Migrate applications before RNIC avail.<br />

• Mix RNIC equipped systems with ENICs<br />

‣ Test/Debug real HW<br />

‣ RDMA transport redundancy/failover<br />

‣ Help to grow OFED Ecosystem for Adoption and Usage beyond HPC<br />

Server farm<br />

using RDMA HW<br />

www.openfabrics.org<br />

6


RDMA Use Case != HPC<br />

Multimedia Data Dissemination via RDMA<br />

‣ RNIC-equipped video server, Chelsio t3 10Gb<br />

• Complete content in Server RAM<br />

• IBM HS21 BladeServers (4core Xeon 2.33 GHz, 8GB Mem.)<br />

‣ Up to 1000 VLC clients to pull FullHD (8.7Mbps)<br />

• VLC client extended for OFED verbs<br />

• Client may seek in data stream<br />

‣ HTTP get (Apache w/sendfile()) or RDMA READ<br />

• Service degradation w/o sendfile<br />

• Increasing load with sendfile<br />

• Zero server CPU load for RDMA<br />

‣ Very simple pull protocol for RDMA<br />

• Minimum <strong>iWARP</strong> server state per Client: RDMA READ!<br />

CPU Load [%]<br />

400<br />

350<br />

300<br />

250<br />

200<br />

150<br />

100<br />

50<br />

<strong>iWARP</strong>,<br />

8.7 Mbps<br />

HTTP sendfile<br />

on, 8.7 Mbps<br />

HTTP sendfile<br />

off, 8.7 Mbps<br />

0<br />

0 200 400 600 800 1000<br />

Number of Clients<br />

40000<br />

35000<br />

30000<br />

iWarp, 8.7 Mbps<br />

HTTP sendfile on, 8.7 Mbps<br />

HTTP sendfile off, 8.7 Mbps<br />

Interrupts / sec.<br />

25000<br />

20000<br />

15000<br />

10000<br />

5000<br />

0<br />

0 200 400 600 800 1000<br />

Number of Clients<br />

www.openfabrics.org<br />

7


<strong>Soft</strong>iwarp TX Path Design<br />

‣ Syscall through OFED verbs API<br />

to post SQ work<br />

‣ Synchronous send out of user<br />

context if socket send space available<br />

‣ Nonblocking socket operation:<br />

• Pause sending if socket buffer full<br />

• Resume sending if TCP indicates<br />

sk_writespace()<br />

• Use Linux workqueue to resume sending<br />

‣ Lock-free source memory validation on<br />

the fly<br />

‣ sendfile()-semantic possible<br />

‣ Post work completions onto CQ<br />

‣ Reap CQE’s asynchronously<br />

post_send<br />

Direct<br />

SQ Processing<br />

SQ<br />

• Process SQ WR<br />

• DDP segmentation<br />

• MPA framing<br />

• CRC<br />

• Put onto CQ<br />

• Call CQ handler<br />

TCP Output<br />

resume<br />

• tcp_sendmsg()/<br />

• sendpage()<br />

CQ<br />

reap_cqe<br />

TX workqueue<br />

• Register SQ work<br />

• Run SQ work<br />

Socket Callback<br />

• sk_writespace()<br />

• TCP ACK frees<br />

send space<br />

www.openfabrics.org<br />

8


<strong>Soft</strong>iwarp RX Path Design<br />

‣ All RX processing done in<br />

softirq context:<br />

• in sk_data_ready() upcall:<br />

• Header parsing<br />

• RQ access<br />

• Immediate data placement<br />

• CRC<br />

• No context switch<br />

• No extra thread<br />

‣ Lock-free target memory<br />

validation on the fly<br />

SQ<br />

post_rcv<br />

RX Processing<br />

• skb_copy_bits():<br />

• Parse Hdr.<br />

• Process DDP/<br />

RDMAP<br />

• Place data/<br />

do CRC/<br />

Post SQ (RREQ)<br />

• Put onto CQ<br />

• Call CQ handler<br />

RQ<br />

CQ<br />

reap_cqe<br />

TX workqueue<br />

• Register SQ work<br />

• Run SQ work<br />

(do RRESP)<br />

workqueue<br />

TX path<br />

‣ Inbound RREQ just posted<br />

at SQ + SQ processing<br />

scheduled to resume later<br />

softirq<br />

TCP Input<br />

• sk_data_ready<br />

• tcp_read_sock<br />

www.openfabrics.org<br />

9


First Tests: <strong>Soft</strong>iwarp<br />

‣ Non-tuned software stack on both sides<br />

‣ Application level flow control<br />

(ping-pong buffers)<br />

‣ SEND‘s for synchronization<br />

‣ 1 connection<br />

10000<br />

9000<br />

8000<br />

total cpu (cli)<br />

total cpu (srv)<br />

RDMA Write Bandwidth Test<br />

200<br />

180<br />

160<br />

Client<br />

WRITE<br />

WRITE<br />

Chunk 1<br />

WRITE<br />

SEND<br />

WRITE<br />

Chunk 2<br />

WRITE<br />

SEND<br />

Wait<br />

RECV-WC<br />

WRITE<br />

Chunk 1<br />

Payload (1,1)<br />

Payload (1,2)<br />

Payload (1,N)<br />

Chunk Done (1)<br />

Payload (2,1)<br />

Payload (2,N)<br />

Chunk Done (2)<br />

Chunk ACK (1)<br />

Payload (1,1)<br />

Server<br />

RECV-WC<br />

SEND<br />

RECV-WC<br />

Wait<br />

Process<br />

Chunk 1 data<br />

Process<br />

Chunk 2 data<br />

Throughput [Mbps]<br />

7000<br />

6000<br />

5000<br />

4000<br />

3000<br />

60<br />

IBM Zurich Research Lab<br />

2000<br />

40<br />

<strong>Soft</strong>iwarp rev 87<br />

1000<br />

20<br />

Test series s07<br />

0<br />

1 2 4 8 16 32 64 128 256 0<br />

Message size [KB]<br />

Write/read application data: off<br />

MPA CRC32C: off<br />

MTU = 9000<br />

140<br />

120<br />

100<br />

80<br />

Total CPU load [%]<br />

www.openfabrics.org<br />

10


First Tests: <strong>Soft</strong>iwarp<br />

Client<br />

Server<br />

Throughput [Mbps]<br />

10000<br />

9000<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

RDMA Write Bandwidth Test<br />

total cpu (cli)<br />

total cpu (srv)<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

3000<br />

30<br />

IBM Zurich Research Lab<br />

2000<br />

20<br />

<strong>Soft</strong>iwarp rev 87<br />

1000<br />

10<br />

Test series s14<br />

0<br />

1 2 4 8 16 32 64 128 256 0<br />

Message size [KB]<br />

100<br />

Total CPU load [%]<br />

Chunk 1<br />

Chunk 2<br />

Wait<br />

Chunk 1<br />

WRITE<br />

WRITE<br />

WRITE<br />

SEND<br />

WRITE<br />

WRITE<br />

SEND<br />

RECV-WC<br />

WRITE<br />

Payload (1,1)<br />

Payload (1,2)<br />

Payload (1,N)<br />

Chunk Done (1)<br />

Payload (2,1)<br />

Payload (2,N)<br />

Chunk Done (2)<br />

Chunk ACK (1)<br />

Payload (1,1)<br />

RECV-WC<br />

SEND<br />

RECV-WC<br />

‣ Same application level flow<br />

control (ping-pong buffers) +<br />

• 1 Core only<br />

• MPA CRC off<br />

• MTU=9000<br />

‣ Sending CPU on its limit<br />

Wait<br />

Process<br />

Chunk 1 data<br />

Process<br />

Chunk 2 data<br />

www.openfabrics.org<br />

11


First Tests: <strong>Soft</strong>iwarp + CRC<br />

Client<br />

Server<br />

Throughput [Mbps]<br />

10000<br />

9000<br />

8000<br />

7000<br />

6000<br />

5000<br />

4000<br />

RDMA Write Bandwidth Test<br />

total cpu (cli)<br />

total cpu (srv)<br />

90<br />

80<br />

70<br />

60<br />

50<br />

40<br />

3000<br />

30<br />

IBM Zurich Research Lab<br />

2000<br />

20<br />

<strong>Soft</strong>iwarp rev 87<br />

1000<br />

10<br />

Test series s15<br />

0<br />

1 2 4 8 16 32 64 128 256 0<br />

Message size [KB]<br />

100<br />

Total CPU load [%]<br />

Chunk 1<br />

Chunk 2<br />

Wait<br />

Chunk 1<br />

WRITE<br />

WRITE<br />

WRITE<br />

SEND<br />

WRITE<br />

WRITE<br />

SEND<br />

RECV-WC<br />

WRITE<br />

Payload (1,1)<br />

Payload (1,2)<br />

Payload (1,N)<br />

Chunk Done (1)<br />

Payload (2,1)<br />

Payload (2,N)<br />

Chunk Done (2)<br />

Chunk ACK (1)<br />

Payload (1,1)<br />

RECV-WC<br />

SEND<br />

RECV-WC<br />

‣ Same application level flow<br />

control (ping-pong buffers) +<br />

• 1 Core only<br />

• MPA CRC ON<br />

• MTU=9000<br />

‣ CRC is killing performance<br />

‣ Still sending CPU on its limit<br />

Wait<br />

Process<br />

Chunk 1 data<br />

Process<br />

Chunk 2 data<br />

www.openfabrics.org<br />

12


First Tests: <strong>Soft</strong>iwarp-Chelsio<br />

WRITE<br />

Server<br />

Payload (1,1)<br />

Client<br />

Chunk 1<br />

WRITE<br />

Sig.WRITE<br />

Payload (1,N-1)<br />

Payload (1,N)<br />

Chunk 2<br />

WRITE<br />

Sig.WRITE<br />

WC WRITE (N)<br />

WRITE<br />

Payload (2,1)<br />

Payload (2,N)<br />

Payload (1,1)<br />

Chunk 1<br />

‣ Test 1: <strong>Soft</strong>iwarp peering Chelsio T3<br />

• Setup:<br />

• RNIC sends WRITEs to <strong>Soft</strong>iwarp target<br />

• Target just places data w/o appl. interaction<br />

• Result:<br />

• Close to line speed at 8KB<br />

• Uups - some issues at larger buffers<br />

‣ Test 2: <strong>Soft</strong>iwarp peering <strong>Soft</strong>iwarp<br />

• Same setup<br />

• Result:<br />

• Maximum Bandwidth from 128KB on<br />

www.openfabrics.org<br />

‣ Conclusions:<br />

• Promising for first test on non-tuned stack<br />

• <strong>Soft</strong>ware stack may perform well on client side<br />

• Further improvement with sendfile() possible<br />

13


<strong>Soft</strong>iwarp: Work in progress<br />

Core Functionality<br />

RDMAP/DDP/MPA<br />

QP/CQ/PD/MR Objects<br />

Send<br />

Receive<br />

RDMA WRITE<br />

RDMA READ<br />

Connection Mgmt (IWCM, TCP)<br />

Memory Management<br />

(x): done, (w): work in progress, (-) not done<br />

www.openfabrics.org<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

x<br />

Features (incomplete)…<br />

MPA CRC<br />

x<br />

MPA Markers -<br />

Memory Windows<br />

w<br />

Inline Data<br />

w<br />

Shared Receive Queue -<br />

Fast Memory Registration -<br />

Termination Messages<br />

w<br />

Remote Invalidation -<br />

Stag 0 -<br />

Resource Realloc. (MR/QP/CQ) -<br />

TCP header alignment<br />

w<br />

Relative adressing (ZBVA)<br />

w<br />

14


<strong>Soft</strong>iwarp Roadmap<br />

‣ Opensource very soon<br />

‣ Discuss current code base in the community<br />

• Be open for changes/critics<br />

• Identify core must-haves which are missing<br />

• Stability!<br />

• Invite others to contribute<br />

• Feedback known issues of OFED core to team<br />

• Don’t touch TCP<br />

‣ Start compliance testing (OFA IWG) soon<br />

‣ Investigate private fast path user interface option<br />

‣ Start working on kernel client support<br />

‣ Investigate partially offloading of CPU intensive tasks<br />

• CRC, tx-markers<br />

• Data placement,..<br />

www.openfabrics.org<br />

15


Feedback: OFED Issues<br />

‣ Late RDMA Transition<br />

• Something not part of RNIC integration is now possible<br />

• Very simple to do with <strong>Soft</strong>iwarp, benefits iSER & Co.<br />

• <strong>Soft</strong>iwarp allows late RDMA mode transition w/o TCP context<br />

migration<br />

‣ OFED CM<br />

• How to coexist with RNIC if SW stack shares link, shall we?<br />

• Can we exist within OFED w/o full (complex) IWCM support?<br />

• Current code spends 2000 lines out of 9000 for CM!<br />

‣ Device Management<br />

• Wildcard listen on multiple interfaces must be translated to individual<br />

wildcard listens on each (port/ipaddr) combination<br />

‣ Zero based virtual adressing<br />

‣ …<br />

www.openfabrics.org<br />

16


Summary<br />

‣ <strong>Soft</strong>ware RDMA is useful<br />

‣ <strong>Soft</strong>ware RDMA is efficient on client side (at least)<br />

‣ RDMA semantics help to use transport efficiently<br />

‣ <strong>Soft</strong>iwarp helps to grow RDMA/OFED ecosystem<br />

• Establish RDMA communication model<br />

• Prepare applications to use RDMA<br />

• Prepare systems to introduce RDMA HW<br />

• Peer & thus enable RDMA HW<br />

‣ <strong>Soft</strong>iwarp is work in progress<br />

• Please join.<br />

www.openfabrics.org<br />

17

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!