Soft iWARP

Softiwarp 

A Software iWARP Driver for OpenFabrics 

Bernard Metzler, Fredy Neeser, Philip Frey 

IBM Zurich Research Lab 

www.openfabrics.org

Contents 

‣Background 

‣What is it? 

‣Do we need Software RDMA? 

‣How is it made? 

‣Some first Test Results 

‣Feedback: OFED Issues 

‣Project Status & Roadmap 

‣Summary 

www.openfabrics.org 

2

Background 

‣ RDMA (from proprietary to standard): 

• via, Quadrics, Myrinet, .., InfiniBand, iWARP 

‣ Ethernet (from lame to fast): 

• 1,10,100,1000,10000,40000,…MBit 

‣ Unified Wire: 

• Single link, single switch, single tech. or dump adapter 

Top500, Nov 2008 

‣ OpenIB 

• Focused on InfiniBand 

‣ OpenFabrics 

• InfiniBand + iWARP HW 

• + iWARP SW? 

‣ IBM Zurich Research 

• RDMA API standardization 

• IETF work on iWARP 

• Software iWARP stack 


3

Softiwarp: What is it? 

‣ Just another OFED iWARP driver 

• ../hw/cxgb3/, ../hw/siw, 

‣ Purely software based iWARP protocol 

stack implementation 

• Kernel module 

• Runs on top of TCP kernel sockets 

• Exports OFED Interfaces (verbs, IWCM, 

management, …) 

‣ Client support 

• Currently only user level clients 

• libsiw: user space library to integrate with 

libibverbs, librdmacm 

ulib x 

IPoIB 

OpenFabrics 

ulib y 

OFA uverbs/API 

SRP iSER NFS-R. 

Connection Manager 

Abstraction 

CM CM 

OFA kverbs/API 

libsiw 

driver x driver y siw driver 

‣ Current build 

• OFED 1.3 

• Linux 2.6.24 

• ~9000 lines for *.[ch] including comments 

HCA x 

RNIC y 

TCP/IP 

E-NIC 


4

OFED and Kernel Integration 

Approach: Keep things simple and standard 

‣ TCP interface: Kernel Sockets 

• TCP stack completely untouched 

• Non-blocking write() with pause and resume 

• softirq-based read() 

‣ Linux Kernel Services 

• List-based QP/WQE management 

• Workqueue-based asynchronous sending/CM 

• … 

‣ OFED interface 

• verbs, 

• Event callbacks, 

• Device registration 

‣ Fast Path 

• No private interface between user lib and 

kernel module 

• Syscall for each post(SQ/RQ) or reap(CQ) 

operation 

uverbs 

OFED libibverbs/librdmacm 

libsiw 

OFED Core OFED CM 

kverbs 

kernel socket 

Application 

siw 

TCP 

cm-pi 


5

Why RDMA in Software? 

‣ Enable systems without RNIC to speak RDMA 

• Conventional ENIC sufficient 

• Peer with real RNICs 

• Help busy server to offload 

• Speak RDMA out of the Cluster 

Softiwarp 

• Enable real RNICs(!) 

enabled Clients 

• Benefit from RDMA API semantics 

• Application benefits 

• Async. comm., parallelism 

• One-sided operations 

• CPU benefits 

• Copy avoidance in tx 

• Named buffers in rx 

‣ Early system migration to RDMA 

• Migrate applications before RNIC avail. 

• Mix RNIC equipped systems with ENICs 

‣ Test/Debug real HW 

‣ RDMA transport redundancy/failover 

‣ Help to grow OFED Ecosystem for Adoption and Usage beyond HPC 

Server farm 

using RDMA HW 


6

RDMA Use Case != HPC 

Multimedia Data Dissemination via RDMA 

‣ RNIC-equipped video server, Chelsio t3 10Gb 

• Complete content in Server RAM 

• IBM HS21 BladeServers (4core Xeon 2.33 GHz, 8GB Mem.) 

‣ Up to 1000 VLC clients to pull FullHD (8.7Mbps) 

• VLC client extended for OFED verbs 

• Client may seek in data stream 

‣ HTTP get (Apache w/sendfile()) or RDMA READ 

• Service degradation w/o sendfile 

• Increasing load with sendfile 

• Zero server CPU load for RDMA 

‣ Very simple pull protocol for RDMA 

• Minimum iWARP server state per Client: RDMA READ! 

CPU Load [%] 

400 

350 

300 

250 

200 

150 

100 

50 

iWARP, 

8.7 Mbps 

HTTP sendfile 

on, 8.7 Mbps 

HTTP sendfile 

off, 8.7 Mbps 

0 

0 200 400 600 800 1000 

Number of Clients 

40000 

35000 

30000 

iWarp, 8.7 Mbps 

HTTP sendfile on, 8.7 Mbps 

HTTP sendfile off, 8.7 Mbps 

Interrupts / sec. 

25000 

20000 

15000 

10000 

5000 

0 

0 200 400 600 800 1000 

Number of Clients 


7

Softiwarp TX Path Design 

‣ Syscall through OFED verbs API 

to post SQ work 

‣ Synchronous send out of user 

context if socket send space available 

‣ Nonblocking socket operation: 

• Pause sending if socket buffer full 

• Resume sending if TCP indicates 

sk_writespace() 

• Use Linux workqueue to resume sending 

‣ Lock-free source memory validation on 

the fly 

‣ sendfile()-semantic possible 

‣ Post work completions onto CQ 

‣ Reap CQE’s asynchronously 

post_send 

Direct 

SQ Processing 

SQ 

• Process SQ WR 

• DDP segmentation 

• MPA framing 

• CRC 

• Put onto CQ 

• Call CQ handler 

TCP Output 

resume 

• tcp_sendmsg()/ 

• sendpage() 

CQ 

reap_cqe 

TX workqueue 

• Register SQ work 

• Run SQ work 

Socket Callback 

• sk_writespace() 

• TCP ACK frees 

send space 


8

Softiwarp RX Path Design 

‣ All RX processing done in 

softirq context: 

• in sk_data_ready() upcall: 

• Header parsing 

• RQ access 

• Immediate data placement 

• CRC 

• No context switch 

• No extra thread 

‣ Lock-free target memory 

validation on the fly 

SQ 

post_rcv 

RX Processing 

• skb_copy_bits(): 

• Parse Hdr. 

• Process DDP/ 

RDMAP 

• Place data/ 

do CRC/ 

Post SQ (RREQ) 

• Put onto CQ 

• Call CQ handler 

RQ 

CQ 

reap_cqe 

TX workqueue 

• Register SQ work 

• Run SQ work 

(do RRESP) 

workqueue 

TX path 

‣ Inbound RREQ just posted 

at SQ + SQ processing 

scheduled to resume later 

softirq 

TCP Input 

• sk_data_ready 

• tcp_read_sock 


9

First Tests: Softiwarp 

‣ Non-tuned software stack on both sides 

‣ Application level flow control 

(ping-pong buffers) 

‣ SEND‘s for synchronization 

‣ 1 connection 

10000 

9000 

8000 

total cpu (cli) 

total cpu (srv) 

RDMA Write Bandwidth Test 

200 

180 

160 

Client 

WRITE 

WRITE 

Chunk 1 

WRITE 

SEND 

WRITE 

Chunk 2 

WRITE 

SEND 

Wait 

RECV-WC 

WRITE 

Chunk 1 

Payload (1,1) 

Payload (1,2) 

Payload (1,N) 

Chunk Done (1) 

Payload (2,1) 

Payload (2,N) 


Chunk ACK (1) 

Payload (1,1) 

Server 

RECV-WC 

SEND 

RECV-WC 

Wait 

Process 

Chunk 1 data 

Process 

Chunk 2 data 

Throughput [Mbps] 

7000 

6000 

5000 

4000 

3000 

60 


2000 

40 

Softiwarp rev 87 

1000 

20 

Test series s07 

0 

1 2 4 8 16 32 64 128 256 0 

Message size [KB] 

Write/read application data: off 

MPA CRC32C: off 

MTU = 9000 

140 

120 

100 

80 

Total CPU load [%] 


10

First Tests: Softiwarp 

Client 

Server 


10000 

9000 

8000 

7000 

6000 

5000 

4000 




90 

80 

70 

60 

50 

40 

3000 

30 


2000 

20 


1000 

10 


0 

1 2 4 8 16 32 64 128 256 0 


100 


Chunk 1 

Chunk 2 

Wait 

Chunk 1 

WRITE 

WRITE 

WRITE 

SEND 

WRITE 

WRITE 

SEND 

RECV-WC 

WRITE 

Payload (1,1) 

Payload (1,2) 

Payload (1,N) 


Payload (2,1) 

Payload (2,N) 


Chunk ACK (1) 

Payload (1,1) 

RECV-WC 

SEND 

RECV-WC 

‣ Same application level flow 

control (ping-pong buffers) + 

• 1 Core only 

• MPA CRC off 

• MTU=9000 

‣ Sending CPU on its limit 

Wait 

Process 

Chunk 1 data 

Process 

Chunk 2 data 


11

First Tests: Softiwarp + CRC 

Client 

Server 


10000 

9000 

8000 

7000 

6000 

5000 

4000 




90 

80 

70 

60 

50 

40 

3000 

30 


2000 

20 


1000 

10 


0 

1 2 4 8 16 32 64 128 256 0 


100 


Chunk 1 

Chunk 2 

Wait 

Chunk 1 

WRITE 

WRITE 

WRITE 

SEND 

WRITE 

WRITE 

SEND 

RECV-WC 

WRITE 

Payload (1,1) 

Payload (1,2) 

Payload (1,N) 


Payload (2,1) 

Payload (2,N) 


Chunk ACK (1) 

Payload (1,1) 

RECV-WC 

SEND 

RECV-WC 

‣ Same application level flow 

control (ping-pong buffers) + 

• 1 Core only 

• MPA CRC ON 

• MTU=9000 

‣ CRC is killing performance 

‣ Still sending CPU on its limit 

Wait 

Process 

Chunk 1 data 

Process 

Chunk 2 data 


12

First Tests: Softiwarp-Chelsio 

WRITE 

Server 

Payload (1,1) 

Client 

Chunk 1 

WRITE 

Sig.WRITE 

Payload (1,N-1) 

Payload (1,N) 

Chunk 2 

WRITE 

Sig.WRITE 

WC WRITE (N) 

WRITE 

Payload (2,1) 

Payload (2,N) 

Payload (1,1) 

Chunk 1 

‣ Test 1: Softiwarp peering Chelsio T3 

• Setup: 

• RNIC sends WRITEs to Softiwarp target 

• Target just places data w/o appl. interaction 

• Result: 

• Close to line speed at 8KB 

• Uups - some issues at larger buffers 

‣ Test 2: Softiwarp peering Softiwarp 

• Same setup 

• Result: 

• Maximum Bandwidth from 128KB on 


‣ Conclusions: 

• Promising for first test on non-tuned stack 

• Software stack may perform well on client side 

• Further improvement with sendfile() possible 

13

Softiwarp: Work in progress 

Core Functionality 

RDMAP/DDP/MPA 

QP/CQ/PD/MR Objects 

Send 

Receive 

RDMA WRITE 

RDMA READ 

Connection Mgmt (IWCM, TCP) 

Memory Management 

(x): done, (w): work in progress, (-) not done 


x 

x 

x 

x 

x 

x 

x 

x 

Features (incomplete)… 

MPA CRC 

x 

MPA Markers - 

Memory Windows 

w 

Inline Data 

w 

Shared Receive Queue - 

Fast Memory Registration - 

Termination Messages 

w 

Remote Invalidation - 

Stag 0 - 

Resource Realloc. (MR/QP/CQ) - 

TCP header alignment 

w 

Relative adressing (ZBVA) 

w 

14

Softiwarp Roadmap 

‣ Opensource very soon 

‣ Discuss current code base in the community 

• Be open for changes/critics 

• Identify core must-haves which are missing 

• Stability! 

• Invite others to contribute 

• Feedback known issues of OFED core to team 

• Don’t touch TCP 

‣ Start compliance testing (OFA IWG) soon 

‣ Investigate private fast path user interface option 

‣ Start working on kernel client support 

‣ Investigate partially offloading of CPU intensive tasks 

• CRC, tx-markers 

• Data placement,.. 


15

Feedback: OFED Issues 

‣ Late RDMA Transition 

• Something not part of RNIC integration is now possible 

• Very simple to do with Softiwarp, benefits iSER & Co. 

• Softiwarp allows late RDMA mode transition w/o TCP context 

migration 

‣ OFED CM 

• How to coexist with RNIC if SW stack shares link, shall we? 

• Can we exist within OFED w/o full (complex) IWCM support? 

• Current code spends 2000 lines out of 9000 for CM! 

‣ Device Management 

• Wildcard listen on multiple interfaces must be translated to individual 

wildcard listens on each (port/ipaddr) combination 

‣ Zero based virtual adressing 

‣ … 


16

Summary 

‣ Software RDMA is useful 

‣ Software RDMA is efficient on client side (at least) 

‣ RDMA semantics help to use transport efficiently 

‣ Softiwarp helps to grow RDMA/OFED ecosystem 

• Establish RDMA communication model 

• Prepare applications to use RDMA 

• Prepare systems to introduce RDMA HW 

• Peer & thus enable RDMA HW 

‣ Softiwarp is work in progress 

• Please join. 


17

Soft iWARP

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?