07.01.2013 Views

A Measurement Study of the Linux TCP/IP Stack Performance and ...

A Measurement Study of the Linux TCP/IP Stack Performance and ...

A Measurement Study of the Linux TCP/IP Stack Performance and ...

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

for <strong>the</strong> <strong>Linux</strong> Operating System. The <strong>Linux</strong> OS<br />

has been a popular choice for server class systems<br />

due to its stability <strong>and</strong> security features, <strong>and</strong> it is<br />

now used even by large-scale system operators such<br />

as Amazon <strong>and</strong> Google [11], [13]. In this paper,<br />

we have focused on <strong>the</strong> network stack performance<br />

<strong>of</strong> <strong>Linux</strong> kernel 2.4 <strong>and</strong> kernel 2.6. Until recently,<br />

kernel 2.4 was <strong>the</strong> most stable <strong>Linux</strong> kernel <strong>and</strong><br />

was used extensively. Kernel 2.6 is <strong>the</strong> latest stable<br />

<strong>Linux</strong> kernel <strong>and</strong> is fast replacing kernel 2.4.<br />

Although several performance studies <strong>of</strong> <strong>the</strong> <strong>TCP</strong><br />

protocol stack [1], [6], [9], [10] have been done, this<br />

is <strong>the</strong> first time a thorough comparison <strong>of</strong> <strong>Linux</strong> 2.4<br />

<strong>and</strong> <strong>Linux</strong> 2.6 <strong>TCP</strong>/<strong>IP</strong> stack performance has been<br />

carried out. We have compared <strong>the</strong> performance<br />

<strong>of</strong> <strong>the</strong>se two <strong>Linux</strong> versions along various metrics:<br />

bulk data throughput, connection throughput<br />

<strong>and</strong> scalability across multiple processors. We also<br />

present a fine-grained pr<strong>of</strong>iling <strong>of</strong> resource usage<br />

by <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack functions, <strong>the</strong>reby identifying<br />

<strong>the</strong> bottlenecks.<br />

In most <strong>of</strong> <strong>the</strong> experiments, kernel 2.6 performed<br />

better than kernel 2.4. Although this is to be expected,<br />

we have identified specific changes in <strong>Linux</strong><br />

2.6 which contribute to <strong>the</strong> improved performance.<br />

We also discuss some unexpected results such as<br />

<strong>the</strong> degraded performance <strong>of</strong> kernel 2.6 on SMP<br />

architecture when processing a single connection<br />

on an SMP system. We present fine grained kernel<br />

pr<strong>of</strong>iling results which explain <strong>the</strong> performance<br />

characteristics observed in <strong>the</strong> experiments.<br />

The rest <strong>of</strong> <strong>the</strong> paper is organised as follows.<br />

In Section II, we review previous work in <strong>TCP</strong>/<strong>IP</strong><br />

pr<strong>of</strong>iling, <strong>and</strong> discuss some approaches for protocol<br />

processing improvement. Section III discusses <strong>the</strong><br />

improvements made in <strong>Linux</strong> kernel 2.6 which<br />

affect <strong>the</strong> network performance <strong>of</strong> <strong>the</strong> system. Section<br />

IV presents results <strong>of</strong> performance measurement<br />

on uniprocessor systems, while Section V<br />

discusses results <strong>of</strong> performance measurement on<br />

multiprocessor systems. In Section VI we discuss<br />

<strong>the</strong> kernel pr<strong>of</strong>iling results <strong>and</strong> inferences drawn<br />

from it. In Section VII we conclude with our main<br />

observations.<br />

II. BACKGROUND<br />

Several studies have been done earlier on <strong>the</strong><br />

performance <strong>of</strong> <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack processing [1],<br />

[5], [6], [9], [10]. Copying <strong>and</strong> checksumming,<br />

among o<strong>the</strong>rs, have usually been identified as expensive<br />

operations. Thus, zero copy networking,<br />

integrated checksum <strong>and</strong> copying, header prediction<br />

[6], jumbo frame size [1] etc are improvements that<br />

have been explored earlier. Ano<strong>the</strong>r approach has<br />

been to <strong>of</strong>fload <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack processing to a<br />

NIC with dedicated hardware [24].<br />

Efforts have also been made to exploit <strong>the</strong> parallelism<br />

available in general purpose machines itself,<br />

by modifying <strong>the</strong> protocol stack implementation<br />

appropriately. Parallelizing approaches usually deal<br />

with trade-<strong>of</strong>fs between balancing load between<br />

multiple processors <strong>and</strong> <strong>the</strong> overhead due to maintenance<br />

<strong>of</strong> shared data among <strong>the</strong>se processors [3],<br />

[4], [19]. Some <strong>of</strong> <strong>the</strong> approaches that are known<br />

to work well include "processor per message" <strong>and</strong><br />

"processor per connection" [23]. In <strong>the</strong> processorper-message<br />

paradigm each processor executes <strong>the</strong><br />

whole protocol stack for one message (i.e. packet).<br />

With this approach, heavily used connections can<br />

be efficiently served, however <strong>the</strong> connection state<br />

has to be shared between <strong>the</strong> processors. In <strong>the</strong><br />

processor-per-connection paradigm, one processor<br />

h<strong>and</strong>les all <strong>the</strong> messages belonging to a particular<br />

connection. This eliminates <strong>the</strong> connection state<br />

sharing problem, but can suffer from uneven distribution<br />

<strong>of</strong> load. O<strong>the</strong>r approaches include "processor<br />

per protocol" ( each layer <strong>of</strong> <strong>the</strong> protocol stack is<br />

processed by a particular processor) <strong>and</strong> "processor<br />

per task" (each processor performs a specific task or<br />

function within a protocol). Both <strong>the</strong>se approaches<br />

suffer from poor caching efficiency.<br />

III. IMPROVEMENTS IN LINUX KERNEL 2.6<br />

The <strong>Linux</strong> kernel 2.6 was a major upgrade from<br />

<strong>the</strong> earlier default kernel 2.4 with many performance<br />

improvements. In this section we discuss<br />

some <strong>of</strong> <strong>the</strong> changes made in kernel 2.6, which<br />

can have an impact on <strong>the</strong> performance <strong>of</strong> <strong>the</strong><br />

networking subsystem.<br />

Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 29, 2009 at 00:47 from IEEE Xplore. Restrictions apply.<br />

2

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!