A Measurement Study of the Linux TCP/IP Stack Performance and ...
A Measurement Study of the Linux TCP/IP Stack Performance and ...
A Measurement Study of the Linux TCP/IP Stack Performance and ...
Create successful ePaper yourself
Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.
for <strong>the</strong> <strong>Linux</strong> Operating System. The <strong>Linux</strong> OS<br />
has been a popular choice for server class systems<br />
due to its stability <strong>and</strong> security features, <strong>and</strong> it is<br />
now used even by large-scale system operators such<br />
as Amazon <strong>and</strong> Google [11], [13]. In this paper,<br />
we have focused on <strong>the</strong> network stack performance<br />
<strong>of</strong> <strong>Linux</strong> kernel 2.4 <strong>and</strong> kernel 2.6. Until recently,<br />
kernel 2.4 was <strong>the</strong> most stable <strong>Linux</strong> kernel <strong>and</strong><br />
was used extensively. Kernel 2.6 is <strong>the</strong> latest stable<br />
<strong>Linux</strong> kernel <strong>and</strong> is fast replacing kernel 2.4.<br />
Although several performance studies <strong>of</strong> <strong>the</strong> <strong>TCP</strong><br />
protocol stack [1], [6], [9], [10] have been done, this<br />
is <strong>the</strong> first time a thorough comparison <strong>of</strong> <strong>Linux</strong> 2.4<br />
<strong>and</strong> <strong>Linux</strong> 2.6 <strong>TCP</strong>/<strong>IP</strong> stack performance has been<br />
carried out. We have compared <strong>the</strong> performance<br />
<strong>of</strong> <strong>the</strong>se two <strong>Linux</strong> versions along various metrics:<br />
bulk data throughput, connection throughput<br />
<strong>and</strong> scalability across multiple processors. We also<br />
present a fine-grained pr<strong>of</strong>iling <strong>of</strong> resource usage<br />
by <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack functions, <strong>the</strong>reby identifying<br />
<strong>the</strong> bottlenecks.<br />
In most <strong>of</strong> <strong>the</strong> experiments, kernel 2.6 performed<br />
better than kernel 2.4. Although this is to be expected,<br />
we have identified specific changes in <strong>Linux</strong><br />
2.6 which contribute to <strong>the</strong> improved performance.<br />
We also discuss some unexpected results such as<br />
<strong>the</strong> degraded performance <strong>of</strong> kernel 2.6 on SMP<br />
architecture when processing a single connection<br />
on an SMP system. We present fine grained kernel<br />
pr<strong>of</strong>iling results which explain <strong>the</strong> performance<br />
characteristics observed in <strong>the</strong> experiments.<br />
The rest <strong>of</strong> <strong>the</strong> paper is organised as follows.<br />
In Section II, we review previous work in <strong>TCP</strong>/<strong>IP</strong><br />
pr<strong>of</strong>iling, <strong>and</strong> discuss some approaches for protocol<br />
processing improvement. Section III discusses <strong>the</strong><br />
improvements made in <strong>Linux</strong> kernel 2.6 which<br />
affect <strong>the</strong> network performance <strong>of</strong> <strong>the</strong> system. Section<br />
IV presents results <strong>of</strong> performance measurement<br />
on uniprocessor systems, while Section V<br />
discusses results <strong>of</strong> performance measurement on<br />
multiprocessor systems. In Section VI we discuss<br />
<strong>the</strong> kernel pr<strong>of</strong>iling results <strong>and</strong> inferences drawn<br />
from it. In Section VII we conclude with our main<br />
observations.<br />
II. BACKGROUND<br />
Several studies have been done earlier on <strong>the</strong><br />
performance <strong>of</strong> <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack processing [1],<br />
[5], [6], [9], [10]. Copying <strong>and</strong> checksumming,<br />
among o<strong>the</strong>rs, have usually been identified as expensive<br />
operations. Thus, zero copy networking,<br />
integrated checksum <strong>and</strong> copying, header prediction<br />
[6], jumbo frame size [1] etc are improvements that<br />
have been explored earlier. Ano<strong>the</strong>r approach has<br />
been to <strong>of</strong>fload <strong>the</strong> <strong>TCP</strong>/<strong>IP</strong> stack processing to a<br />
NIC with dedicated hardware [24].<br />
Efforts have also been made to exploit <strong>the</strong> parallelism<br />
available in general purpose machines itself,<br />
by modifying <strong>the</strong> protocol stack implementation<br />
appropriately. Parallelizing approaches usually deal<br />
with trade-<strong>of</strong>fs between balancing load between<br />
multiple processors <strong>and</strong> <strong>the</strong> overhead due to maintenance<br />
<strong>of</strong> shared data among <strong>the</strong>se processors [3],<br />
[4], [19]. Some <strong>of</strong> <strong>the</strong> approaches that are known<br />
to work well include "processor per message" <strong>and</strong><br />
"processor per connection" [23]. In <strong>the</strong> processorper-message<br />
paradigm each processor executes <strong>the</strong><br />
whole protocol stack for one message (i.e. packet).<br />
With this approach, heavily used connections can<br />
be efficiently served, however <strong>the</strong> connection state<br />
has to be shared between <strong>the</strong> processors. In <strong>the</strong><br />
processor-per-connection paradigm, one processor<br />
h<strong>and</strong>les all <strong>the</strong> messages belonging to a particular<br />
connection. This eliminates <strong>the</strong> connection state<br />
sharing problem, but can suffer from uneven distribution<br />
<strong>of</strong> load. O<strong>the</strong>r approaches include "processor<br />
per protocol" ( each layer <strong>of</strong> <strong>the</strong> protocol stack is<br />
processed by a particular processor) <strong>and</strong> "processor<br />
per task" (each processor performs a specific task or<br />
function within a protocol). Both <strong>the</strong>se approaches<br />
suffer from poor caching efficiency.<br />
III. IMPROVEMENTS IN LINUX KERNEL 2.6<br />
The <strong>Linux</strong> kernel 2.6 was a major upgrade from<br />
<strong>the</strong> earlier default kernel 2.4 with many performance<br />
improvements. In this section we discuss<br />
some <strong>of</strong> <strong>the</strong> changes made in kernel 2.6, which<br />
can have an impact on <strong>the</strong> performance <strong>of</strong> <strong>the</strong><br />
networking subsystem.<br />
Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 29, 2009 at 00:47 from IEEE Xplore. Restrictions apply.<br />
2