A Measurement Study of the Linux TCP/IP Stack Performance and ...

More documents

Recommendations

Info

for the Linux Operating System. The Linux OS has been a popular choice for server class systems due to its stability and security features, and it is now used even by large-scale system operators such as Amazon and Google [11], [13]. In this paper, we have focused on the network stack performance of Linux kernel 2.4 and kernel 2.6. Until recently, kernel 2.4 was the most stable Linux kernel and was used extensively. Kernel 2.6 is the latest stable Linux kernel and is fast replacing kernel 2.4. Although several performance studies of the TCP protocol stack [1], [6], [9], [10] have been done, this is the first time a thorough comparison of Linux 2.4 and Linux 2.6 TCP/IP stack performance has been carried out. We have compared the performance of these two Linux versions along various metrics: bulk data throughput, connection throughput and scalability across multiple processors. We also present a fine-grained profiling of resource usage by the TCP/IP stack functions, thereby identifying the bottlenecks. In most of the experiments, kernel 2.6 performed better than kernel 2.4. Although this is to be expected, we have identified specific changes in Linux 2.6 which contribute to the improved performance. We also discuss some unexpected results such as the degraded performance of kernel 2.6 on SMP architecture when processing a single connection on an SMP system. We present fine grained kernel profiling results which explain the performance characteristics observed in the experiments. The rest of the paper is organised as follows. In Section II, we review previous work in TCP/IP profiling, and discuss some approaches for protocol processing improvement. Section III discusses the improvements made in Linux kernel 2.6 which affect the network performance of the system. Section IV presents results of performance measurement on uniprocessor systems, while Section V discusses results of performance measurement on multiprocessor systems. In Section VI we discuss the kernel profiling results and inferences drawn from it. In Section VII we conclude with our main observations. II. BACKGROUND Several studies have been done earlier on the performance of the TCP/IP stack processing [1], [5], [6], [9], [10]. Copying and checksumming, among others, have usually been identified as expensive operations. Thus, zero copy networking, integrated checksum and copying, header prediction [6], jumbo frame size [1] etc are improvements that have been explored earlier. Another approach has been to offload the TCP/IP stack processing to a NIC with dedicated hardware [24]. Efforts have also been made to exploit the parallelism available in general purpose machines itself, by modifying the protocol stack implementation appropriately. Parallelizing approaches usually deal with trade-offs between balancing load between multiple processors and the overhead due to maintenance of shared data among these processors [3], [4], [19]. Some of the approaches that are known to work well include "processor per message" and "processor per connection" [23]. In the processorper-message paradigm each processor executes the whole protocol stack for one message (i.e. packet). With this approach, heavily used connections can be efficiently served, however the connection state has to be shared between the processors. In the processor-per-connection paradigm, one processor handles all the messages belonging to a particular connection. This eliminates the connection state sharing problem, but can suffer from uneven distribution of load. Other approaches include "processor per protocol" ( each layer of the protocol stack is processed by a particular processor) and "processor per task" (each processor performs a specific task or function within a protocol). Both these approaches suffer from poor caching efficiency. III. IMPROVEMENTS IN LINUX KERNEL 2.6 The Linux kernel 2.6 was a major upgrade from the earlier default kernel 2.4 with many performance improvements. In this section we discuss some of the changes made in kernel 2.6, which can have an impact on the performance of the networking subsystem. Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 29, 2009 at 00:47 from IEEE Xplore. Restrictions apply. 2
A. Kernel Locking Improvements The Linux kernel 2.4 uses a lock, termed as the Big Kernel Lock (BKL), which is a global kernel lock, which allows only one processor to be running kernel code at any given time, to make the kernel safe for concurrent access from multiple CPUs [12]. The BKL makes SMP Linux possible, but it does not scale very well. Kernel 2.6 is not completely free of the BKL, however, its usage has been greatly reduced. Scanning the kernel source code revealed that the kernel 2.6 networking stack has only one reference of the BKL. B. New API - NAPI One of the most significant changes in kernel 2.6 network stack, is the addition of NAPI ("New API"), which is designed to improve the performance of high-speed networking with two main tricks: interrupt mitigation and packet throttling [14]. During high traffic, interrupt mitigation allows interrupts to be disabled, while packet throttling allows NAPI compliant drivers to drop packets at the network adaptor itself. Both techniques reduce CPU load. C. Efficient copy routines The Linux kernel maintains separate address space for the kernel and user processes for protection against misbehaving programs. Due to the two separate address spaces, when a packet is sent or received over the network, an additional step of copying the network buffer from the user space to the kernel space or vice versa is required. Kernel 2.6 copy routines have therefore been optimised, for the x86 architecture, by using the technique of hand unrolled loop with integer registers [7], [21], instead of the less efficient "movsd" instruction used in kernel 2.4. D. Scheduling Algorithm The kernel 2.4 scheduler, while being widely used and quite reliable, has a major drawback: it contains O(n) algorithms where n is the number of processes in the system. This severely impedes its scalability [15]. The new scheduler in kernel 2.6 on the other hand does not contain any algorithms that run in worse than 0(1) time. This is extremely important in multi-threaded applications such as Web servers as it allows them to handle large number of concurrent connections, without dropping requests. IV. PERFORMANCE ON UNIPROCESSOR SYSTEMS As a first step of the study, we measured "highlevel" performance of the two OS versions - that is, without fine-grained profiling of the kernel routines. These tests help us characterise the performance, while the kernel profiling results (Section VI) help us explain those characteristics. Thus, in this section we compare performance measures such as connection throughput and HTTP throughput for Linux 2.4 and Linux 2.6. We also carried out high-level profiling to get a basic idea of the processing needs of the socket system calls. The tests were carried out on two platforms - 1)A single CPU 1.6 GHz Pentium IV machine with 256MB RAM henceforth referred to as the "Pentium IV server" and 2) A Dual CPU 3.2 Ghz Xeon(HT) machine with 512MB RAM, henceforth referred to as the "Xeon server". A. Performance comparison of socket system calls In this test, the CPU requirement of the socket system calls was measured using strace [26], while clients ran a simple loop of opening and closing connections with servers. The tests were run on kernel-2.4.20 and kernel-2.6.3 on the Pentium IV, with clients and servers on the same machine, connecting over the loopback interface. Kernel => socket() bind() listen() connecto 2.4.20 18.05 2.91 32.37 98.97 2 . 6.3 20.56 3.37 25.97 89.19 TABLE I AVERAGE TIME SPENT (pS) IN SOCKET SYSTEM CALLS. The results obtained are shown in Table IV-A. It shows that there is not much difference in the bind( and socket() system call overheads between the two kernels, but the listen( and connect( system calls are slightly cheaper in kernel 2.6. Table IV-A does not show accept and close system calls, as these Authorized licensed use limited to: INDIAN INSTITUTE OF TECHNOLOGY BOMBAY. Downloaded on April 29, 2009 at 00:47 from IEEE Xplore. Restrictions apply. 3
Page 1: A Measurement Study of the Linux TC
Page 5 and 6: o .t o 10000 _ 0 6000 F- 4000 Conne
Page 7 and 8: Kernel 2.4 Kernel 2.6 Kernel 2.4 Ke
Page 9 and 10: CPU 0 Samples % CPU 1 Samples J % [

A Measurement Study of the Linux TCP/IP Stack Performance and ...

Create successful ePaper yourself

Delete template?

Save as template?