Benefits of I/O Acceleration Technology (I/OAT) in ... - CiteSeerX

Benefits of I/O Acceleration Technology (I/OAT) in Clusters ∗Karthikeyan VaidyanathanDhabaleswar K. PandaComputer Science and Engineering,The Ohio State University{vaidyana, panda}@cse.ohio-state.eduAbstractPacket processing in the TCP/IP stack at multi-Gigabit data ratesoccupies a significant portion of the system overhead. Though thereare several techniques to reduce the packet processing overhead onthe sender-side, the receiver-side continues to remain as a bottleneck.I/O Acceleration Technology (I/OAT), developed by Intel, is a set offeatures particularly designed to reduce the receiver-side packet processingoverhead. This paper studies the benefits of the I/OAT technologyby extensive evaluations through micro-benchmarks as wellas evaluations on two different application domains: (1) A multitierdata-center environment and (2) A Parallel Virtual File System(PVFS). Our micro-benchmark evaluations show that I/OAT resultsin 38% lower overall CPU utilization in comparison with traditionalcommunication. Due to this reduced CPU utilization, I/OAT deliversbetter performance and increased network bandwidth. Our experimentalresults with data-centers and file systems reveal that I/OATcan improve the total number of transactions processed by 14% andthroughput by 12%, respectively. In addition, I/OAT can sustain alarge number of concurrent threads (up to a factor of four as comparedto non-I/OAT) in data-center environments, thus increasing thescalability of the servers.1 IntroductionOver the past few years, there has been an incredible growthof highly data-intensive applications in various fields such asmedical informatics, genomics, e-commerce, data mining andsatellite weather image analysis. With technology trends, theability to store and share the datasets generated by these applicationsis also increasing, allowing scientists and institutionsto create large dataset repositories and making them availablefor use by others. On the other hand, clusters consisting ofcommodity off-the-shelf hardware components have becomeincreasingly attractive as platforms for high-performance computationand scalable servers. Based on these two trends, researchershave proposed the feasibility and potential of clusterbasedservers [14, 10, 18, 19].Several clients request these servers for either the rawor some kind of processed data simultaneously. However,existing servers are becoming increasingly incapable of∗ This research is supported in part by NSF grants #CNS-0403342and #CNS-0509452; DOE grants #DE-FC02-06ER25749 and #DE-FC02-06ER25755; grants from Intel, Mellanox, Cisco systems, Linux Networx andSun Microsystems; and equipment donations from Intel, Mellanox, AMD, Apple,Appro, Dell, Microway, PathScale, IBM, SilverStorm and Sun Microsystems.meeting such sky-rocketing processing demands with highperformanceand scalability. These servers rely on TCP/IPfor data communication and typically use Gigabit Ethernetnetworks for cost-effective designs. The host-based TCP/IPprotocols on such networks have high CPU utilization andlow bandwidth, thereby limiting the maximum capacity (interms of requests they can handle per unit time). Alternatively,many servers use multiple Gigabit Ethernet networks to copewith the network traffic. However, at multi-Gigabit data rates,packet processing in the TCP/IP stack occupies a significantportion of the system overhead.Packet processing [12, 13] usually involves manipulatingthe headers and moving the data through the TCP/IP stack.Though this does not require significant computation, processortime gets wasted due to delays caused by latency ofmemory accesses and data movement operations. To overcomethese overheads, researchers have proposed several techniques[9] such as transport segmentation offload (TSO),jumbo frames, zero-copy data transfer (sendfile()), interruptcoalescing, etc. Unfortunately, many of these techniques areapplicable only on the sender side, while the receiver side continuesto remain as a bottleneck in several cases, thus resultingin a huge performance gap between the CPU overheads ofsending and receiving packets.Intel’s I/O Acceleration Technology (I/OAT) [1, 3, 2, 15] isa set of features which attempts to alleviate the receiver packetprocessing overheads. It has three additional features, namely:(i) split headers, (ii) DMA copy offload engine and (iii) multiplereceive queues.At this point, the following open questions arise:• What kind of benefits can be expected from the currentI/OAT architecture?• How does this benefit translate to applications?In this paper, we focus on the above questions. We firstanalyze the performance of I/OAT based on a detailed suiteof micro-benchmarks. Next, we evaluate it on two differentapplication domains:• A multi-tier Data-Center environment• A Parallel Virtual File System (PVFS)Our micro-benchmark evaluations show that I/OAT reduces theoverall CPU utilization significantly, up to 38%, as compared1

to traditional communication (non-I/OAT). Due to this reducedCPU utilization, I/OAT delivers better performance and increasednetwork bandwidth. Our experimental results withdata-centers and file systems reveal that I/OAT can improvethe total number of transactions processed by 14% and thethroughput by 12%, respectively. Also, our results show thatI/OAT can sustain a large number of concurrent threads (upto a factor of four as compared to non-I/OAT) in data-centerenvironments, thus increasing the scalability of the servers.The remaining part of the paper is organized as follows:Section 2 provides a brief background on socket optimizationsand I/OAT architecture. In Section 3, we discuss the softwareinfrastructures used; in particular the Multi-Tier Data-Center environments and Parallel Virtual File System (PVFS).Section 4 addresses micro-benchmark evaluations to study theideal case benefits of I/OAT over the native kernel implementation.Sections 5 and 6 present the benefits of I/OAT in a datacenterand PVFS environment, respectively. Section 7 presentsthe advantages and the disadvantages of I/OAT. We concludethe paper in Section 8.2 BackgroundIn this section, we present some of the socket optimizationtechnologies in detail. Later, we provide a brief background ofthe I/OAT architecture and its features.2.1 Socket OptimizationsAs mentioned in Section 1, due to technological developments,significant performance gaps exist between the CPUoverhead of sending and receiving packets. In particular, thereare two techniques that make the CPU usage on the sender-sidefar less than the CPU usage on the receiver-side.The first technique, TCP segmentation offload (TSO), allowsthe kernel to send a large buffer, which is more than themaximum transmission unit (MTU), to the network controller.The network controller segments the buffer into individual Ethernetframes and sends it on the wire. In the absence of thistechnique, the host CPU is expected to break the large bufferinto small frames and send the small frames to the networkcontroller. This operation is more CPU intensive. The secondtechnique, sendfile() optimization, allows the kernel to performzero-copy operation. Using this technique, the kernel does notcopy the user buffer to the network buffer. Instead, it points thepinned pages of the user buffer as the data source for transmission.However, neither of these techniques are applicable to thereceiver side. On the receiver side, interrupt coalescing techniquehelps to reduce the number of interrupts generated in thehost system. This technique generates one interrupt for multiplepackets rather than one interrupt for every single packet.However, this optimization improves the performance onlywhen the network is heavily loaded.2.2 I/O Acceleration Technology (I/OAT) OverviewI/OAT [15] attempts to alleviate the bottlenecks mentionedin Section 1 by providing a set of features, namely: (i) Splitheaders, (ii) Asynchronous copy using DMA engine and (iii)Multiple Receive Queues.2.2.1 Split-headersAs shown in Figure 1, the optimized TCP/IP stack of I/OAThas the split header feature implemented. In TCP/IP-basedcommunication for transmission of application data, headerssuch as TCP header, IP header and Ethernet header are attachedalong with the application data. Typically, the networkcontroller transfers the headers and the application data onto asingle buffer. However, with the split-header feature, the controllerpartitions the network data into headers and applicationdata and copies them into two separate buffers. This allowsthe header and data to be optimally aligned. Since the headersare frequently accessed during network processing, thisfeature also results in better cache utilization by not pollutingthe CPU’s cache with any application data while accessingthe headers, thus increasing the locality of the incomingheaders. For more information regarding this feature and itsbenefits please refer to [15, 16].Optimized TCP/IP stackwith enhancementsBalanced network processingwith network−flow affinityCPUcore core core coreChipsetI/O DevicesCPUNetworkData streamMemoryEnhanced direct memoryaccess with asynchronouslow−cost copyFigure 1. Intel’s I/O Acceleration Technology(Courtesy Intel)2.2.2 Asynchronous copy using DMA engineMost of the receive packet processing time [15] is spent incopying the data from kernel buffer to user buffer. In particular,the ratio of useful work done by the CPU to useless overheadssuch as CPU stalling for memory access or data movement, decreasesas we go from 1 Gbps links to 10 Gbps links [17]. Especiallyfor data movement operations, rather than waiting fora memory copy to finish, the host CPU can process other pendingpackets while the copy is still in progress. I/OAT offloadsthis copy operation with an additional DMA engine. This isa dedicated device which can perform memory copies. As aresult, while the DMA is performing the data copy, the CPUbecomes free to process other pending packets. For more detailsregarding this feature and its implementation please referto [15, 17, 16].

WANProxyServersDatabaseApplication ServersServersTier 1Tier 0 Tier 2StorageComputeNodeComputeNodeComputeNode.ComputeNodeNetworkMetadataManagerI/O serverNodeI/O serverNode..I/O serverNodeMetaDataDataData.DataFigure 2. Application Domains: (a) A Multi-Tier Data-Center Environment and (b) Parallel Virtual FileSystem (PVFS) Environment2.2.3 Multiple Receive QueuesProcessing large packets is generally not CPU-intensive,whereas processing small packets can fully occupy the CPU.Even on a multi-CPU system, processing occurs on a singleCPU (the CPU which handles the controller’s interrupt). Multiplereceive queue feature helps in distributing the packetsamong different receive queues and allows multiple CPUs toact on different packets at the same time. For more details regardingthis feature please refer to [15, 16]. Unfortunately,this feature is currently disabled in the Linux platform. Hence,we could not measure the performance impact of this featurein our evaluation.3 Software InfrastructureWe have carried out the evaluation of I/OAT on two differentsoftware infrastructures: Multi-Tier Data Center environmentand the Parallel Virtual File System (PVFS). In thissection, we discuss each of these in more detail.3.1 Multi-Tier Data Center environmentMore and more people are using web interfaces for a widerange of services. Scalability of these systems is a very importantrequirement. Upgrading a server to a single, more powerfulserver is no longer a feasible method. In recent years,clusters consisting of low-cost commodity off-the-shelf hardwarecomponents have become increasingly attractive for hostingweb services. Cluster architectures are characterized by alow setup and maintenance cost. In addition, clusters providegreater flexibility in reallocating resources among the tiers ofa data-center to accommodate fluctuating load.A typical Multi-tier Data-center [20, 8, 21] has a cluster ofnodes, called the edge nodes, as its first tier. These nodes canbe thought of as switches (up to the 7th layer) that provideservices like load balancing, security, caching etc. The mainpurpose of this tier is to increase the performance of the innertiers. The next tiers are usually the web-servers. These nodes,apart from serving static content, can fetch dynamic data fromother sources and serve that data in a presentable form. Thelast tier of the data-center is the database tier. It is used to storepersistent data. This tier is usually I/O intensive. Figure 2ashows a typical data-center setup.A request from a client is received by the edge or proxyservers. This request is serviced from cache if possible, otherwiseit is forwarded to the web/application servers. Static requestsare serviced by the web servers by just returning the requestedfile to the client via the edge server. This content maybe cached at the edge server so that subsequent requests to thesame static content may be served from the cache. The applicationtier handles dynamic content. Any request that needs somevalue to be computed, searched, analyzed or stored, must usethis tier at some stage. The application servers may also needto spend some time on converting data to presentable formats.The back-end database servers are responsible for storing datapersistently and responding to queries. These nodes are connectedto a persistent storage system. Queries to the databasesystems can be anything from a simple search of required datato performing join/aggregation/select operations on the data.3.2 Parallel Virtual File System (PVFS)Parallel Virtual File System (PVFS) [7] is one of the leadingparallel file systems for Linux clusters today. It is designedto meet the increasing I/O demands of parallel applications incluster systems. Figure 2b demonstrates a typical PVFS environment.As shown in the figure, a group of nodes in the clustersystem can be configured as I/O servers and one of them (eitheran I/O server or a different node) as a meta-data manager.It is possible for a node to host computation while serving asan I/O node.PVFS achieves high performance by striping files across aset of I/O server nodes to allow parallel accesses to the data. Ituses the native file system on the I/O servers to store individualfile stripes. An I/O daemon runs on each I/O node and servicesrequests from the compute nodes, in particular the read andwrite requests. Data is transferred directly between the I/Oservers and the compute nodes.A manager daemon runs on the meta-data manager node. Ithandles meta-data operations involving file permissions, truncation,file stripe characteristics, and so on. Meta-data isalso stored on the local file system. The meta-data managerprovides a cluster-wide consistent name space to applications.In PVFS, the meta-data manager does not participatein read/write operations.PVFS supports a set of feature-rich interfaces, including

Throughput (Mbps)60005000400030002000100001 2 3 4 5 6Number of Ports40%35%30%25%20%15%10%5%0%Avg. CPU UtilizationThroughput (Mbps)1200010000800060004000200001 2 3 4 5 6Number of Ports100%90%80%70%60%50%40%30%20%10%0%Avg. CPU Utilizationnon-I/OAT I/OAT non-I/OAT-CPU I/OAT-CPUnon-I/OAT I/OAT non-I/OAT-CPU I/OAT-CPUFigure 3. Micro-Benchmarks: (a) Bandwidth and (b) Bi-directional Bandwidthsupport for both contiguous and non-contiguous accesses tomemory and files [11]. PVFS can be used with multiple APIs:a native API, the UNIX/POSIX API, MPI-IO [22], and an arrayI/O interface called the Multi-Dimensional Block Interface(MDBI). The presence of multiple popular interfaces contributesto the wide success of PVFS in the industry.4 I/OAT Micro-Benchmark ResultsIn this section, we compare the ideal case performance benefitsachievable by I/OAT as compared to the native socketsimplementation (non-I/OAT) using a set of micro-benchmarks.We use two testbeds for all of our experiments. Their descriptionsare as follows:Testbed 1: A system consisting of two nodes built aroundSuperMicro X7DB8+ motherboards which include 64-bit133 MHz PCI-X interfaces. Each node has a dual dual-coreIntel 3.46 GHz processor with a 2 MB L2 cache. The machinesare connected with three Intel PRO 1000Mbit adapterswith two ports each through a 24-port Netgear Gigabit Ethernetswitch. We use the Linux RedHat AS 4 operating systemand kernel version 2.6.9-30.Testbed 2: A cluster system consisting of 44 nodes. Eachnode has a dual Intel Xeon 2.66 GHz processor with 512KBL2 cache and 2GB of main memory.For the experiments mentioned in Sections 5, we use thenodes in Testbed 2 as clients and the nodes in Testbed 1 asservers. For all other experiments, we only use the nodes inTestbed 1. Also, for experiments within Testbed 1, we createa separate VLAN for each network adapter in one node and acorresponding IP address within the same VLAN on the othernode to ensure an even distribution of network traffic. In allof our experiments, we define the term relative CPU benefit ofI/OAT as follows: if a is the % CPU utilization of I/OAT and bis the % CPU utilization of non-I/OAT, the relative CPU benefitof I/OAT is defined as (b − a)/b. For example, if I/OAT occupies30% CPU and non-I/OAT occupies 60% CPU, the relativeCPU benefit of I/OAT is 50%, though the absolute differencein CPU usage is only 30%.4.1 Bandwidth and Bi-directional BandwidthFigure 3a shows the bandwidth achieved by I/OAT and non-I/OAT with an increasing number of network ports. We usethe standard ttcp benchmark [4] for measuring the bandwidth.As the number of ports increase, we expect the bandwidth toincrease. As shown in Figure 3a, we see that the bandwidthperformance achieved by I/OAT is similar to the performanceachieved by non-I/OAT with an increasing number of ports.The maximum bandwidth achieved is close to 5635 Mbps withsix network ports. However, we see a difference in performancewith respect to the CPU utilization on the receiver side.We observe that the CPU utilization is lower for I/OAT as comparedto non-I/OAT using three network ports and this differenceincreases as we see the number of ports increase fromthree to six. For a six port configuration, non-I/OAT occupiesclose to 37% of the CPU while I/OAT occupies only 29% ofthe CPU. The relative benefit achieved by I/OAT in this case isclose to 21%.In the bi-directional bandwidth test, we use two machinesand 2*N threads on each machine with N threads acting asservers and the other N threads as clients. Each thread on onemachine has a connection to exactly one thread on the othermachine. The client threads connect to the server threads onthe other machine. Thus, 2*N connections are establishedbetween these two machines. On each connection, the basicbandwidth test is performed using the ttcp benchmark. Theaggregate bandwidth achieved by all threads is calculated asthe bi-directional bandwidth, as shown in Figure 3b. In our experiments,N is equal to the number of network ports. Wesee that the maximum bi-directional bandwidth is close to9600 Mbps. Also, we observe that I/OAT shows an improvementin CPU utilization using only two ports and this improvementincreases with an increasing number of ports. Withsix network ports, non-I/OAT occupies close to 90% of CPUwhereas I/OAT occupies only 70% of the CPU. The relativeCPU benefits achieved by I/OAT is close to 22%. This trendalso suggests that with an addition of one or two network portsto this configuration, non-I/OAT may not give the best networkthroughput in comparison with I/OAT since non-I/OAT mayend up occupying 100% CPU.4.2 Multi-Stream BandwidthThe multi-stream bandwidth test is very similar to the bidirectionalbandwidth test mentioned above. However, in thisexperiment, only one machine acts as a server and the othermachine as the client. We use two machines and N threads oneach machine. Each thread on one machine has a connection toexactly one thread on the other machine. On each connection,the basic bandwidth test is performed. The aggregate bandwidthachieved by all threads is calculated as the multi-streambandwidth. As shown in Figure 4, we observe that the band-

width achieved by I/OAT is similar to the bandwidth achievedby non-I/OAT for an increasing number of threads. However,when the number of threads increases to 120, we see a degradationin performance of non-I/OAT; whereas, I/OAT consistentlyshows no degradation in network bandwidth. Further,the CPU utilization for non-I/OAT also increases with an increasingnumber of threads. With 120 threads in the system,we see that non-I/OAT occupies close to 76% CPU whereasI/OAT only occupies 52% resulting in 24% absolute benefit inCPU utilization. The relative CPU benefits achieved by I/OATin this case is close to 32%.Throughput (Mbps)700060005000400030002000100006 12 30 60 120Number of Threads80%70%60%50%40%30%20%10%0%non-I/OAT I/OAT non-I/OAT-CPU I/OAT-CPUFigure 4. Multi-Stream Bandwidth4.3 Bandwidth and Bi-directional Bandwidth withSocket OptimizationsAs mentioned in Section 2, several optimizations on thesender side exist to reduce the packet overhead and also to improvethe network performance. In this experiment, we consideredthree such existing optimizations: (i) Large Socketbuffer sizes (100 MB), (ii) Segmentation Offload (TSO) and(iii) Jumbo Frames. In this experiment, we aim to understandthe impact of each of these optimizations and observe the improvement,both in terms of throughput and CPU utilizationon the receiver-side. Case 1 uses the default socket optionswithout any optimization. In Case 2, we increase the socketbuffer size to 100 MB. For Case 3, we further improve theoptimization by enabling segmentation offload (TSO) so thatthe host CPU is relieved from fragmenting large packets. InCase 4, in addition to the previous optimizations, we increasethe MTU-size to 2048 bytes so that large packets are sent overthe network. In addition to the sender-side optimizations, inCase 5, we include the interrupt coalescing feature. Performancenumbers with various socket optimizations are shownin Figure 5.In the bandwidth test, as shown in Figure 5a, we observetwo interesting trends. First, as we increase the socket optimizations,we see an increase in the aggregate bandwidth.Second, we observe that the performance of I/OAT is consistentlybetter than the performance of non-I/OAT. Especially forCase 5, which includes all socket optimizations, the bandwidthachieved by I/OAT is close to 5586 Mbps, whereas non-I/OATachieves only 5514 Mbps. More importantly, we observe thatthere is a significant improvement in terms of CPU utilization.As shown in Figure 5a, the relative CPU benefit of I/OAT increasesas we increase the socket optimizations. EspeciallyAvg. CPU Utilizationin Case 4, we observe that I/OAT achieves 30% relative CPUbenefit in comparison with non-I/OAT.We see similar trends with the bi-directional bandwidth testas shown in Figure 5b. Further, the relative benefits achievedby I/OAT is much higher than compared to the bandwidth experiment.In Case 4, I/OAT achieves close to 38% relative CPUbenefit as compared to non-I/OAT.In summary, we see that the micro-benchmark results showa significant improvement in terms of CPU utilization and networkperformance for I/OAT. Since I/OAT in Linux has twoadditional features, the split-header and DMA copy engine, itis also important to understand the benefits attributed by eachof these features individually. In the following section, we conductseveral experiments to show the individual benefits.4.4 Benefits of Asynchronous DMA Copy EngineIn this section, we isolate the DMA engine feature of I/OATand show the benefits of an asynchronous DMA copy engine.We compare the performance of the copy engine with that oftraditional CPU-based copy and show its benefits in terms ofperformance and overlap efficiency. For CPU-based copy, weuse the standard memcpy utility.Figure 6 compares the cost of performing a copy using theCPU and I/OAT’s DMA engine. The copy-cache bars denotethe performance of CPU-based copy with the source and destinationbuffers in the cache and the copy-nocache bars denotethe performance with the source and destination buffersnot in the cache. The DMA-copy bars denote the total timetaken to perform the copy using the copy engine and the DMAoverheadbars include the startup overhead in initiating thecopy using the DMA engine. The Overlap line denotes thepercentage of DMA copy time that can be overlapped withother computations. As shown in Figure 6, we see that the performanceof DMA-based copy approach (DMA-copy) is betterthan the performance of CPU-based copy approach (copynocache)for message sizes greater than 8 KB. Further, we observethat the percentage of overlap increases with increasingmessage sizes reaching up to 93% for a 64 KB message size.However, if the source and destination buffers are in the cache,we observe that the performance of the CPU-based copy ismuch better than the performance of the DMA-based copy approach.Also, since the DMA-based copy can be overlappedwith processing other packets, we incur only the DMA startupoverheads. As observed in Figure 6, we see that the DMAstartup overhead time is much less than the time taken by theCPU-based copy approach. Thus, the DMA copy engine canalso be useful even if the source and destination buffers are incache.4.5 Breakdown of I/OAT FeaturesIn this section, we break down the two features of I/OATand show its benefits in terms of throughput and CPU utilization.We first measure the performance of non-I/OAT withoutthe DMA engine and the split header features. Next, wemeasure the performance including the DMA engine feature(denoted as I/OAT-DMA in Figure 7) and compare it with the

Throughput (Mbps)56505600555055005450540053505300525030.0%25.0%20.0%15.0%10.0%5.0%0.0%Case 1 Case 2 Case 3 Case 4 Case 5Sender-side Optimizationsnon-I/OAT I/OAT Relative CPU Benefit35.0%11500% CPU Benefit with IOATThroughput (Mbps)110001050010000950090008500Case 1 Case 2 Case 3 Case 4 Case 5Sender-side Optimizationsnon-I/OAT I/OAT Relative CPU BenefitFigure 5. Optimizations: (a) Bandwidth and (b) Bi-directional Bandwidth45%40%35%30%25%20%15%10%5%0%% CPU Benefit with IOATperformance of non-I/OAT. We report the difference in performanceas DMA engine benefits. Finally, we measure the performanceincluding both the split-header feature and the DMAengine feature (denoted as I/OAT-SPLIT in Figure 7) and compareit with the performance of I/OAT with DMA engine. Wereport the difference in performance as split-header benefits.We design the experiment in the following way. We use thetwo nodes from Testbed 1, one node as a server and the otheras a client. We use two Intel network adapters with two portseach on both the server and the client node. Since we havefour network ports on each node, we use four clients and performa bandwidth test with four server threads. We first measurethe bandwidth performance for small message sizes forall three cases, i.e., non-I/OAT, I/OAT-DMA and I/OAT-SPLITand report the relative CPU benefits. As shown in Figure 7a,we observe that the DMA engine feature achieves close to 16%relative CPU benefit compared to the non-I/OAT case. This improvementis seen for all message sizes from 16 KB to 128 KB.However, we do not observe any throughput improvement forall message sizes. Also, the split-header feature of I/OAT doesnot seem to improve the CPU or the throughput for all messagesizes.As mentioned in Section 2, the split-header feature of I/OATincreases the locality of incoming headers and also helps improvecache utilization by not polluting the CPU’s cache withapplication buffers during network processing. In order toshow the cache pollution effects, we repeat the previous experimentfor large message sizes that cannot fit in the systemcache. Figure 7b shows the throughput benefits achieved byI/OAT features for large message sizes. We observe that thesplit-header feature can achieve up to 26% benefit in throughputfor transferring 1 MB of application data. In this case, theserver with I/OAT capability receives a total of 4 MB of applicationdata from the four clients and since the cache size isonly 2 MB, clearly the application data does not fit in the systemcache. We see a huge improvement for the split-headerfeature because the CPU’s cache is not polluted with applicationdata. In addition, we see that the benefits achieved by thesplit-header feature decreases for increasing message sizes.5 Data-Center Performance EvaluationIn this section, we analyze the performance of a 2-tier datacenterenvironment with I/OAT and compare its performancewith non-I/OAT. For all experiments in this section, we use thenodes in Testbed 1 (described in Section 4) for the data-centerTime (usecs)504540353025201510501K 2K 4K 8K 16K 32K 64KMessage Size (bytes)copy-cache copy-nocache DMA-copyDMA-overhead Overlap100%90%80%70%60%50%40%30%20%10%0%Figure 6. Copy Performance: CPU vs DMAtiers. For the client nodes, we use the nodes in Testbed 2 formost of the experiments. We notify the readers at appropriatepoints in the remaining sections when other nodes are used asclients.5.1 Evaluation MethodologyAs mentioned in Section 2, I/OAT is a server architecturegeared to improve the receiver-side performance. Inother words, I/OAT can be deployed in a data-center environmentand client requests coming over the Wide Area Network(WAN) can seamlessly take advantage of I/OAT and get servicedmuch faster. Further, the communication between thetiers inside the data-center, such as proxy tier and applicationtier, can be greatly enhanced using I/OAT, thus improving theoverall data-center performance and scalability.We set up a two-tier data-center testbed to determine theperformance characteristics of using I/OAT and non-I/OAT.The first tier consists of the front-end proxies. For this, we usedthe proxy module of Apache 2.2.0. The second tier consists ofthe web server module of Apache in order to service static requests.The two tiers in the data-center reside on 1 GigabitEthernet network; the clients are connected to the data-centerusing a 1 Gigabit Ethernet connection.Typical data-center workloads have a wide range of characteristics.Some workloads may vary from high to low temporallocality, following a Zipf-like distribution [6]. Similarlyworkloads vary from small documents (e.g., on-line bookstores, browsing sites, etc.) to large documents (e.g., downloadsites, etc.). Further, workloads may contain requests forsimple cacheable static or time invariant content or more complexdynamic or time variant content via CGI, PHP, and Javaservlets with a back-end database. Due to these varying characteristicsof workloads, we classify the workloads into three% Overlap

Throughput (MB/s)80070018%16%60014%50012%10%4008%3006%2004%1002%00%1 2 3 4 5 6Number of Clientsnon-I/OAT I/OAT % Relative CPU Benefit% CPU Benefit with I/OATThroughput (MB/s)70060020%18%16%50014%40012%10%3008%2006%4%1002%00%1 2 3 4 5Number of Clientsnon-I/OAT I/OAT % Relative CPU BenefitFigure 10. PVFS Concurrent Read Performance: (a) Six I/O Servers and (b) Five I/O Servers% CPU Benefit with I/OATThroughput (MB/s)80070020%18%16%60014%50012%40010%3008%6%2004%1002%00%1 2 3 4 5 6Number of Clientsnon-I/OAT I/OAT % Relative CPU Benefit% CPU Benefit with I/OATThroughput (MB/s)70060020%18%16%50014%40012%10%3008%2006%4%1002%00%1 2 3 4 5Number of Clientsnon-I/OAT I/OAT % Relative CPU BenefitFigure 11. PVFS Concurrent Write Performance: (a) Six I/O Servers and(b) Five I/O Servers% CPU Benefit with I/OAT697 MB/s. With I/OAT, the write bandwidth performance increasesfrom 460 MB/s to 750 MB/s with an increasing numberof compute nodes, i.e., I/OAT achieves 8% throughput benefitas compared to non-I/OAT. Further, I/OAT achieves close to5% CPU benefit as compared to non-I/OAT using six computenodes. We see similar trends when we decrease the number ofI/O servers as shown in Figure 11b.Throughput (MB/s)80070060050040030020010001 4 16 64Number of Emulated Clients100%90%80%70%60%50%40%30%20%10%0%Client-Side: Avg. CPU Utilization6.2.2 PVFS read performance with multiple streamsIn this experiment, we emulate a scenario where multipleclients access the file system. We increase the number ofclients from 1 to 64 and study the behavior of concurrent readperformance in PVFS. As shown in Figure 12, we observe thatthe performance of I/OAT is either equal to or greater than theperformance achieved by non-I/OAT. It is to be noted that wereport the CPU utilization numbers measured on the client-sidewhich also has the I/OAT capability. Due to the fact that theclient with I/OAT can receive the data at a much faster rate,the clients can also fire PVFS read requests at a faster rate thancompared to non-I/OAT. As a result, we observe that the CPUutilization on the client-side is around 10-12% higher than thenon-I/OAT case.In summary, PVFS with I/OAT has a 12% improvement forconcurrent reads and 8% improvement for concurrent writesas compared to PVFS with non-I/OAT. In terms of CPU utilization,PVFS with I/OAT achieves 15% benefit for concurrentreads and 7% benefit for concurrent writes as compared toPVFS with non-I/OAT.7 DiscussionAs mentioned earlier, I/OAT helps in improving the networkperformance and reducing the CPU utilization through its featuressuch as the asynchronous copy engine, split-headers, etc.non-I/OAT I/OAT non-I/OAT-CPU I/OAT-CPUFigure 12. Multi-Stream PVFS Read PerformanceThe asynchronous copy engine can also be used by user applicationsto improve the performance of memory operations andalso to improve the communication performance between twoprocesses within the same node. Since the copy engine canoperate in parallel with the host CPU, applications can achievesignificant benefits by overlapping useful computations withmemory operations. With the introduction of chip-level multiprocessing(CMP), per-core memory bandwidth reduces drastically,thus increasing the need for such asynchronous memoryoperations as a memory-latency hiding mechanism. Mohit,et. al., have proposed soft-timer techniques to reduce thereceiver-side processing [5]. I/OAT can co-exist with this technologyto further reduce the receiver-side overheads and improvethe performance.On the other hand, I/OAT requires special network adaptersto support the split-header and the multiple receive queue feature.In addition, the host system should have an in-builtcopy engine to support the asynchronous copy feature. Also,the memory controller uses physical addresses and hence, thepages cannot be swapped during a copy operation. As a result,the physical pages need to be pinned before initiating the copy

operation. Due to the page-pinning requirement, the usefulnessof the copy engine becomes questionable if the pinningcost exceeds the copy cost.8 Concluding Remarks and Future WorkI/O Acceleration technology (I/OAT) developed by Intel isa set of features particularly designed to reduce the packet processingoverheads on the receiver-side. This paper studies thebenefits of I/OAT technology through extensive evaluations ofmicro-benchmarks as well as evaluations on two different applicationdomains: (1) A multi-tier Data-Center environmentand (2) A Parallel Virtual File System (PVFS). Our microbenchmarkresults show that I/OAT results in 38% lower overallCPU utilization in comparison with traditional communication.Due to this reduced CPU utilization, I/OAT deliversbetter performance and increased network bandwidth. Our experimentalresults with data-centers and file systems reveal thatI/OAT can improve the total number of transactions processedby 14% and throughput by 12%, respectively. In addition,I/OAT can sustain a large number of concurrent threads (upto a factor of four as compared to non-I/OAT) in data-centerenvironments, thus increasing the scalability of the servers.We are currently working on two broad aspects with respectto I/OAT. First, we are looking at modifying the applicationsto realize the true benefits of I/OAT capability. Second, weare trying to provide an asynchronous memory copy operationto user applications. Though this involves some amount ofoverhead (such as context switches, user page locking, etc),asynchronous memory copy can help applications in using theCPU cycles intelligently.9 AcknowledgmentsWe would like to thank Intel Corporation for providing accessto I/OAT-based systems and the kernel patch which includesall of their features. We would also like to thank EricSimpson from the I/OAT team of Intel Corporation for providingus crucial insights into the implementation details.References[1] Accelerating High-Speed Networking with Intel I/O AccelerationTechnology. http://www.intel.com/technology/ioacceleration/306517.pdf.[2] Increasing network speeds: Technology@intel.http://www.intel.com/technology/magazine/communications/intelioat-0305.htm.[3] Intel I/O Acceleration Technology.http://www.intel.com/technology/ioacceleration/306484.pdf.[4] USNA, TTCP: A test of TCP and UDP Performance, August2001.[5] Mohit Aron and Peter Druschel. Soft timers: efficient microsecondsoftware timer support for network processing. ACM Transactionson Computer Systems, 18(3):197–228, 2000.[6] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Webcaching and zipf-like distributions: Evidence and implications.In INFOCOM (1), pages 126–134, 1999.[7] Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and RajeevThakur. PVFS: A Parallel File System for Linux Clusters. InProceedings of the 4th Annual Linux Showcase and Conference,pages 317–327, Atlanta, GA, 2000. USENIX Association.[8] A. Chandra, W. Gong, and P. Shenoy. Dynamic Resource Allocationfor Shared Data Centers Using Online Measurements.In Proceedings of ACM Sigmetrics 2003, San Diego, CA, June2003.[9] J. Chase, A. Gallatin, and K. Yocum. End system optimizationsfor high-speed TCP. IEEE Communications Magazine,39(4):68–74, 2001.[10] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar, AminVahdat, and Ronald P. Doyle. Managing energy and server resourcesin hosting centres. In Symposium on Operating SystemsPrinciples, 2001.[11] Avery Ching, Alok Choudhary, Wei keng Liao, Robert Ross,and William Gropp. Noncontiguous I/O through PVFS. In Proceedingsof the IEEE International Conference on Cluster Computing,2002.[12] J. R. David D. Clark, Van Jacobson, and H. Salwen. An analysisof TCP processing overhead, 1989.[13] Annie Foong et al. TCP performance analysis revisited. In IEEEInternational Symposium on Performance Analysis of Softwareand Systems, March 2003.[14] Armando Fox, Steven D. Gribble, Yatin Chawathe, Eric A.Brewer, and Paul Gauthier. Cluster-based scalable network services.In Symposium on Operating Systems Principles, 1997.[15] Andrew Gover and Christopher Leech. Accelerating networkreceiver processing. http://linux.inet.hr/files/ols2005/groverreprint.pdf.[16] S. Makineni and R. Iyer. Architectural characterization ofTCP/IP packet processing on the Pentium M microprocessor.In High Performance Computer Architecture, HPCA-10, 2004.[17] G. Regnier, S. Makineni, R. Illikkal, R. Iyer, D. Minturn,R. Huggahalli, D. Newell, L. Cline, and A. Foong. TCP Onloadingfor Data Center Servers. In IEEE Computer, Nov 2004.[18] John Reumann, Ashish Mehra, Kang G. Shin, and Dilip Kandlur.Virtual services: A new abstraction for server consolidation.In In Proceedings of the USENIX 2000 Technical Conferece,June 2000.[19] Yasushi Saito, Brian N. Bershad, and Henry M. Levy. Manageability,availability and performance in porcupine: A highlyscalable, cluster-based mail service. In Symposium on OperatingSystems Principles, 1999.[20] H. V. Shah, D. B. Minturn, A. Foong, G. L. McAlpine, R. S.Madukkarumukumana, and G. J. Regnier. CSP: A Novel SystemArchitecture for Scalable Internet and Communication Services.In the Proceedings of the 3rd USENIX Symposium on InternetTechnologies and Systems, pages pages 61–72, San Francisco,CA, March 2001.[21] K. Shen, H. Tang, T. Yang, and L. Chu. Integrated ResourceManagement for Cluster-based Internet Services. In Proceedingsof Fifth USENIX Symposium on Operating Systems Designand Implementation (OSDI ’02), pages 225–238, Boston, MA,December 2002.[22] Rajeev Thakur, William Gropp, and Ewing Lusk. On ImplementingMPI-IO Portably and with High Performance. In Proceedingsof the 6th Workshop on I/O in Parallel and DistributedSystems, pages 23–32. ACM Press, May 1999.

Benefits of I/O Acceleration Technology (I/OAT) in ... - CiteSeerX

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?