12.07.2015 Views

Benefits of I/O Acceleration Technology (I/OAT) in ... - CiteSeerX

Benefits of I/O Acceleration Technology (I/OAT) in ... - CiteSeerX

Benefits of I/O Acceleration Technology (I/OAT) in ... - CiteSeerX

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Benefits</strong> <strong>of</strong> I/O <strong>Acceleration</strong> <strong>Technology</strong> (I/<strong>OAT</strong>) <strong>in</strong> Clusters ∗Karthikeyan VaidyanathanDhabaleswar K. PandaComputer Science and Eng<strong>in</strong>eer<strong>in</strong>g,The Ohio State University{vaidyana, panda}@cse.ohio-state.eduAbstractPacket process<strong>in</strong>g <strong>in</strong> the TCP/IP stack at multi-Gigabit data ratesoccupies a significant portion <strong>of</strong> the system overhead. Though thereare several techniques to reduce the packet process<strong>in</strong>g overhead onthe sender-side, the receiver-side cont<strong>in</strong>ues to rema<strong>in</strong> as a bottleneck.I/O <strong>Acceleration</strong> <strong>Technology</strong> (I/<strong>OAT</strong>), developed by Intel, is a set <strong>of</strong>features particularly designed to reduce the receiver-side packet process<strong>in</strong>goverhead. This paper studies the benefits <strong>of</strong> the I/<strong>OAT</strong> technologyby extensive evaluations through micro-benchmarks as wellas evaluations on two different application doma<strong>in</strong>s: (1) A multitierdata-center environment and (2) A Parallel Virtual File System(PVFS). Our micro-benchmark evaluations show that I/<strong>OAT</strong> results<strong>in</strong> 38% lower overall CPU utilization <strong>in</strong> comparison with traditionalcommunication. Due to this reduced CPU utilization, I/<strong>OAT</strong> deliversbetter performance and <strong>in</strong>creased network bandwidth. Our experimentalresults with data-centers and file systems reveal that I/<strong>OAT</strong>can improve the total number <strong>of</strong> transactions processed by 14% andthroughput by 12%, respectively. In addition, I/<strong>OAT</strong> can susta<strong>in</strong> alarge number <strong>of</strong> concurrent threads (up to a factor <strong>of</strong> four as comparedto non-I/<strong>OAT</strong>) <strong>in</strong> data-center environments, thus <strong>in</strong>creas<strong>in</strong>g thescalability <strong>of</strong> the servers.1 IntroductionOver the past few years, there has been an <strong>in</strong>credible growth<strong>of</strong> highly data-<strong>in</strong>tensive applications <strong>in</strong> various fields such asmedical <strong>in</strong>formatics, genomics, e-commerce, data m<strong>in</strong><strong>in</strong>g andsatellite weather image analysis. With technology trends, theability to store and share the datasets generated by these applicationsis also <strong>in</strong>creas<strong>in</strong>g, allow<strong>in</strong>g scientists and <strong>in</strong>stitutionsto create large dataset repositories and mak<strong>in</strong>g them availablefor use by others. On the other hand, clusters consist<strong>in</strong>g <strong>of</strong>commodity <strong>of</strong>f-the-shelf hardware components have become<strong>in</strong>creas<strong>in</strong>gly attractive as platforms for high-performance computationand scalable servers. Based on these two trends, researchershave proposed the feasibility and potential <strong>of</strong> clusterbasedservers [14, 10, 18, 19].Several clients request these servers for either the rawor some k<strong>in</strong>d <strong>of</strong> processed data simultaneously. However,exist<strong>in</strong>g servers are becom<strong>in</strong>g <strong>in</strong>creas<strong>in</strong>gly <strong>in</strong>capable <strong>of</strong>∗ This research is supported <strong>in</strong> part by NSF grants #CNS-0403342and #CNS-0509452; DOE grants #DE-FC02-06ER25749 and #DE-FC02-06ER25755; grants from Intel, Mellanox, Cisco systems, L<strong>in</strong>ux Networx andSun Microsystems; and equipment donations from Intel, Mellanox, AMD, Apple,Appro, Dell, Microway, PathScale, IBM, SilverStorm and Sun Microsystems.meet<strong>in</strong>g such sky-rocket<strong>in</strong>g process<strong>in</strong>g demands with highperformanceand scalability. These servers rely on TCP/IPfor data communication and typically use Gigabit Ethernetnetworks for cost-effective designs. The host-based TCP/IPprotocols on such networks have high CPU utilization andlow bandwidth, thereby limit<strong>in</strong>g the maximum capacity (<strong>in</strong>terms <strong>of</strong> requests they can handle per unit time). Alternatively,many servers use multiple Gigabit Ethernet networks to copewith the network traffic. However, at multi-Gigabit data rates,packet process<strong>in</strong>g <strong>in</strong> the TCP/IP stack occupies a significantportion <strong>of</strong> the system overhead.Packet process<strong>in</strong>g [12, 13] usually <strong>in</strong>volves manipulat<strong>in</strong>gthe headers and mov<strong>in</strong>g the data through the TCP/IP stack.Though this does not require significant computation, processortime gets wasted due to delays caused by latency <strong>of</strong>memory accesses and data movement operations. To overcomethese overheads, researchers have proposed several techniques[9] such as transport segmentation <strong>of</strong>fload (TSO),jumbo frames, zero-copy data transfer (sendfile()), <strong>in</strong>terruptcoalesc<strong>in</strong>g, etc. Unfortunately, many <strong>of</strong> these techniques areapplicable only on the sender side, while the receiver side cont<strong>in</strong>uesto rema<strong>in</strong> as a bottleneck <strong>in</strong> several cases, thus result<strong>in</strong>g<strong>in</strong> a huge performance gap between the CPU overheads <strong>of</strong>send<strong>in</strong>g and receiv<strong>in</strong>g packets.Intel’s I/O <strong>Acceleration</strong> <strong>Technology</strong> (I/<strong>OAT</strong>) [1, 3, 2, 15] isa set <strong>of</strong> features which attempts to alleviate the receiver packetprocess<strong>in</strong>g overheads. It has three additional features, namely:(i) split headers, (ii) DMA copy <strong>of</strong>fload eng<strong>in</strong>e and (iii) multiplereceive queues.At this po<strong>in</strong>t, the follow<strong>in</strong>g open questions arise:• What k<strong>in</strong>d <strong>of</strong> benefits can be expected from the currentI/<strong>OAT</strong> architecture?• How does this benefit translate to applications?In this paper, we focus on the above questions. We firstanalyze the performance <strong>of</strong> I/<strong>OAT</strong> based on a detailed suite<strong>of</strong> micro-benchmarks. Next, we evaluate it on two differentapplication doma<strong>in</strong>s:• A multi-tier Data-Center environment• A Parallel Virtual File System (PVFS)Our micro-benchmark evaluations show that I/<strong>OAT</strong> reduces theoverall CPU utilization significantly, up to 38%, as compared1


to traditional communication (non-I/<strong>OAT</strong>). Due to this reducedCPU utilization, I/<strong>OAT</strong> delivers better performance and <strong>in</strong>creasednetwork bandwidth. Our experimental results withdata-centers and file systems reveal that I/<strong>OAT</strong> can improvethe total number <strong>of</strong> transactions processed by 14% and thethroughput by 12%, respectively. Also, our results show thatI/<strong>OAT</strong> can susta<strong>in</strong> a large number <strong>of</strong> concurrent threads (upto a factor <strong>of</strong> four as compared to non-I/<strong>OAT</strong>) <strong>in</strong> data-centerenvironments, thus <strong>in</strong>creas<strong>in</strong>g the scalability <strong>of</strong> the servers.The rema<strong>in</strong><strong>in</strong>g part <strong>of</strong> the paper is organized as follows:Section 2 provides a brief background on socket optimizationsand I/<strong>OAT</strong> architecture. In Section 3, we discuss the s<strong>of</strong>tware<strong>in</strong>frastructures used; <strong>in</strong> particular the Multi-Tier Data-Center environments and Parallel Virtual File System (PVFS).Section 4 addresses micro-benchmark evaluations to study theideal case benefits <strong>of</strong> I/<strong>OAT</strong> over the native kernel implementation.Sections 5 and 6 present the benefits <strong>of</strong> I/<strong>OAT</strong> <strong>in</strong> a datacenterand PVFS environment, respectively. Section 7 presentsthe advantages and the disadvantages <strong>of</strong> I/<strong>OAT</strong>. We concludethe paper <strong>in</strong> Section 8.2 BackgroundIn this section, we present some <strong>of</strong> the socket optimizationtechnologies <strong>in</strong> detail. Later, we provide a brief background <strong>of</strong>the I/<strong>OAT</strong> architecture and its features.2.1 Socket OptimizationsAs mentioned <strong>in</strong> Section 1, due to technological developments,significant performance gaps exist between the CPUoverhead <strong>of</strong> send<strong>in</strong>g and receiv<strong>in</strong>g packets. In particular, thereare two techniques that make the CPU usage on the sender-sidefar less than the CPU usage on the receiver-side.The first technique, TCP segmentation <strong>of</strong>fload (TSO), allowsthe kernel to send a large buffer, which is more than themaximum transmission unit (MTU), to the network controller.The network controller segments the buffer <strong>in</strong>to <strong>in</strong>dividual Ethernetframes and sends it on the wire. In the absence <strong>of</strong> thistechnique, the host CPU is expected to break the large buffer<strong>in</strong>to small frames and send the small frames to the networkcontroller. This operation is more CPU <strong>in</strong>tensive. The secondtechnique, sendfile() optimization, allows the kernel to performzero-copy operation. Us<strong>in</strong>g this technique, the kernel does notcopy the user buffer to the network buffer. Instead, it po<strong>in</strong>ts thep<strong>in</strong>ned pages <strong>of</strong> the user buffer as the data source for transmission.However, neither <strong>of</strong> these techniques are applicable to thereceiver side. On the receiver side, <strong>in</strong>terrupt coalesc<strong>in</strong>g techniquehelps to reduce the number <strong>of</strong> <strong>in</strong>terrupts generated <strong>in</strong> thehost system. This technique generates one <strong>in</strong>terrupt for multiplepackets rather than one <strong>in</strong>terrupt for every s<strong>in</strong>gle packet.However, this optimization improves the performance onlywhen the network is heavily loaded.2.2 I/O <strong>Acceleration</strong> <strong>Technology</strong> (I/<strong>OAT</strong>) OverviewI/<strong>OAT</strong> [15] attempts to alleviate the bottlenecks mentioned<strong>in</strong> Section 1 by provid<strong>in</strong>g a set <strong>of</strong> features, namely: (i) Splitheaders, (ii) Asynchronous copy us<strong>in</strong>g DMA eng<strong>in</strong>e and (iii)Multiple Receive Queues.2.2.1 Split-headersAs shown <strong>in</strong> Figure 1, the optimized TCP/IP stack <strong>of</strong> I/<strong>OAT</strong>has the split header feature implemented. In TCP/IP-basedcommunication for transmission <strong>of</strong> application data, headerssuch as TCP header, IP header and Ethernet header are attachedalong with the application data. Typically, the networkcontroller transfers the headers and the application data onto as<strong>in</strong>gle buffer. However, with the split-header feature, the controllerpartitions the network data <strong>in</strong>to headers and applicationdata and copies them <strong>in</strong>to two separate buffers. This allowsthe header and data to be optimally aligned. S<strong>in</strong>ce the headersare frequently accessed dur<strong>in</strong>g network process<strong>in</strong>g, thisfeature also results <strong>in</strong> better cache utilization by not pollut<strong>in</strong>gthe CPU’s cache with any application data while access<strong>in</strong>gthe headers, thus <strong>in</strong>creas<strong>in</strong>g the locality <strong>of</strong> the <strong>in</strong>com<strong>in</strong>gheaders. For more <strong>in</strong>formation regard<strong>in</strong>g this feature and itsbenefits please refer to [15, 16].Optimized TCP/IP stackwith enhancementsBalanced network process<strong>in</strong>gwith network−flow aff<strong>in</strong>ityCPUcore core core coreChipsetI/O DevicesCPUNetworkData streamMemoryEnhanced direct memoryaccess with asynchronouslow−cost copyFigure 1. Intel’s I/O <strong>Acceleration</strong> <strong>Technology</strong>(Courtesy Intel)2.2.2 Asynchronous copy us<strong>in</strong>g DMA eng<strong>in</strong>eMost <strong>of</strong> the receive packet process<strong>in</strong>g time [15] is spent <strong>in</strong>copy<strong>in</strong>g the data from kernel buffer to user buffer. In particular,the ratio <strong>of</strong> useful work done by the CPU to useless overheadssuch as CPU stall<strong>in</strong>g for memory access or data movement, decreasesas we go from 1 Gbps l<strong>in</strong>ks to 10 Gbps l<strong>in</strong>ks [17]. Especiallyfor data movement operations, rather than wait<strong>in</strong>g fora memory copy to f<strong>in</strong>ish, the host CPU can process other pend<strong>in</strong>gpackets while the copy is still <strong>in</strong> progress. I/<strong>OAT</strong> <strong>of</strong>floadsthis copy operation with an additional DMA eng<strong>in</strong>e. This isa dedicated device which can perform memory copies. As aresult, while the DMA is perform<strong>in</strong>g the data copy, the CPUbecomes free to process other pend<strong>in</strong>g packets. For more detailsregard<strong>in</strong>g this feature and its implementation please referto [15, 17, 16].


WANProxyServersDatabaseApplication ServersServersTier 1Tier 0 Tier 2StorageComputeNodeComputeNodeComputeNode.ComputeNodeNetworkMetadataManagerI/O serverNodeI/O serverNode..I/O serverNodeMetaDataDataData.DataFigure 2. Application Doma<strong>in</strong>s: (a) A Multi-Tier Data-Center Environment and (b) Parallel Virtual FileSystem (PVFS) Environment2.2.3 Multiple Receive QueuesProcess<strong>in</strong>g large packets is generally not CPU-<strong>in</strong>tensive,whereas process<strong>in</strong>g small packets can fully occupy the CPU.Even on a multi-CPU system, process<strong>in</strong>g occurs on a s<strong>in</strong>gleCPU (the CPU which handles the controller’s <strong>in</strong>terrupt). Multiplereceive queue feature helps <strong>in</strong> distribut<strong>in</strong>g the packetsamong different receive queues and allows multiple CPUs toact on different packets at the same time. For more details regard<strong>in</strong>gthis feature please refer to [15, 16]. Unfortunately,this feature is currently disabled <strong>in</strong> the L<strong>in</strong>ux platform. Hence,we could not measure the performance impact <strong>of</strong> this feature<strong>in</strong> our evaluation.3 S<strong>of</strong>tware InfrastructureWe have carried out the evaluation <strong>of</strong> I/<strong>OAT</strong> on two differents<strong>of</strong>tware <strong>in</strong>frastructures: Multi-Tier Data Center environmentand the Parallel Virtual File System (PVFS). In thissection, we discuss each <strong>of</strong> these <strong>in</strong> more detail.3.1 Multi-Tier Data Center environmentMore and more people are us<strong>in</strong>g web <strong>in</strong>terfaces for a widerange <strong>of</strong> services. Scalability <strong>of</strong> these systems is a very importantrequirement. Upgrad<strong>in</strong>g a server to a s<strong>in</strong>gle, more powerfulserver is no longer a feasible method. In recent years,clusters consist<strong>in</strong>g <strong>of</strong> low-cost commodity <strong>of</strong>f-the-shelf hardwarecomponents have become <strong>in</strong>creas<strong>in</strong>gly attractive for host<strong>in</strong>gweb services. Cluster architectures are characterized by alow setup and ma<strong>in</strong>tenance cost. In addition, clusters providegreater flexibility <strong>in</strong> reallocat<strong>in</strong>g resources among the tiers <strong>of</strong>a data-center to accommodate fluctuat<strong>in</strong>g load.A typical Multi-tier Data-center [20, 8, 21] has a cluster <strong>of</strong>nodes, called the edge nodes, as its first tier. These nodes canbe thought <strong>of</strong> as switches (up to the 7th layer) that provideservices like load balanc<strong>in</strong>g, security, cach<strong>in</strong>g etc. The ma<strong>in</strong>purpose <strong>of</strong> this tier is to <strong>in</strong>crease the performance <strong>of</strong> the <strong>in</strong>nertiers. The next tiers are usually the web-servers. These nodes,apart from serv<strong>in</strong>g static content, can fetch dynamic data fromother sources and serve that data <strong>in</strong> a presentable form. Thelast tier <strong>of</strong> the data-center is the database tier. It is used to storepersistent data. This tier is usually I/O <strong>in</strong>tensive. Figure 2ashows a typical data-center setup.A request from a client is received by the edge or proxyservers. This request is serviced from cache if possible, otherwiseit is forwarded to the web/application servers. Static requestsare serviced by the web servers by just return<strong>in</strong>g the requestedfile to the client via the edge server. This content maybe cached at the edge server so that subsequent requests to thesame static content may be served from the cache. The applicationtier handles dynamic content. Any request that needs somevalue to be computed, searched, analyzed or stored, must usethis tier at some stage. The application servers may also needto spend some time on convert<strong>in</strong>g data to presentable formats.The back-end database servers are responsible for stor<strong>in</strong>g datapersistently and respond<strong>in</strong>g to queries. These nodes are connectedto a persistent storage system. Queries to the databasesystems can be anyth<strong>in</strong>g from a simple search <strong>of</strong> required datato perform<strong>in</strong>g jo<strong>in</strong>/aggregation/select operations on the data.3.2 Parallel Virtual File System (PVFS)Parallel Virtual File System (PVFS) [7] is one <strong>of</strong> the lead<strong>in</strong>gparallel file systems for L<strong>in</strong>ux clusters today. It is designedto meet the <strong>in</strong>creas<strong>in</strong>g I/O demands <strong>of</strong> parallel applications <strong>in</strong>cluster systems. Figure 2b demonstrates a typical PVFS environment.As shown <strong>in</strong> the figure, a group <strong>of</strong> nodes <strong>in</strong> the clustersystem can be configured as I/O servers and one <strong>of</strong> them (eitheran I/O server or a different node) as a meta-data manager.It is possible for a node to host computation while serv<strong>in</strong>g asan I/O node.PVFS achieves high performance by strip<strong>in</strong>g files across aset <strong>of</strong> I/O server nodes to allow parallel accesses to the data. Ituses the native file system on the I/O servers to store <strong>in</strong>dividualfile stripes. An I/O daemon runs on each I/O node and servicesrequests from the compute nodes, <strong>in</strong> particular the read andwrite requests. Data is transferred directly between the I/Oservers and the compute nodes.A manager daemon runs on the meta-data manager node. Ithandles meta-data operations <strong>in</strong>volv<strong>in</strong>g file permissions, truncation,file stripe characteristics, and so on. Meta-data isalso stored on the local file system. The meta-data managerprovides a cluster-wide consistent name space to applications.In PVFS, the meta-data manager does not participate<strong>in</strong> read/write operations.PVFS supports a set <strong>of</strong> feature-rich <strong>in</strong>terfaces, <strong>in</strong>clud<strong>in</strong>g


Throughput (Mbps)60005000400030002000100001 2 3 4 5 6Number <strong>of</strong> Ports40%35%30%25%20%15%10%5%0%Avg. CPU UtilizationThroughput (Mbps)1200010000800060004000200001 2 3 4 5 6Number <strong>of</strong> Ports100%90%80%70%60%50%40%30%20%10%0%Avg. CPU Utilizationnon-I/<strong>OAT</strong> I/<strong>OAT</strong> non-I/<strong>OAT</strong>-CPU I/<strong>OAT</strong>-CPUnon-I/<strong>OAT</strong> I/<strong>OAT</strong> non-I/<strong>OAT</strong>-CPU I/<strong>OAT</strong>-CPUFigure 3. Micro-Benchmarks: (a) Bandwidth and (b) Bi-directional Bandwidthsupport for both contiguous and non-contiguous accesses tomemory and files [11]. PVFS can be used with multiple APIs:a native API, the UNIX/POSIX API, MPI-IO [22], and an arrayI/O <strong>in</strong>terface called the Multi-Dimensional Block Interface(MDBI). The presence <strong>of</strong> multiple popular <strong>in</strong>terfaces contributesto the wide success <strong>of</strong> PVFS <strong>in</strong> the <strong>in</strong>dustry.4 I/<strong>OAT</strong> Micro-Benchmark ResultsIn this section, we compare the ideal case performance benefitsachievable by I/<strong>OAT</strong> as compared to the native socketsimplementation (non-I/<strong>OAT</strong>) us<strong>in</strong>g a set <strong>of</strong> micro-benchmarks.We use two testbeds for all <strong>of</strong> our experiments. Their descriptionsare as follows:Testbed 1: A system consist<strong>in</strong>g <strong>of</strong> two nodes built aroundSuperMicro X7DB8+ motherboards which <strong>in</strong>clude 64-bit133 MHz PCI-X <strong>in</strong>terfaces. Each node has a dual dual-coreIntel 3.46 GHz processor with a 2 MB L2 cache. The mach<strong>in</strong>esare connected with three Intel PRO 1000Mbit adapterswith two ports each through a 24-port Netgear Gigabit Ethernetswitch. We use the L<strong>in</strong>ux RedHat AS 4 operat<strong>in</strong>g systemand kernel version 2.6.9-30.Testbed 2: A cluster system consist<strong>in</strong>g <strong>of</strong> 44 nodes. Eachnode has a dual Intel Xeon 2.66 GHz processor with 512KBL2 cache and 2GB <strong>of</strong> ma<strong>in</strong> memory.For the experiments mentioned <strong>in</strong> Sections 5, we use thenodes <strong>in</strong> Testbed 2 as clients and the nodes <strong>in</strong> Testbed 1 asservers. For all other experiments, we only use the nodes <strong>in</strong>Testbed 1. Also, for experiments with<strong>in</strong> Testbed 1, we createa separate VLAN for each network adapter <strong>in</strong> one node and acorrespond<strong>in</strong>g IP address with<strong>in</strong> the same VLAN on the othernode to ensure an even distribution <strong>of</strong> network traffic. In all<strong>of</strong> our experiments, we def<strong>in</strong>e the term relative CPU benefit <strong>of</strong>I/<strong>OAT</strong> as follows: if a is the % CPU utilization <strong>of</strong> I/<strong>OAT</strong> and bis the % CPU utilization <strong>of</strong> non-I/<strong>OAT</strong>, the relative CPU benefit<strong>of</strong> I/<strong>OAT</strong> is def<strong>in</strong>ed as (b − a)/b. For example, if I/<strong>OAT</strong> occupies30% CPU and non-I/<strong>OAT</strong> occupies 60% CPU, the relativeCPU benefit <strong>of</strong> I/<strong>OAT</strong> is 50%, though the absolute difference<strong>in</strong> CPU usage is only 30%.4.1 Bandwidth and Bi-directional BandwidthFigure 3a shows the bandwidth achieved by I/<strong>OAT</strong> and non-I/<strong>OAT</strong> with an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> network ports. We usethe standard ttcp benchmark [4] for measur<strong>in</strong>g the bandwidth.As the number <strong>of</strong> ports <strong>in</strong>crease, we expect the bandwidth to<strong>in</strong>crease. As shown <strong>in</strong> Figure 3a, we see that the bandwidthperformance achieved by I/<strong>OAT</strong> is similar to the performanceachieved by non-I/<strong>OAT</strong> with an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> ports.The maximum bandwidth achieved is close to 5635 Mbps withsix network ports. However, we see a difference <strong>in</strong> performancewith respect to the CPU utilization on the receiver side.We observe that the CPU utilization is lower for I/<strong>OAT</strong> as comparedto non-I/<strong>OAT</strong> us<strong>in</strong>g three network ports and this difference<strong>in</strong>creases as we see the number <strong>of</strong> ports <strong>in</strong>crease fromthree to six. For a six port configuration, non-I/<strong>OAT</strong> occupiesclose to 37% <strong>of</strong> the CPU while I/<strong>OAT</strong> occupies only 29% <strong>of</strong>the CPU. The relative benefit achieved by I/<strong>OAT</strong> <strong>in</strong> this case isclose to 21%.In the bi-directional bandwidth test, we use two mach<strong>in</strong>esand 2*N threads on each mach<strong>in</strong>e with N threads act<strong>in</strong>g asservers and the other N threads as clients. Each thread on onemach<strong>in</strong>e has a connection to exactly one thread on the othermach<strong>in</strong>e. The client threads connect to the server threads onthe other mach<strong>in</strong>e. Thus, 2*N connections are establishedbetween these two mach<strong>in</strong>es. On each connection, the basicbandwidth test is performed us<strong>in</strong>g the ttcp benchmark. Theaggregate bandwidth achieved by all threads is calculated asthe bi-directional bandwidth, as shown <strong>in</strong> Figure 3b. In our experiments,N is equal to the number <strong>of</strong> network ports. Wesee that the maximum bi-directional bandwidth is close to9600 Mbps. Also, we observe that I/<strong>OAT</strong> shows an improvement<strong>in</strong> CPU utilization us<strong>in</strong>g only two ports and this improvement<strong>in</strong>creases with an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> ports. Withsix network ports, non-I/<strong>OAT</strong> occupies close to 90% <strong>of</strong> CPUwhereas I/<strong>OAT</strong> occupies only 70% <strong>of</strong> the CPU. The relativeCPU benefits achieved by I/<strong>OAT</strong> is close to 22%. This trendalso suggests that with an addition <strong>of</strong> one or two network portsto this configuration, non-I/<strong>OAT</strong> may not give the best networkthroughput <strong>in</strong> comparison with I/<strong>OAT</strong> s<strong>in</strong>ce non-I/<strong>OAT</strong> mayend up occupy<strong>in</strong>g 100% CPU.4.2 Multi-Stream BandwidthThe multi-stream bandwidth test is very similar to the bidirectionalbandwidth test mentioned above. However, <strong>in</strong> thisexperiment, only one mach<strong>in</strong>e acts as a server and the othermach<strong>in</strong>e as the client. We use two mach<strong>in</strong>es and N threads oneach mach<strong>in</strong>e. Each thread on one mach<strong>in</strong>e has a connection toexactly one thread on the other mach<strong>in</strong>e. On each connection,the basic bandwidth test is performed. The aggregate bandwidthachieved by all threads is calculated as the multi-streambandwidth. As shown <strong>in</strong> Figure 4, we observe that the band-


width achieved by I/<strong>OAT</strong> is similar to the bandwidth achievedby non-I/<strong>OAT</strong> for an <strong>in</strong>creas<strong>in</strong>g number <strong>of</strong> threads. However,when the number <strong>of</strong> threads <strong>in</strong>creases to 120, we see a degradation<strong>in</strong> performance <strong>of</strong> non-I/<strong>OAT</strong>; whereas, I/<strong>OAT</strong> consistentlyshows no degradation <strong>in</strong> network bandwidth. Further,the CPU utilization for non-I/<strong>OAT</strong> also <strong>in</strong>creases with an <strong>in</strong>creas<strong>in</strong>gnumber <strong>of</strong> threads. With 120 threads <strong>in</strong> the system,we see that non-I/<strong>OAT</strong> occupies close to 76% CPU whereasI/<strong>OAT</strong> only occupies 52% result<strong>in</strong>g <strong>in</strong> 24% absolute benefit <strong>in</strong>CPU utilization. The relative CPU benefits achieved by I/<strong>OAT</strong><strong>in</strong> this case is close to 32%.Throughput (Mbps)700060005000400030002000100006 12 30 60 120Number <strong>of</strong> Threads80%70%60%50%40%30%20%10%0%non-I/<strong>OAT</strong> I/<strong>OAT</strong> non-I/<strong>OAT</strong>-CPU I/<strong>OAT</strong>-CPUFigure 4. Multi-Stream Bandwidth4.3 Bandwidth and Bi-directional Bandwidth withSocket OptimizationsAs mentioned <strong>in</strong> Section 2, several optimizations on thesender side exist to reduce the packet overhead and also to improvethe network performance. In this experiment, we consideredthree such exist<strong>in</strong>g optimizations: (i) Large Socketbuffer sizes (100 MB), (ii) Segmentation Offload (TSO) and(iii) Jumbo Frames. In this experiment, we aim to understandthe impact <strong>of</strong> each <strong>of</strong> these optimizations and observe the improvement,both <strong>in</strong> terms <strong>of</strong> throughput and CPU utilizationon the receiver-side. Case 1 uses the default socket optionswithout any optimization. In Case 2, we <strong>in</strong>crease the socketbuffer size to 100 MB. For Case 3, we further improve theoptimization by enabl<strong>in</strong>g segmentation <strong>of</strong>fload (TSO) so thatthe host CPU is relieved from fragment<strong>in</strong>g large packets. InCase 4, <strong>in</strong> addition to the previous optimizations, we <strong>in</strong>creasethe MTU-size to 2048 bytes so that large packets are sent overthe network. In addition to the sender-side optimizations, <strong>in</strong>Case 5, we <strong>in</strong>clude the <strong>in</strong>terrupt coalesc<strong>in</strong>g feature. Performancenumbers with various socket optimizations are shown<strong>in</strong> Figure 5.In the bandwidth test, as shown <strong>in</strong> Figure 5a, we observetwo <strong>in</strong>terest<strong>in</strong>g trends. First, as we <strong>in</strong>crease the socket optimizations,we see an <strong>in</strong>crease <strong>in</strong> the aggregate bandwidth.Second, we observe that the performance <strong>of</strong> I/<strong>OAT</strong> is consistentlybetter than the performance <strong>of</strong> non-I/<strong>OAT</strong>. Especially forCase 5, which <strong>in</strong>cludes all socket optimizations, the bandwidthachieved by I/<strong>OAT</strong> is close to 5586 Mbps, whereas non-I/<strong>OAT</strong>achieves only 5514 Mbps. More importantly, we observe thatthere is a significant improvement <strong>in</strong> terms <strong>of</strong> CPU utilization.As shown <strong>in</strong> Figure 5a, the relative CPU benefit <strong>of</strong> I/<strong>OAT</strong> <strong>in</strong>creasesas we <strong>in</strong>crease the socket optimizations. EspeciallyAvg. CPU Utilization<strong>in</strong> Case 4, we observe that I/<strong>OAT</strong> achieves 30% relative CPUbenefit <strong>in</strong> comparison with non-I/<strong>OAT</strong>.We see similar trends with the bi-directional bandwidth testas shown <strong>in</strong> Figure 5b. Further, the relative benefits achievedby I/<strong>OAT</strong> is much higher than compared to the bandwidth experiment.In Case 4, I/<strong>OAT</strong> achieves close to 38% relative CPUbenefit as compared to non-I/<strong>OAT</strong>.In summary, we see that the micro-benchmark results showa significant improvement <strong>in</strong> terms <strong>of</strong> CPU utilization and networkperformance for I/<strong>OAT</strong>. S<strong>in</strong>ce I/<strong>OAT</strong> <strong>in</strong> L<strong>in</strong>ux has twoadditional features, the split-header and DMA copy eng<strong>in</strong>e, itis also important to understand the benefits attributed by each<strong>of</strong> these features <strong>in</strong>dividually. In the follow<strong>in</strong>g section, we conductseveral experiments to show the <strong>in</strong>dividual benefits.4.4 <strong>Benefits</strong> <strong>of</strong> Asynchronous DMA Copy Eng<strong>in</strong>eIn this section, we isolate the DMA eng<strong>in</strong>e feature <strong>of</strong> I/<strong>OAT</strong>and show the benefits <strong>of</strong> an asynchronous DMA copy eng<strong>in</strong>e.We compare the performance <strong>of</strong> the copy eng<strong>in</strong>e with that <strong>of</strong>traditional CPU-based copy and show its benefits <strong>in</strong> terms <strong>of</strong>performance and overlap efficiency. For CPU-based copy, weuse the standard memcpy utility.Figure 6 compares the cost <strong>of</strong> perform<strong>in</strong>g a copy us<strong>in</strong>g theCPU and I/<strong>OAT</strong>’s DMA eng<strong>in</strong>e. The copy-cache bars denotethe performance <strong>of</strong> CPU-based copy with the source and dest<strong>in</strong>ationbuffers <strong>in</strong> the cache and the copy-nocache bars denotethe performance with the source and dest<strong>in</strong>ation buffersnot <strong>in</strong> the cache. The DMA-copy bars denote the total timetaken to perform the copy us<strong>in</strong>g the copy eng<strong>in</strong>e and the DMAoverheadbars <strong>in</strong>clude the startup overhead <strong>in</strong> <strong>in</strong>itiat<strong>in</strong>g thecopy us<strong>in</strong>g the DMA eng<strong>in</strong>e. The Overlap l<strong>in</strong>e denotes thepercentage <strong>of</strong> DMA copy time that can be overlapped withother computations. As shown <strong>in</strong> Figure 6, we see that the performance<strong>of</strong> DMA-based copy approach (DMA-copy) is betterthan the performance <strong>of</strong> CPU-based copy approach (copynocache)for message sizes greater than 8 KB. Further, we observethat the percentage <strong>of</strong> overlap <strong>in</strong>creases with <strong>in</strong>creas<strong>in</strong>gmessage sizes reach<strong>in</strong>g up to 93% for a 64 KB message size.However, if the source and dest<strong>in</strong>ation buffers are <strong>in</strong> the cache,we observe that the performance <strong>of</strong> the CPU-based copy ismuch better than the performance <strong>of</strong> the DMA-based copy approach.Also, s<strong>in</strong>ce the DMA-based copy can be overlappedwith process<strong>in</strong>g other packets, we <strong>in</strong>cur only the DMA startupoverheads. As observed <strong>in</strong> Figure 6, we see that the DMAstartup overhead time is much less than the time taken by theCPU-based copy approach. Thus, the DMA copy eng<strong>in</strong>e canalso be useful even if the source and dest<strong>in</strong>ation buffers are <strong>in</strong>cache.4.5 Breakdown <strong>of</strong> I/<strong>OAT</strong> FeaturesIn this section, we break down the two features <strong>of</strong> I/<strong>OAT</strong>and show its benefits <strong>in</strong> terms <strong>of</strong> throughput and CPU utilization.We first measure the performance <strong>of</strong> non-I/<strong>OAT</strong> withoutthe DMA eng<strong>in</strong>e and the split header features. Next, wemeasure the performance <strong>in</strong>clud<strong>in</strong>g the DMA eng<strong>in</strong>e feature(denoted as I/<strong>OAT</strong>-DMA <strong>in</strong> Figure 7) and compare it with the


Throughput (Mbps)56505600555055005450540053505300525030.0%25.0%20.0%15.0%10.0%5.0%0.0%Case 1 Case 2 Case 3 Case 4 Case 5Sender-side Optimizationsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> Relative CPU Benefit35.0%11500% CPU Benefit with I<strong>OAT</strong>Throughput (Mbps)110001050010000950090008500Case 1 Case 2 Case 3 Case 4 Case 5Sender-side Optimizationsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> Relative CPU BenefitFigure 5. Optimizations: (a) Bandwidth and (b) Bi-directional Bandwidth45%40%35%30%25%20%15%10%5%0%% CPU Benefit with I<strong>OAT</strong>performance <strong>of</strong> non-I/<strong>OAT</strong>. We report the difference <strong>in</strong> performanceas DMA eng<strong>in</strong>e benefits. F<strong>in</strong>ally, we measure the performance<strong>in</strong>clud<strong>in</strong>g both the split-header feature and the DMAeng<strong>in</strong>e feature (denoted as I/<strong>OAT</strong>-SPLIT <strong>in</strong> Figure 7) and compareit with the performance <strong>of</strong> I/<strong>OAT</strong> with DMA eng<strong>in</strong>e. Wereport the difference <strong>in</strong> performance as split-header benefits.We design the experiment <strong>in</strong> the follow<strong>in</strong>g way. We use thetwo nodes from Testbed 1, one node as a server and the otheras a client. We use two Intel network adapters with two portseach on both the server and the client node. S<strong>in</strong>ce we havefour network ports on each node, we use four clients and performa bandwidth test with four server threads. We first measurethe bandwidth performance for small message sizes forall three cases, i.e., non-I/<strong>OAT</strong>, I/<strong>OAT</strong>-DMA and I/<strong>OAT</strong>-SPLITand report the relative CPU benefits. As shown <strong>in</strong> Figure 7a,we observe that the DMA eng<strong>in</strong>e feature achieves close to 16%relative CPU benefit compared to the non-I/<strong>OAT</strong> case. This improvementis seen for all message sizes from 16 KB to 128 KB.However, we do not observe any throughput improvement forall message sizes. Also, the split-header feature <strong>of</strong> I/<strong>OAT</strong> doesnot seem to improve the CPU or the throughput for all messagesizes.As mentioned <strong>in</strong> Section 2, the split-header feature <strong>of</strong> I/<strong>OAT</strong><strong>in</strong>creases the locality <strong>of</strong> <strong>in</strong>com<strong>in</strong>g headers and also helps improvecache utilization by not pollut<strong>in</strong>g the CPU’s cache withapplication buffers dur<strong>in</strong>g network process<strong>in</strong>g. In order toshow the cache pollution effects, we repeat the previous experimentfor large message sizes that cannot fit <strong>in</strong> the systemcache. Figure 7b shows the throughput benefits achieved byI/<strong>OAT</strong> features for large message sizes. We observe that thesplit-header feature can achieve up to 26% benefit <strong>in</strong> throughputfor transferr<strong>in</strong>g 1 MB <strong>of</strong> application data. In this case, theserver with I/<strong>OAT</strong> capability receives a total <strong>of</strong> 4 MB <strong>of</strong> applicationdata from the four clients and s<strong>in</strong>ce the cache size isonly 2 MB, clearly the application data does not fit <strong>in</strong> the systemcache. We see a huge improvement for the split-headerfeature because the CPU’s cache is not polluted with applicationdata. In addition, we see that the benefits achieved by thesplit-header feature decreases for <strong>in</strong>creas<strong>in</strong>g message sizes.5 Data-Center Performance EvaluationIn this section, we analyze the performance <strong>of</strong> a 2-tier datacenterenvironment with I/<strong>OAT</strong> and compare its performancewith non-I/<strong>OAT</strong>. For all experiments <strong>in</strong> this section, we use thenodes <strong>in</strong> Testbed 1 (described <strong>in</strong> Section 4) for the data-centerTime (usecs)504540353025201510501K 2K 4K 8K 16K 32K 64KMessage Size (bytes)copy-cache copy-nocache DMA-copyDMA-overhead Overlap100%90%80%70%60%50%40%30%20%10%0%Figure 6. Copy Performance: CPU vs DMAtiers. For the client nodes, we use the nodes <strong>in</strong> Testbed 2 formost <strong>of</strong> the experiments. We notify the readers at appropriatepo<strong>in</strong>ts <strong>in</strong> the rema<strong>in</strong><strong>in</strong>g sections when other nodes are used asclients.5.1 Evaluation MethodologyAs mentioned <strong>in</strong> Section 2, I/<strong>OAT</strong> is a server architecturegeared to improve the receiver-side performance. Inother words, I/<strong>OAT</strong> can be deployed <strong>in</strong> a data-center environmentand client requests com<strong>in</strong>g over the Wide Area Network(WAN) can seamlessly take advantage <strong>of</strong> I/<strong>OAT</strong> and get servicedmuch faster. Further, the communication between thetiers <strong>in</strong>side the data-center, such as proxy tier and applicationtier, can be greatly enhanced us<strong>in</strong>g I/<strong>OAT</strong>, thus improv<strong>in</strong>g theoverall data-center performance and scalability.We set up a two-tier data-center testbed to determ<strong>in</strong>e theperformance characteristics <strong>of</strong> us<strong>in</strong>g I/<strong>OAT</strong> and non-I/<strong>OAT</strong>.The first tier consists <strong>of</strong> the front-end proxies. For this, we usedthe proxy module <strong>of</strong> Apache 2.2.0. The second tier consists <strong>of</strong>the web server module <strong>of</strong> Apache <strong>in</strong> order to service static requests.The two tiers <strong>in</strong> the data-center reside on 1 GigabitEthernet network; the clients are connected to the data-centerus<strong>in</strong>g a 1 Gigabit Ethernet connection.Typical data-center workloads have a wide range <strong>of</strong> characteristics.Some workloads may vary from high to low temporallocality, follow<strong>in</strong>g a Zipf-like distribution [6]. Similarlyworkloads vary from small documents (e.g., on-l<strong>in</strong>e bookstores, brows<strong>in</strong>g sites, etc.) to large documents (e.g., downloadsites, etc.). Further, workloads may conta<strong>in</strong> requests forsimple cacheable static or time <strong>in</strong>variant content or more complexdynamic or time variant content via CGI, PHP, and Javaservlets with a back-end database. Due to these vary<strong>in</strong>g characteristics<strong>of</strong> workloads, we classify the workloads <strong>in</strong>to three% Overlap


Throughput (MB/s)80070018%16%60014%50012%10%4008%3006%2004%1002%00%1 2 3 4 5 6Number <strong>of</strong> Clientsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> % Relative CPU Benefit% CPU Benefit with I/<strong>OAT</strong>Throughput (MB/s)70060020%18%16%50014%40012%10%3008%2006%4%1002%00%1 2 3 4 5Number <strong>of</strong> Clientsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> % Relative CPU BenefitFigure 10. PVFS Concurrent Read Performance: (a) Six I/O Servers and (b) Five I/O Servers% CPU Benefit with I/<strong>OAT</strong>Throughput (MB/s)80070020%18%16%60014%50012%40010%3008%6%2004%1002%00%1 2 3 4 5 6Number <strong>of</strong> Clientsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> % Relative CPU Benefit% CPU Benefit with I/<strong>OAT</strong>Throughput (MB/s)70060020%18%16%50014%40012%10%3008%2006%4%1002%00%1 2 3 4 5Number <strong>of</strong> Clientsnon-I/<strong>OAT</strong> I/<strong>OAT</strong> % Relative CPU BenefitFigure 11. PVFS Concurrent Write Performance: (a) Six I/O Servers and(b) Five I/O Servers% CPU Benefit with I/<strong>OAT</strong>697 MB/s. With I/<strong>OAT</strong>, the write bandwidth performance <strong>in</strong>creasesfrom 460 MB/s to 750 MB/s with an <strong>in</strong>creas<strong>in</strong>g number<strong>of</strong> compute nodes, i.e., I/<strong>OAT</strong> achieves 8% throughput benefitas compared to non-I/<strong>OAT</strong>. Further, I/<strong>OAT</strong> achieves close to5% CPU benefit as compared to non-I/<strong>OAT</strong> us<strong>in</strong>g six computenodes. We see similar trends when we decrease the number <strong>of</strong>I/O servers as shown <strong>in</strong> Figure 11b.Throughput (MB/s)80070060050040030020010001 4 16 64Number <strong>of</strong> Emulated Clients100%90%80%70%60%50%40%30%20%10%0%Client-Side: Avg. CPU Utilization6.2.2 PVFS read performance with multiple streamsIn this experiment, we emulate a scenario where multipleclients access the file system. We <strong>in</strong>crease the number <strong>of</strong>clients from 1 to 64 and study the behavior <strong>of</strong> concurrent readperformance <strong>in</strong> PVFS. As shown <strong>in</strong> Figure 12, we observe thatthe performance <strong>of</strong> I/<strong>OAT</strong> is either equal to or greater than theperformance achieved by non-I/<strong>OAT</strong>. It is to be noted that wereport the CPU utilization numbers measured on the client-sidewhich also has the I/<strong>OAT</strong> capability. Due to the fact that theclient with I/<strong>OAT</strong> can receive the data at a much faster rate,the clients can also fire PVFS read requests at a faster rate thancompared to non-I/<strong>OAT</strong>. As a result, we observe that the CPUutilization on the client-side is around 10-12% higher than thenon-I/<strong>OAT</strong> case.In summary, PVFS with I/<strong>OAT</strong> has a 12% improvement forconcurrent reads and 8% improvement for concurrent writesas compared to PVFS with non-I/<strong>OAT</strong>. In terms <strong>of</strong> CPU utilization,PVFS with I/<strong>OAT</strong> achieves 15% benefit for concurrentreads and 7% benefit for concurrent writes as compared toPVFS with non-I/<strong>OAT</strong>.7 DiscussionAs mentioned earlier, I/<strong>OAT</strong> helps <strong>in</strong> improv<strong>in</strong>g the networkperformance and reduc<strong>in</strong>g the CPU utilization through its featuressuch as the asynchronous copy eng<strong>in</strong>e, split-headers, etc.non-I/<strong>OAT</strong> I/<strong>OAT</strong> non-I/<strong>OAT</strong>-CPU I/<strong>OAT</strong>-CPUFigure 12. Multi-Stream PVFS Read PerformanceThe asynchronous copy eng<strong>in</strong>e can also be used by user applicationsto improve the performance <strong>of</strong> memory operations andalso to improve the communication performance between twoprocesses with<strong>in</strong> the same node. S<strong>in</strong>ce the copy eng<strong>in</strong>e canoperate <strong>in</strong> parallel with the host CPU, applications can achievesignificant benefits by overlapp<strong>in</strong>g useful computations withmemory operations. With the <strong>in</strong>troduction <strong>of</strong> chip-level multiprocess<strong>in</strong>g(CMP), per-core memory bandwidth reduces drastically,thus <strong>in</strong>creas<strong>in</strong>g the need for such asynchronous memoryoperations as a memory-latency hid<strong>in</strong>g mechanism. Mohit,et. al., have proposed s<strong>of</strong>t-timer techniques to reduce thereceiver-side process<strong>in</strong>g [5]. I/<strong>OAT</strong> can co-exist with this technologyto further reduce the receiver-side overheads and improvethe performance.On the other hand, I/<strong>OAT</strong> requires special network adaptersto support the split-header and the multiple receive queue feature.In addition, the host system should have an <strong>in</strong>-builtcopy eng<strong>in</strong>e to support the asynchronous copy feature. Also,the memory controller uses physical addresses and hence, thepages cannot be swapped dur<strong>in</strong>g a copy operation. As a result,the physical pages need to be p<strong>in</strong>ned before <strong>in</strong>itiat<strong>in</strong>g the copy


operation. Due to the page-p<strong>in</strong>n<strong>in</strong>g requirement, the usefulness<strong>of</strong> the copy eng<strong>in</strong>e becomes questionable if the p<strong>in</strong>n<strong>in</strong>gcost exceeds the copy cost.8 Conclud<strong>in</strong>g Remarks and Future WorkI/O <strong>Acceleration</strong> technology (I/<strong>OAT</strong>) developed by Intel isa set <strong>of</strong> features particularly designed to reduce the packet process<strong>in</strong>goverheads on the receiver-side. This paper studies thebenefits <strong>of</strong> I/<strong>OAT</strong> technology through extensive evaluations <strong>of</strong>micro-benchmarks as well as evaluations on two different applicationdoma<strong>in</strong>s: (1) A multi-tier Data-Center environmentand (2) A Parallel Virtual File System (PVFS). Our microbenchmarkresults show that I/<strong>OAT</strong> results <strong>in</strong> 38% lower overallCPU utilization <strong>in</strong> comparison with traditional communication.Due to this reduced CPU utilization, I/<strong>OAT</strong> deliversbetter performance and <strong>in</strong>creased network bandwidth. Our experimentalresults with data-centers and file systems reveal thatI/<strong>OAT</strong> can improve the total number <strong>of</strong> transactions processedby 14% and throughput by 12%, respectively. In addition,I/<strong>OAT</strong> can susta<strong>in</strong> a large number <strong>of</strong> concurrent threads (upto a factor <strong>of</strong> four as compared to non-I/<strong>OAT</strong>) <strong>in</strong> data-centerenvironments, thus <strong>in</strong>creas<strong>in</strong>g the scalability <strong>of</strong> the servers.We are currently work<strong>in</strong>g on two broad aspects with respectto I/<strong>OAT</strong>. First, we are look<strong>in</strong>g at modify<strong>in</strong>g the applicationsto realize the true benefits <strong>of</strong> I/<strong>OAT</strong> capability. Second, weare try<strong>in</strong>g to provide an asynchronous memory copy operationto user applications. Though this <strong>in</strong>volves some amount <strong>of</strong>overhead (such as context switches, user page lock<strong>in</strong>g, etc),asynchronous memory copy can help applications <strong>in</strong> us<strong>in</strong>g theCPU cycles <strong>in</strong>telligently.9 AcknowledgmentsWe would like to thank Intel Corporation for provid<strong>in</strong>g accessto I/<strong>OAT</strong>-based systems and the kernel patch which <strong>in</strong>cludesall <strong>of</strong> their features. We would also like to thank EricSimpson from the I/<strong>OAT</strong> team <strong>of</strong> Intel Corporation for provid<strong>in</strong>gus crucial <strong>in</strong>sights <strong>in</strong>to the implementation details.References[1] Accelerat<strong>in</strong>g High-Speed Network<strong>in</strong>g with Intel I/O <strong>Acceleration</strong><strong>Technology</strong>. http://www.<strong>in</strong>tel.com/technology/ioacceleration/306517.pdf.[2] Increas<strong>in</strong>g network speeds: <strong>Technology</strong>@<strong>in</strong>tel.http://www.<strong>in</strong>tel.com/technology/magaz<strong>in</strong>e/communications/<strong>in</strong>telioat-0305.htm.[3] Intel I/O <strong>Acceleration</strong> <strong>Technology</strong>.http://www.<strong>in</strong>tel.com/technology/ioacceleration/306484.pdf.[4] USNA, TTCP: A test <strong>of</strong> TCP and UDP Performance, August2001.[5] Mohit Aron and Peter Druschel. S<strong>of</strong>t timers: efficient microseconds<strong>of</strong>tware timer support for network process<strong>in</strong>g. ACM Transactionson Computer Systems, 18(3):197–228, 2000.[6] L. Breslau, P. Cao, L. Fan, G. Phillips, and S. Shenker. Webcach<strong>in</strong>g and zipf-like distributions: Evidence and implications.In INFOCOM (1), pages 126–134, 1999.[7] Philip H. Carns, Walter B. Ligon III, Robert B. Ross, and RajeevThakur. PVFS: A Parallel File System for L<strong>in</strong>ux Clusters. InProceed<strong>in</strong>gs <strong>of</strong> the 4th Annual L<strong>in</strong>ux Showcase and Conference,pages 317–327, Atlanta, GA, 2000. USENIX Association.[8] A. Chandra, W. Gong, and P. Shenoy. Dynamic Resource Allocationfor Shared Data Centers Us<strong>in</strong>g Onl<strong>in</strong>e Measurements.In Proceed<strong>in</strong>gs <strong>of</strong> ACM Sigmetrics 2003, San Diego, CA, June2003.[9] J. Chase, A. Gallat<strong>in</strong>, and K. Yocum. End system optimizationsfor high-speed TCP. IEEE Communications Magaz<strong>in</strong>e,39(4):68–74, 2001.[10] Jeffrey S. Chase, Darrell C. Anderson, Prachi N. Thakar, Am<strong>in</strong>Vahdat, and Ronald P. Doyle. Manag<strong>in</strong>g energy and server resources<strong>in</strong> host<strong>in</strong>g centres. In Symposium on Operat<strong>in</strong>g SystemsPr<strong>in</strong>ciples, 2001.[11] Avery Ch<strong>in</strong>g, Alok Choudhary, Wei keng Liao, Robert Ross,and William Gropp. Noncontiguous I/O through PVFS. In Proceed<strong>in</strong>gs<strong>of</strong> the IEEE International Conference on Cluster Comput<strong>in</strong>g,2002.[12] J. R. David D. Clark, Van Jacobson, and H. Salwen. An analysis<strong>of</strong> TCP process<strong>in</strong>g overhead, 1989.[13] Annie Foong et al. TCP performance analysis revisited. In IEEEInternational Symposium on Performance Analysis <strong>of</strong> S<strong>of</strong>twareand Systems, March 2003.[14] Armando Fox, Steven D. Gribble, Yat<strong>in</strong> Chawathe, Eric A.Brewer, and Paul Gauthier. Cluster-based scalable network services.In Symposium on Operat<strong>in</strong>g Systems Pr<strong>in</strong>ciples, 1997.[15] Andrew Gover and Christopher Leech. Accelerat<strong>in</strong>g networkreceiver process<strong>in</strong>g. http://l<strong>in</strong>ux.<strong>in</strong>et.hr/files/ols2005/groverrepr<strong>in</strong>t.pdf.[16] S. Mak<strong>in</strong>eni and R. Iyer. Architectural characterization <strong>of</strong>TCP/IP packet process<strong>in</strong>g on the Pentium M microprocessor.In High Performance Computer Architecture, HPCA-10, 2004.[17] G. Regnier, S. Mak<strong>in</strong>eni, R. Illikkal, R. Iyer, D. M<strong>in</strong>turn,R. Huggahalli, D. Newell, L. Cl<strong>in</strong>e, and A. Foong. TCP Onload<strong>in</strong>gfor Data Center Servers. In IEEE Computer, Nov 2004.[18] John Reumann, Ashish Mehra, Kang G. Sh<strong>in</strong>, and Dilip Kandlur.Virtual services: A new abstraction for server consolidation.In In Proceed<strong>in</strong>gs <strong>of</strong> the USENIX 2000 Technical Conferece,June 2000.[19] Yasushi Saito, Brian N. Bershad, and Henry M. Levy. Manageability,availability and performance <strong>in</strong> porcup<strong>in</strong>e: A highlyscalable, cluster-based mail service. In Symposium on Operat<strong>in</strong>gSystems Pr<strong>in</strong>ciples, 1999.[20] H. V. Shah, D. B. M<strong>in</strong>turn, A. Foong, G. L. McAlp<strong>in</strong>e, R. S.Madukkarumukumana, and G. J. Regnier. CSP: A Novel SystemArchitecture for Scalable Internet and Communication Services.In the Proceed<strong>in</strong>gs <strong>of</strong> the 3rd USENIX Symposium on InternetTechnologies and Systems, pages pages 61–72, San Francisco,CA, March 2001.[21] K. Shen, H. Tang, T. Yang, and L. Chu. Integrated ResourceManagement for Cluster-based Internet Services. In Proceed<strong>in</strong>gs<strong>of</strong> Fifth USENIX Symposium on Operat<strong>in</strong>g Systems Designand Implementation (OSDI ’02), pages 225–238, Boston, MA,December 2002.[22] Rajeev Thakur, William Gropp, and Ew<strong>in</strong>g Lusk. On Implement<strong>in</strong>gMPI-IO Portably and with High Performance. In Proceed<strong>in</strong>gs<strong>of</strong> the 6th Workshop on I/O <strong>in</strong> Parallel and DistributedSystems, pages 23–32. ACM Press, May 1999.

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!