1qfrBND

2009 17th 17th IEEE IEEE Symposium on on High High Performance InterconnectsFPGA accelerated low-latencymarket data feed processingGareth W. MorrisCeloxica Inc.Email: gareth.morris@celoxica.comDavid B. Thomas and Wayne LukImperial College LondonEmail: {dt10,wl}@doc.ic.ac.ukAbstract—Modern financial exchanges provide updates to theirmembers on the changing status of the market place, by providingstreams of messages about events, called a market data feed.Markets are growing busier, and the data-rates of these feeds arealready in the gigabit range, from which customers must extractand process messages with sub-millisecond latency. This paperpresents an FPGA accelerated approach to market data feedprocessing, using an FPGA connected directly to the networkto parse, optionally decompress, and filter the feed, and thento push the decoded messages directly into the memory of ageneral purpose processor. Such a solution offers flexibility, asthe FPGA can be reconfigured for new data feed formats, andhigh throughput with low latency by eliminating the operatingsystem’s network stack. This approach is demonstrated using theCeloxica AMDC board, which accepts a pair of redundant datafeeds over two gigabit Ethernet ports, parses and filters the data,then pushes relevant messages directly into system memory overthe PCIe bus. Tests with an ORPA FAST data feed redistributionsystem show that the AMDC is able to process up to 3.5Mmessages per second, 12 times the current real-world rate, whilethe complete system rebroadcasts at least 99% of packets witha latency of less than 26us. The hardware portion of the designhas a constant latency, irrespective of throughput, of 4us.I. INTRODUCTIONModern financial instrument exchanges provide updateson the current state of the market place to their members.This is performed by transmitting messages describing changeevents on a dedicated ‘market data feed’. These events includechanges such as completed trades, bid/ask prices, and otherstatus information. The event messages are typically aggregatedinto one large market data feed, containing informationabout all activities within an exchange or market. Automatedtrading systems can examine this feed in order to reconstructthe current market state for financial instruments of interest.This can then be used, for example, to perform algorithmictrading, detect instrument arbitrage opportunities, or to rehedgeportfolios. However, this feed is already in the gigabitrange, and members face the problem of parsing this huge volumeof data, while also supporting a sub-millisecond responseto messages of interest.Existing pure software solutions are no longer able toprovide low latency solutions, so there is a need for hardwareacceleration of market feed data processing. Field ProgrammableGate Arrays (FPGAs) provide a very attractivemeans of acceleration, as they are a mature technology, withlow power and space requirements [1]. However, FPGAshave traditionally been seen as difficult to program, requiringapplications to be written in hardware-design languages, andneeding specialised engineers to develop, maintain, and extendany FPGA-based solutions.Application Specific Integrated Circuits (ASICs) may providesimilar advantages to FPGA technology with respect topower and space. However ASICs have a far longer designcycle when compared to FPGA based solutions, and don’t havethe advantage of reconfigurability for resistance to updates inmarket data feed specifications. Furthermore the cost of anASIC-based design would be considerably higher for the smallto medium production runs needed to address the financialindustry.This paper proposes an FPGA-based hardware accelerationarchitecture for the processing of high-throughput market datafeeds, providing a solution able to operate up to the maximumdata-rate of the network connection, while offering a verylow latency path from the network interface to the consumingprocess, irrespective of network load. Our key contributionsare:• A model for the division of market data feed processing,placing line-rate A-B line arbitrage, filtering, and routingin reconfigurable hardware, while keeping non-line-rateconsuming processes in software for ease-of-use, andprocessing ability appropriate for trading algorithms.• An implementation of this model using the CeloxicaAMDC accelerator card, which is able to filter andprocess a redundant pair of market data feeds at gigabitline rates.• An evaluation of the AMDC card using the FASTcompressedOPRA market data feed, demonstrating anaverage latency from packet arrival to message deliveryof 4us, at throughputs limited only by the gigabit Ethernetconnection.We first explain the motivation and needs of market datafeed processing, in particular the requirements for high messagerates, and low-latency. Our high-level approach to FPGAacceleration of feed processing is then described, followed bya description of a concrete implementation of this approachon the AMDC board and its achieved performance.II. MOTIVATIONFinancial market data feeds are used by exchanges to communicatechanges in prices and market conditions to traders.For example, an option exchange will facilitate trades in a large1550-4794/09 $26.00 © 2009 IEEEDOI 10.1109/HOTI.2009.179183Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

number of quoted options, each of which will have differentproperties, such as the underlying asset on which the optionrelies, the strike price of the option, and the expiry date of theoption. For each instrument there will be a set of bid ordersfrom traders wishing to buy at a certain price, and ask ordersfrom traders who are willing to sell at a certain price. The jobof the exchange is to collect and monitor these bid/ask prices,and match trades when the bid on a particular instrument meetsor exceeds the ask price.Information about changes in bid/ask levels and tradesthat have been completed is broadcast by the exchanges inreal-time, allowing traders to reconstruct the market place,monitor market conditions, and respond when necessary. Theseautomated trading systems may execute a number of strategies:Algorithmic Trading: Many trading strategies can be expressedalgorithmically, from the very simple “stop-loss”approach, up to complex heuristics to detect patterns in themarket.Instrument Arbitrage Detection: When an asset (or twosimilar assets) are priced differently in two markets, and thisdifference can be detected in time, a profit can be made bybuying in the cheaper market and selling in the more expensiveone.Portfolio Hedging: It is possible to construct a portfoliowith a known level of risk, by including specific sets ofassets and derivatives. However, as market conditions changethe composition of the portfolio must be adjusted to reflectchanges in prices.In almost all such applications, a key requirement is lowlatency processing, as the market is continually changing, sodecisions should be made based on the most up to date data.For example, an opportunity for instrument arbitrage onlyexists until the first trader is able to exploit it, so it is critical tominimise the time between the information becoming availableand the execution of the appropriate response.Market data feeds intended for low latency trading aremade available via UDP multicast IP networks, allowing manyusers to subscribe to a shared source. These feeds are oftenduplicated, with two streams (A and B) providing the sameinformation, giving redundancy in the case of packet-lossesbetween the exchange and trader. However, exchange memberswho wish to use the redundancy feature provided by the A-Bfeeds must process twice as much data. If the redundancy ofthe A-B feeds cannot be used, messages can be re-requestedfrom the exchange, which then gives a significant latencydisadvantage whilst waiting for the re-requested messages.Over time the number of messages produced has drasticallyincreased. Figure 1 shows the increase in messages rate forthe OPRA data feed. This increase has a number of causes,such as:Increased range of products: The number and scope oftraded instruments have continually increased. For example,options are now quoted at strike price increments of one dollar,rather than five dollars, increasing the number of listed optionsData rate (Mb/s)Fig. 1.100010010Mb/s (ASCII)Mb/s (FAST)Messages/Second11992 1995 1998 2001 2004 2007Date1.E+061.E+051.E+041.E+031.E+021.E+011.E+00Increase in the message and data rate for the OPRA data feed.by five times.High resolution monitoring: Changes in prices are nowreported to a resolution of one penny ($0.01), rather than anickel ($0.05), resulting in much more frequent updates dueto small price movements.Algorithmic trading: Algorithms generate and cancel tradesat a much higher rate than human traders, leading to anincreased message rate from the exchange.The result is a message rate approaching 1M messages/sec,a huge computational load for any potential data-stream client.The growth in message rate has also caused the raw datarateof feeds to increase in proportion. The data-rate requiredfor the ASCII encoded OPRA stream has risen from less than1 Mb/s to almost 600 Mb/s over the last ten years. Suchdata-rates place enormous strain on the networks betweenexchanges and traders, so data-streams are increasingly compressed,using domain-specific encoding schemes.One such scheme, used on the OPRA feed amongst others,is FAST [2], a binary protocol which replaces the ASCII FIXformat. To reduce message length, the format is no longer selfdescribing,so clients must already know the structure of themessages. A number of compression strategies are also used,such as:• Variable-length encoding of integers.• Delta-encoding of values against values transmitted inprevious packets.• Packing of data into seven bit atoms, using the eighth bitto indicate end-of-field.This compression reduces the data-size by around twothirds,significantly reducing pressure on networks. However,clients must now decompress the stream, increasing theamount of processing that must be performed before messagescan be delivered to the applications that will consume them.III. APPROACHThe combination of increased message rate and morecomplex market data feed formats mean that pure softwaresolutions are often unable to keep up. Even when softwaresolutions can deal with average message rates, they mayMessages/second9284Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

start to develop a backlog during large bursts of tradingactivity, increasing latency in the situations where it is mostimportant to stay up to date. Our proposal is to use a hardwareaccelerated stream processor, while still supporting softwaretrading applications. Our key requirements are:• High-bandwidth message processing, limited only byincoming network bandwidth.• Minimal latency between arrival of packets at networkinterface and delivery of the contained packets to software.• Support for redundant A-B input streams, with automaticrecovery of missing messages.A. OverviewFigure 2 gives a high-level overview of the problem to besolved. On the left is shown the market data source, withinthe exchange. This takes a stream of messages, which may beaggregated from sources within the exchange, and produces asingle master stream. This stream may then be compressed,and sent to two network ports which multicast the streamsout, via standard network infrastructure such as bridges andswitches, to any interested parties. Eventually the two feedsarrive at the user’s market data processor server, which containsthe applications wishing to consume a particular subsetof the messages contained in the feed.The first step that must be taken in the data processormachine is to reconstruct the original feed - all networkingequipment will occasionally lose packets, so the two redundantfeeds are designed to minimise the effect of this. A-B linearbitrage requires any missing data from one stream to bereplaced with correct data from the other stream. This is anon-trivial job, as even though the two streams left the datasource machine at the same time, the intervening networkinfrastructure is likely to introduce skew. The line arbitrageshould also make sure to select the earliest packet to arrive atthe two interfaces, to minimise latency.The second step is to decompress the reconstructed compressedstream, if necessary. This turns the packets intomessages that can be consumed in software. Decompressionrequires intimate knowledge of the stream compression formatand the stream payload, and must be updated as standardsevolve. The final stage is to filter the stream, identifying whichmessages are of interest to which applications.One approach is to use a multi-core processor for this task,assigning parallel threads to the jobs of A-B line arbitrage,decompression, and routing, but it has a number of drawbacks:1) Software processing is reliant on OS calls to receivepackets from network ports, but this introduces significantlatency, due to the inefficiency of the OS networkingstack, and because of the cost of context-switchingbetween user- and kernel-space.2) Co-operating threads must pass messages around eitherusing OS level synchronisation primitives which requireexpensive context switches, or using spin-locks whichlock up processors.3) Many aspects of stream decompression are essentiallysequential, so it is difficult to evenly schedule theprocessing load over multiple cores.These factors combine to make a pure software solutionunacceptable, in particular because there are so many sourcesof latency, and because the overall latency will vary randomly.In contrast, we propose placing almost all stages of streamprocessing directly into hardware. All stages of data-streamhandling, right up until the packet is delivered to the applicationcode, is mapped into an FPGA accelerator. This isachieved by connecting the market data ingress ports directlyto the FPGA, and by providing the FPGA with direct access tosystem RAM, allowing the accelerator to push data structuresdirectly into the memory space of threads executing applicationcode. Pushing data directly into RAM removes the needfor any calls to OS routines, allowing application threads todetect new messages without any context switching, and sominimising latency.The use of an FPGA as the main processing element meansthat incoming Ethernet traffic can be processed at line rate, atevery tick of its mother clock. As data is processed at line rate,an FPGA solution can guarantee that no packets are dropped,irrespective of the link saturation.B. Stream ProcessingEach data feed arrives over the network as a sequenceof distinct packets, with each packet prefixed with layers ofheaders providing information for different abstraction layersin the network. In the case of a market data feed, theseheaders start at the standard network layers, from Ethernet upto UDP, but also then extend into the financial protocol itself,where there are encapsulation formats, such as FAST, wrappedaround specific payload formats such as different types of FIXpackets.Figure 3 provides an overview of the processing stack, fromthe lowest level (Ethernet) through to the feed specific payloadmessages. Because there are two data streams, A and B, theentire processing stack must be replicated, up until the pointthat the two streams are merged.Our packet processing approach uses a set of composablepacket processing components, which allows packet processorsfor new types of stream to be rapidly constructed. Packets areinitially buffered as byte wide streams, which are transferredat one byte per cycle (shown in Figure 4). Header extractioncomponents then use a shift register to extract the headerfor a given type of packet, producing a new stream with thesame width as the header. This wide stream contains the entireparsed header in the first cycle, followed by the packet payloadin successive cycles.This method may appear inefficient, because headers aretypically many bytes wide, but in practice, the header extractioncomponents are connected directly to other componentsthat will filter or route packets based on the header, so thewide stream only travels a short distance. Furthermore, unusedfields will be optimised away from the design by the FPGAcompilation tool. After filtering, only the payload is retained,9385Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

the applications via memory.At the exchange, the messages sent over the A and B feedare sent at exactly the same time. However, the interveningnetwork infrastructure between exchange and member will introducedelays of variable length, for example due to bufferingin the internal buses of switches. For the same reason packetsmay also arrive out of sequence, either due to switchingalgorithms, or due to different packets taking different routes.The result is that the relationship between the A and B feedswill change over time, possibly on a per-packet basis, withsometimes one feed ahead, and sometimes the other.To reconstruct the stream we maintain a window of messageidentifiers, which extends from the most recent messageidentifier we have seen, r, down to the bottom of the windowr − w +1. The parameter w is chosen to reflect the maximumobserved transmission delay. Within this window we maintaina flag associated with each message identifier, indicatingwhether that message has been forwarded to the applicationyet. As each message arrives from the A or B channel, theassociated flag is checked, and the message is forwardedor dropped depending on whether it has already been seen.When a message identifier i greater than r is encountered,the window is moved forwards – any non-zero flags in rangeremoved from the window indicate completely lost messages,which are reported to the application.D. Filtering and DeliveryThe eventual consumers of messages are implemented asconventional software threads, executing on multi-processorhost containing the FPGA board. This allows threads to bedeveloped in conventional languages, such as C, rather thanrequiring esoteric hardware design languages. Threads interactwith the hardware accelerator by first using a software API tospecify a message filter, indicating which specific assets inthe market they are interested in, and what types of messagesabout those assets they wish to receive. Once the message filterhas been set, the threads continuously poll a message queue,waiting for any messages meeting the filter to be delivered.Polling is traditionally seen as inefficient, but in this situationit is the best way of minimising latency. A traditionalnotification mechanism, such as an OS level mutex or event,incurs a context switch on the part of both notifier andnotifyee, adding a potentially large and variable amount oflatency. However, because messages are pushed directly intothe memory space of the thread, a thread polling memory willreceive the new packet with a latency determined only by thespeed of the memory and cache-coherency protocol.To further minimise the latency, it is necessary to lock eachstream client thread to a specific CPU, and to make sure that itis the only thread on that CPU. This means that the OS nevermoves threads between processors, nor that a thread will bepre-empted by another thread. Modern multi-core processorsoffer many individual CPU cores, so it is an efficient tradeoff.All non-feed processing threads, such as OS processes,are locked onto a single dedicated CPU.ABFPGAMSGFig. 5.2DMA1APICoreMSGUserCoreUserCoreUserCoreUserCore3UserCoreBroadcast of packets to threadsUserCoreRAMOSCoreAs well as the stream client and OS cores, another core isdedicated to run-time management of the hardware accelerator.This is responsible for some aspects of routing, such as multicastingpackets out to individual threads. Note that the runtimedistributor only delivers pointers to structures that havealready been placed in memory: messages do not need to bemoved around once they have entered system memory.Figure 4 gives an overview of the software processing ofmessages: first, the FPGA accelerator DMAs a copy of theentire message into shared memory; second, the run-timemanagement thread detects the new message, and broadcasts apointer to the message to all interested processing threads; andfinally, each thread receives a pointer to the shared message,and can immediately start processing. Since user threadsoperate on lists of trading symbols, it is simple to processcore usage, and balance them symmetrically.IV. IMPLEMENTATION AND RESULTSIn order to verify the architecture, an implementation of thescheme proposed in Section III was carried out. The platformused for this implementation was the Celoxica AMDC acceleratorcard, which uses a Xilinx Virtex 5 LX110T FPGA deviceas the main processing element. The FPGA packet processingengine was written in Handel-C [3], using the Hyper-Streamsprogramming model [4]. The AMDC card has been measuredto draw less than 15watts of power from its host server.The Celoxica AMDC card was inserted in a quad core2.4GHz AMD Opteron server, running Redhat EnterpriseLinux 5. The server was configured such that one CPU corewas locked to polling the FPGA card for incoming messages,with another locked running a user packet redistribution application.The remaining cores were left unlocked for runningOS related processes.Figure 6 shows the harness used for these tests. Pre-recordedOPRA FAST v2 data is inserted into the system with the sameinter-packet gaps as the original data. Further tests artificiallyaccelerate the data transmission rate by reducing the interpacketgaps by a constant factor. The original capture had anaverage of 296,177 messages per second over 60 seconds.Data is streamed into the AMDC card inside the test server,as well as a separate packet sniffer. The Celoxica AMDC cardprocesses the incoming feed, as discussed in Section III, andthe data is made available to the user application. The userapplication repackages the data and broadcasts it out of theserver’s internal NIC card, using a standard OS socket call.9587Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

Feed SourceCeloxica AMDCAcceleratorUser ApplicationOS TCP/IP StackNetwork InterfaceLatency (us)1009080706050403020100Fig. 8.0 20 40 60 80 100PercentileGraph showing percentiles of packet processing latencyFig. 6.Packet SnifferTest harness for measuring overall system latencyFeed SourcePackets dropped (KPackets/sec)4504003503002502001501005000 200 400 600 800OPRA feed rate (Mbps)Fig. 7. Graph showing numbers of packets dropped in the software portionof the test environment with increasing throughputFig. 9.125MHzCounterAMDC-CollectstatisticsAppend toincoming packetsTest serverStandardprocessingADMCRun-timeUserapplicationTest-bench for measuring AMDC latencyOn the output of the system, the packet sniffer also recordsdata leaving the test server’s NIC card. Comparing the differencein arrival time between packets from the feed source andtest server allows latency across the entire test server to bemeasured. It also allows packets dropped in the test server tobe recorded. In order to determine if these packets are droppedby the AMDC card, or software processes, the user applicationrecords packets dropped by the AMDC card.As expected, AMDC did not drop packets until the pointwhere line saturation is reached at around 450Mbps (3,554,119messages per second, 12 times the original OPRA rate).However, the combination of user application and OS networkstack began to drop packets after six times the originaldata rate, at 1,777,059 messages per second. This is illustratedin Figure 7, as the throughput increases above six timesthe original data rate, an increasing number of packets aredropped.Use of operating system networking stack routines on theoutput of the system skewed the latency tests. It is estimatedthat these routines add between 15 to 20us of extra latency,depending on load. Figure 8 shows 99% of packets have lessthan 26us of latency passing through the test server.The results so far have concentrated on a complete redistributionsystem. Here we will consider the latency of thehardware portion of the design residing on the AMDC card,from the time a packet enters the FPGA on the AMDC card,until it becomes available in the memory space of the userapplication.To gather these results, the test illustrated in Figure 9 wasconducted. A counter running at the gigabit Ethernet motherclock of 125GHz (8ns resolution) runs on the FPGA in parallelwith the packet processor. The tail of each packet enteringthe FPGA is appended with the current state of the 125MHzcounter. The packet then passes through normal processing onthe FPGA and is then handed over to the API running onone of the test server’s CPU cores over the PCIe bus. Whenthe API hands over a packet to the user application, it stripsthe packet’s counter value and returns it to the FPGA overthe PCIe bus. A separate module inside the FPGA subtractsthis from the current counter value to calculate the latency,and accumulates the mean latency, which is reported at theend of the test. Notice that the latency of the PCIe bus isincluded twice in the measurement, one more than necessary.This latency has not been taken account of in the results,though doing so would obviously improve these results further.The results of this test are reproduced in Figure 10. Withincreasing throughput, the latency of the hardware portion ofthe design remains roughly constant at around 4us.9688Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

Average AMDC latency (us)4.54.44.34.24.143.93.83.73.63.50 200 400 600OPRA feed rate (Mbps)Fig. 10. Graph showing latency of the AMDC portion of the overall systemlatency with increasing throughputV. RELATED WORKAn alternative hardware-accelerated market data architectureis the Exegy Ticker Plant [5]. In this approach, incomingmarket data enters the system via a conventional Ethernet card.Processing is then augmented with an FPGA accelerator on theprocessor bus. This has the advantage of a higher throughputbus, although this isn’t needed even for a saturated gigabitEthernet link.Another accelerated feed processing approach is providedby the ActivFeed MPU [6]. This uses an XtremeData FPGAaccelerator which is placed in a processor socket of the hostsystem, and communicates using HyperTransport. However,network integration is via an Infiniband bridge, rather than bydirect connection to the Ethernet, and processing latencies arequoted as “end-to-end latency surpass[es] 100 us”, comparedto the 20 us latencies reported here.The idea of packet processing using graphs of composablenodes has been used in a number of previous systems. Inthe Click system [7] a graph of processing nodes is definedusing a C++ interface in software, which is then compiledinto an FPGA design. The Net FPGA project (http://www.netfpga.org/) offers a library of packet processing components,which communicate using a common protocol, but requiresthe use to manually connect together each of the protocolwires between them. In contrast, the approach developed hereallows the graph to be described directly in Handel-C [3],so no extra compiler passes are required. The connectionsbetween components are also specified as abstract data-paths –all protocol-specific details of the connection are hidden fromthe programmer.VI. CONCLUSIONThis paper presents a method that allows processing of marketdata feeds using FPGAs, providing the ability to processextremely large numbers of messages per second, while alsominimising the latency between arrival of network packetsand their delivery to their intended target in software. Thisis achieved by eliminating the operating system networkingstack: all message processing and filtering is applied in anFPGA, which is then able to push messages directly intothe memory space of software threads via FPGA-initiatedDMA. As well as reducing latency due to the OS stack, thisalso reduces both the programming burden and performanceover-head for software components, as messages are providedas fully decoded memory structures, rather than serialisedmessages which must be parsed.This approach has been implemented in the CeloxicaAMDC accelerator card, which incorporates two gigabit Ethernetports and a Xilinx Virtex-5 LX110T FPGA, connected toa host computer over the PCIe bus. Tests performed using theOPRA-FAST compressed data feed format have shown that anAMDC accelerated system can support a message throughputof 5.5 million messages per second, 12 times the current realworldrate, while the complete system rebroadcasts at least99% of packets with a latency of less than 26us. The hardwareportion of the design has a constant latency, irrespective ofthroughput, of 4us.Currently the proposed architecture only accelerates incomingmarket data. In the future accelerated Ethernet transmissionwill be examined. This will include Uni- and Multi-cast UDPoffload, TCP/IP offload, and market order execution.REFERENCES[1] T. J. Todman, G. A. Constantinides, S. J. E. Wilton, O. Mencer, W. Luk,and P. Y. K. Cheung, “Reconfigurable computing: architectures and designmethods,” IEE Proc. Computing and Digital Techniques, vol. 152, no. 2,pp. 193–207, 2004.[2] K. Houstoun, FIX Adapted for STreaming - FAST Protocol TechnicalOverview, 2006.[3] Handel-C Language Reference, http://www.celoxica.com, Celoxica Ltd.,1999.[4] G. W. Morris and M. Aubury, “Design space exploration of the Europeanoption benchmark using Hyperstreams,” in FPL, 2007, pp. 5–10.[5] S. T. A. Center, “Exegy ticker plant with infiniband,” STAC Report, July2007.[6] “Activefeed MPU: Accelerate your market data,” http://www.activfinancial.com/docs/ActivFeedMPU.pdf, 2007.[7] K. C, G. Brebner, and G. Schelle, “Mapping a domain specific language toa platform FPGA,” in Proc. IEEE Design Automation Conference, 2004,pp. 924–927.9789Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

1qfrBND

Create successful ePaper yourself

Delete template?

Save as template?