1qfrBND

the applications via memory.At the exchange, the messages sent over the A and B feedare sent at exactly the same time. However, the interveningnetwork infrastructure between exchange and member will introducedelays of variable length, for example due to bufferingin the internal buses of switches. For the same reason packetsmay also arrive out of sequence, either due to switchingalgorithms, or due to different packets taking different routes.The result is that the relationship between the A and B feedswill change over time, possibly on a per-packet basis, withsometimes one feed ahead, and sometimes the other.To reconstruct the stream we maintain a window of messageidentifiers, which extends from the most recent messageidentifier we have seen, r, down to the bottom of the windowr − w +1. The parameter w is chosen to reflect the maximumobserved transmission delay. Within this window we maintaina flag associated with each message identifier, indicatingwhether that message has been forwarded to the applicationyet. As each message arrives from the A or B channel, theassociated flag is checked, and the message is forwardedor dropped depending on whether it has already been seen.When a message identifier i greater than r is encountered,the window is moved forwards – any non-zero flags in rangeremoved from the window indicate completely lost messages,which are reported to the application.D. Filtering and DeliveryThe eventual consumers of messages are implemented asconventional software threads, executing on multi-processorhost containing the FPGA board. This allows threads to bedeveloped in conventional languages, such as C, rather thanrequiring esoteric hardware design languages. Threads interactwith the hardware accelerator by first using a software API tospecify a message filter, indicating which specific assets inthe market they are interested in, and what types of messagesabout those assets they wish to receive. Once the message filterhas been set, the threads continuously poll a message queue,waiting for any messages meeting the filter to be delivered.Polling is traditionally seen as inefficient, but in this situationit is the best way of minimising latency. A traditionalnotification mechanism, such as an OS level mutex or event,incurs a context switch on the part of both notifier andnotifyee, adding a potentially large and variable amount oflatency. However, because messages are pushed directly intothe memory space of the thread, a thread polling memory willreceive the new packet with a latency determined only by thespeed of the memory and cache-coherency protocol.To further minimise the latency, it is necessary to lock eachstream client thread to a specific CPU, and to make sure that itis the only thread on that CPU. This means that the OS nevermoves threads between processors, nor that a thread will bepre-empted by another thread. Modern multi-core processorsoffer many individual CPU cores, so it is an efficient tradeoff.All non-feed processing threads, such as OS processes,are locked onto a single dedicated CPU.ABFPGAMSGFig. 5.2DMA1APICoreMSGUserCoreUserCoreUserCoreUserCore3UserCoreBroadcast of packets to threadsUserCoreRAMOSCoreAs well as the stream client and OS cores, another core isdedicated to run-time management of the hardware accelerator.This is responsible for some aspects of routing, such as multicastingpackets out to individual threads. Note that the runtimedistributor only delivers pointers to structures that havealready been placed in memory: messages do not need to bemoved around once they have entered system memory.Figure 4 gives an overview of the software processing ofmessages: first, the FPGA accelerator DMAs a copy of theentire message into shared memory; second, the run-timemanagement thread detects the new message, and broadcasts apointer to the message to all interested processing threads; andfinally, each thread receives a pointer to the shared message,and can immediately start processing. Since user threadsoperate on lists of trading symbols, it is simple to processcore usage, and balance them symmetrically.IV. IMPLEMENTATION AND RESULTSIn order to verify the architecture, an implementation of thescheme proposed in Section III was carried out. The platformused for this implementation was the Celoxica AMDC acceleratorcard, which uses a Xilinx Virtex 5 LX110T FPGA deviceas the main processing element. The FPGA packet processingengine was written in Handel-C [3], using the Hyper-Streamsprogramming model [4]. The AMDC card has been measuredto draw less than 15watts of power from its host server.The Celoxica AMDC card was inserted in a quad core2.4GHz AMD Opteron server, running Redhat EnterpriseLinux 5. The server was configured such that one CPU corewas locked to polling the FPGA card for incoming messages,with another locked running a user packet redistribution application.The remaining cores were left unlocked for runningOS related processes.Figure 6 shows the harness used for these tests. Pre-recordedOPRA FAST v2 data is inserted into the system with the sameinter-packet gaps as the original data. Further tests artificiallyaccelerate the data transmission rate by reducing the interpacketgaps by a constant factor. The original capture had anaverage of 296,177 messages per second over 60 seconds.Data is streamed into the AMDC card inside the test server,as well as a separate packet sniffer. The Celoxica AMDC cardprocesses the incoming feed, as discussed in Section III, andthe data is made available to the user application. The userapplication repackages the data and broadcasts it out of theserver’s internal NIC card, using a standard OS socket call.9587Authorized licensed use limited to: Imperial College London. Downloaded on October 12, 2009 at 14:27 from IEEE Xplore. Restrictions apply.

Previous page

Next page

1

2

3

5

6

7

1qfrBND

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?