A Playstation3 based softwarecorrelator for eVLBISoftware correlationon the Cell processorJan Wagner, Jouko Ritakari,Metsähovi Radio Observatoryjwagner@kurp.hut.fi
Software correlation (1/2)Software correlators have been used in production around theworld, and are now slowly appearing in the EVN, too.Hardware correlators:- power efficient high-perf number cruncher, fixed capabilitiesSoftware correlators:- easily extendable, reconfigurable, scalable- data via IP protocols (perhaps eVLBI)- no custom computing hardware; commodity PC cluster- free existing s/w correlators can be modified, adapted to ownrequirementsDiFX from Swinburne is a popular, well written software correlator.DiFX was used as a basis in Metsähovi development.
Software correlation (2/2)DiFX was written by A. Deller, Swinburne:- production quality- used for VLBI, geo-VLBI, pulsar binning- uses Intel's optimized math library (IPP)- cluster architecture (normal MPI) leadsto easy scaling to additional stationsOriginal DiFX runs ”only” on Intel.End of 2006, the first version of the IBM CellBroadband Engine (processor) was released.IBM Cell beats current Intel in:- significantly more GFLOPs- better FLOPS per Watt, lower cost- better memory bus architecturePorting DiFX to Cell seemed attractive!
The Cell Broadband EngineIBM Cell is a heterogenous multi-core processor:1 scaled-down PowerPC core with AltiVec vector unit (PPU)8 special Synergetic Proessing Unit vector processors (SPU)Total computing power 218 GFLOPS @ 3.2 Ghz 35WCores are on an interconnect bus (~0.3 TB/s)Constrained by memory port (~25 GB/s)Cell is available in Cell Blade QS20, Playstation 3,and various computing boards.
Cell processor figures●●●●Typically instructions take ~1ns, ~5 cycles1024point FFT: 6us with IBM SDK c2c FFT, 9uswith Metsähovipatched Cell FFTWComplex conjugate, multiply and accumulate for 4complex pairs only four instructionsCalculating four sin/cos pairs takes fiveinstructions (for predetermined quadrant)
Sony Playstation 3Metsähovi bought a PS3 in January 2007 for evaluation purposes.It's a low-cost IBM Cell platform – install Linux and IBM Cell SDK.Expecting to need 1/6 th of PS3s vs Pcs.Began porting DiFX in January, with pauses.First stage of porting DiFX to Cell:- Replaced closed-source Intel IPP in DiFXwith platform-independent math functions- Runs on AMD, Cell, others- Completed in February 2007
Results of first DiFX port stageWith the Intel IPP maths replaced, DiFX of course got slower.Original Intel IPP DiFX15s 1 Intel Dual Core, 3.2 GHz40s Intel Pentium 4 3.0 GHzPlatform-independent DiFX110s Intel Pentium 4 3.0 GHz220s PS3 Cell PPU unit 3.2 GhzThe next port stage was to begin using Cell SPE coresto get to the real Cell computing power.1Times for test data set from the DiFX homepage (4 stations x 40MB).
Results of ongoing DiFX port●Total wallclock time 48s, or 36s without local disk I/O. Intelwas 40s on P4, 15s on Dual Core.– Core::processdata() 21s 100% PPU – baseline MAC– Mode::process() 7s 70% SPU – fringe rot, FFT, ...– Mode::unpack() 7s either PPU or SPU – raw to float– Disk reading 13s PPU– Others 1s PPU
Own core test results●●Written 2 weeks ago. Raw data first streamed to one SPEfor unpacking, it streams floats onwards to processing SPEsCurrent throughput of core routines per single SPE:– raw data to float 84 Gbps ~2.5 Gsps for 2bit in, float out– fringe rotation w/ 60 Gbps 960 Mcspsquadrature oscillator– 1024point c2c FFT 10.5 Gbps 170k FFT/s or ~135 Mcsps– complex MAC 95 Gbps ~3x500 Mcsps– rotation, FFT, MAC 7.1 Gbps ~110 Mcsps(x5..6)
Conclusions and outlook●Own correlator core:– one PS3 could correlate > 1Gbps realtime, but LANlimited– a 16 PS3 cluster should handle 10 station 1 Gbps realtime– still some more programming to be done...●DiFX– TODO: move baseline processing from PPU onto all SPEs●– refactor DiFX correlation code, so it works in streamed processingfashion (much faster!), just like our own coreThe final core(s) should be available in 1 month