Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Beyond Simple Monte-Carlo:Parallel Computing with QuantLibKlaus SpanderenE.ON Global CommoditiesNovember 14, 2013Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Symmetric Multi-ProcessingGraphical Processing UnitsMessage Passing InterfaceConclusionKlaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Symmetric Multi-Processing: Overview◮ Moore’s Law: Number oftransistors doubles every twoyears.◮ Leaking turns out to be thedeath of CPU scaling.◮ Multi-core designs helpsprocessor makers to managepower dissipation.◮ Symmetric Multi-Processinghas become a main streamtechnology.Herb Sutter: ”The Free Lunch isOver: A Fundamental Turn TowardConcurrency in Software.”Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Processing with QuantLibDivide and Conquer: Spawn several independent OS processesGFlops0.5 1 2 5 10 20 501 2 4 8 16 32 64 128 256# ProcessesThe QuantLib benchmark on a 32 core (plus 32 HT cores) server.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Overview◮ QuantLib is per se not thread-safe.◮ Use case one: really thread-safe QuantLib (see Luigi’s talk)◮ Use case two: multi-threading to speed-up single pricings.◮ Joesph Wang is working with Open Multi-Processing(OpenMP) to parallelize several finite difference andMonte-Carlo algorithms.◮ Use case three: multi-threading to parallelize several pricings,e.g. parallel pricing to calibrate models.◮ Use case four: Use of QuantLib in C#,F#, Java or Scala viaSWIG layer and multi-threaded unit tests.◮ Focus on use case three and four:◮ Situation is not too bad as long as objects are not sharedbetween different threads.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Parallel Model CalibrationC++11 version of a parallel model calibration functionDisposableCalibrationFunction::values(const Array& params) const {model_->setParams(params);std::vector errorFcts;std::transform(std::begin(instruments_), std::end(instruments_),std::back_inserter(errorFcts),[](decltype(*begin(instruments_)) h) {return std::async(std::launch::async,&CalibrationHelper::calibrationError,h.get());});Array values(instruments_.size());std::transform(std::begin(errorFcts), std::end(errorFcts),values.begin(), [](std::future& f) { return f.get();});}return values;Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Singleton◮ Riccardo’s patch: All singletons are thread local singletons.template T& Singleton::instance() {static boost::thread_specific_ptr tss_instance_;if (!tss_instance_.get()) {tss_instance_.reset(new T);}return *tss_instance_;}◮ C++11 Implementation: Scott Meyer Singletontemplate T& Singleton::instance() {static thread_local T t_;return t_;}Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-Pattern◮ Main purpose in QuantLib: Distributed event handling.◮ Current implementation is highly optimized for singlethreading performance.◮ In a thread local environment this would be sufficient, but ...◮ ... the parallel garbage collector in C#/F#, Java or Scala isby definition not thread local!◮ Shuo Chen article ”Where Destructors meet Threads”provides a good solution ...◮ ... but is not applicable to QuantLib without a major redesignof the observer pattern.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-PatternScala example fails immediately with spurious error messages◮ pure virtual function call◮ segmentation faultimport org.quantlib.{Array => QArray, _}object ObserverTest {def main(args: Array[String]) : Unit = {System.loadLibrary("QuantLibJNI");val aSimpleQuote = new SimpleQuote(0)}}while (true) {(0 until 10).foreach(_ => {new QuoteHandle(aSimpleQuote)aSimpleQuote.setValue(aSimpleQuote.value + 1)})System.gc}Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Multi-Threading: Observer-Pattern◮ The observer pattern itself can be solved using the thread-safeboost::signals2 library.◮ Problem remains, an observer must be unregistered from allobservables before the destructor is called.◮ Solution:◮ QuantLib enforces that all observers are instantiated as boostshared pointers.◮ The preprocessor directiveBOOST SP ENABLE DEBUG HOOKS provides a hook toevery destructor call of a shared object.◮ if the shared object is an observer then use the thread-safeversion of Observer::unregisterWithAll to detach the observerfrom all observables.◮ Advantage: this solution is backward compatible, e.g. testsuite can now run multi-threaded.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Finite Differences Methods on GPUs: Overview◮ Performance of Finite Difference Methods is mainly driven bythe speed of the underlying sparse linear algebra subsystem.◮ In QuantLib any finite difference operator can be exported asboost::numeric::ublas::compressed matrix◮ boost sparse matrices can by exported in Compressed SparseRow (CSR) format to high performance libraries.◮ CUDA sparse matrix libraries:◮ cuSPARSE: basic linear algebra subroutines used for sparsematrices.◮ cusp: general template library for sparse iterative solvers.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Spare Matrix Libraries for GPUsPerformance pictures from NVIDIA(https://developer.nvidia.com/cuSPARSE)Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Spare Matrix Libraries for GPUsPerformance pictures from NVIDIASpeed-ups are smaller than the reported ”100x” for Monte-CarloKlaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUsSDE is defined bydS t = (r t − q t )S t dt + √ v t S t dW Stdv t = κ v (θ v − v t )dt + σ v√vt dW vtdr t = κ r (θ r,t − r t )dt + σ r dW rtρ Sv dt = dWt S dWtvρ Sr dt = dWt S dWtrρ vr dt = dWt v dWtrFeynman-Kac gives the corresponding PDE:∂u∂t= 1 2 S 2 ν ∂2 u∂S 2 + 1 2 σ2 νν ∂2 u∂ν 2 + 1 ∂ 2 u2 σ2 r∂r 2+ ρ Sν σ ν Sν ∂2 u∂S∂ν + ρ Sr σ r S √ ν ∂2 u∂S∂r + ρ √ ∂ 2 uvr σ r σ ν ν∂ν∂r+ (r − q)S ∂u∂S + κ v (θ v − ν) ∂u∂ν + κ r (θ r,t − r) ∂u∂r − ruKlaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUs◮ Good new: QuantLib can build the sparse matrix.◮ An operator splitting scheme needs to be ported to the GPU.void HundsdorferScheme::step(array_type& a, Time t) {Array y = a + dt_*map_->apply(a);Array y0 = y;for (Size i=0; i < map_->size(); ++i) {Array rhs = y - theta_*dt_*map_->apply_direction(i, a);y = map_->solve_splitting(i, rhs, -theta_*dt_);}Array yt = y0 + mu_*dt_*map_->apply(y-a);for (Size i=0; i < map_->size(); ++i) {Array rhs = yt - theta_*dt_*map_->apply_direction(i, y);yt = map_->solve_splitting(i, rhs, -theta_*dt_);}a = yt;}Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example I: Heston-Hull-White Model on GPUsHeston−Hull−White Model: GTX560 vs. Core i7Speed−Up GPU vs CPU2 4 6 8 10 12 14GPU single precisionGPU double precision20x50x20x1020x50x20x2050x100x50x1050x100x50x2050x200x100x1050x200x100x2050x400x100x20Grid Size (t,x,v,r)Speed-ups are much smaller than for Monte-Carlo pricing.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example II: Heston Model on GPUsHeston Model: GTX560 vs. Core i7Speed−Up GPU vs CPU5 10 15 20GPU single precisionGPU double precision50x200x100100x200x100100x500x100100x500x200100x1000x500100x2000x500100x2000x1000Grid Size (t, x, v)Speed-ups are much smaller than for Monte-Carlo pricing.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example III: Virtual Power PlantKluge model (two OU processes plus jump diffusion) leads to athree dimensional partial integro differential equation:rV = ∂V∂t + σ2 x2∂ 2 V ∂V− αx∂x 2 ∂x− βy∂V∂y+ σ2 u ∂ 2 V ∂V− κu2 ∂u2 ∂u + ρσ ∂ 2 Vxσ u∂x∂u∫+ λ (V (x, y + z, u, t) − V (x, y, u, t)) ω(z)dzRDue to the integro part the equation is not truly a sparse matrix.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Example III: Virtual Power PlantGTX560@0.8/1.6GHz vs. Core i5@3.0GhzCalculation Time1 5 10 50 100 500 1000●GPU BiCGStab+TridiagGPU BiCGStab+nonsym Bridson● GPU BiCGStabGPU BicgStab+DiagCPU Douglas SchemeCPU BiCGStab+Tridiag●●●●10x10x10x6 20x20x10x6 40x20x20x6 80x40x20x6 100x50x40x6Grid Size (x,y,u,s)Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Overview◮ Koksma-Hlawka bound is the basis for any QMC method:1∣nn∑∫f (x i ) − f (u)du∣ ≤ V (f )D ∗ (x 1 , ..., x n )[0,1] di=1D ∗ (log n)d(x 1 , ..., x n ) ≥ cn◮ The real advantage of QMC shows up only after N ∼ e ddrawing samples, where d is the dimensionality of the problem.◮ Dimensional reduction of the problem is often the first step.◮ The Brownian bridge is tailor-made to reduce the number ofsignificant dimensions.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Arithmetic Option ExampleKlaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Exotic Equity OptionsAccelerating Exotic Option Pricing and Model Calibration Using GPUs, Bernemann et al in High PerformanceComputational Finance (WHPCF), 2010, IEEE Workshop on, pages 17, Nov. 2010.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: QuantLib Implementation◮ CUDA supports Sobol random numbers up to the dimension20,000.◮ Direction integers are taken from the JoeKuoD7 set.◮ On comparable hardware CUDA Sobol generators are approx.50 times faster than MKL.◮ Weights and indices of the Brownian bridge will be calculatedby QuantLib.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: PerformanceSobol Brownian Bridge GPU vs CPUSpeed−up GPU vs CPU0 20 40 60 80●●●●●● single precision, one factorsingle precision, four factorsdouble precision, one factor●●●●●●10 1 10 2 10 3 10 4Path LengthsComparison GPU (GTX 560@0.8/1.6Ghz) vs. CPU (i5@3.0GHz)Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Quasi Monte-Carlo on GPUs: Scrambled Sobol Sequences◮ In addition CUDA supportsscrambled Sobol sequences.◮ Higher order scrambledsequences are a variant ofrandomized QMC method.◮ They achieve better rootmean square errors onsmooth integrands.◮ Error analysis is difficult. Ashifted (t,m,d)-net does notneed to be a (t,m,d)-net.RMSE for a benchmark portfolio of Asian options.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Overview◮ De-facto standard for massive parallel processing (MPP).◮ MPI is a complementary standard to OpenMP or threading.◮ Vendors provide high performance/low latencyimplementations.◮ The roots of the MPI specification are going back to the early90s and you will feel the age if you use the C-API.◮ Favour Boost.MPI over the original MPI C++ bindings!◮ Boost.MPI can build MPI data types for user-defined typesusing the Boost.Serialization library.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model Calibration◮ Model calibration can be a very time-consuming task, e.g. thecalibration of a Heston or a Heston-Hull-White model usingAmerican puts with discrete dividends → FDM pricing◮ Minimal approach: introduce a MPICalibrationHelper proxy,which ”has a” CalibrationHelper.class MPICalibrationHelper : public CalibrationHelper {public:MPICalibrationHelper(Integer mpiRankId,const Handle& volatility,const Handle& termStructure,const boost::shared_ptr& helper);....private:std::future modelValueF_;const boost::shared_ptr world_;....};Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model Calibrationvoid MPICalibrationHelper::update() {if (world_->rank() == mpiRankId_) {modelValueF_ = std::async(std::launch::async,&CalibrationHelper::modelValue, helper_);}CalibrationHelper::update();}Real MPICalibrationHelper::modelValue() const {if (world_->rank() == mpiRankId_) {modelValue_ = modelValueF_.get();}boost::mpi::broadcast(*world_, modelValue_, mpiRankId_);}return modelValue_;int main(int argc, char* argv[]) {boost::mpi::environment env(argc, argv);....}Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Message Passing Interface (MPI): Model CalibrationParallel Heston−Hull−White Calibration on 2x4 CoresSpeed−up1 2 3 4 5 61 2 4 8 16#ProcessesKlaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Conclusion◮ Often a simple divide and conquer approach on process levelis sufficient to ”parallelize” QuantLib.◮ In a multi-threading environment the singleton- andobserver-pattern need to be modified.◮ Do not share QuantLib objects between different threads.◮ Working solution for languages with parallel garbage collector.◮ Finite Difference speed-up on GPUs is rather 10x than 100x.◮ Scrambled Sobol sequences in conjunction with Brownianbridges improve the convergence rate on GPUs.◮ Boost.MPI is a convenient library to utilise QuantLib on MPPsystems.Klaus SpanderenBeyond Simple Monte-Carlo: Parallel Computing with QuantLib

Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Create successful ePaper yourself

Delete template?

Save as template?