11.07.2015 Views

Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

Beyond Simple Monte-Carlo: Parallel Computing with QuantLib

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>:<strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>Klaus SpanderenE.ON Global CommoditiesNovember 14, 2013Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Symmetric Multi-ProcessingGraphical Processing UnitsMessage Passing InterfaceConclusionKlaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Symmetric Multi-Processing: Overview◮ Moore’s Law: Number oftransistors doubles every twoyears.◮ Leaking turns out to be thedeath of CPU scaling.◮ Multi-core designs helpsprocessor makers to managepower dissipation.◮ Symmetric Multi-Processinghas become a main streamtechnology.Herb Sutter: ”The Free Lunch isOver: A Fundamental Turn TowardConcurrency in Software.”Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Processing <strong>with</strong> <strong>QuantLib</strong>Divide and Conquer: Spawn several independent OS processesGFlops0.5 1 2 5 10 20 501 2 4 8 16 32 64 128 256# ProcessesThe <strong>QuantLib</strong> benchmark on a 32 core (plus 32 HT cores) server.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: Overview◮ <strong>QuantLib</strong> is per se not thread-safe.◮ Use case one: really thread-safe <strong>QuantLib</strong> (see Luigi’s talk)◮ Use case two: multi-threading to speed-up single pricings.◮ Joesph Wang is working <strong>with</strong> Open Multi-Processing(OpenMP) to parallelize several finite difference and<strong>Monte</strong>-<strong>Carlo</strong> algorithms.◮ Use case three: multi-threading to parallelize several pricings,e.g. parallel pricing to calibrate models.◮ Use case four: Use of <strong>QuantLib</strong> in C#,F#, Java or Scala viaSWIG layer and multi-threaded unit tests.◮ Focus on use case three and four:◮ Situation is not too bad as long as objects are not sharedbetween different threads.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: <strong>Parallel</strong> Model CalibrationC++11 version of a parallel model calibration functionDisposableCalibrationFunction::values(const Array& params) const {model_->setParams(params);std::vector errorFcts;std::transform(std::begin(instruments_), std::end(instruments_),std::back_inserter(errorFcts),[](decltype(*begin(instruments_)) h) {return std::async(std::launch::async,&CalibrationHelper::calibrationError,h.get());});Array values(instruments_.size());std::transform(std::begin(errorFcts), std::end(errorFcts),values.begin(), [](std::future& f) { return f.get();});}return values;Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: Singleton◮ Riccardo’s patch: All singletons are thread local singletons.template T& Singleton::instance() {static boost::thread_specific_ptr tss_instance_;if (!tss_instance_.get()) {tss_instance_.reset(new T);}return *tss_instance_;}◮ C++11 Implementation: Scott Meyer Singletontemplate T& Singleton::instance() {static thread_local T t_;return t_;}Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: Observer-Pattern◮ Main purpose in <strong>QuantLib</strong>: Distributed event handling.◮ Current implementation is highly optimized for singlethreading performance.◮ In a thread local environment this would be sufficient, but ...◮ ... the parallel garbage collector in C#/F#, Java or Scala isby definition not thread local!◮ Shuo Chen article ”Where Destructors meet Threads”provides a good solution ...◮ ... but is not applicable to <strong>QuantLib</strong> <strong>with</strong>out a major redesignof the observer pattern.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: Observer-PatternScala example fails immediately <strong>with</strong> spurious error messages◮ pure virtual function call◮ segmentation faultimport org.quantlib.{Array => QArray, _}object ObserverTest {def main(args: Array[String]) : Unit = {System.loadLibrary("<strong>QuantLib</strong>JNI");val a<strong>Simple</strong>Quote = new <strong>Simple</strong>Quote(0)}}while (true) {(0 until 10).foreach(_ => {new QuoteHandle(a<strong>Simple</strong>Quote)a<strong>Simple</strong>Quote.setValue(a<strong>Simple</strong>Quote.value + 1)})System.gc}Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Multi-Threading: Observer-Pattern◮ The observer pattern itself can be solved using the thread-safeboost::signals2 library.◮ Problem remains, an observer must be unregistered from allobservables before the destructor is called.◮ Solution:◮ <strong>QuantLib</strong> enforces that all observers are instantiated as boostshared pointers.◮ The preprocessor directiveBOOST SP ENABLE DEBUG HOOKS provides a hook toevery destructor call of a shared object.◮ if the shared object is an observer then use the thread-safeversion of Observer::unregisterWithAll to detach the observerfrom all observables.◮ Advantage: this solution is backward compatible, e.g. testsuite can now run multi-threaded.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Finite Differences Methods on GPUs: Overview◮ Performance of Finite Difference Methods is mainly driven bythe speed of the underlying sparse linear algebra subsystem.◮ In <strong>QuantLib</strong> any finite difference operator can be exported asboost::numeric::ublas::compressed matrix◮ boost sparse matrices can by exported in Compressed SparseRow (CSR) format to high performance libraries.◮ CUDA sparse matrix libraries:◮ cuSPARSE: basic linear algebra subroutines used for sparsematrices.◮ cusp: general template library for sparse iterative solvers.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Spare Matrix Libraries for GPUsPerformance pictures from NVIDIA(https://developer.nvidia.com/cuSPARSE)Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Spare Matrix Libraries for GPUsPerformance pictures from NVIDIASpeed-ups are smaller than the reported ”100x” for <strong>Monte</strong>-<strong>Carlo</strong>Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example I: Heston-Hull-White Model on GPUsSDE is defined bydS t = (r t − q t )S t dt + √ v t S t dW Stdv t = κ v (θ v − v t )dt + σ v√vt dW vtdr t = κ r (θ r,t − r t )dt + σ r dW rtρ Sv dt = dWt S dWtvρ Sr dt = dWt S dWtrρ vr dt = dWt v dWtrFeynman-Kac gives the corresponding PDE:∂u∂t= 1 2 S 2 ν ∂2 u∂S 2 + 1 2 σ2 νν ∂2 u∂ν 2 + 1 ∂ 2 u2 σ2 r∂r 2+ ρ Sν σ ν Sν ∂2 u∂S∂ν + ρ Sr σ r S √ ν ∂2 u∂S∂r + ρ √ ∂ 2 uvr σ r σ ν ν∂ν∂r+ (r − q)S ∂u∂S + κ v (θ v − ν) ∂u∂ν + κ r (θ r,t − r) ∂u∂r − ruKlaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example I: Heston-Hull-White Model on GPUs◮ Good new: <strong>QuantLib</strong> can build the sparse matrix.◮ An operator splitting scheme needs to be ported to the GPU.void HundsdorferScheme::step(array_type& a, Time t) {Array y = a + dt_*map_->apply(a);Array y0 = y;for (Size i=0; i < map_->size(); ++i) {Array rhs = y - theta_*dt_*map_->apply_direction(i, a);y = map_->solve_splitting(i, rhs, -theta_*dt_);}Array yt = y0 + mu_*dt_*map_->apply(y-a);for (Size i=0; i < map_->size(); ++i) {Array rhs = yt - theta_*dt_*map_->apply_direction(i, y);yt = map_->solve_splitting(i, rhs, -theta_*dt_);}a = yt;}Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example I: Heston-Hull-White Model on GPUsHeston−Hull−White Model: GTX560 vs. Core i7Speed−Up GPU vs CPU2 4 6 8 10 12 14GPU single precisionGPU double precision20x50x20x1020x50x20x2050x100x50x1050x100x50x2050x200x100x1050x200x100x2050x400x100x20Grid Size (t,x,v,r)Speed-ups are much smaller than for <strong>Monte</strong>-<strong>Carlo</strong> pricing.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example II: Heston Model on GPUsHeston Model: GTX560 vs. Core i7Speed−Up GPU vs CPU5 10 15 20GPU single precisionGPU double precision50x200x100100x200x100100x500x100100x500x200100x1000x500100x2000x500100x2000x1000Grid Size (t, x, v)Speed-ups are much smaller than for <strong>Monte</strong>-<strong>Carlo</strong> pricing.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example III: Virtual Power PlantKluge model (two OU processes plus jump diffusion) leads to athree dimensional partial integro differential equation:rV = ∂V∂t + σ2 x2∂ 2 V ∂V− αx∂x 2 ∂x− βy∂V∂y+ σ2 u ∂ 2 V ∂V− κu2 ∂u2 ∂u + ρσ ∂ 2 Vxσ u∂x∂u∫+ λ (V (x, y + z, u, t) − V (x, y, u, t)) ω(z)dzRDue to the integro part the equation is not truly a sparse matrix.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Example III: Virtual Power PlantGTX560@0.8/1.6GHz vs. Core i5@3.0GhzCalculation Time1 5 10 50 100 500 1000●GPU BiCGStab+TridiagGPU BiCGStab+nonsym Bridson● GPU BiCGStabGPU BicgStab+DiagCPU Douglas SchemeCPU BiCGStab+Tridiag●●●●10x10x10x6 20x20x10x6 40x20x20x6 80x40x20x6 100x50x40x6Grid Size (x,y,u,s)Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: Overview◮ Koksma-Hlawka bound is the basis for any QMC method:1∣nn∑∫f (x i ) − f (u)du∣ ≤ V (f )D ∗ (x 1 , ..., x n )[0,1] di=1D ∗ (log n)d(x 1 , ..., x n ) ≥ cn◮ The real advantage of QMC shows up only after N ∼ e ddrawing samples, where d is the dimensionality of the problem.◮ Dimensional reduction of the problem is often the first step.◮ The Brownian bridge is tailor-made to reduce the number ofsignificant dimensions.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: Arithmetic Option ExampleKlaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: Exotic Equity OptionsAccelerating Exotic Option Pricing and Model Calibration Using GPUs, Bernemann et al in High PerformanceComputational Finance (WHPCF), 2010, IEEE Workshop on, pages 17, Nov. 2010.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: <strong>QuantLib</strong> Implementation◮ CUDA supports Sobol random numbers up to the dimension20,000.◮ Direction integers are taken from the JoeKuoD7 set.◮ On comparable hardware CUDA Sobol generators are approx.50 times faster than MKL.◮ Weights and indices of the Brownian bridge will be calculatedby <strong>QuantLib</strong>.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: PerformanceSobol Brownian Bridge GPU vs CPUSpeed−up GPU vs CPU0 20 40 60 80●●●●●● single precision, one factorsingle precision, four factorsdouble precision, one factor●●●●●●10 1 10 2 10 3 10 4Path LengthsComparison GPU (GTX 560@0.8/1.6Ghz) vs. CPU (i5@3.0GHz)Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Quasi <strong>Monte</strong>-<strong>Carlo</strong> on GPUs: Scrambled Sobol Sequences◮ In addition CUDA supportsscrambled Sobol sequences.◮ Higher order scrambledsequences are a variant ofrandomized QMC method.◮ They achieve better rootmean square errors onsmooth integrands.◮ Error analysis is difficult. Ashifted (t,m,d)-net does notneed to be a (t,m,d)-net.RMSE for a benchmark portfolio of Asian options.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Message Passing Interface (MPI): Overview◮ De-facto standard for massive parallel processing (MPP).◮ MPI is a complementary standard to OpenMP or threading.◮ Vendors provide high performance/low latencyimplementations.◮ The roots of the MPI specification are going back to the early90s and you will feel the age if you use the C-API.◮ Favour Boost.MPI over the original MPI C++ bindings!◮ Boost.MPI can build MPI data types for user-defined typesusing the Boost.Serialization library.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Message Passing Interface (MPI): Model Calibration◮ Model calibration can be a very time-consuming task, e.g. thecalibration of a Heston or a Heston-Hull-White model usingAmerican puts <strong>with</strong> discrete dividends → FDM pricing◮ Minimal approach: introduce a MPICalibrationHelper proxy,which ”has a” CalibrationHelper.class MPICalibrationHelper : public CalibrationHelper {public:MPICalibrationHelper(Integer mpiRankId,const Handle& volatility,const Handle& termStructure,const boost::shared_ptr& helper);....private:std::future modelValueF_;const boost::shared_ptr world_;....};Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Message Passing Interface (MPI): Model Calibrationvoid MPICalibrationHelper::update() {if (world_->rank() == mpiRankId_) {modelValueF_ = std::async(std::launch::async,&CalibrationHelper::modelValue, helper_);}CalibrationHelper::update();}Real MPICalibrationHelper::modelValue() const {if (world_->rank() == mpiRankId_) {modelValue_ = modelValueF_.get();}boost::mpi::broadcast(*world_, modelValue_, mpiRankId_);}return modelValue_;int main(int argc, char* argv[]) {boost::mpi::environment env(argc, argv);....}Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Message Passing Interface (MPI): Model Calibration<strong>Parallel</strong> Heston−Hull−White Calibration on 2x4 CoresSpeed−up1 2 3 4 5 61 2 4 8 16#ProcessesKlaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>


Conclusion◮ Often a simple divide and conquer approach on process levelis sufficient to ”parallelize” <strong>QuantLib</strong>.◮ In a multi-threading environment the singleton- andobserver-pattern need to be modified.◮ Do not share <strong>QuantLib</strong> objects between different threads.◮ Working solution for languages <strong>with</strong> parallel garbage collector.◮ Finite Difference speed-up on GPUs is rather 10x than 100x.◮ Scrambled Sobol sequences in conjunction <strong>with</strong> Brownianbridges improve the convergence rate on GPUs.◮ Boost.MPI is a convenient library to utilise <strong>QuantLib</strong> on MPPsystems.Klaus Spanderen<strong>Beyond</strong> <strong>Simple</strong> <strong>Monte</strong>-<strong>Carlo</strong>: <strong>Parallel</strong> <strong>Computing</strong> <strong>with</strong> <strong>QuantLib</strong>

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!