30.07.2015 Views

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

Actas JP2011 - Universidad de La Laguna

SHOW MORE
SHOW LESS

Create successful ePaper yourself

Turn your PDF publications into a flip-book with our unique Google optimized e-Paper software.

<strong>Actas</strong> XXII Jornadas <strong>de</strong> Paralelismo (<strong>JP2011</strong>) , <strong>La</strong> <strong>La</strong>guna, Tenerife, 7-9 septiembre 2011Fig. 1.rCUDA architecture on a VMM environment.When used with VMs, the communication protocolin rCUDA will make use of the virtual network<strong>de</strong>vice to communicate the front-end and back-endmiddleware. Therefore, the network has to be configuredin a way that both the VM and the host OScan address IP packets to each other. Fig. 1 showsan rCUDA architecture diagram modified to reflectits usage in VM environments. We were able to successfullytest the current implementation of rCUDAin KVM, VB-OSE, and VMware Server virtualizationsolutions. However, we were unable to run itin a recent release of the Xen Hypervisor (3.4.3), aswe could not gain access to a recent NVIDIA GPUdriver that worked properly un<strong>de</strong>r the modified kernelfor the administrative domain, with the ultimatereason being that this driver is not <strong>de</strong>signed to supportthe Xen environment.With rCUDA, multiple VMs running in the samephysical computer can make concurrent use of allCUDA-compatible <strong>de</strong>vices installed on the computer(as long as there is enough memory on the <strong>de</strong>vices tobe allocated by the different applications). Furthermore,although not addressed in this paper, rCUDAalso allows the usage of a GPU located in a differentphysical computer.In the following section we provi<strong>de</strong> an in-<strong>de</strong>pthanalysis of the use of rCUDA to enable GPGPUcapabilities within VMs. We believe our proposalis the first work <strong>de</strong>scribing a VMM-in<strong>de</strong>pen<strong>de</strong>ntproduction-ready CUDA solution for VMs.V. Experimental EvaluationIn this section we conduct a collection of experimentsin or<strong>de</strong>r to evaluate the performance of therCUDA framework on a VMM environment. Thetarget system consists of two Quad-core Intel XeonE5410 processors running at 2.33 GHz with 8 GB ofmain memory. An OpenSuse Linux distribution withkernel version 2.6.31 is run at both host and guestsi<strong>de</strong>s. The GPGPU capabilities are provi<strong>de</strong>d by twoNVIDIA GeForce 9800 GX2 cards featuring a totalof 4 NVIDIA G92 GPUs; the driver version is 190.53.We selected two Open Source VMMs for theperformance analysis: KVM (userspace qemu-kvmv0.12.3) and VB-OSE 3.1.6, with their VMs configuredto make use of para-virtualized network <strong>de</strong>vices.In addition, for load isolation purposes, each VM wasconfigured to make use of only one processor core.All benchmarks employed in our evaluation arepart of the CUDA SDK. From the 67 benchmarksin the suite, we selected 10 representativeSDK benchmarks of varying computationFig. 2.Native vs. KVM and VB-OSE.loads and data sizes, which use different CU-DA features: alignedTypes (AT), asyncAPI (AA),bicubicTexture (BT), BlackScholes (BS), box-Filter (BF), clock (CLK), convolutionSeparable(CS), fastWalshTransform (FWT), image-Denoising (ID), and matrixMul (MM). A <strong>de</strong>scriptionof each benchmark can be found in the documentationof the SDK package [15]. The benchmarkswere executed with the <strong>de</strong>fault options, otherthan setting the target <strong>de</strong>vice. In addition, benchmarksrequiring OpenGL capabilities for their <strong>de</strong>faultexecutions (BT, BF, and ID) were executedwith the -qatest argument, in or<strong>de</strong>r to perform a“quick auto test”, which does not make use of thegraphics-oriented API. To make the original benchmarkco<strong>de</strong> compatible with rCUDA, which does notsupport the C for CUDA extensions, the pieces ofco<strong>de</strong> using these extensions were rewritten using theplain C API (only a 7% of the total effective sourcelines of co<strong>de</strong> required being modified).The execution times reported in the next experimentsare the minimum from 5 executions, in or<strong>de</strong>rto avoid eventual network and CPU noise. They reflectthe elapsed time experienced by the users, fromthe start of the execution of the application till theend of it. The experiments are presented in this sectionin two groups. First, those concerning one VMare presented. <strong>La</strong>ter, we introduce experiments involvingseveral VMs being concurrently executed.A. Single Virtual MachineWe first analyze the performance of the CUDASDK benchmarks running in a VM using rCUDA,and compare their execution times with those of anative environment —i.e., using the regular CUDARuntime library in a non-virtualized environment.The results of this experiment are reported in Fig. 2.It would also be interesting including data for a versionof the benchmarks that makes only use of theCPU. However, as it is difficult to find optimized algorithmsfor CPUs performing the same operationsas all of our benchmarks, and those inclu<strong>de</strong>d in theSDK package are often naive versions, we cannotpresent such a comparison. Nevertheless, it is notstrictly required for un<strong>de</strong>rstanding the experimentspresented and, additionally, the convenience of usingvirtualized remote GPUs instead of the local CPUwas previously discussed [11].<strong>JP2011</strong>-343

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!