06.01.2013 Views

Mixing Graphics and Compute with multiple GPUs, Alina Alt - Nvidia

Mixing Graphics and Compute with multiple GPUs, Alina Alt - Nvidia

Mixing Graphics and Compute with multiple GPUs, Alina Alt - Nvidia

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

<strong>Mixing</strong> <strong>Graphics</strong> <strong>and</strong><br />

<strong>Compute</strong> <strong>with</strong> <strong>multiple</strong><br />

<strong>GPUs</strong>, <strong>Alina</strong> <strong>Alt</strong>


Agenda<br />

<strong>Compute</strong> <strong>and</strong> <strong>Graphics</strong> API Interoperability<br />

Interoperability at a system level<br />

Application design considerations<br />

NVIDIA Confidential


<strong>Compute</strong> <strong>and</strong> Visualize the same data<br />

NVIDIA Confidential<br />

Application<br />

<strong>Compute</strong> <strong>Graphics</strong>


<strong>Compute</strong>/<strong>Graphics</strong> interoperability<br />

Set up the objects in the graphics context<br />

Register objects <strong>with</strong> the compute context<br />

Map/Unmap the objects from the compute context<br />

NVIDIA Confidential<br />

CUDA<br />

CUDA Buffer<br />

CUDA Array<br />

Application<br />

Buffer<br />

Texture<br />

OpenGL/DX


Simple OpenGL-CUDA interop sample: Setup<br />

<strong>and</strong> Register of Buffer Objects<br />

GLuint imagePBO;<br />

cuda<strong>Graphics</strong>Resource_t cudaResourceBuf;<br />

//OpenGL buffer creation<br />

glGenBuffers(1, &imagePBO);<br />

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB, imagePBO);<br />

glBufferData(GL_PIXEL_UNPACK_BUFFER_ARB, size, NULL,<br />

GL_DYNAMIC_DRAW);<br />

glBindBuffer(GL_PIXEL_UNPACK_BUFFER_ARB,0);<br />

//Registration <strong>with</strong> CUDA<br />

cuda<strong>Graphics</strong>GLRegisterBuffer(&cudaResourceBuf, imagePBO,<br />

cuda<strong>Graphics</strong>RegisterFlagsNone);<br />

NVIDIA Confidential


Simple OpenGL-CUDA interop sample: Setup<br />

<strong>and</strong> Register of Texture Objects<br />

GLuint imageTex;<br />

cuda<strong>Graphics</strong>Resource_t cudaResourceTex;<br />

//OpenGL texture creation<br />

glGenTextures(1, &imageTex);<br />

glBindTexture(GL_TEXTURE_2D, imageTex);<br />

//set texture parameters here<br />

glTexImage2D(GL_TEXTURE_2D,0, GL_RGBA8UI_EXT, width, height, 0,<br />

GL_RGBA_INTEGER_EXT, GL_UNSIGNED_BYTE, NULL);<br />

glBindTexture(GL_TEXTURE_2D, 0);<br />

//Registration <strong>with</strong> CUDA<br />

cuda<strong>Graphics</strong>GLRegisterImage (&cudaResourceTex, imageTex,<br />

GL_TEXTURE_2D, cuda<strong>Graphics</strong>RegisterFlagsNone);<br />

NVIDIA Confidential


Simple OpenGL-CUDA interop sample<br />

unsigned char *memPtr;<br />

cudaArray *arrayPtr;<br />

while (!done) {<br />

}<br />

cuda<strong>Graphics</strong>MapResources(1, &cudaResourceBuf, cudaStream);<br />

cuda<strong>Graphics</strong>ResourceGetMappedPointer((void **)&memPtr, &size,<br />

doWorkInCUDA(cudaArray, memPtr, cudaStream);<br />

NVIDIA Confidential<br />

cudaResourceBuf);<br />

cuda<strong>Graphics</strong>UnmapResources(1, & cudaResourceBuf, cudaStream);<br />

doWorkInGL(imagePBO, imageTex);


Simple OpenGL-CUDA interop sample<br />

unsigned char *memPtr;<br />

cudaArray *arrayPtr;<br />

while (!done) {<br />

}<br />

cuda<strong>Graphics</strong>MapResources(1, &cudaResourceBuf, cudaStream);<br />

cuda<strong>Graphics</strong>ResourceGetMappedPointer((void **)&memPtr, &size,<br />

doWorkInCUDA(cudaArray, memPtr, cudaStream);<br />

NVIDIA Confidential<br />

cudaResourceBuf);<br />

cuda<strong>Graphics</strong>UnmapResources(1, & cudaResourceBuf, cudaStream);<br />

doWorkInGL(imagePBO, imageTex);<br />

Context switching


Interoperability behavior:single GPU<br />

NVIDIA Confidential<br />

The resource is shared<br />

Context switch is fast <strong>and</strong> independent on<br />

data size<br />

compute<br />

draw<br />

A<br />

A<br />

B<br />

B<br />

CUDA<br />

OpenGL<br />

/DX<br />

time<br />

CUDA<br />

Context<br />

API Interop<br />

GPU<br />

Data<br />

OpenGL<br />

Context


Interoperability behavior: <strong>multiple</strong> <strong>GPUs</strong><br />

Each context owns a copy of the resource<br />

Context switch can be slow <strong>and</strong> is<br />

dependent on data size<br />

Synchronize/<br />

Map<br />

compute<br />

Synchronize/<br />

Unmap<br />

draw<br />

NVIDIA Confidential<br />

A<br />

A<br />

A<br />

A<br />

vsync<br />

B<br />

B CUDA<br />

B<br />

GPU<br />

CUDA<br />

Context<br />

Data<br />

OpenGL<br />

/DX<br />

API Interop<br />

time<br />

GPU<br />

OpenGL<br />

Context<br />

Data


Simple OpenGL-CUDA interop sample<br />

If CUDA is the producer….<br />

Use mapping hint cuda<strong>Graphics</strong>MapFlagsWriteDiscard <strong>with</strong><br />

cuda<strong>Graphics</strong>ResourceSetMapFlags()<br />

unsigned char *memPtr;<br />

cuda<strong>Graphics</strong>ResourceSetMapFlags(cudaResourceBuf,<br />

while (!done) {<br />

}<br />

NVIDIA Confidential<br />

cuda<strong>Graphics</strong>MapFlagsWriteDiscard);<br />

cuda<strong>Graphics</strong>MapResources(1, &cudaResourceBuf, cudaStream);<br />

cuda<strong>Graphics</strong>ResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf);<br />

doWorkInCUDA(memPtr, cudaStream);<br />

cuda<strong>Graphics</strong>UnmapResources(1, &cudaResourceBuf, cudaStream);<br />

doWorkInGL(imagePBO);


Interoperability behavior:single GPU<br />

compute<br />

draw<br />

NVIDIA Confidential<br />

Tasks are still serialized….<br />

A<br />

A<br />

B<br />

B<br />

CUDA<br />

OpenGL<br />

/DX<br />

time<br />

CUDA<br />

Context<br />

API Interop<br />

GPU<br />

Data<br />

OpenGL<br />

Context


Interoperability behavior: <strong>multiple</strong> <strong>GPUs</strong><br />

Tasks are overlapped<br />

Synchronize/<br />

Map<br />

compute<br />

Synchronize/<br />

Unmap<br />

draw<br />

NVIDIA Confidential<br />

A<br />

B<br />

A<br />

B A<br />

CUDA<br />

B<br />

A<br />

vsync<br />

B<br />

A<br />

GPU<br />

CUDA<br />

Context<br />

Data<br />

OpenGL<br />

/DX<br />

API Interop<br />

time<br />

GPU<br />

OpenGL<br />

Context<br />

Data


Simple OpenGL-CUDA interop sample<br />

Similar considerations are applicable when OpenGL is the<br />

producer <strong>and</strong> CUDA is consumer:<br />

NVIDIA Confidential<br />

cuda<strong>Graphics</strong>MapFlagsReadOnly


Application Example:Adobe Premiere Pro<br />

OpenGL plug-in <strong>with</strong>in CUDA application.<br />

NVIDIA Confidential<br />

int count = 1;<br />

while (!done) {<br />

}<br />

SignalWait(oglDone[count]);<br />

doWorkInCUDA(memPtr, NULL);<br />

EventSignal(cudaDone[count]);<br />

count = (count+1)%2;<br />

Main CUDA<br />

thread<br />

mainCtx = wglCreateContext(hDC);<br />

wglMakeCurrent(hDC, mainCtx);<br />

//Register OpenGL objects <strong>with</strong> CUDA<br />

int count = 0;<br />

while (!done) {<br />

}<br />

SignalWait(cudaDone[count]);<br />

cuda<strong>Graphics</strong>UnmapResources(1, &cudaResourceBuf[count], NULL);<br />

doWorkInGL(imagePBO[count]);<br />

cuda<strong>Graphics</strong>MapResources(1, &cudaResourceBuf[count], NULL);<br />

cuda<strong>Graphics</strong>ResourceGetMappedPointer((void **)&memPtr, &size,<br />

cudaResourceBuf[count]);<br />

EventSignal(oglDone[count]);<br />

count = (count+1)%2;<br />

OpenGL<br />

Worker<br />

thread


Application Example:Autodesk Maya<br />

CUDA plug-in <strong>with</strong>in an OpenGL application<br />

NVIDIA Confidential }<br />

SignalWait(setupCompleted);<br />

wglMakeCurrent(hDC,workerCtx);<br />

//Register OpenGL objects <strong>with</strong> CUDA<br />

int count = 1;<br />

while (!done) {<br />

SignalWait(oglDone[count]);<br />

glWaitSync(endGLSync[count]);<br />

cuda<strong>Graphics</strong>MapResources(1, &cudaResourceBuf[count], cudaStream[N]);<br />

cuda<strong>Graphics</strong>ResourceGetMappedPointer((void **)&memPtr, &size, cudaResourceBuf[count]);<br />

doWorkInCUDA(memPtr, cudaStream[N]);<br />

cuda<strong>Graphics</strong>UnmapResources(1, &cudaResourceBuf[count], cudaStream[N]);<br />

cudaStreamSynchronize(cudaStreamN);<br />

EventSignal(cudaDone[count]);<br />

count = (count+1)%2;<br />

CUDA worker<br />

thread N<br />

mainCtx = wglCreateContext(hDC);<br />

workerCtx = wglCreateContextAttrib<br />

wglMakeCurrent(hDC, mainCtx);<br />

//Create OpenGL objects<br />

EventSignal(setupCompleted);<br />

int count = 0;<br />

while (!done) {<br />

}<br />

SignalWait(cudaDone[count]);<br />

doWorkInGL(imagePBO[count]);<br />

(hDC,mainCtx…);<br />

endGLSync[count] = glFenceSync(…);<br />

EventSignal(oglDone[count]);<br />

count = (count+1)%2;<br />

Main OpenGL<br />

thread


API Interoperability hides all the complexity<br />

NVIDIA Confidential


Manual Inetroperability<br />

NVIDIA Confidential<br />

A) B)<br />

GPU 1 GPU 2<br />

CUDA<br />

Context<br />

Buffer<br />

RAM<br />

OpenGL<br />

Context<br />

CUDA memcpy glBufferSubData


Application Design Considerations<br />

Context switching performance varies <strong>with</strong> system configuration<br />

<strong>and</strong> OS.<br />

Provision for multi-GPU environments:<br />

NVIDIA Confidential<br />

No heuristics. Let the user chose the <strong>GPUs</strong>.<br />

Use cudaD3D[9|10|11]GetDevices/cudaGLGetDevices to match CUDA <strong>and</strong><br />

OpenGL device enumerations<br />

Avoid synchronized <strong>GPUs</strong> for CUDA<br />

CUDA-OpenGL interoperability can perform slower if OpenGL<br />

context spans <strong>multiple</strong> GPU<br />

Consider manual interoperability for fine-grained control


Resources<br />

CUDA samples/documentation:<br />

http://developer.nvidia.com/cuda-downloads<br />

OpenGL Insights, Patric Cozzi, Christophe Riccio, 2012. ISBN<br />

1439893764. www.openglinsights.com<br />

NVIDIA Confidential

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!