12.07.2015 Views

S5135-Christoph-Kubisch-and-Pierre-Boudier

S5135-Christoph-Kubisch-and-Pierre-Boudier

S5135-Christoph-Kubisch-and-Pierre-Boudier

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

GPU-DRIVEN LARGE SCENERENDERING NV_COMMAND_LIST<strong>Pierre</strong> <strong>Boudier</strong>, Quadro Software Architect<strong>Christoph</strong> <strong>Kubisch</strong>, Developer Technology Engineer


MOTIVATIONModern GPUs have a lot of execution units to make use ofQuadro 4000:Quadro K4000:Quadro K4200:256 cores768 cores1344 coresQuadro M6000: 3072 coresHow to leverage all this power?Efficient API usage <strong>and</strong> rendering algorithmsAPIs reflecting recent hardware designs <strong>and</strong> capabilities2


App + driverCHALLENGE OF ISSUING COMMANDSIssuing drawcalls <strong>and</strong> state changes can be a real bottleneckCPUGPUExcessiveWork fromApp & DriverOn CPU!!GPUidlecourtesy of PTC• 650,000 Triangles• 3,700,000 Triangles• 14,338,275 Triangles/lines• 68,000 Parts• 98,000 Parts• 300,528 drawcalls (parts)• ~ 10 Triangles per part• ~ 37 Triangles per part• ~ 48 Triangles per part3


Avoid data redundancyENABLING GPU SCALABILITYData stored once, referenced multiple timesUpdate only once (less host to gpu transfers)Increase GPU workload per jobFurther cuts API callsLess CPU workMinimize CPU/GPU interactionAllow GPU to update its own dataLow API usage when scene is changed littleE.g. GPU-based culling, matrix updates...http://on-dem<strong>and</strong>.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdf4


64 bits addressWhat is it about?BINDLESS TECHNOLOGY64-bit pointers& h<strong>and</strong>lesWork from native GPU pointers/h<strong>and</strong>lesLess validation, less CPU cache thrashingElement buffer (EBO)IndicesVertex Puller (IA)GPU can use flexible data structuresBindless BuffersVertex Buffer (VBO)AttributesVertex ShaderVertex & Global memory since pre-FermiBindless Constants (UBO)Support for Fermi <strong>and</strong> aboveBindless TexturesSince KeplerUniform BlockTexture FetchUniform BlockGPUVirtualMemoryFragment ShaderGraphicsPipeline5


BINDLESS DRAWING LOOPUpdateBuffers();glBufferAddressRangeNV(UNIFORM..., 0, addrView, ...);// redundancy filters not shownforeach (obj in scene) {glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...);glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...);glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices + obj.mtxOffset, ...);// iterate over cached material groupsforeach ( batch in obj.materialGroups) {glBufferAddressRangeNV(UNIFORM, 2, addrMaterials + batch.mtlOffset, ...);}}glMultiDrawElements (...);6


NV_COMMAND_LIST – KEY CONCEPTSTokenized Rendering (GPU modifiable comm<strong>and</strong> buffers):Simple state changes <strong>and</strong> draw comm<strong>and</strong>s are encoded into binary data streamLeverages bindless resourcesState Objects (pre-validated)Macro state (program, blending, fbo-config...) is captured into an objectControl over when costly validation happens, later reuse of objects is very fastCompiled Comm<strong>and</strong> List (alternative to token buffer)Display list like usage, however buffer addresses are referenced, therefore theircontent (matrices, vertices...) can still be modified.7


COMMAND PIPELINEApplicationDriverPush BufferComm<strong>and</strong>s (FIFO)OpenGLComm<strong>and</strong>sOpenGLResources64 bitsPointersGPUH<strong>and</strong>les(IDs)Id 64 bitsAddr.8


COMMAND PIPELINEApplicationDriverPush BufferComm<strong>and</strong>s (FIFO)OpenGLComm<strong>and</strong>sviaTokens & StateObjectsOpenGLResources64 bitsPointers(bindless)StateObjectresolveFast path throughdriver viaNV_comm<strong>and</strong>_listGPU9


TOKENIZED RENDERING// bindless scene drawing loopforeach (obj in scene) {glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...);glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...);glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices + obj.mtxOffset, ...);foreach ( batch in obj.materialCaches) {glBufferAddressRangeNV(UNIFORM, 2, addrMaterials + batch.mtlOffset, ...);glMultiDrawElements(...)}}All these comm<strong>and</strong>s (hundreds ofthous<strong>and</strong>s) for the entire scene canbe replaced by a single call to API!glDrawComm<strong>and</strong>sNV (TRIANGLES,tokenBuffer, offsets[], sizes[], count);// {0}, {tokensSize}, 1Token bufferVBO - addressEBO - addressUBO – matrix addressUBO – material addressDraw – first, count...UBO – material addressDraw – first, count......ObjectMaterialbatchesNextObject...10


TOKENIZED RENDERINGTokens are tightly packed structs in linear memoryDRAW tokens allowmixing strips, lists,fans, loops of samebase mode(TRIANGLES, LINES,POINTS) in singledispatch*Comm<strong>and</strong>NV {GLuint header; // glGetComm<strong>and</strong>HeaderNV(type,…)... comm<strong>and</strong> specific payload}; ELEMENT_ADDRESS_COMMAND_NVTERMINATE_SEQUENCE_COMMAND_NVNOP_COMMAND_NVDRAW_ELEMENTS_COMMAND_NVDRAW_ARRAYS_COMMAND_NVDRAW_ELEMENTS_STRIP_COMMAND_NVDRAW_ARRAYS_STRIP_COMMAND_NVDRAW_ELEMENTS_INSTANCED_COMMAND_NVDRAW_ARRAYS_INSTANCED_COMMAND_NVATTRIBUTE_ADDRESS_COMMAND_NVUNIFORM_ADDRESS_COMMAND_NVBLEND_COLOR_COMMAND_NVSTENCIL_REF_COMMAND_NVLINE_WIDTH_COMMAND_NVPOLYGON_OFFSET_COMMAND_NVALPHA_REF_COMMAND_NVVIEWPORT_COMMAND_NVSCISSOR_COMMAND_NVFRONTFACE_COMMAND_NV11


TOKENIZED RENDERING// single drawcall, tokens encoded into raw memory buffer!glDrawComm<strong>and</strong>sNV (..., tokenBuffer, offsets[], sizes[], count);// {0}, {bufferSize}, 1VBO EBO UBO Matrix UBO Material Draw UBO Material Draw DrawAttributeAddressComm<strong>and</strong>NV{GLuint header;ElementAddressComm<strong>and</strong>NV{GLuint header;UniformAddressComm<strong>and</strong>NV{GLuint header;DrawElementsComm<strong>and</strong>NV{Gluint header;}GLuint index;GLuint64 address;}GLuint64 address;GLuint typeSizeInByte;}GLushort index;GLushort stage;// glGetStageIndexNV(VERTEX..)GLuint64 address;}GLuintGLuintGLuintcount;firstIndex;baseVertex;12


What is so great about it?TOKENIZED RENDERINGIt‘s crazy fast (see later) <strong>and</strong> tokens are popular in render engines alreadyThe tokenbuffer is a „regular“ GL bufferCan be manipulated by all mechanisms OpenGL offersCan be filled from different CPU threads (which do not require a GL context)Exp<strong>and</strong>s the possibilities of GPU driving its own work without CPU roundtrip13


STATE OBJECTSStateObjectEncapsulates majority of state (fbo format, active shader, blend, depth ...),but no bindings! (use bindless textures passed via UBO...)glCaptureStateNV ( stateobject, GL_TRIANGLES );Less rendertime variability, explicit control over validation timeRender entire scenes with different shaders/fbos... in one goDriver caches state transitions// single drawcall, multiple shaders, fbos...glDrawComm<strong>and</strong>sStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);14


STATE OBJECTS// single drawcall, multiple shaders, fbos...glDrawComm<strong>and</strong>sStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);for i < count {if (i == 0) set state from states[i];else set state transition states[i-1] to states[i]if (fbo[i]) glBindFramebuffer( fbo[i] ) // must be compatible to states[i].fboelse glBindFramebuffer( states[i].fbo )}ProcessComm<strong>and</strong>Sequence(... tokenBuffer, offsets[i], sizes[i])Can reuse tokens & state with different fbos (e.g. shadow passes)Compatibilty depends on fbo‘s drawbuffers, texture formats... but not sizes15


STATE OBJECTS// single drawcall, multiple shaders, fbos...glDrawComm<strong>and</strong>sStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);// {0,sizeA}, {sizeA, sizeB}, {A,B}, {f,f}, 2tokenBuffer:VBO IBO Matrix UBO Material UBO Draw Material UBO Draw DrawSequence A (e.g. triangles)Draw Draw DrawSequence B (lines)[0]State Object AFBO fVBO IBO Matrix UBO Material UBO Draw Material UBO Draw Draw[1]State Object BFBO fDraw Draw DrawWithin glDrawComm<strong>and</strong>sStatesNV state set by tokens is inherited across sequences16


COMPILED COMMAND LISTCompiledComm<strong>and</strong> ListCombine multiple segments into Comm<strong>and</strong>List objectTokens provided by system memoryStateObjectStateObjectStateObjectStateObjectglListDrawComm<strong>and</strong>sStatesClientNV( list, segment,void* tokencmds[], sizes[], states[], fbos[], count);FBOVBOFBOVBOFBOVBOFBOVBOLess flexibilty compared to token bufferToken content, state <strong>and</strong> fbo assignments are deep-copiedList is immutable, needs recompile if pointers/state changesglCompileComm<strong>and</strong>ListNV( list );Allows even faster state transitionsAll key data is known to the driverIBOMatrix UBO Material UBO Material UBODraw Draw DrawIBOMatrix UBO Material UBO DrawIBOMatrix UBO Material UBO Material UBODraw DrawIBOMatrix UBO Material UBO Draw Draw Draw17


RESULTSHigh scene complexityNo instancing used, true copiesEach object unique <strong>and</strong> editable90 000 objectsEach drawn with triangles & linesRaw: 4.8m drawcallsSt<strong>and</strong>ard GL: 2 fpsComm<strong>and</strong>list: 20 fps18


RENDERING RESEARCH FRAMEWORKRender test with „Graphicscard“ modelMany low-complexity drawcalls (CPUchallenged)Same geometrymultiple objects110 geometries, 66 materials68 000 parts2500 objectsSame geometry(fan) multiple parts19


„Shaded“ <strong>and</strong> „Shaded & Edges“SCENE STYLES20


SCENE DRAWING (GROUPED)UpdateBuffers();glBindBufferBase (UBO, 0, uboView);foreach (obj in scene) {// redundancy filter for these (if (used != last)...glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize);glBindBuffer (ELEMENT, obj.geometry->ibo);glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize);// iterate over cached material groupsforeach ( batch in obj.materialGroups) {glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize);}}glMultiDrawElements (...);~ 2 500 api drawcalls~11 000 drawcalls~55 triangles per call21


UpdateBuffers();glBindBufferBase (UBO, 0, uboView);foreach (obj in scene) {...SCENE DRAWING (INDIVIDUAL)}// iterate over all parts individuallyforeach ( part in obj.parts) {if (part.material != lastMaterial){glBindBufferRange (UBO, 2, uboMaterial, part.materialOffset, mtlSize);}glDrawElements (...);}~68 000 drawcalls~10 triangles per call22


PERFORMANCE SHADEDRender all objects as trianglesGROUPED: ~ 300 KB ( 22 k tokens, ~11k buffer related, 11 k for drawing)INDIVIDUAL: ~ 1 MB ( 79 k tokens, ~68 k for drawing)TechniqueDraw time 11k draws Draw time 68k drawsTimerGPU CPU GPU CPUCore OpenGL 0.4 1.4 3.1 6.7NV bindless0.30.7 2 x 1.93.8 1.7 xTOKEN buffer 0.3~0 BIG x 0.9~0 BIG xPreliminary results M600023


PERFORMANCE SHADEDRemoving buffer redundancy filteringMaterial UBODraw Material UBO Draw Material UBOadds 60 k UBO, <strong>and</strong> 3.4k EBO & VBO tokens; total 144 k tokensTechniqueDraw time 68k drawsTimer GPU CPUUnfiltered Core OpenGL 5.6 11.1Core OpenGL 3.1 1.8 x 6.7 1.6 xUnfiltered TOKEN buffer 1.9 2.9 x ~0 BIG xPreliminary results M600024


PERFORMANCE SHADED & EDGESFor each object render triangles then linesFrequent alternation between two state objects(TRIANGLES/LINES) (~5000 times)GROUPED: 540 KB ( ~ 40k tokens)INDIVIDUAL: 2 MB ( ~ 160k tokens)TechniqueDraw time 11k*2 draws Draw time 68k*2 drawsTimerGPU CPU GPU CPUCore OpenGL 0.8 2.4 11.5 14.3NV bindless0.81.4 1.7 x 6.58.0 1.7 xTOKEN buffer 0.80.4 6 x 2.10.4 35 xPreliminary results M600025


EXAMPLE USE CASES5 000 shader changes: toggling between two shaders in „shaded & edges“Timer GPU CPUNV bindless 12.3 15.1TOKEN buffer1.6 7.7 x 0.4 37 xCompiled TOKEN list1.5 8.2 x 0.005 BIG x5 000 fbo changes: similar as above but with fbo toggle instead of shaderAlmost no additional cost compared to rendering without fbo changesTimer GPU CPUNV bindless 57.0 59.0TOKEN buffer1.1 51 x 0.9 65 xCompiled TOKEN list0.8 71 x 0.022 BIG xPreliminary results on M600026


TOKEN STREAMINGIn case token buffer cannot be reused, fill tokens every frameFill & emit from a single thread or multiple threadsPass comm<strong>and</strong> buffer pointers to worker threads, that do not require GL contextsH<strong>and</strong>le state objects in GL thread, or pass what is required to generate betweenthreads (GL thread captures state, while worker fills comm<strong>and</strong> buffer)Single-threadedPtrGenerate token streamEmitMulti-threadedGL threadIdle or StateCapturesPtr Ptr Emit EmitWorker threadWorker threadGenerate token streamGenerate token stream27


TOKEN STREAMINGRendering the model 16 times (176k draws)StateObjects are reused, tokens regenerated <strong>and</strong>submitted in chunks (~22 per frame)Framework is „too simple“ in terms of per-threadwork to show greater scalingTechniqueDraw time 11k*16 drawsTimerGPU CPUCore OpenGL (1 thread) 22 27TOKEN 1 worker thread4.33.5 7.7 xTOKEN 2 worker threads 4.32.1 12 xTOKEN 3 worker threads 4.3 1.7 15 xPreliminary results M600028


MIGRATION STEPSAll rendering via FBO (statecapture doesn‘t support default backbuffer)No legacy state use in GLSLShader driven pipeline, use generic glVertexAttribPointer (not glTexCoordPointer <strong>and</strong> soon), use custom uniforms no gl_ModelView...No classic uniforms, all in UBOARB_bindless_texture for texturingBindless BuffersARB_vertex_attrib_binding combined with NV_vertex_buffer_unified_memoryOrganize for StateObject reuseCan no longer just „glEnable(GL_BLEND)“, avoid many state captures per frame29


MIGRATION TIPSVertex Attributes <strong>and</strong> bindless VBOhttp://on-dem<strong>and</strong>.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf (slide 11-16)GLSL// classic attributes// generic attributes// ideally share this definition across C <strong>and</strong> GLSL#define VERTEX_POS 0#define VERTEX_NORMAL 1...normal = gl_Normal;gl_Position = gl_Vertex;in layout(location= VERTEX_POS) vec4 attr_Pos;in layout(location= VERTEX_NORMAL) vec3 attr_Normal;...normal = attr_Normal;gl_Position = attr_Pos;30


UBO Parameter managementMIGRATION TIPShttp://on-dem<strong>and</strong>.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf ( 18-27, 44-47)Ideally group by frequency of change// classic uniformsuniform samplerCube viewEnvTex;uniform vec4 materialColor;uniform sampler2D materialTex;// UBO usage, bindless texture inside UBO, grouped by changelayout(comm<strong>and</strong>BindableNV) uniform;layout(std140, binding=0) uniform view{samplerCube viewEnvTex;};layout(std140, binding=1) uniform material {vec4 materialColor;sampler2D texMaterialColor;};......31


LET GPU DO MORE WORKWhen data is only referenced, we can:Still change vertices, materials, matrices... from CPUPerform updates based on additional knowledge on GPUObject data (matrices, materials animation)Geoemtry data (deformation, skinning, morphing...)Occlusion CullingLevel of Detail33


TRANSFORM TREE UPDATESAll matrices stored on GPUUse ARB_compute_shader for hierarchy updates, send only local matrixchanges, evaluate treehttp://on-dem<strong>and</strong>.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf ( 29-30 )model courtesy of PTC34


Try create less total workloadOCCLUSION CULLINGMany occluded parts in the car model (lots of vertices)35


GPU friendly processingResultsGPU CULLING BASICSMatrix, bbox <strong>and</strong> object (matrixIdx + bboxIdx) buffersMore efficient than occ. queries, as we test many objects at onceReadback: GPU to HostGPU can pack bit streamIndirect: GPU to GPUE.g. DrawIndirect‘s instanceCount to 0 or 10,1,0,1,1,1,0,0,0buffer cmdBuffer{Comm<strong>and</strong> cmds[];};...cmds[obj].instanceCount = visible;36


HiZ occlusiondepth maxpyramidOCCLUSION CULLINGdepthbufferRaster OcclusionTest clip box againstdepth texelsProjected sizedeterminesdepth mip levelRaster gives more accurate resultsBoth benefit from temporal coherenceusage to avoid a dedicated depth passhttp://on-dem<strong>and</strong>.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf (slide 49 - 54)Passing bboxfragments enableobject// rendered without depth or color writes// GLSL fragment shader// from ARB_shader_image_load_storelayout(early_fragment_tests) in;...void main(){visibility[objectID] = 1;// could use atomicAdd for coverage}37


RESULTS VIA READBACKUse dedicated buffers for readbackOne for GPU processing only (ensures best memory type used)N for readbacks (for example 4 to avoid sync points)glCopyNamedBufferSubData (gpuresult, readbacks[ frame % N ]...)Readback could be mapped persistently via GL_ARB_buffer_storageIdeally delay access of readback for a few framesAvoids need for synchronization, but can introduce visible artefactsReadback older frames to give CPU additional knowledge, but use GPUindirect methods for rendering38


RESULTS VIA COMMANDLISTComm<strong>and</strong>list culling needs several buffersToken comm<strong>and</strong>stream (input & output): variable sizeToken attributes (input & output): size, offset, object IDAlgorithm:Can use negative objectID to encode tokens that must always be addedFirst compute output sizes using object ID <strong>and</strong> visbilityoutput.sizes [ token ] = visible [ objectID ] ? input.sizes[ token ] : 0Run a scan operation to compute output offsetsBuild output tokenstream39


RESULTS VIA COMMANDLISTMultiple squences may be stored in the tokenstream (different stateobjects..)Original token streamSequence ASequence BFind out which tokens to cullculledToken offsets from scan areglobalunusedThe sequence separation is providedby CPU, which we can‘t alterCorrect output offset basedon sequence‘s start offsetunusedInsert terminate sequence when:last token‘s offset != originaloffsetTS40


RESULTSNow overcomes deficit of previous methodshttp://on-dem<strong>and</strong>.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdfTechniqueDraw time 11k drawsTimer GPU CPUglBindBufferRange 6.2 3.4TOKEN native 6.2 ~0 BIG xCULL Old Readback (stalls pipe)* 3.1 2 x 3.1 1.1 xCULL Old Bindless MDI*3.7 1.6 x 0.7 4.8 xCULL NEW TOKEN buffer 2.5 2.5 x 0.2 17 xPreliminary results M6000, * taken from slightly differnt frameworkNo instancing! (data replication)SCENE:materials: 138objects: 13,032geometries: 3,312parts: 789,464triangles: 29,527,840vertices: 27,584,37641


For example particle LODDYNAMIC LEVEL OF DETAILGPU classifies how to render particles based onscreen area, without CPU envolved (GL 4.3)Point spriteSimple mesh via enhanced instancingAdaptive tessellationTokens allow same for more complex objectsVBO UBO DRAWState ATSState BTS42


Leverage GPU to full extentCONCLUSIONModern software approaches (comm<strong>and</strong> buffers, stateobjects...) found in manynew graphics APIs (DX12, Vulkan...) or extended OpenGLHigher fidelity (e.g. multiple scene passes) or interactivity for even larger scenesSave CPU time (power/battery, other work...)GPU can do more than „just“ renderingDrive decision making (culling, LOD, interactive scientific data brushing...)Compute auxiliary data (matrices, materials...)NV_comm<strong>and</strong>_list <strong>and</strong> NVIDIA‘s bindless enable workflows beyond core api43


THANK YOUContact: ckubisch@nvidia.com @pixeljetstreamSample codehttps://github.com/nvpro-samplesPast presentationshttp://www.slideshare.net/tlorach/opengl-nvidia-comm<strong>and</strong>listapproaching-zerodriveroverheadhttp://on-dem<strong>and</strong>.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdfhttp://on-dem<strong>and</strong>.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdfOpenGL work creation referenceshttp://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/44


BACKUP45


HIZ CULLINGOpenGL 3.x/4.xDepth-PassCreate mipmap pyramid, MAX depthGM2xx supports GL_EXT_texture_filter_minmax„invisible“ vertex shader or computeCompare object‘s clipspace bbox against z value of depth mipThe mip level is chosen by clipspace 2D areaProjected size determinesdepth mip level mip texels cover object46


RASTER CULLINGOpenGL 4.2+Depth-PassRaster „invisible“ bounding boxesDisable Color/Depth writesdepthbufferPassing bbox fragmentsenable object// GLSL fragment shader// from ARB_shader_image_load_storelayout(early_fragment_tests) in;Algorithm byEvgeny Makarov, NVIDIAGeometry Shader to create the three visiblebox sidesDepth buffer discards occluded fragments(earlyZ...)Fragment Shader writes output:visible[objindex] = 1buffer visibilityBuffer{int visibility[]; // cleared to 0};flat in int objectID; // unique per boxvoid main(){visibility[objectID] = 1;// no atomics required (32-bit write)}47


TEMPORAL COHERENCEFew changes relative to cameraDraw each object only onceRender last visible, fully shaded(last)frame: f – 1cameraframe: fcameramovedvisibleinvisiblelast visibleTest all against current depth:(visible)Render newly added visible:none, if no spatial changes madebboxes pass depth(visible)bboxes occludedAlgorithm byMarkus Tavenrath, NVIDIA(~last) & (visible)(last) = (visible)new visible48

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!