S5135-Christoph-Kubisch-and-Pierre-Boudier

GPU-DRIVEN LARGE SCENERENDERING NV_COMMAND_LISTPierre Boudier, Quadro Software ArchitectChristoph Kubisch, Developer Technology Engineer

MOTIVATIONModern GPUs have a lot of execution units to make use ofQuadro 4000:Quadro K4000:Quadro K4200:256 cores768 cores1344 coresQuadro M6000: 3072 coresHow to leverage all this power?Efficient API usage and rendering algorithmsAPIs reflecting recent hardware designs and capabilities2

App + driverCHALLENGE OF ISSUING COMMANDSIssuing drawcalls and state changes can be a real bottleneckCPUGPUExcessiveWork fromApp & DriverOn CPU!!GPUidlecourtesy of PTC• 650,000 Triangles• 3,700,000 Triangles• 14,338,275 Triangles/lines• 68,000 Parts• 98,000 Parts• 300,528 drawcalls (parts)• ~ 10 Triangles per part• ~ 37 Triangles per part• ~ 48 Triangles per part3

Avoid data redundancyENABLING GPU SCALABILITYData stored once, referenced multiple timesUpdate only once (less host to gpu transfers)Increase GPU workload per jobFurther cuts API callsLess CPU workMinimize CPU/GPU interactionAllow GPU to update its own dataLow API usage when scene is changed littleE.g. GPU-based culling, matrix updates...http://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdf4

64 bits addressWhat is it about?BINDLESS TECHNOLOGY64-bit pointers& handlesWork from native GPU pointers/handlesLess validation, less CPU cache thrashingElement buffer (EBO)IndicesVertex Puller (IA)GPU can use flexible data structuresBindless BuffersVertex Buffer (VBO)AttributesVertex ShaderVertex & Global memory since pre-FermiBindless Constants (UBO)Support for Fermi and aboveBindless TexturesSince KeplerUniform BlockTexture FetchUniform BlockGPUVirtualMemoryFragment ShaderGraphicsPipeline5

BINDLESS DRAWING LOOPUpdateBuffers();glBufferAddressRangeNV(UNIFORM..., 0, addrView, ...);// redundancy filters not shownforeach (obj in scene) {glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...);glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...);glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices + obj.mtxOffset, ...);// iterate over cached material groupsforeach ( batch in obj.materialGroups) {glBufferAddressRangeNV(UNIFORM, 2, addrMaterials + batch.mtlOffset, ...);}}glMultiDrawElements (...);6

NV_COMMAND_LIST – KEY CONCEPTSTokenized Rendering (GPU modifiable command buffers):Simple state changes and draw commands are encoded into binary data streamLeverages bindless resourcesState Objects (pre-validated)Macro state (program, blending, fbo-config...) is captured into an objectControl over when costly validation happens, later reuse of objects is very fastCompiled Command List (alternative to token buffer)Display list like usage, however buffer addresses are referenced, therefore theircontent (matrices, vertices...) can still be modified.7

COMMAND PIPELINEApplicationDriverPush BufferCommands (FIFO)OpenGLCommandsOpenGLResources64 bitsPointersGPUHandles(IDs)Id 64 bitsAddr.8

COMMAND PIPELINEApplicationDriverPush BufferCommands (FIFO)OpenGLCommandsviaTokens & StateObjectsOpenGLResources64 bitsPointers(bindless)StateObjectresolveFast path throughdriver viaNV_command_listGPU9

TOKENIZED RENDERING// bindless scene drawing loopforeach (obj in scene) {glBufferAddressRangeNV(VERTEX.., 0, obj.geometry->addrVBO, ...);glBufferAddressRangeNV(ELEMENT..., 0, obj.geometry->addrIBO, ...);glBufferAddressRangeNV(UNIFORM..., 1, addrMatrices + obj.mtxOffset, ...);foreach ( batch in obj.materialCaches) {glBufferAddressRangeNV(UNIFORM, 2, addrMaterials + batch.mtlOffset, ...);glMultiDrawElements(...)}}All these commands (hundreds ofthousands) for the entire scene canbe replaced by a single call to API!glDrawCommandsNV (TRIANGLES,tokenBuffer, offsets[], sizes[], count);// {0}, {tokensSize}, 1Token bufferVBO - addressEBO - addressUBO – matrix addressUBO – material addressDraw – first, count...UBO – material addressDraw – first, count......ObjectMaterialbatchesNextObject...10

TOKENIZED RENDERINGTokens are tightly packed structs in linear memoryDRAW tokens allowmixing strips, lists,fans, loops of samebase mode(TRIANGLES, LINES,POINTS) in singledispatch*CommandNV {GLuint header; // glGetCommandHeaderNV(type,…)... command specific payload}; ELEMENT_ADDRESS_COMMAND_NVTERMINATE_SEQUENCE_COMMAND_NVNOP_COMMAND_NVDRAW_ELEMENTS_COMMAND_NVDRAW_ARRAYS_COMMAND_NVDRAW_ELEMENTS_STRIP_COMMAND_NVDRAW_ARRAYS_STRIP_COMMAND_NVDRAW_ELEMENTS_INSTANCED_COMMAND_NVDRAW_ARRAYS_INSTANCED_COMMAND_NVATTRIBUTE_ADDRESS_COMMAND_NVUNIFORM_ADDRESS_COMMAND_NVBLEND_COLOR_COMMAND_NVSTENCIL_REF_COMMAND_NVLINE_WIDTH_COMMAND_NVPOLYGON_OFFSET_COMMAND_NVALPHA_REF_COMMAND_NVVIEWPORT_COMMAND_NVSCISSOR_COMMAND_NVFRONTFACE_COMMAND_NV11

TOKENIZED RENDERING// single drawcall, tokens encoded into raw memory buffer!glDrawCommandsNV (..., tokenBuffer, offsets[], sizes[], count);// {0}, {bufferSize}, 1VBO EBO UBO Matrix UBO Material Draw UBO Material Draw DrawAttributeAddressCommandNV{GLuint header;ElementAddressCommandNV{GLuint header;UniformAddressCommandNV{GLuint header;DrawElementsCommandNV{Gluint header;}GLuint index;GLuint64 address;}GLuint64 address;GLuint typeSizeInByte;}GLushort index;GLushort stage;// glGetStageIndexNV(VERTEX..)GLuint64 address;}GLuintGLuintGLuintcount;firstIndex;baseVertex;12

What is so great about it?TOKENIZED RENDERINGIt‘s crazy fast (see later) and tokens are popular in render engines alreadyThe tokenbuffer is a „regular“ GL bufferCan be manipulated by all mechanisms OpenGL offersCan be filled from different CPU threads (which do not require a GL context)Expands the possibilities of GPU driving its own work without CPU roundtrip13

STATE OBJECTSStateObjectEncapsulates majority of state (fbo format, active shader, blend, depth ...),but no bindings! (use bindless textures passed via UBO...)glCaptureStateNV ( stateobject, GL_TRIANGLES );Less rendertime variability, explicit control over validation timeRender entire scenes with different shaders/fbos... in one goDriver caches state transitions// single drawcall, multiple shaders, fbos...glDrawCommandsStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);14

STATE OBJECTS// single drawcall, multiple shaders, fbos...glDrawCommandsStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);for i < count {if (i == 0) set state from states[i];else set state transition states[i-1] to states[i]if (fbo[i]) glBindFramebuffer( fbo[i] ) // must be compatible to states[i].fboelse glBindFramebuffer( states[i].fbo )}ProcessCommandSequence(... tokenBuffer, offsets[i], sizes[i])Can reuse tokens & state with different fbos (e.g. shadow passes)Compatibilty depends on fbo‘s drawbuffers, texture formats... but not sizes15

STATE OBJECTS// single drawcall, multiple shaders, fbos...glDrawCommandsStatesNV (tokenBuffer, offsets[], sizes[], states[], fbos[], count);// {0,sizeA}, {sizeA, sizeB}, {A,B}, {f,f}, 2tokenBuffer:VBO IBO Matrix UBO Material UBO Draw Material UBO Draw DrawSequence A (e.g. triangles)Draw Draw DrawSequence B (lines)[0]State Object AFBO fVBO IBO Matrix UBO Material UBO Draw Material UBO Draw Draw[1]State Object BFBO fDraw Draw DrawWithin glDrawCommandsStatesNV state set by tokens is inherited across sequences16

COMPILED COMMAND LISTCompiledCommand ListCombine multiple segments into CommandList objectTokens provided by system memoryStateObjectStateObjectStateObjectStateObjectglListDrawCommandsStatesClientNV( list, segment,void* tokencmds[], sizes[], states[], fbos[], count);FBOVBOFBOVBOFBOVBOFBOVBOLess flexibilty compared to token bufferToken content, state and fbo assignments are deep-copiedList is immutable, needs recompile if pointers/state changesglCompileCommandListNV( list );Allows even faster state transitionsAll key data is known to the driverIBOMatrix UBO Material UBO Material UBODraw Draw DrawIBOMatrix UBO Material UBO DrawIBOMatrix UBO Material UBO Material UBODraw DrawIBOMatrix UBO Material UBO Draw Draw Draw17

RESULTSHigh scene complexityNo instancing used, true copiesEach object unique and editable90 000 objectsEach drawn with triangles & linesRaw: 4.8m drawcallsStandard GL: 2 fpsCommandlist: 20 fps18

RENDERING RESEARCH FRAMEWORKRender test with „Graphicscard“ modelMany low-complexity drawcalls (CPUchallenged)Same geometrymultiple objects110 geometries, 66 materials68 000 parts2500 objectsSame geometry(fan) multiple parts19

„Shaded“ and „Shaded & Edges“SCENE STYLES20

SCENE DRAWING (GROUPED)UpdateBuffers();glBindBufferBase (UBO, 0, uboView);foreach (obj in scene) {// redundancy filter for these (if (used != last)...glBindVertexBuffer (0, obj.geometry->vbo, 0, vtxSize);glBindBuffer (ELEMENT, obj.geometry->ibo);glBindBufferRange (UBO, 1, uboMatrices, obj.matrixOffset, maSize);// iterate over cached material groupsforeach ( batch in obj.materialGroups) {glBindBufferRange (UBO, 2, uboMaterial, batch.materialOffset, mtlSize);}}glMultiDrawElements (...);~ 2 500 api drawcalls~11 000 drawcalls~55 triangles per call21

UpdateBuffers();glBindBufferBase (UBO, 0, uboView);foreach (obj in scene) {...SCENE DRAWING (INDIVIDUAL)}// iterate over all parts individuallyforeach ( part in obj.parts) {if (part.material != lastMaterial){glBindBufferRange (UBO, 2, uboMaterial, part.materialOffset, mtlSize);}glDrawElements (...);}~68 000 drawcalls~10 triangles per call22

PERFORMANCE SHADEDRender all objects as trianglesGROUPED: ~ 300 KB ( 22 k tokens, ~11k buffer related, 11 k for drawing)INDIVIDUAL: ~ 1 MB ( 79 k tokens, ~68 k for drawing)TechniqueDraw time 11k draws Draw time 68k drawsTimerGPU CPU GPU CPUCore OpenGL 0.4 1.4 3.1 6.7NV bindless0.30.7 2 x 1.93.8 1.7 xTOKEN buffer 0.3~0 BIG x 0.9~0 BIG xPreliminary results M600023

PERFORMANCE SHADEDRemoving buffer redundancy filteringMaterial UBODraw Material UBO Draw Material UBOadds 60 k UBO, and 3.4k EBO & VBO tokens; total 144 k tokensTechniqueDraw time 68k drawsTimer GPU CPUUnfiltered Core OpenGL 5.6 11.1Core OpenGL 3.1 1.8 x 6.7 1.6 xUnfiltered TOKEN buffer 1.9 2.9 x ~0 BIG xPreliminary results M600024

PERFORMANCE SHADED & EDGESFor each object render triangles then linesFrequent alternation between two state objects(TRIANGLES/LINES) (~5000 times)GROUPED: 540 KB ( ~ 40k tokens)INDIVIDUAL: 2 MB ( ~ 160k tokens)TechniqueDraw time 11k*2 draws Draw time 68k*2 drawsTimerGPU CPU GPU CPUCore OpenGL 0.8 2.4 11.5 14.3NV bindless0.81.4 1.7 x 6.58.0 1.7 xTOKEN buffer 0.80.4 6 x 2.10.4 35 xPreliminary results M600025

EXAMPLE USE CASES5 000 shader changes: toggling between two shaders in „shaded & edges“Timer GPU CPUNV bindless 12.3 15.1TOKEN buffer1.6 7.7 x 0.4 37 xCompiled TOKEN list1.5 8.2 x 0.005 BIG x5 000 fbo changes: similar as above but with fbo toggle instead of shaderAlmost no additional cost compared to rendering without fbo changesTimer GPU CPUNV bindless 57.0 59.0TOKEN buffer1.1 51 x 0.9 65 xCompiled TOKEN list0.8 71 x 0.022 BIG xPreliminary results on M600026

TOKEN STREAMINGIn case token buffer cannot be reused, fill tokens every frameFill & emit from a single thread or multiple threadsPass command buffer pointers to worker threads, that do not require GL contextsHandle state objects in GL thread, or pass what is required to generate betweenthreads (GL thread captures state, while worker fills command buffer)Single-threadedPtrGenerate token streamEmitMulti-threadedGL threadIdle or StateCapturesPtr Ptr Emit EmitWorker threadWorker threadGenerate token streamGenerate token stream27

TOKEN STREAMINGRendering the model 16 times (176k draws)StateObjects are reused, tokens regenerated andsubmitted in chunks (~22 per frame)Framework is „too simple“ in terms of per-threadwork to show greater scalingTechniqueDraw time 11k*16 drawsTimerGPU CPUCore OpenGL (1 thread) 22 27TOKEN 1 worker thread4.33.5 7.7 xTOKEN 2 worker threads 4.32.1 12 xTOKEN 3 worker threads 4.3 1.7 15 xPreliminary results M600028

MIGRATION STEPSAll rendering via FBO (statecapture doesn‘t support default backbuffer)No legacy state use in GLSLShader driven pipeline, use generic glVertexAttribPointer (not glTexCoordPointer and soon), use custom uniforms no gl_ModelView...No classic uniforms, all in UBOARB_bindless_texture for texturingBindless BuffersARB_vertex_attrib_binding combined with NV_vertex_buffer_unified_memoryOrganize for StateObject reuseCan no longer just „glEnable(GL_BLEND)“, avoid many state captures per frame29

MIGRATION TIPSVertex Attributes and bindless VBOhttp://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf (slide 11-16)GLSL// classic attributes// generic attributes// ideally share this definition across C and GLSL#define VERTEX_POS 0#define VERTEX_NORMAL 1...normal = gl_Normal;gl_Position = gl_Vertex;in layout(location= VERTEX_POS) vec4 attr_Pos;in layout(location= VERTEX_NORMAL) vec3 attr_Normal;...normal = attr_Normal;gl_Position = attr_Pos;30

UBO Parameter managementMIGRATION TIPShttp://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf ( 18-27, 44-47)Ideally group by frequency of change// classic uniformsuniform samplerCube viewEnvTex;uniform vec4 materialColor;uniform sampler2D materialTex;// UBO usage, bindless texture inside UBO, grouped by changelayout(commandBindableNV) uniform;layout(std140, binding=0) uniform view{samplerCube viewEnvTex;};layout(std140, binding=1) uniform material {vec4 materialColor;sampler2D texMaterialColor;};......31

LET GPU DO MORE WORKWhen data is only referenced, we can:Still change vertices, materials, matrices... from CPUPerform updates based on additional knowledge on GPUObject data (matrices, materials animation)Geoemtry data (deformation, skinning, morphing...)Occlusion CullingLevel of Detail33

TRANSFORM TREE UPDATESAll matrices stored on GPUUse ARB_compute_shader for hierarchy updates, send only local matrixchanges, evaluate treehttp://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf ( 29-30 )model courtesy of PTC34

Try create less total workloadOCCLUSION CULLINGMany occluded parts in the car model (lots of vertices)35

GPU friendly processingResultsGPU CULLING BASICSMatrix, bbox and object (matrixIdx + bboxIdx) buffersMore efficient than occ. queries, as we test many objects at onceReadback: GPU to HostGPU can pack bit streamIndirect: GPU to GPUE.g. DrawIndirect‘s instanceCount to 0 or 10,1,0,1,1,1,0,0,0buffer cmdBuffer{Command cmds[];};...cmds[obj].instanceCount = visible;36

HiZ occlusiondepth maxpyramidOCCLUSION CULLINGdepthbufferRaster OcclusionTest clip box againstdepth texelsProjected sizedeterminesdepth mip levelRaster gives more accurate resultsBoth benefit from temporal coherenceusage to avoid a dedicated depth passhttp://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdf (slide 49 - 54)Passing bboxfragments enableobject// rendered without depth or color writes// GLSL fragment shader// from ARB_shader_image_load_storelayout(early_fragment_tests) in;...void main(){visibility[objectID] = 1;// could use atomicAdd for coverage}37

RESULTS VIA READBACKUse dedicated buffers for readbackOne for GPU processing only (ensures best memory type used)N for readbacks (for example 4 to avoid sync points)glCopyNamedBufferSubData (gpuresult, readbacks[ frame % N ]...)Readback could be mapped persistently via GL_ARB_buffer_storageIdeally delay access of readback for a few framesAvoids need for synchronization, but can introduce visible artefactsReadback older frames to give CPU additional knowledge, but use GPUindirect methods for rendering38

RESULTS VIA COMMANDLISTCommandlist culling needs several buffersToken commandstream (input & output): variable sizeToken attributes (input & output): size, offset, object IDAlgorithm:Can use negative objectID to encode tokens that must always be addedFirst compute output sizes using object ID and visbilityoutput.sizes [ token ] = visible [ objectID ] ? input.sizes[ token ] : 0Run a scan operation to compute output offsetsBuild output tokenstream39

RESULTS VIA COMMANDLISTMultiple squences may be stored in the tokenstream (different stateobjects..)Original token streamSequence ASequence BFind out which tokens to cullculledToken offsets from scan areglobalunusedThe sequence separation is providedby CPU, which we can‘t alterCorrect output offset basedon sequence‘s start offsetunusedInsert terminate sequence when:last token‘s offset != originaloffsetTS40

RESULTSNow overcomes deficit of previous methodshttp://on-demand.gputechconf.com/gtc/2014/presentations/S4379-opengl-44-scene-rendering-techniques.pdfTechniqueDraw time 11k drawsTimer GPU CPUglBindBufferRange 6.2 3.4TOKEN native 6.2 ~0 BIG xCULL Old Readback (stalls pipe)* 3.1 2 x 3.1 1.1 xCULL Old Bindless MDI*3.7 1.6 x 0.7 4.8 xCULL NEW TOKEN buffer 2.5 2.5 x 0.2 17 xPreliminary results M6000, * taken from slightly differnt frameworkNo instancing! (data replication)SCENE:materials: 138objects: 13,032geometries: 3,312parts: 789,464triangles: 29,527,840vertices: 27,584,37641

For example particle LODDYNAMIC LEVEL OF DETAILGPU classifies how to render particles based onscreen area, without CPU envolved (GL 4.3)Point spriteSimple mesh via enhanced instancingAdaptive tessellationTokens allow same for more complex objectsVBO UBO DRAWState ATSState BTS42

Leverage GPU to full extentCONCLUSIONModern software approaches (command buffers, stateobjects...) found in manynew graphics APIs (DX12, Vulkan...) or extended OpenGLHigher fidelity (e.g. multiple scene passes) or interactivity for even larger scenesSave CPU time (power/battery, other work...)GPU can do more than „just“ renderingDrive decision making (culling, LOD, interactive scientific data brushing...)Compute auxiliary data (matrices, materials...)NV_command_list and NVIDIA‘s bindless enable workflows beyond core api43

THANK YOUContact: ckubisch@nvidia.com @pixeljetstreamSample codehttps://github.com/nvpro-samplesPast presentationshttp://www.slideshare.net/tlorach/opengl-nvidia-commandlistapproaching-zerodriveroverheadhttp://on-demand.gputechconf.com/siggraph/2014/presentation/SG4117-OpenGL-Scene-Rendering-Techniques.pdfhttp://on-demand.gputechconf.com/gtc/2013/presentations/S3032-Advanced-Scenegraph-Rendering-Pipeline.pdfOpenGL work creation referenceshttp://rastergrid.com/blog/2010/10/gpu-based-dynamic-geometry-lod/http://rastergrid.com/blog/2010/10/hierarchical-z-map-based-occlusion-culling/44

BACKUP45

HIZ CULLINGOpenGL 3.x/4.xDepth-PassCreate mipmap pyramid, MAX depthGM2xx supports GL_EXT_texture_filter_minmax„invisible“ vertex shader or computeCompare object‘s clipspace bbox against z value of depth mipThe mip level is chosen by clipspace 2D areaProjected size determinesdepth mip level mip texels cover object46

RASTER CULLINGOpenGL 4.2+Depth-PassRaster „invisible“ bounding boxesDisable Color/Depth writesdepthbufferPassing bbox fragmentsenable object// GLSL fragment shader// from ARB_shader_image_load_storelayout(early_fragment_tests) in;Algorithm byEvgeny Makarov, NVIDIAGeometry Shader to create the three visiblebox sidesDepth buffer discards occluded fragments(earlyZ...)Fragment Shader writes output:visible[objindex] = 1buffer visibilityBuffer{int visibility[]; // cleared to 0};flat in int objectID; // unique per boxvoid main(){visibility[objectID] = 1;// no atomics required (32-bit write)}47

TEMPORAL COHERENCEFew changes relative to cameraDraw each object only onceRender last visible, fully shaded(last)frame: f – 1cameraframe: fcameramovedvisibleinvisiblelast visibleTest all against current depth:(visible)Render newly added visible:none, if no spatial changes madebboxes pass depth(visible)bboxes occludedAlgorithm byMarkus Tavenrath, NVIDIA(~last) & (visible)(last) = (visible)new visible48

S5135-Christoph-Kubisch-and-Pierre-Boudier

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?