Toward A Multicore Architecture for Real-time Ray-tracing

More documents

Recommendations

Info

Core type Performance (millions rays/sec) #Tiles #CoresIdeal RealisticAve. Range Ave. Range1-issue, no SMT 37 16-62 42 19-66 16 1281-issue, 4 threads 82 33-138 43 13-74 16 1152-issue, no SMT 34 15-57 26 12-41 10 782-issue, 4 threads 82 36-137 41 13-70 8 65Table 5. Full system performance and area for all scenes.We recommend 1-issue, 4 threads as the best configuration.bottleneck and not the network.Full system performance: With the number of memorybanks specified, our full system model can make performanceprojections for different core configurations and scenes whichare shown in Table 5 as an average and range across our fivebenchmark scenes. Columns two and three show “ideal” performanceachieved by naively scaling the performance for onetile to 16 tiles (or as many that will fit), ignoring contentionfor shared resources. Columns four and five show realisticperformance with the 4-DIMM memory. Our conservativememory model projects that we can sustain between 13 and74 million rays/second, and the ideal model suggests 33 to138 million rays/second.Full system area analysis: Using die-photos of other processorswe built an area model for the tiles and the full chip.For overheads of multi-threading, we use methodology fromacademic and industry papers on area scaling [19]. The areafor our baseline 1-issue core is derived from a Niagara2 corescaled to 22nm, and we assume a 12% area overhead formulti-threading for 2-threads, and additional 5% per thread.In our area analysis, the additional area of the 4-wide FPUcompared to Niagara2’s FPU is ignored. We assume an 110%overhead when going from 1-issue to 2-issue, since in thismodel we double the number of load/store ports also andhence the L1 cache sizes. The cache size is fixed at 2MB forall designs and this includes the cross-bar area between thecores and the cache. Columns 6 and 7 in Table 5 show thenumber of cores of each type that will fit on a 240mm 2 chip.With single-issue cores, we can easily fit 16 tiles, and abouthalf with two issue cores.Summary: Overall, going to dual issue does not providesignificantly higher performance and single-issue, 4-way multithreadedcores seem ideal. Single-issue cores without SMTseem to match the 4-way SMT cores at the chip-level, becauseof insufficient memory bandwidth to feed the threads. Atthe chip-level, memory bandwidth is the primary bottleneck.While our final performance results do not outright exceedthe required 100 million rays/second for every scene, theflexibility and potential for further architectural optimizationsshow this is a viable system.6. Related workRecently, a few application-driven architectures have beenproposed. Yeh et al. have proposed ParallAX, an architecturespecialized for real-time physics processing [42]. Hugheset al. recently proposed a 64-core CMP for animation andvisual effects [10]. Kumar et al. have proposed an architecturecalled Carbon that can support fine-grained parallelismfor large scale chip-multiprocessors [14]. Yang et al., describea scalable streaming processor targeted at scientificapplications [41]. Clark et al. have proposed a techniquecalled liquid SIMD that can abstract away the details of thedata-parallel hardware [4]. Sankaralingam et al. developeda methodical approach for characterizing data parallel applicationsand proposed a set of universal mechanisms fordata parallelism [29]. All these designs follow the flow ofworkload characterization of an existing application drivingthe design of an architecture. Our work is different in tworespects. First, we co-design future challenge applicationsand the architecture to meet the application performancerequirements. Second, we provide a more general purposearchitecture and new quantitative tools to support the designprocess. Embedded systems use similar hardware/softwareco-design but for building specialized processors and in muchsmaller scale. Architecture-specific analytical models havebeen applied for processor pipelines to analyze performanceand power [22], [37], [13], [34], [43], [8], [27]. These modelshave also been used for design space exploration and rapidsimulation [12], [11], [16], [3], [6], [24].7. Discussion and Future workBuilding efficient systems requires that the software andhardware be designed for each other. Deviating both hardwareand software architectures from existing designs poses aparticularly challenging co-design problem. The focus of thispaper has been the co-design of the software, architecture, andevaluation strategy for one such system: a ray-tracing basedreal-time graphics platform that is fundamentally differentfrom today’s Z-buffer based graphics systems.Our software component (Razor) was designed to allowgraphics and scene behavior to be exploited by hardwarevia fine-grained parallelism, locality in graph traversal,good behavior of secondary rays, and slow growthin memory by virtue of kd-tree duplication. To solve theapplication/architecture “chicken-and-egg” problem, we implementedRazor on existing hardware, in our case SSEacceleratedx86 cores and built a prototype system usingavailable technology with effectively eight cores. This prototypeallowed us to perform detailed application characterizationon real scenes, something impossible on simulationand meaningless without an optimized application.Closing the development loop with design, evaluation andanalysis of software’s behavior on our proposed hardwarearchitecture was accomplished via a novel analytical modelthat provided intuition both for the architecture and theapplication. For example, an important application design10
question was to decide whether to duplicate or share the kdtreedata structure. Our 8-core prototype hardware systemdoes not scale and our simulator for proposed hardware is tooslow to run the full application to completion. Only the modelcould reveal that duplication overhead is manageable, thusrelaxing coherence requirements for the architecture. To developthe model, we combined data gathered on the optimizedprototype using tools like Pin and Valgrind with simulationbasedmeasurements of cache behavior and instruction frequency.With this co-designed approach, we have shownthe raytracing-based Copernican universe of application andgeneral visibility-centric 3D graphics is feasible. However,this work represents only a first cut at such a system design;there is more to explore in the application, architecture, andevaluation details.Application: Razor is the first implementation of an aggressivesoftware design incorporating many new ideas, some ofwhich have worked better than others. With further iterativedesign, algorithm development, and performance tuning, webelieve a ten-fold performance improvement in the softwareis possible. Non-rendering tasks in a game environmentand more generally irregular applications map well to thearchitecture and require further exploration.Architecture: There is potential for further architecturalenhancements. First, the length of basic blocks is quite largeand hence data-flow ISAs and/or greater SIMD width canprovide higher efficiency. Second, ISA specialization, beyondSIMD specialization targeted at shading and texture computationscould provide significant performance improvements.Improving memory system will be most effective, and 3D integratedDRAM could significantly increase performance andreduce system power [17]. Physically scaling the architectureby varying the number of tiles, frequency, and voltage scalingto meet power and area budgets provides a rich design spaceto be explored.Evaluation: Our analytical model enables accurate performanceprojections and can even be used for sensitivitystudies. In addition, it can be extended to accommodate otherCopernican architectures, like Intel Larrabee [30].Comparison to GPUs and Beyond Ray-tracing: The processororganization in Copernicus is fundamentally differentfrom conventional GPUs, which provide a primitive memorysystem abstraction while deferring scene geometry managementto the CPU. Architecturally, the hardware Z-bufferis replaced with a flexible memory system and softwarespatial data structure for visibility test. This support enablesscene management and rendering in one single computationalsubstrate. We believe GPUs are likely to evolve to such amodel over time, potentially with a different implementation.For example, secondary rays could be hybridized with Z-buffer rendering. Our system is a particular point in thearchitecture design space that can support ray tracing as oneof potentially several workloads.8. ConclusionsModern rendering systems live in a Ptolemic Z-bufferuniverse that is beginning to pose several problems in providingsignificant visual quality improvements. We show that aCopernican universe centered around applications and sophisticatedvisibility algorithms with ray-tracing is possible andthe architecture and application challenges can be addressedthrough full system co-design. In this paper, we describeour system, called Copernicus, which includes several codesignedhardware and software innovations. Razor, the softwarecomponent of Copernicus, is a highly parallel, multigranular,locality-aware ray tracer. The hardware architectureis a large-scale tiled multicore processor with private L2caches, fine-grained ISA specialization tuned to the workload,multi-threading for hiding memory access latency, and limited(cluster-local) cache coherence. This organization representsa unique design point that trades off data redundancy andrecomputation over synchronization, thus easily scaling tohundreds of cores.The methodology used for this work is of interest in itsown right. We developed a novel evaluation methodologythat combines software implementation and analysis on currenthardware, architecture simulation of proposed hardware,and analytical performance modeling for the full hardwaresoftwareplatform. Our results show that if projected improvementsin software algorithms are obtained, we cansustain real-time raytracing on a future 240mm 2 chip at22nm technology. The mechanisms and the architecture arenot strictly limited to ray-tracing, as future systems thatmust execute irregular applications on large scale single-chipparallel processors are likely to have similar requirements.AcknowledgmentWe thank Paul Gratz, Boris Grot, Simha Sethumadhavan, theVertical group, and the anonymous reviewers for comments, theWisconsin Condor project and UW CSL for their assistance, andthe Real-Time Graphics and Parallel Systems Group for benchmarkscenes and for their prior work on Razor. Many thanks to MarkHill for several valuable suggestions. Support for this researchwas provided by NSF CAREER award #0546236 and by IntelCorporation.References[1] C. Benthin, I. Wald, M. Scherbaum, and H. Friedrich, “RayTracing on the CELL Processor,” in Interactive Ray Tracing,2006, pp. 15–23.[2] J. Bigler, A. Stephens, and S. Parker, “Design for parallelinteractive ray tracing systems,” in Interactive Ray Tracing,2006, pp. 187–196.[3] D. Brooks, P. Bose, V. Srinivasan, M. K. Gschwind, P. G.Emma, and M. G. Rosenfield, “New methodology for earlystage,microarchitecture-level power-performance analysis ofmicroprocessors,” IBM J. Res. Dev., vol. 47, no. 5-6, pp. 653–670, 2003.11
Page 2 and 3: system design called Copernicus, ou
Page 5 and 6: 43.532.521.5CourtyardFairyforestFor
Page 7 and 8: achieved by providing a “block”
Page 9: Scene 1-Thread 2-Threads 4-ThreadsI

Toward A Multicore Architecture for Real-time Ray-tracing

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?