12.07.2015 Views

Toward A Multicore Architecture for Real-time Ray-tracing

Toward A Multicore Architecture for Real-time Ray-tracing

Toward A Multicore Architecture for Real-time Ray-tracing

SHOW MORE
SHOW LESS

You also want an ePaper? Increase the reach of your titles

YUMPU automatically turns print PDFs into web optimized ePapers that Google loves.

Core type Per<strong>for</strong>mance (millions rays/sec) #Tiles #CoresIdeal <strong>Real</strong>isticAve. Range Ave. Range1-issue, no SMT 37 16-62 42 19-66 16 1281-issue, 4 threads 82 33-138 43 13-74 16 1152-issue, no SMT 34 15-57 26 12-41 10 782-issue, 4 threads 82 36-137 41 13-70 8 65Table 5. Full system per<strong>for</strong>mance and area <strong>for</strong> all scenes.We recommend 1-issue, 4 threads as the best configuration.bottleneck and not the network.Full system per<strong>for</strong>mance: With the number of memorybanks specified, our full system model can make per<strong>for</strong>manceprojections <strong>for</strong> different core configurations and scenes whichare shown in Table 5 as an average and range across our fivebenchmark scenes. Columns two and three show “ideal” per<strong>for</strong>manceachieved by naively scaling the per<strong>for</strong>mance <strong>for</strong> onetile to 16 tiles (or as many that will fit), ignoring contention<strong>for</strong> shared resources. Columns four and five show realisticper<strong>for</strong>mance with the 4-DIMM memory. Our conservativememory model projects that we can sustain between 13 and74 million rays/second, and the ideal model suggests 33 to138 million rays/second.Full system area analysis: Using die-photos of other processorswe built an area model <strong>for</strong> the tiles and the full chip.For overheads of multi-threading, we use methodology fromacademic and industry papers on area scaling [19]. The area<strong>for</strong> our baseline 1-issue core is derived from a Niagara2 corescaled to 22nm, and we assume a 12% area overhead <strong>for</strong>multi-threading <strong>for</strong> 2-threads, and additional 5% per thread.In our area analysis, the additional area of the 4-wide FPUcompared to Niagara2’s FPU is ignored. We assume an 110%overhead when going from 1-issue to 2-issue, since in thismodel we double the number of load/store ports also andhence the L1 cache sizes. The cache size is fixed at 2MB <strong>for</strong>all designs and this includes the cross-bar area between thecores and the cache. Columns 6 and 7 in Table 5 show thenumber of cores of each type that will fit on a 240mm 2 chip.With single-issue cores, we can easily fit 16 tiles, and abouthalf with two issue cores.Summary: Overall, going to dual issue does not providesignificantly higher per<strong>for</strong>mance and single-issue, 4-way multithreadedcores seem ideal. Single-issue cores without SMTseem to match the 4-way SMT cores at the chip-level, becauseof insufficient memory bandwidth to feed the threads. Atthe chip-level, memory bandwidth is the primary bottleneck.While our final per<strong>for</strong>mance results do not outright exceedthe required 100 million rays/second <strong>for</strong> every scene, theflexibility and potential <strong>for</strong> further architectural optimizationsshow this is a viable system.6. Related workRecently, a few application-driven architectures have beenproposed. Yeh et al. have proposed ParallAX, an architecturespecialized <strong>for</strong> real-<strong>time</strong> physics processing [42]. Hugheset al. recently proposed a 64-core CMP <strong>for</strong> animation andvisual effects [10]. Kumar et al. have proposed an architecturecalled Carbon that can support fine-grained parallelism<strong>for</strong> large scale chip-multiprocessors [14]. Yang et al., describea scalable streaming processor targeted at scientificapplications [41]. Clark et al. have proposed a techniquecalled liquid SIMD that can abstract away the details of thedata-parallel hardware [4]. Sankaralingam et al. developeda methodical approach <strong>for</strong> characterizing data parallel applicationsand proposed a set of universal mechanisms <strong>for</strong>data parallelism [29]. All these designs follow the flow ofworkload characterization of an existing application drivingthe design of an architecture. Our work is different in tworespects. First, we co-design future challenge applicationsand the architecture to meet the application per<strong>for</strong>mancerequirements. Second, we provide a more general purposearchitecture and new quantitative tools to support the designprocess. Embedded systems use similar hardware/softwareco-design but <strong>for</strong> building specialized processors and in muchsmaller scale. <strong>Architecture</strong>-specific analytical models havebeen applied <strong>for</strong> processor pipelines to analyze per<strong>for</strong>manceand power [22], [37], [13], [34], [43], [8], [27]. These modelshave also been used <strong>for</strong> design space exploration and rapidsimulation [12], [11], [16], [3], [6], [24].7. Discussion and Future workBuilding efficient systems requires that the software andhardware be designed <strong>for</strong> each other. Deviating both hardwareand software architectures from existing designs poses aparticularly challenging co-design problem. The focus of thispaper has been the co-design of the software, architecture, andevaluation strategy <strong>for</strong> one such system: a ray-<strong>tracing</strong> basedreal-<strong>time</strong> graphics plat<strong>for</strong>m that is fundamentally differentfrom today’s Z-buffer based graphics systems.Our software component (Razor) was designed to allowgraphics and scene behavior to be exploited by hardwarevia fine-grained parallelism, locality in graph traversal,good behavior of secondary rays, and slow growthin memory by virtue of kd-tree duplication. To solve theapplication/architecture “chicken-and-egg” problem, we implementedRazor on existing hardware, in our case SSEacceleratedx86 cores and built a prototype system usingavailable technology with effectively eight cores. This prototypeallowed us to per<strong>for</strong>m detailed application characterizationon real scenes, something impossible on simulationand meaningless without an optimized application.Closing the development loop with design, evaluation andanalysis of software’s behavior on our proposed hardwarearchitecture was accomplished via a novel analytical modelthat provided intuition both <strong>for</strong> the architecture and theapplication. For example, an important application design10

Hooray! Your file is uploaded and ready to be published.

Saved successfully!

Ooh no, something went wrong!