ATI Stream Computing OpenCL Programming Guide - CiteSeerX

More documents

Recommendations

Info

ATI STREAM COMPUTING1.5.3 DMA TransfersDirect Memory Access (DMA) memory transfers can be executed separately fromthe command queue using the DMA engine on the GPU compute device. DMAcalls are executed immediately; and the order of DMA calls and command queueflushes is guaranteed.DMA transfers can occur asynchronously. This means that a DMA transfer isexecuted concurrently with other system or GPU compute device operations.However, data is not guaranteed to be ready until the DMA engine signals thatthe event or transfer is completed. The application can query the hardware forDMA event completion. If used carefully, DMA transfers are another source ofparallelization.1.6 GPU Compute Device SchedulingGPU compute devices are very efficient at parallelizing large numbers of workitemsin a manner transparent to the application. Each GPU compute deviceuses the large number of wavefronts to hide memory access latencies by havingthe resource scheduler switch the active wavefront in a given compute unitwhenever the current wavefront is waiting for a memory access to complete.Hiding memory access latencies requires that each work-item contain a largenumber of ALU operations per memory load/store.Figure 1.9 shows the timing of a simplified execution of work-items in a singlestream core. At time 0, the work-items are queued and waiting for execution. Inthis example, only four work-items (T0…T3) are scheduled for the compute unit.The hardware limit for the number of active work-items is dependent on theresource usage (such as the number of active registers used) of the programbeing executed. An optimally programmed GPU compute device typically hasthousands of active work-items.1-14 Chapter 1: OpenCL Architecture and the ATI Stream Computing SystemCopyright © 2010 Advanced Micro Devices, Inc. All rights reserved.
ATI STREAM COMPUTINGWork-ItemT0XXXXXXXSTALLREADYT1READYXXXXXXXSTALLT2READYXXXXSTALLT3READYXSTALL0 20 40 60 80= executing = ready (not executing) XX = stalledFigure 1.9Simplified Execution Of Work-Items On A Single Stream CoreAt runtime, work-item T0 executes until cycle 20; at this time, a stall occurs dueto a memory fetch request. The scheduler then begins execution of the nextwork-item, T1. Work-item T1 executes until it stalls or completes. New work-itemsexecute, and the process continues until the available number of active workitemsis reached. The scheduler then returns to the first work-item, T0.If the data work-item T0 is waiting for has returned from memory, T0 continuesexecution. In the example in Figure 1.9, the data is ready, so T0 continues. Sincethere were enough work-items and processing element operations to cover thelong memory latencies, the stream core does not idle. This method of memorylatency hiding helps the GPU compute device achieve maximum performance.If none of T0 – T3 are runnable, the stream core waits (stalls) until one of T0 –T3 is ready to execute. In the example shown in Figure 1.10, T0 is the first tocontinue execution.1.6 GPU Compute Device Scheduling 1-15Copyright © 2010 Advanced Micro Devices, Inc. All rights reserved.
Page 1 and 2: Programming GuideATI Stream Computi
Page 3 and 4: PrefaceATI STREAM COMPUTINGAbout Th
Page 5: ATI STREAM COMPUTINGContact Informa
Page 8 and 9: ATI STREAM COMPUTINGChapter 2Chapte
Page 10 and 11: ATI STREAM COMPUTINGAppendix DDevic
Page 12 and 13: ATI STREAM COMPUTINGxiiContentsCopy
Page 14 and 15: ATI STREAM COMPUTING1.1.3 Synchroni
Page 16 and 17: ATI STREAM COMPUTINGA stream core i
Page 18 and 19: ATI STREAM COMPUTINGFigure 1.4Simpl
Page 20 and 21: ATI STREAM COMPUTING1.3.2 Flow Cont
Page 22 and 23: ATI STREAM COMPUTINGkernels written
Page 24 and 25: ATI STREAM COMPUTING1.4.1 Memory Ac
Page 28 and 29: ATI STREAM COMPUTINGWork-ItemT0XXXX
Page 30 and 31: ATI STREAM COMPUTINGparallel progra
Page 32 and 33: ATI STREAM COMPUTINGsuch as a given
Page 34 and 35: ATI STREAM COMPUTING1.9.2 Second Ex
Page 36 and 37: ATI STREAM COMPUTINGExample Code 2
Page 38 and 39: ATI STREAM COMPUTING///////////////
Page 40 and 41: ATI STREAM COMPUTINGminimized. Note
Page 42 and 43: ATI STREAM COMPUTING" \n"" // Dump
Page 44 and 45: ATI STREAM COMPUTING// Create a con
Page 46 and 47: ATI STREAM COMPUTINGdbg_ptr = (cl_u
Page 48 and 49: ATI STREAM COMPUTING2.1 Compiling t
Page 50 and 51: ATI STREAM COMPUTINGOpenCL API Func
Page 52 and 53: ATI STREAM COMPUTING2-6 Chapter 2:
Page 54 and 55: ATI STREAM COMPUTING3.3 Sample GDB
Page 56 and 57: ATI STREAM COMPUTING3-4 Chapter 3:
Page 58 and 59: ATI STREAM COMPUTINGNameGPRScratchR
Page 60 and 61: ATI STREAM COMPUTING4.3 Estimating
Page 62 and 63: ATI STREAM COMPUTING4.3.3 Estimatin
Page 64 and 65: ATI STREAM COMPUTING4.4.1 Two Memor
Page 66 and 67: ATI STREAM COMPUTINGTable 4.2 lists
Page 68 and 69: ATI STREAM COMPUTINGstrides. For in
Page 70 and 71: ATI STREAM COMPUTINGAn inefficient
Page 72 and 73: ATI STREAM COMPUTINGFigure 4.3Trans
Page 74 and 75: ATI STREAM COMPUTINGFigure 4.4Two K
Page 76 and 77:
ATI STREAM COMPUTINGFigure 4.5Effec
Page 78 and 79:
ATI STREAM COMPUTINGFigure 4.6Unali
Page 80 and 81:
ATI STREAM COMPUTINGadvantages when
Page 82 and 83:
ATI STREAM COMPUTING__kernel void l
Page 84 and 85:
ATI STREAM COMPUTINGTable 4.7Hardwa
Page 86 and 87:
ATI STREAM COMPUTINGeach of which p
Page 88 and 89:
ATI STREAM COMPUTINGGP Registers us
Page 90 and 91:
ATI STREAM COMPUTING4.8.3 Partition
Page 92 and 93:
ATI STREAM COMPUTINGWork-item 0 1 2
Page 94 and 95:
ATI STREAM COMPUTINGThe difference
Page 96 and 97:
ATI STREAM COMPUTINGoperations: eac
Page 98 and 99:
ATI STREAM COMPUTINGscheduling algo
Page 100 and 101:
ATI STREAM COMPUTING4.9.5 GPU and C
Page 102 and 103:
ATI STREAM COMPUTINGTable 4.11Singl
Page 104 and 105:
ATI STREAM COMPUTINGpowr() 28.7xdiv
Page 106 and 107:
ATI STREAM COMPUTING4.11 Clause Bou
Page 108 and 109:
4.12 Additional Performance Guidanc
Page 110 and 111:
ATI STREAM COMPUTING• When tuning
Page 112 and 113:
ATI STREAM COMPUTING4-56 Chapter 4:
Page 114 and 115:
A.3 Querying Extensions for a Devic
Page 116 and 117:
ATI STREAM COMPUTINGA.7 cl_ext Exte
Page 118 and 119:
ATI STREAM COMPUTINGBuilt-in functi
Page 120 and 121:
ATI STREAM COMPUTINGA-8 Appendix A:
Page 122 and 123:
ATI STREAM COMPUTINGcl_uint numPlat
Page 124 and 125:
ATI STREAM COMPUTINGB-4 Appendix B:
Page 126 and 127:
ATI STREAM COMPUTINGC.3 Performance
Page 128 and 129:
ATI STREAM COMPUTINGGB/s10090807060
Page 130 and 131:
ATI STREAM COMPUTINGTable D.1Parame
Page 132 and 133:
ATI STREAM COMPUTINGD-4 Appendix D:
Page 134 and 135:
ATI STREAM COMPUTINGTermALUARATI St
Page 136 and 137:
ATI STREAM COMPUTINGTermdouble quad
Page 138 and 139:
ATI STREAM COMPUTINGTermLDSLERPloca
Page 140 and 141:
ATI STREAM COMPUTINGTermsamplerSCsc
Page 142:
ATI STREAM COMPUTINGTermwaterfallwa
show all

ATI Stream Computing OpenCL Programming Guide - CiteSeerX

You also want an ePaper? Increase the reach of your titles

Delete template?

Save as template?