ATI Stream Computing OpenCL Programming Guide - CiteSeerX

More documents

Recommendations

Info

ATI STREAM COMPUTINGminimized. Note that while only a single clFinish() is executed at the endof the timing run, the two kernels are always linked using an event to ensureserial execution.The bandwidth is expressed as “number of input bytes processed.” For highendgraphics cards, the bandwidth of this algorithm should be an order ofmagnitude higher than that of the CPU, due to the parallelized memorysubsystem of the graphics card.7. The results then are checked against the comparison value. This alsoestablishes that the result is the same on both CPU and GPU, which canserve as the first verification test for newly written kernel code.8. Note the use of the debug buffer to obtain some runtime variables. Debugbuffers also can be used to create short execution traces for each thread,assuming the device has enough memory.Kernel Code –9. The code uses four-component vectors (uint4) so the compiler can identifyconcurrent execution paths as often as possible. On the GPU, this can beused to further optimize memory accesses and distribution across ALUs. Onthe CPU, it can be used to enable SSE-like execution.10. The kernel sets up a memory access pattern based on the device. For theCPU, the source buffer is chopped into continuous buffers: one per thread.Each CPU thread serially walks through its buffer portion, which results ingood cache and prefetch behavior for each core.On the GPU, each thread walks the source buffer using a stride of the totalnumber of threads. As many threads are executed in parallel, the result is amaximally coalesced memory pattern requested from the memory back-end.For example, if each compute unit has 16 physical processors, 16 uint4requests are produced in parallel, per clock, for a total of 256 bytes per clock.11. The kernel code uses a reduction consisting of three stages: __global to__private, __private to __local, which is flushed to __global, and finally__global to __global. In the first loop, each thread walks __globalmemory, and reduces all values into a min value in __private memory(typically, a register). This is the bulk of the work, and is mainly bound by__global memory bandwidth. The subsequent reduction stages are brief incomparison.12. Next, all per-thread minimum values inside the work-group are reduced to a__local value, using an atomic operation. Access to the __local value isserialized; however, the number of these operations is very small comparedto the work of the previous reduction stage. The threads within a work-groupare synchronized through a local barrier(). The reduced min value isstored in __global memory.13. After all work-groups are finished, a second kernel reduces all work-groupvalues into a single value in __global memory, using an atomic operation.This is a minor contributor to the overall runtime.1-28 Chapter 1: OpenCL Architecture and the ATI Stream Computing SystemCopyright © 2010 Advanced Micro Devices, Inc. All rights reserved.
ATI STREAM COMPUTINGExample Code 3 –//// Copyright (c) 2010 Advanced Micro Devices, Inc. All rights reserved.//#include #include #include #include #include "Timer.h"#define NDEVS 2// A parallel min() kernel that works well on CPU and GPUconst char *kernel_source =" \n""#pragma OPENCL EXTENSION cl_khr_local_int32_extended_atomics : enable \n""#pragma OPENCL EXTENSION cl_khr_global_int32_extended_atomics : enable \n"" \n"" // 9. The source buffer is accessed as 4-vectors. \n"" \n""__kernel void minp( __global uint4 *src,\n"" __global uint *gmin, \n"" __local uint *lmin, \n"" __global uint *dbg, \n"" size_t nitems, \n"" uint dev ) \n""{ \n"" // 10. Set up __global memory access pattern. \n"" \n"" uint count = ( nitems / 4 ) / get_global_size(0); \n"" uint idx = (dev == 0) ? get_global_id(0) * count \n"" : get_global_id(0); \n"" uint stride = (dev == 0) ? 1 : get_global_size(0); \n"" uint pmin = (uint) -1; \n"" \n"" // 11. First, compute private min, for this work-item. \n"" \n"" for( int n=0; n < count; n++, idx += stride ) \n"" { \n"" pmin = min( pmin, src[idx].x ); \n"" pmin = min( pmin, src[idx].y ); \n"" pmin = min( pmin, src[idx].z ); \n"" pmin = min( pmin, src[idx].w ); \n"" } \n"" \n"" // 12. Reduce min values inside work-group. \n"" \n"" if( get_local_id(0) == 0 ) \n"" lmin[0] = (uint) -1; \n"" \n"" barrier( CLK_LOCAL_MEM_FENCE ); \n"" \n"" (void) atom_min( lmin, pmin ); \n"" \n"" barrier( CLK_LOCAL_MEM_FENCE ); \n"" \n"" // Write out to __global. \n"" \n"" if( get_local_id(0) == 0 ) \n"" gmin[ get_group_id(0) ] = lmin[0]; \n"1.9 Example Programs 1-29Copyright © 2010 Advanced Micro Devices, Inc. All rights reserved.
Page 1 and 2: Programming GuideATI Stream Computi
Page 3 and 4: PrefaceATI STREAM COMPUTINGAbout Th
Page 5: ATI STREAM COMPUTINGContact Informa
Page 8 and 9: ATI STREAM COMPUTINGChapter 2Chapte
Page 10 and 11: ATI STREAM COMPUTINGAppendix DDevic
Page 12 and 13: ATI STREAM COMPUTINGxiiContentsCopy
Page 14 and 15: ATI STREAM COMPUTING1.1.3 Synchroni
Page 16 and 17: ATI STREAM COMPUTINGA stream core i
Page 18 and 19: ATI STREAM COMPUTINGFigure 1.4Simpl
Page 20 and 21: ATI STREAM COMPUTING1.3.2 Flow Cont
Page 22 and 23: ATI STREAM COMPUTINGkernels written
Page 24 and 25: ATI STREAM COMPUTING1.4.1 Memory Ac
Page 26 and 27: ATI STREAM COMPUTING1.5.3 DMA Trans
Page 28 and 29: ATI STREAM COMPUTINGWork-ItemT0XXXX
Page 30 and 31: ATI STREAM COMPUTINGparallel progra
Page 32 and 33: ATI STREAM COMPUTINGsuch as a given
Page 34 and 35: ATI STREAM COMPUTING1.9.2 Second Ex
Page 36 and 37: ATI STREAM COMPUTINGExample Code 2
Page 38 and 39: ATI STREAM COMPUTING///////////////
Page 42 and 43: ATI STREAM COMPUTING" \n"" // Dump
Page 44 and 45: ATI STREAM COMPUTING// Create a con
Page 46 and 47: ATI STREAM COMPUTINGdbg_ptr = (cl_u
Page 48 and 49: ATI STREAM COMPUTING2.1 Compiling t
Page 50 and 51: ATI STREAM COMPUTINGOpenCL API Func
Page 52 and 53: ATI STREAM COMPUTING2-6 Chapter 2:
Page 54 and 55: ATI STREAM COMPUTING3.3 Sample GDB
Page 56 and 57: ATI STREAM COMPUTING3-4 Chapter 3:
Page 58 and 59: ATI STREAM COMPUTINGNameGPRScratchR
Page 60 and 61: ATI STREAM COMPUTING4.3 Estimating
Page 62 and 63: ATI STREAM COMPUTING4.3.3 Estimatin
Page 64 and 65: ATI STREAM COMPUTING4.4.1 Two Memor
Page 66 and 67: ATI STREAM COMPUTINGTable 4.2 lists
Page 68 and 69: ATI STREAM COMPUTINGstrides. For in
Page 70 and 71: ATI STREAM COMPUTINGAn inefficient
Page 72 and 73: ATI STREAM COMPUTINGFigure 4.3Trans
Page 74 and 75: ATI STREAM COMPUTINGFigure 4.4Two K
Page 76 and 77: ATI STREAM COMPUTINGFigure 4.5Effec
Page 78 and 79: ATI STREAM COMPUTINGFigure 4.6Unali
Page 80 and 81: ATI STREAM COMPUTINGadvantages when
Page 82 and 83: ATI STREAM COMPUTING__kernel void l
Page 84 and 85: ATI STREAM COMPUTINGTable 4.7Hardwa
Page 86 and 87: ATI STREAM COMPUTINGeach of which p
Page 88 and 89: ATI STREAM COMPUTINGGP Registers us
Page 90 and 91:
ATI STREAM COMPUTING4.8.3 Partition
Page 92 and 93:
ATI STREAM COMPUTINGWork-item 0 1 2
Page 94 and 95:
ATI STREAM COMPUTINGThe difference
Page 96 and 97:
ATI STREAM COMPUTINGoperations: eac
Page 98 and 99:
ATI STREAM COMPUTINGscheduling algo
Page 100 and 101:
ATI STREAM COMPUTING4.9.5 GPU and C
Page 102 and 103:
ATI STREAM COMPUTINGTable 4.11Singl
Page 104 and 105:
ATI STREAM COMPUTINGpowr() 28.7xdiv
Page 106 and 107:
ATI STREAM COMPUTING4.11 Clause Bou
Page 108 and 109:
4.12 Additional Performance Guidanc
Page 110 and 111:
ATI STREAM COMPUTING• When tuning
Page 112 and 113:
ATI STREAM COMPUTING4-56 Chapter 4:
Page 114 and 115:
A.3 Querying Extensions for a Devic
Page 116 and 117:
ATI STREAM COMPUTINGA.7 cl_ext Exte
Page 118 and 119:
ATI STREAM COMPUTINGBuilt-in functi
Page 120 and 121:
ATI STREAM COMPUTINGA-8 Appendix A:
Page 122 and 123:
ATI STREAM COMPUTINGcl_uint numPlat
Page 124 and 125:
ATI STREAM COMPUTINGB-4 Appendix B:
Page 126 and 127:
ATI STREAM COMPUTINGC.3 Performance
Page 128 and 129:
ATI STREAM COMPUTINGGB/s10090807060
Page 130 and 131:
ATI STREAM COMPUTINGTable D.1Parame
Page 132 and 133:
ATI STREAM COMPUTINGD-4 Appendix D:
Page 134 and 135:
ATI STREAM COMPUTINGTermALUARATI St
Page 136 and 137:
ATI STREAM COMPUTINGTermdouble quad
Page 138 and 139:
ATI STREAM COMPUTINGTermLDSLERPloca
Page 140 and 141:
ATI STREAM COMPUTINGTermsamplerSCsc
Page 142:
ATI STREAM COMPUTINGTermwaterfallwa
show all

ATI Stream Computing OpenCL Programming Guide - CiteSeerX

Create successful ePaper yourself

Delete template?

Save as template?