NVIDIA CUDA è®¡ç®ç»ä¸è®¾å¤æ¶æ

More documents

Recommendations

Info

中的个、cuMemAllocPitch() 分线性存储器。放节、CUDA 数 int offset = 0; int i; cuParamSeti(cuFunction, offset, i); offset += sizeof(i); float f; cuParamSetf(cuFunction, offset, f); offset += sizeof(f); char data[32]; cuParamSetv(cuFunction, offset, (void*)data, sizeof(data)); offset += sizeof(data); cuParamSetSize(cuFunction, offset); cuFuncSetSharedSize(cuFunction, numElements * sizeof(float)); cuLaunchGrid(cuFunction, gridWidth, gridHeight); 4.5.3.6 存储器管理或分释 CUdeviceptr devPtr; cuMemAlloc(&devPtr, 256 * sizeof(float)); : // host code CUdeviceptr devPtr; int pitch; cuMemAllocPitch(&devPtr, &pitch, 因浮点元素的数组 width * sizeof(float), height, 4); cuModuleGetFunction(&cuFunction, cuModule, “myKernel”); cuFuncSetBlockShape(cuFunction, 512, 1, 1); cuParamSeti(cuFunction, 0, devPtr); cuParamSetSize(cuFunction, sizeof(devPtr)); cuLaunchGrid(cuFunction, 100, 1); // device code __global__ void myKernel(float* devPtr) { for (int r = 0; r < height; ++r) { float* row = (float*)((char*)devPtr + r * pitch); for (int c = 0; c < width; ++c) { float element = row[c]; } } } CUDA 数创销 cuMemAlloc() 用使可 cuMemAllocPitch() cuMemFree() 性存储器 , 并使用线配 256 的代码示例在线性存储器中分配了一个包含面下建议在分配二维数组时使用 cudaMallocPitch(), 5.1.2.1 能确保合理填充已分配的存储器 , 满足第它为绍的对齐要求 , 从而确保访问行地址或执行二维数组与设备存储器的其他区域之间的复制 ( 使用 cudaMemcpy2D()) 时获得最优性能。所返回的间距 ( 或步幅 ) 必须用于访问数组元素。以下代码示例将分介 widthxheight 个一配维浮点值数组 , 并显示如何在设备代码中循环遍历数组元素 : 二。毁的数位浮点组件 : 的 CUDA_ARRAY_DESCRIPTOR desc; desc.Format = CU_AD_FORMAT_FLOAT; desc.NumChannels = 1; desc.Width = width; desc.Height = height; CUarray cuArray; cuArrayCreate(&cuArray, &desc); 分组 32 CUDA 编 cuArrayCreate() 使用是组 cuArrayDestroy() , 使用的建配的线性存储器 widthxheight 代码示例分配了一个下以 CUDA 32 包含一个 , 组 CUDA_MEMCPY2D copyParam; memset(&copyParam, 0, sizeof(copyParam)); 存储器 : 性线的配 copyParam.dstMemoryType = CU_MEMORYTYPE_ARRAY; copyParam.dstArray = cuArray; copyParam.srcMemoryType = CU_MEMORYTYPE_DEVICE; cuMemAlloc() 手册列举了用于在考参 2.0 南 , 版本指程 CUDA 之间复制存储器的所有函数。下面的代码示例将二维数组复制到之前代码示例中分配的组数
。hostPtr copyParam.srcDevice = devPtr; copyParam.srcPitch = pitch; copyParam.WidthInBytes = width * sizeof(float); copyParam.Height = height; cuMemcpy2D(&copyParam); float data[256]; int size = sizeof(data); CUdeviceptr devPtr; cuMemAlloc(&devPtr, size); cuMemcpyHtoD(devPtr, data, size); 4.5.3.7 流管理面的代码示例将一些宿主存储器数组复制到设备存储器中 : 下 CUstream stream[2]; for (int i = 0; i < 2; ++i) cuStreamCreate(&stream[i], 0); for (int i = 0; i < 2; ++i) cuMemcpyHtoDAsync(inputDevPtr + i * size, hostPtr + i * size, size, stream[i]); for (int i = 0; i < 2; ++i) { cuFuncSetBlockShape(cuFunction, 512, 1, 1); int offset = 0; cuParamSeti(cuFunction, offset, outputDevPtr); offset += sizeof(int); cuParamSeti(cuFunction, offset, inputDevPtr); offset += sizeof(int); cuParamSeti(cuFunction, offset, size); offset += sizeof(int); cuParamSetSize(cuFunction, offset); cuLaunchGridAsync(cuFunction, 100, 1, stream[i]); 面的代码示例创建了两个流 : 这些流均通过以下代码示例定义为一个任务序列 , 包括一次从宿主到设备的存储器复制、一次内核启动、下次从设备到宿主的存储器复制 : 一 } for (int i = 0; i < 2; ++i) cuMemcpyDtoHAsync(hostPtr + i * size, outputDevPtr + i * size, size, stream[i]); cudaCtxSynchronize(); 的数 cuFunction 允处复必的处理 hostPtr 许一个流的存储器复制与另外一个流的内核执行相互重叠须指向分页锁定的宿 float* hostPtr; 器 , 这样才能同时执行 : 储存主 cuMemAllocHost((void**)&hostPtr, 2 * size); hostPtr 流均会将其输入数组个两 inputDevPtr 分复制到设备存储器中的部一中 , 通过调用组并设备上的 inputDevPtr, 理 outputDevPtr 果结将 hostPtr 回制相同部分。使用两个流 cuStreamSynchronize() 可 4.5.3.8 事件管理目事件管理 CUevent start, stop; cuEventCreate(&start); cuEventCreate(&stop); 后调用的是在进一步处理之前确定所有流均已完成。用于同步宿主与特定流 , 允许其他流继续在该设备上执行。最面的代码示例创建了两个事件 : 下 CUDA 编 33 了 cuCtxSynchronize(), cuEventRecord(start, 0); for (int i = 0; i < 2; ++i) cuMemcpyHtoDAsync(inputDevPtr + i * size, hostPtr + i * size, 些事件可用于为上一节的代码示例计时 , 方法如下 : 这 2.0 南 , 版本指程
Page 1 and 2: 程指南 , 版 NVIDIA CUDA 计
Page 3 and 4: 目录 1 2 3 第第 4 iii .......
Page 5 and 6: 程指南 , 版 5.3 5.4 5.5 6.1
Page 7 and 8: 核 (manycore) 众程指南 ,
Page 9 and 10: 渲程指南 , 版的第列
Page 11 and 12: 程指南 , 版变 (shared (intr
Page 13 and 14: 所 ,CUDA (host) 假 memory)。因
Page 15 and 16: 提 (compute 由 capability) 。
Page 17 and 18: 所块位 (constant (texture (tex
Page 19 and 20: 编的 ,C 标限 (host) (function
Page 21 and 22: 节变和助变 (implied (segme
Page 23 and 24: 工代语 ,__device__ 函对块
Page 25 and 26: 、2 分位是的节 (texture (
Page 27 and 28: 列数 (atomic ,atomicAdd() 将
Page 29 and 30: Direct3D 互。节和函节 ,D
Page 31 and 32: 的分个、cudaMallocPitch()
Page 33 and 34: 定类 ,cudaFilterModeLinear 是
Page 35 and 36: mode) emulation 是用 (printf() :
Page 37: 节函。cuCtxPopCurrent() 上 (u
Page 41 and 42: 创填程指南 , 版上上
Page 43 and 44: 块器延迟。的 (if、switc
Page 45 and 46: warp , 当半块中的线程
Page 47 and 48: 存未为计程指南 , 版
Page 49 and 50: 字字存存 ,type 计 (Common
Page 51 and 52: 块和展个的位 ,warp 块
Page 53 and 54: 位的线性寻址。字程
Page 55 and 56: 的。个使用广播机制
Page 57 and 58: 节位数之节节或 (locali
Page 59 and 60: 相所选 ,Csub 等更高的
Page 61 and 62: ,Muld() 将的将相的程指
Page 63 and 64: 的节节节附录 A 技术
Page 65 and 66: (round-towards-zero), : (denormaliz
Page 67 and 68: 程指南 , 版内间外距
Page 69 and 70: 后后后后程指南 , 版
Page 71 and 72: 和映位位程指南 , 版
Page 73 and 74: 位程指南 , 版处位计
Page 75 and 76: ≤ 寻的 × 是 × 是节个
Page 77 and 78: 程指南 , 版使用线性

NVIDIA CUDA è®¡ç®ç»ä¸è®¾å¤æ¶æ

Create successful ePaper yourself

Delete template?

Save as template?

NVIDIA CUDA è®¡ç®ç»ä¸è®¾å¤æ¶æ