Zero copy is a feature that was added in version 2.2 of the CUDA Toolkit. It enables GPU threads to directly access host memory. For this purpose, it requires mapped pinned (non-pageable) memory. On integrated GPUs (e.g., mobile GPUs for notebooks), mapped pinned memory is always a performance gain because it avoids superfluous copies as integrated GPU and CPU memory are physically the same. On discrete GPUs, mapped pinned memory is advantageous only in certain cases. Because the data is not cached on the GPU on devices of compute capability 1.x, mapped pinned memory should be read or written only once, and the global loads and stores that read and write the memory should be coalesced. Zero copy can be used in place of streams because kernel-originated data transfers automatically overlap kernel execution without the overhead of setting up and determining the optimal number of streams.
The host code in Zero-copy host code shows how zero copy is typically set up.
float *a_h, *a_map; … cudaGetDeviceProperties(&prop, 0); if (!prop.canMapHostMemory) exit(0); cudaSetDeviceFlags(cudaDeviceMapHost); cudaHostAlloc(&a_h, nBytes, cudaHostAllocMapped); cudaHostGetDevicePointer(&a_map, a_h, 0); kernel<<<gridSize, blockSize>>>(a_map);
In this code, the canMapHostMemory field of the structure returned by cudaGetDeviceProperties() is used to check that the device supports mapping host memory to the device’s address space. Page-locked memory mapping is enabled by calling cudaSetDeviceFlags() with cudaDeviceMapHost. Note that cudaSetDeviceFlags() must be called prior to setting a device or making a CUDA call that requires state (that is, essentially, before a context is created). Page-locked mapped host memory is allocated using cudaHostAlloc(), and the pointer to the mapped device address space is obtained via the function cudaHostGetDevicePointer(). In the code in Zero-copy host code, kernel() can reference the mapped pinned host memory using the pointer a_map in exactly the same was as it would if a_map referred to a location in device memory.