GPU and host memories are typically disjoint, requiring explicit (or implicit, depending on the development platform) data transfer between the two.
CUDA: Compute Unified Device Architecture. provides two sets of APIs (a low, and a higher-level one), and it is available freely for Windows, Mac OS X, and Linux operating systems. Although it can be considered too verbose, for example requiring explicit memory transfers between the host and the GPU, it is the basis for the implementation of higher-level third-party APIs and libraries, as explained below.
OpenCL: Open Computing Language. supported by both Nvidia and AMD. It is the primary development platform for AMD GPUs. OpenCL’s programming model matches closely the one offered by CUDA.
CUDA:
Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.
In order to properly utilize a GPU, the program must be decomposed into a large number of threads that can run concurrently. GPU schedulers can execute these threads with minimum switching overhead and under a variety of configurations based on the available device capabilities.
Threads are organized in a hierarchy of two levels, as shown in Figure 6.3. At the lower level, threads are organized in blocks that can be of one, two or three dimensions. Blocks are then organized in grids of one, two, or three dimensions. The sizes of the blocks and grids are limited by the capabilities of the target device.
In every block, it is a kernel. A same function run in all kernel at same time.
hello<<<2,10>>> ()
mean function hello() will run 20 time same-time, one time in one kernel.
an example:
__global__ void hello()
{
printf("hello world\n");
}
int main()
{
hello<<<1,10>>>();
cudaDeviceSynchronize();
}
it will display "hello world" 10time, cudaDeviceSynchronize() is a barrier. waiting for the execute result. function hello() is execute on devices, although it called by host.
Each kernel function has two dimension: grid and block. this mean function has grid index and block index.
What's Warp:
In GPU, same kernel instruction is executing in different process unit (SP: Stream processor). This collection of SP under same controller is SM: Streaming Multiprocessor.
One GPU contains multiple SM, each SM run each own kernel. In nVidia, one SP == CUDA core. Nvidia calls this execution model Single-Instruction, Multiple Threads (SIMT).
I feel One SM is One block.
Warp: The threads in a block do not run concurrently, though. Instead they are executed in groups called warps.
Threads in a warp may execute as one, but they operate on different data. So, what happens if the result of a conditional operation leads them to different paths? The answer is that all the divergent paths are evaluated (if threads branch into them) in sequence until the paths merge again. The threads that do not follow the path currently being executed are stalled.
see the example:
CUDA: Compute Unified Device Architecture. provides two sets of APIs (a low, and a higher-level one), and it is available freely for Windows, Mac OS X, and Linux operating systems. Although it can be considered too verbose, for example requiring explicit memory transfers between the host and the GPU, it is the basis for the implementation of higher-level third-party APIs and libraries, as explained below.
OpenCL: Open Computing Language. supported by both Nvidia and AMD. It is the primary development platform for AMD GPUs. OpenCL’s programming model matches closely the one offered by CUDA.
CUDA:
Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.
In order to properly utilize a GPU, the program must be decomposed into a large number of threads that can run concurrently. GPU schedulers can execute these threads with minimum switching overhead and under a variety of configurations based on the available device capabilities.
Threads are organized in a hierarchy of two levels, as shown in Figure 6.3. At the lower level, threads are organized in blocks that can be of one, two or three dimensions. Blocks are then organized in grids of one, two, or three dimensions. The sizes of the blocks and grids are limited by the capabilities of the target device.
In every block, it is a kernel. A same function run in all kernel at same time.
hello<<<2,10>>> ()
mean function hello() will run 20 time same-time, one time in one kernel.
an example:
__global__ void hello()
{
printf("hello world\n");
}
int main()
{
hello<<<1,10>>>();
cudaDeviceSynchronize();
}
it will display "hello world" 10time, cudaDeviceSynchronize() is a barrier. waiting for the execute result. function hello() is execute on devices, although it called by host.
Each kernel function has two dimension: grid and block. this mean function has grid index and block index.
What's Warp:
In GPU, same kernel instruction is executing in different process unit (SP: Stream processor). This collection of SP under same controller is SM: Streaming Multiprocessor.
One GPU contains multiple SM, each SM run each own kernel. In nVidia, one SP == CUDA core. Nvidia calls this execution model Single-Instruction, Multiple Threads (SIMT).
I feel One SM is One block.
Warp: The threads in a block do not run concurrently, though. Instead they are executed in groups called warps.
Threads in a warp may execute as one, but they operate on different data. So, what happens if the result of a conditional operation leads them to different paths? The answer is that all the divergent paths are evaluated (if threads branch into them) in sequence until the paths merge again. The threads that do not follow the path currently being executed are stalled.
see the example:
No comments:
Post a Comment