memory in GPU.
fence operation: make sure memory read/write should be done
- global memory: all work-item can access. transit between CPU and GPU
- constant memory
- local memory: faster, in workgroup
- private memory
relation between OpenCL memory model with AMD HD6970
Synchronize :
- in kernel: barrier() (marker) barrier(CLK_LOCAL_MEM_FENCE)
- between CPU and GPU: CL_TRUE, clfinish()
- event: clWaitForEvent()
event.clsetEventCallback() callback host function when the event happen
Native Kernel: execute in host. unboxing
fence operation: make sure memory read/write should be done
Atomic operation: atomic_add() atomix_xchg()
for constant data, we can use clDeviceInfo() to get the size and number of divice: CL_DEVICE_MAX_CONSTANT_ARGS CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE
One wavefront execute on all work-time, branch in wavefront have very poor efficient, see below:
memory access: channel and bank.
one wavefront should try to access on channel and bank 64KB, it is most efficient.
memory access: channel and bank.
one wavefront should try to access on channel and bank 64KB, it is most efficient.
Profiler:
AMD:
- Accelerated Parallel Processing Profiler
- Accelerated Parallel Processing Kernel Analyzer
- gDEBugget
- clGetEventProfileingInfo(): get event time infomation
No comments:
Post a Comment