Sunday, January 15, 2017

study: openCL programming

memory in GPU.

  • global memory: all work-item can access. transit between CPU and GPU
  • constant memory
  • local memory: faster, in workgroup 
  • private memory
relation between OpenCL memory model with AMD HD6970

Synchronize :
  • in kernel: barrier() (marker) barrier(CLK_LOCAL_MEM_FENCE)
  • between CPU and GPU: CL_TRUE, clfinish()
  • event: clWaitForEvent()
event.getProfilingInfo() for debug
event.clsetEventCallback() callback host function when the event happen

Native Kernel: execute in host. unboxing

fence operation: make sure memory read/write should be done
Atomic operation: atomic_add() atomix_xchg()

for constant data, we can use clDeviceInfo() to get the size and number of divice: CL_DEVICE_MAX_CONSTANT_ARGS CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE

One wavefront execute on all work-time, branch in wavefront have very poor efficient, see below:

memory access: channel and bank. 
one wavefront should try to access on channel and bank 64KB, it is most efficient.

memory access: channel and bank. 
one wavefront should try to access on channel and bank 64KB, it is most efficient. 

Profiler:
AMD: 
  • Accelerated Parallel Processing Profiler
  • Accelerated Parallel Processing Kernel Analyzer
  • gDEBugget
  • clGetEventProfileingInfo(): get event time infomation

No comments:

Post a Comment