Sunday, January 15, 2017

study: openCL programming

memory in GPU.

  • global memory: all work-item can access. transit between CPU and GPU
  • constant memory
  • local memory: faster, in workgroup 
  • private memory
relation between OpenCL memory model with AMD HD6970

Synchronize :
  • in kernel: barrier() (marker) barrier(CLK_LOCAL_MEM_FENCE)
  • between CPU and GPU: CL_TRUE, clfinish()
  • event: clWaitForEvent()
event.getProfilingInfo() for debug
event.clsetEventCallback() callback host function when the event happen

Native Kernel: execute in host. unboxing

fence operation: make sure memory read/write should be done
Atomic operation: atomic_add() atomix_xchg()

for constant data, we can use clDeviceInfo() to get the size and number of divice: CL_DEVICE_MAX_CONSTANT_ARGS CL_DEVICE_MAX_CONSTANT_BUFFER_SIZE

One wavefront execute on all work-time, branch in wavefront have very poor efficient, see below:

memory access: channel and bank. 
one wavefront should try to access on channel and bank 64KB, it is most efficient.

memory access: channel and bank. 
one wavefront should try to access on channel and bank 64KB, it is most efficient. 

Profiler:
AMD: 
  • Accelerated Parallel Processing Profiler
  • Accelerated Parallel Processing Kernel Analyzer
  • gDEBugget
  • clGetEventProfileingInfo(): get event time infomation

Tuesday, January 10, 2017

study: GPU programming CUDA OpenCL

GPU and host memories are typically disjoint, requiring explicit (or implicit, depending on the development platform) data transfer between the two. 

CUDACompute Unified Device Architecture. provides two sets of APIs (a low, and a higher-level one), and it is available freely for Windows, Mac OS X, and Linux operating systems. Although it can be considered too verbose, for example requiring explicit memory transfers between the host and the GPU, it is the basis for the implementation of higher-level third-party APIs and libraries, as explained below. 
OpenCLOpen Computing Language. supported by both Nvidia and AMD. It is the primary development platform for AMD GPUs. OpenCL’s programming model matches closely the one offered by CUDA.

CUDA:
Support heterogeneous computation where applications use both the CPU and GPU. Serial portions of applications are run on the CPU, and parallel portions are offloaded to the GPU. As such, CUDA can be incrementally applied to existing applications. The CPU and GPU are treated as separate devices that have their own memory spaces. This configuration also allows simultaneous computation on the CPU and GPU without contention for memory resources.

In order to properly utilize a GPU, the program must be decomposed into a large number of threads that can run concurrently. GPU schedulers can execute these threads with minimum switching overhead and under a variety of configurations based on the available device capabilities. 

Threads are organized in a hierarchy of two levels, as shown in Figure 6.3. At the lower level, threads are organized in blocks that can be of one, two or three dimensions. Blocks are then organized in grids of one, two, or three dimensions. The sizes of the blocks and grids are limited by the capabilities of the target device.
In every block, it is a kernel. A same function run in all kernel at same time. 
hello<<<2,10>>> ()
mean function hello() will run 20 time same-time, one time in one kernel.
an example: 
__global__ void hello()
{
   printf("hello world\n");
}
int main()
{
   hello<<<1,10>>>();
   cudaDeviceSynchronize();
}
it will display "hello world" 10time, cudaDeviceSynchronize() is a barrier. waiting for the execute result. function hello() is execute on devices, although it called by host. 
Each kernel function has two dimension: grid and block. this mean function has grid index and block index.

What's Warp:
In GPU, same kernel instruction is executing in different process unit (SP: Stream processor). This collection of SP under same controller is SM: Streaming Multiprocessor.
One GPU contains multiple SM, each SM run each own kernel.  In nVidia, one SP == CUDA core.  Nvidia calls this execution model Single-Instruction, Multiple Threads (SIMT).
I feel One SM is One block. 
Warp: The threads in a block do not run concurrently, though. Instead they are executed in groups called warps

Threads in a warp may execute as one, but they operate on different data. So, what happens if the result of a conditional operation leads them to different paths? The answer is that all the divergent paths are evaluated (if threads branch into them) in sequence until the paths merge again. The threads that do not follow the path currently being executed are stalled.
see the example:



Monday, January 9, 2017

study: OpenCL Intel GPU: install SDK


  • I download opencl sdk for intel GPU from https://software.intel.com/en-us/intel-opencl/download it ask my email to register, then, star download ntel_sdk_for_opencl_setup_6.3.0.1904.exe (276 MB)
  • click the exe file, it will install intel sdk for opencl in folder: intel\opencl SDK\6.3 after install, it require restart computer. the folder is about 500M, but there is no sample file.
  • I think I can copy sample file from amd opencl package. this package is big, about 522M, it contain: 
    • bolt
    • c++AMP
    • opencl
    • opencv
  • in eopncl, we click OpenCL2.0SamplesVS13, this solution includes lots of project, we select HelloWorld to compile, there are error. obviously, the error is come from the proejct still don't know the position of opencl header file and lib. then we need add AMDAPPSDKROOT into environment variables, we set it equal to C:\Program Files (x86)\Intel\OpenCL SDK\6.3
  • then, recompile the project, it success. we can run it. it display "Passed". but, not all project works. some of them miss CLUtil.hpp file
  • after install, we can found in the vs, there is a new menu item: code-builder . it is for opencl development. https://software.intel.com/en-us/intel-opencl/ there are several video teach you how to use it 

study: Multicore and GPU programming (1)

CPUs employ large on-chip (and sometimes multiple) memory caches, few complex (e.g., pipelined) arithmetic and logical processing units (ALUs), and complex instruction decoding and prediction hardware to avoid stalling while waiting for data to arrive from the main memory.
Instead, GPU designers chose a different path: small on-chip caches with a big collection of simple ALUs capable of parallel operation, since data reuse is typically small for graphics processing and programs are relatively simple. In order to feed the multiple cores on a GPU, designers also dedicated very wide, fast memory buses for fetching data from the GPU’s main memory.
Now it becomes obvious why having CPU and GPU cores share and access the same memory space is an important feature. On principle, this arrangement promises better integration of computing resources and potentially greater performance, but only time will tell.

THE CELL BE PROCESSOR

Cell features a design well ahead of its time: a master-worker, heterogeneous, MIMD machine on a chip.
The hardware was designed for maximum computing efficiency but at the expense of programming ease. The Cell is notorious for being one of the most difficult platforms to program on.

NVIDIA’S KEPLER


The cores in a Kepler GPU are arranged in groups called Streaming Multiprocessors (abbreviated to SMX in Kepler, SM in previous architectures, and SMM in the upcoming Maxwell). Each Kepler SMX contains 192 cores that execute in a SIMD fashion, i.e., they run the same sequence of instructions but on different data. Each SMX can run its own program, though. The total number of SMX blocks is the primary differentiating factor between different chips of the same family. The most powerful chip in the Kepler family is the GTX Titan, with a total of 15 SMXs. One of the SMXs is disabled in order to improve production yields, resulting in a total of 14 · 192 = 2688 cores! The extra SMX is enabled in the version of the chip used in the dual-GPU, GTX Titan Z card, resulting in an astonishing package of 5760 cores! AMD’s dual-GPU offering in the form of the Radeon R9 295X2 card is also brandishing 5632 cores in a shootout that is delighting all high-performance enthusiasts.

AMD’S APUS

What is significant is the unification of the memory spaces of the CPU and GPU cores. This means that there is no communication overhead associated with assigning workload to the GPU cores, nor any delay in getting the results back. This also removes one of the major hassles in GPU programming, which is the explicit (or implicit, based on the middleware available) data transfers that need to take place.
HSA is arguably the way forward, having the capability to assign each task to the computing node most suitable for it, without the penalty of traversing a slow peripheral bus. Sequential tasks are more suitable for the LCU/CPU cores, while data-parallel tasks can take advantage of the high-bandwidth, high-computational throughput of the TCU/GPU cores.




Monday, January 2, 2017

bookmark: H265 vs H264 codec

A Comparison of H.264 and H.265
FunctionH.264H.265
Coding unit16 × 16 macroblock64 × 64, 32 × 32, 16 × 16 coding tree unit
64 × 64, 32 × 32, 16 × 16, 8 × 8 coding unit
Prediction16 × 16, 16 × 8, 8 × 16, 8 × 8, 8 × 4, 4 × 8, 4 × 464 × 64 to 4 × 4, symmetric/asymmetric
Transform size8 × 8, 4 × 432 × 32, 16 × 16, 8 × 8, 4 × 4
TransformDCTDCT, optional DST for 4 × 4
Intraprediction9 modes35 modes
Luma interpolation6-tap filter for 1/2 sample followed by bilinear interpolation for 1/4 sample8-tap filter for 1/2 sample, 7-tap filter for 1/4 sample
Chroma interpolationBilinear interpolation4-tap filter for 1/8 sample
InterpredictionMotion vectorAdvanced motion vector prediction (spatial and temporal)
Entropy codingCABAC, CAVLCCABAC
In-loop filteringDeblockingDeblocking followed by sample-adaptive offset
Parallel processingSlices, slice groupsSlices, tiles, wavefronts
H265 GOP
Different from mpeg4, the B frame can used as referenced or unreferenced.


Unlike H264, H265 GOP could start without IDR frame.An IDR frame is independently coded and frames that follow it in the bitstream will not reference frames prior to it. To implement it, H265 define complex frame: 
  • CRA: clean random access frame: new, independently coded frame that starts at an RAP
  • BLA: broken-link access  frame
  • RASL: random access skipped leading frame
  • RADL:random access decodable leading frame
RASL frames and RADL frames are leading frames because their display order (i.e., encoder output order) precedes the RAP frame even though they appear after the RAP frame in the decoding order.

RDO (High Complexity) When Compared to No RDO
ModeEncoding TimeCompression EfficiencyVideo Quality
VBRLonger (especially for low QP)LowerBetter for every frame (especially for low QP)
CBRLongerNABetter (especially for low QP)