CPUs employ large on-chip (and sometimes multiple) memory caches, few complex (e.g., pipelined) arithmetic and logical processing units (ALUs), and complex instruction decoding and prediction hardware to avoid stalling while waiting for data to arrive from the main memory.
Instead, GPU designers chose a different path: small on-chip caches with a big collection of simple ALUs capable of parallel operation, since data reuse is typically small for graphics processing and programs are relatively simple. In order to feed the multiple cores on a GPU, designers also dedicated very wide, fast memory buses for fetching data from the GPU’s main memory.
Now it becomes obvious why having CPU and GPU cores share and access the same memory space is an important feature. On principle, this arrangement promises better integration of computing resources and potentially greater performance, but only time will tell.
Instead, GPU designers chose a different path: small on-chip caches with a big collection of simple ALUs capable of parallel operation, since data reuse is typically small for graphics processing and programs are relatively simple. In order to feed the multiple cores on a GPU, designers also dedicated very wide, fast memory buses for fetching data from the GPU’s main memory.
Now it becomes obvious why having CPU and GPU cores share and access the same memory space is an important feature. On principle, this arrangement promises better integration of computing resources and potentially greater performance, but only time will tell.
THE CELL BE PROCESSOR
Cell features a design well ahead of its time: a master-worker, heterogeneous, MIMD machine on a chip.
The hardware was designed for maximum computing efficiency but at the expense of programming ease. The Cell is notorious for being one of the most difficult platforms to program on.
NVIDIA’S KEPLER
The cores in a Kepler GPU are arranged in groups called Streaming Multiprocessors (abbreviated to SMX in Kepler, SM in previous architectures, and SMM in the upcoming Maxwell). Each Kepler SMX contains 192 cores that execute in a SIMD fashion, i.e., they run the same sequence of instructions but on different data. Each SMX can run its own program, though. The total number of SMX blocks is the primary differentiating factor between different chips of the same family. The most powerful chip in the Kepler family is the GTX Titan, with a total of 15 SMXs. One of the SMXs is disabled in order to improve production yields, resulting in a total of 14 · 192 = 2688 cores! The extra SMX is enabled in the version of the chip used in the dual-GPU, GTX Titan Z card, resulting in an astonishing package of 5760 cores! AMD’s dual-GPU offering in the form of the Radeon R9 295X2 card is also brandishing 5632 cores in a shootout that is delighting all high-performance enthusiasts.
AMD’S APUS
What is significant is the unification of the memory spaces of the CPU and GPU cores. This means that there is no communication overhead associated with assigning workload to the GPU cores, nor any delay in getting the results back. This also removes one of the major hassles in GPU programming, which is the explicit (or implicit, based on the middleware available) data transfers that need to take place.
HSA is arguably the way forward, having the capability to assign each task to the computing node most suitable for it, without the penalty of traversing a slow peripheral bus. Sequential tasks are more suitable for the LCU/CPU cores, while data-parallel tasks can take advantage of the high-bandwidth, high-computational throughput of the TCU/GPU cores.
No comments:
Post a Comment