The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
September 10, 2008
One of the most exciting developments in parallel programming over the past few years has been the availability and advancement of programmable graphics cards. A high end graphics card costs less than a high end CPU and provides tantalizing peak performance approaching, or exceeding, one teraflop. Since microprocessor peak performance tops out at about 25 gigaflops/core (single precision), this potential, at such low cost, is worth exploring. Harnessing this performance, however, is problematic.
It's important to note that the GPUs powering the graphics cards are designed to do specific jobs very well. They are not designed as general purpose processors, and in fact will do a very poor job on many programs, even highly parallel applications. The key is to determine whether your application can fit into a programming model that maps well onto the GPU. I'm going to discuss the GPU architecture, but I'm going to start with an analogy, and probably stretch the analogy to the breaking point; let's discuss airline travel.
Suppose your job is to transport several dozen large tour groups between London and Seattle, each group with 30-60 members. Your most likely choice is to use jet aircraft for a flight of about 5,000 miles or 7,700 km. Going for the most parallelism, you could use a new Airbus A380 to move 600 people in about 9 hours. One problem you have with these jumbo jets is they don't fit at the main terminal, so you have to take an airport train out to the remote terminal where the plane is parked; the train can only carry so many people at a time, but let's be optimistic and say it will take 90 minutes to move everyone out to the remote gate and load them, and another 90 minutes at the other end for unloading. This gives us a 24 hour round trip. If you regularly fill the plane, this could be a good investment (though, at $300M, a bit more than a GPU). However, if you only have a half-load, it doesn't get them there twice as fast or at half the cost, though it does reduce the load/unload time. Measuring performance as passenger-miles, the performance comes from parallelism (many passengers), not from latency reduction.
Alternatively, you could opt for a few smaller planes, say two to four Boeing 787s. Each can carry about 200-250 passengers at the same speed, so you can move the same number of people at the same rate. You also have the advantage of parking at the main terminal, so you can reduce the passenger load/unload time to about half an hour, so your total round trip time is only 20 hours. Of course, you have to arrange for more crew, more landing slots at the airports, and so forth. Your total capital investment is about the same, but it gives you some flexibility. If you only have 200 people to move, you can leave all but one of the planes behind, saving on fuel and crew costs.
Or, perhaps you could invest in the (future) hypersonic transport, which some believe may be able to travel at speeds of Mach 6, 7-8 times faster than the current subsonic airliners. Assuming it takes time to get up to speed and to slow down for landing, the total flight time might drop from 9 hours to 2.5 hours, or 7 hours round trip. If the capacity of your hypothetical hypersonic transport is 200 passengers, you can transport 600 people in each direction in just 21 hours. Even better, if you only have to transport 400 people, you can get the same work done in two-thirds the time. Of course, you are buying a higher cost transport, probably paying more for fuel, and so on.
Very fast CPUs are more like the hypersonic transport; they are designed to provide extremely high performance for small tasks, but still offer adequate performance for large tasks. Multicore processors are more like the middle option, several smaller, lower-capacity devices, each quite capable, and you can save power by shutting down one or more of them. The GPU is more like the super-jumbo jet; it only gets high performance (passenger-miles) when you have lots of passengers. It doesn't do so well getting just one or a few passengers across the ocean.

So now to GPU architecture. GPUs were originally hardwired for specific tasks; as transistor budgets and demand for flexibility grew, the hardware became more programmable. They still contain special hardware and functional units specific to graphics tasks, but I'll ignore those and view today's GPUs as compute accelerators. A typical design, shown in the figure above, is abstracted from information put out by NVIDIA. I'll use both NVIDIA's terms and more standard computer architecture descriptions of the various parts. The key to the performance is all those thread processors; in the figure, there are eight thread processors in each of 16 multiprocessors, for 128 TPs total. NVIDIA delivers GPUs with up to 30 multiprocessors and 240 TPs. In each clock, each TP can produce a result, giving this design a very high peak performance rating.
Each multiprocessor executes in SIMD mode, meaning each thread processor in that multiprocessor executes the same instruction simultaneously. If one thread processor is executing a floating point add, they all are; if one is doing a fetch, they all are. Moreover, each instruction is quad-clocked, meaning the SIMD width is 32, even though there are only eight thread processors. Unlike classical SIMD machines, there isn't a distinction between the scalar and parallel operations, or mono and poly operations, to borrow terms from C* and Dataparallel C, and Cn for the Clearspeed card. Instead, the model is that of many scalar threads that just happen to be executing in SIMD mode, something NVIDIA calls SIMT execution. Careful orchestration of the 32 threads that execute in SIMD mode is necessary for best performance.
Stretching our analogy, think of each thread processor like a seat in our superjumbo jet, and each multiprocessor as a tour group. Imagine that when the flight attendant serves a meal, the whole tour group must be served at once (synchronously). The whole tour group must watch the same movie on the seat-back screens at the same time, or they all must read books at the same time, though some may be napping at any time. When one person wants to use the restroom, they all must go at once, even if not everyone needs to. Note that this is a long latency operation, which will correspond to the long latency memory.
The multiprocessors themselves execute asynchronously, and without communication. This last point is quite important. In a multicore or multiprocessor system, the cores or processors can communicate through the memory. If one thread stores a value in variable A then sets a FLAG, the hardware will guarantee that another thread on the same or another core or processor will not see the updated FLAG without seeing the updated value for A. The hardware supports a memory model that preserves the store order. No such memory model is supported on the GPU; a program could store a set of values on one multiprocessor and read the same locations on another, but there is no guarantee that the value fetched will be consistent (in the formal sense) with the values stored. Relaxing the memory model allows the hardware to reorder the stores from a multiprocessor, allowing more throughput.
Page: 1 of 3(Digg, Technorati, more)
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
Mar 11 | Linux Magazine | CUDA may be the rage, but OpenCL is a standard that has some features you may need. Read more...
Mar 09 | Free Software Magazine | Data-driven computing will need open software. Read more...
Mar 09 | Bio-IT World | Tahoe Informatics founder eyes GPUs, CUDA software. Read more...
Mar 08 | Sporting Life | Formula One engineers differ on benefits of CFD. Read more...
Mar 08 | InfoWorld | AMD offers up 48-core server prize. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html