The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
September 10, 2008
One of the most exciting developments in parallel programming over the past few years has been the availability and advancement of programmable graphics cards. A high end graphics card costs less than a high end CPU and provides tantalizing peak performance approaching, or exceeding, one teraflop. Since microprocessor peak performance tops out at about 25 gigaflops/core (single precision), this potential, at such low cost, is worth exploring. Harnessing this performance, however, is problematic.
It's important to note that the GPUs powering the graphics cards are designed to do specific jobs very well. They are not designed as general purpose processors, and in fact will do a very poor job on many programs, even highly parallel applications. The key is to determine whether your application can fit into a programming model that maps well onto the GPU. I'm going to discuss the GPU architecture, but I'm going to start with an analogy, and probably stretch the analogy to the breaking point; let's discuss airline travel.
Suppose your job is to transport several dozen large tour groups between London and Seattle, each group with 30-60 members. Your most likely choice is to use jet aircraft for a flight of about 5,000 miles or 7,700 km. Going for the most parallelism, you could use a new Airbus A380 to move 600 people in about 9 hours. One problem you have with these jumbo jets is they don't fit at the main terminal, so you have to take an airport train out to the remote terminal where the plane is parked; the train can only carry so many people at a time, but let's be optimistic and say it will take 90 minutes to move everyone out to the remote gate and load them, and another 90 minutes at the other end for unloading. This gives us a 24 hour round trip. If you regularly fill the plane, this could be a good investment (though, at $300M, a bit more than a GPU). However, if you only have a half-load, it doesn't get them there twice as fast or at half the cost, though it does reduce the load/unload time. Measuring performance as passenger-miles, the performance comes from parallelism (many passengers), not from latency reduction.
Alternatively, you could opt for a few smaller planes, say two to four Boeing 787s. Each can carry about 200-250 passengers at the same speed, so you can move the same number of people at the same rate. You also have the advantage of parking at the main terminal, so you can reduce the passenger load/unload time to about half an hour, so your total round trip time is only 20 hours. Of course, you have to arrange for more crew, more landing slots at the airports, and so forth. Your total capital investment is about the same, but it gives you some flexibility. If you only have 200 people to move, you can leave all but one of the planes behind, saving on fuel and crew costs.
Or, perhaps you could invest in the (future) hypersonic transport, which some believe may be able to travel at speeds of Mach 6, 7-8 times faster than the current subsonic airliners. Assuming it takes time to get up to speed and to slow down for landing, the total flight time might drop from 9 hours to 2.5 hours, or 7 hours round trip. If the capacity of your hypothetical hypersonic transport is 200 passengers, you can transport 600 people in each direction in just 21 hours. Even better, if you only have to transport 400 people, you can get the same work done in two-thirds the time. Of course, you are buying a higher cost transport, probably paying more for fuel, and so on.
Very fast CPUs are more like the hypersonic transport; they are designed to provide extremely high performance for small tasks, but still offer adequate performance for large tasks. Multicore processors are more like the middle option, several smaller, lower-capacity devices, each quite capable, and you can save power by shutting down one or more of them. The GPU is more like the super-jumbo jet; it only gets high performance (passenger-miles) when you have lots of passengers. It doesn't do so well getting just one or a few passengers across the ocean.

So now to GPU architecture. GPUs were originally hardwired for specific tasks; as transistor budgets and demand for flexibility grew, the hardware became more programmable. They still contain special hardware and functional units specific to graphics tasks, but I'll ignore those and view today's GPUs as compute accelerators. A typical design, shown in the figure above, is abstracted from information put out by NVIDIA. I'll use both NVIDIA's terms and more standard computer architecture descriptions of the various parts. The key to the performance is all those thread processors; in the figure, there are eight thread processors in each of 16 multiprocessors, for 128 TPs total. NVIDIA delivers GPUs with up to 30 multiprocessors and 240 TPs. In each clock, each TP can produce a result, giving this design a very high peak performance rating.
Each multiprocessor executes in SIMD mode, meaning each thread processor in that multiprocessor executes the same instruction simultaneously. If one thread processor is executing a floating point add, they all are; if one is doing a fetch, they all are. Moreover, each instruction is quad-clocked, meaning the SIMD width is 32, even though there are only eight thread processors. Unlike classical SIMD machines, there isn't a distinction between the scalar and parallel operations, or mono and poly operations, to borrow terms from C* and Dataparallel C, and Cn for the Clearspeed card. Instead, the model is that of many scalar threads that just happen to be executing in SIMD mode, something NVIDIA calls SIMT execution. Careful orchestration of the 32 threads that execute in SIMD mode is necessary for best performance.
Stretching our analogy, think of each thread processor like a seat in our superjumbo jet, and each multiprocessor as a tour group. Imagine that when the flight attendant serves a meal, the whole tour group must be served at once (synchronously). The whole tour group must watch the same movie on the seat-back screens at the same time, or they all must read books at the same time, though some may be napping at any time. When one person wants to use the restroom, they all must go at once, even if not everyone needs to. Note that this is a long latency operation, which will correspond to the long latency memory.
The multiprocessors themselves execute asynchronously, and without communication. This last point is quite important. In a multicore or multiprocessor system, the cores or processors can communicate through the memory. If one thread stores a value in variable A then sets a FLAG, the hardware will guarantee that another thread on the same or another core or processor will not see the updated FLAG without seeing the updated value for A. The hardware supports a memory model that preserves the store order. No such memory model is supported on the GPU; a program could store a set of values on one multiprocessor and read the same locations on another, but there is no guarantee that the value fetched will be consistent (in the formal sense) with the values stored. Relaxing the memory model allows the hardware to reorder the stores from a multiprocessor, allowing more throughput.
Page: 1 of 3(Digg, Technorati, more)
New Paper: Parallel Computing Without Parallel Programming
Learn how domain experts can run VHLL programs like MATLAB® on a variety of high-performance platforms without low-level reprogramming and how to work with the largest datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Jul 09 | Engineer Live | The demand for computational tools to underpin the 3D seismic interpretation process has never been more apparent. Read more...
Jul 08 | EE Times | Unemployment for U.S. engineers has reached record levels, according to government figures. Read more...
Jul 08 | Network World | Global spending for 2009 projected to drop 6 percent, for a total of $3.2 trillion. Read more...
Jul 08 | Linux Magazine | Portability or efficiency? Neither is guaranteed when writing explicit parallel code. Read more...
Jul 07 | Ars Technica | Japanese company builds custom ASIC to accelerate real-time ray traced rendering for the auto industry. Read more...
Jul 10 | | Engineers, scientists, and other domain experts depend on the productivity enabled by very high-level language (VHLL) tools like MATLAB® and Python. However, as datasets grow larger and programs get more sophisticated, ordinary desktop computers can no longer keep up. The paper explores how to run VHLL programs on high-performance platforms without low-level reprogramming. Work with large datasets and complex algorithms without sacrificing ease of use or reducing productivity.
Apr 14 | | Many HPC IT departments are feeling the rising pressure to deliver more capacity computing and performance while trying to reduce the total cost of ownership. This white paper discusses how an environmentally-friendly and open-standards HPC building block based computing system using flexible interconnect options helps address capacity computing needs.
Source: Addison Snell, GM/VP, Tabor Research; sponsored by Dell
Many organizations that could benefit from the use of HPC clusters find that it is complicated to get the systems up and running because of limited IT resources or the complexities of the clusters themselves. Learn how the Intel Cluster Ready program, for which Dell was an original partner, seeks to address this challenge for entry level and mid-range HPC users.
BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.
Sun Studio Compilers and Tools and Sun HPC ClusterTools allow you to create high performance parallel applications for OpenSolaris, Solaris and Linux. Sun Studio Express 11/08 includes MPI performance analysis capabilities and full OpenMP 3.0 compiler support. Learn about all this and the latest in Sun HPC ClusterTools 8.1.