Visit additional Tabor Communication Publications
November 14, 2006
The Coprocessor Revival
The idea of using specialized coprocessors to accelerate general-purpose computers for specific applications is at least as old as the attached processors of the late 1970s and early 1980s. Back then, a DEC or IBM minicomputer with a peak speed of less than a megaflop could become a "poor man's supercomputer" by adding a cabinet full of hardware designed for floating-point operations. And don't forget that Intel's original foray into serious FLOPS was when John Palmer convinced Intel to build a "coprocessor," the 8087 chip, to kick up the speed of the 8086 on technical applications. As transistor sizes shrank, vendors found it less of a burden to integrate what once had been prohibitively bulky and expensive hardware for fast arithmetic, and coprocessors faded from view.
Mark Twain once said, "History doesn't repeat itself, but it does rhyme." It's two decades later, and coprocessors are back with a vengeance. This time, the reasons are different: computing is increasingly limited by power consumption, cooling, space and weight; if you know how your workload is different from the general one, you can exploit that difference to get far more computing done within those limits, by applying the right accelerator technology.
The Questions to Ask
All accelerators are good... for the purpose for which they were designed. The old saying "if you give a five-year-old a hammer, everything starts to look like a nail" comes to mind when we see attempts to use accelerators outside their intended range. Some of the things to ask in considering the fitness of an accelerator for a particular purpose are:
The last one is so fundamental to the use of accelerators, it deserves its own section.
Bandwidth: Is This Trip Necessary?
Adding an accelerator is much like adding another node to a computing cluster, in that you weigh the cost of sending data to it against the benefit of the extra processing power. Whether an accelerator fits in a socket on the motherboard or a PCI slot or even an extension chassis, you have to ask: Will it really pay to use this, including the time to move the data there and back? And is an accelerator really better than simply adding another node to the cluster?
It's elementary computer architecture to figure out how much computation per data point you have to have to amortize (or overlap) the time to move the data to and from a processor resource. I like to think about this "grain size" issue in terms of a simple dot product of two vectors of length N. If I have two processors, and can divide the vectors to reside on separate processors and then communicate to get the final sum, how big does N have to be for two processors to be faster than a single one? On a typical cluster with about 2 microseconds of latency between nodes, N can easily be in the thousands of elements. It's the same way with accelerators. Accelerators are for substantial tasks, not small-grain stuff like computing the cosine of a single number. Even if you have an infinitely fast accelerator in a low-latency socket on the motherboard, 10 nanoseconds away, a modern general-purpose chip can do 200 floating-point operations in the time it takes to get operands to the socket and the result back.
Too often I see benchmark specifications for accelerators that clearly don't take into account the time to get the data in and out. So be careful. This is especially true for Fast Fourier Transforms (FFTs), which really don't perform very many floating-point operations per data point. And remember what you're comparing against: The current crop of mainstream microprocessors can produce over 20 64-bit GFLOPS per socket, so if you don't do your homework (understanding the match between what you are doing and the specific capabilities of an accelerator), you can easily find that your accelerator solution is more like, well, a decelerator.
But the issue is no longer just speed. It's speed per watt, speed per square foot, and speed per cubic inch. Oh, and speed divided by total cost of ownership. This is where accelerators can sometimes be of great benefit compared to simply adding more general-purpose processors to a system. With a few exceptions, accelerators that are performing the function for which they were designed can far exceed the ratio of performance to power and space that you can achieve with a general-purpose processor.
When you account for all costs, accelerators can also deliver more performance per dollar. Don't forget that many ISVs have licensing fees that charge by the number of nodes. Although we can hope for the day ISVs use a different way to charge for software on clusters, at least accelerators (graphics, network, or whatever) don't raise ISV license fees. And if you pick the right accelerator, they don't even increase software development costs because they can automatically invoke the library routines you already use and thus work transparently.
Let's now assume you've done your due diligence of carefully examining the option of adding more general purpose processors, and it still looks like you might be better off with accelerators. Consider four general categories of accelerator: Field-Programmable Gate Arrays (FPGAs), Graphics Processing Units (GPUs), Games Processors, and HPTC coprocessor cards (like those made by ClearSpeed.)
The reason FPGAs are getting interesting lately is that they've finally gotten capacious enough to hold some substantial kernel operations. They excel at bit twiddling, like the operations you need for cryptography and genetic pattern matching operations. They can do integer calculations very well, often far faster than a general-purpose processor, which can help you accelerate tasks like the complicated indexing used in multidimensional quantum chemistry arrays.
Where they don't do as well is floating-point arithmetic. You can only squeeze a few 32-bit floating-point operations into an FPGA, and getting more than one 64-bit multiply-add into the current generation of FPGAs is definitely a shoehorn job. In one recent effort, a researcher spent a month programming an FPGA to run a kernel with 56 double-precision operations. Since they couldn't possibly fit, he went to single precision and was just barely able to create the circuit representing the kernel. If he had had unlimited resources in the FPGA, the 100 MHz part would have done all 56 floating-point operations every clock cycle, yielding 5.6 GFLOPS. That's less than you can get from a low-end Intel or AMD or POWER processor with a lot less programming effort! Sometimes you hear about FPGAs going much faster than a conventional processor at floating-point work, but the comparison may be between the meticulously programmed FPGA and a cursory compilation of a Fortran or C program for the microprocessor... that's one of David Bailey's famous "Twelve Ways to Fool the Masses" techniques. We can watch these research experiments with interest to see when FPGAs start to become competitive with other ways of doing the floating-point calculations, and we know that FPGAs will eventually grow to the point where they accommodate 64-bit IEEE floating point and have competitive speeds.
Just because an FPGA is a single chip, don't assume it's inexpensive! An FPGA big enough to be interesting can cost more than an entire server. And while they usually consume less power than a microprocessor, they consume much more than a custom VLSI circuit does for the thing the FPGA is programmed to do... more than ten times as much. So they are intermediate in speed and power consumption between the general programmability of a microprocessor and a handcrafted full-custom chip.
Programming tools for FPGAs have gotten much better in recent years, but the fact remains that you are not writing a program so much as you are laying out a circuit. The tools for circuit design resemble C, but if you really want to exploit the FPGA then it's not as simple as conventional computer programming. You don't want to undertake the job casually or often. Once you've gotten an FPGA programmed, you'll usually keep it programmed that way for a long time.
A common way to add an FPGA to a system is by inserting it into a socket on the motherboard. This is economical for space, but it might be taking up a socket that could instead hold a general-purpose processor. Unless every single PCI expansion slot in a chassis is in use, you might be better off using the socket for a microprocessor, and convert the otherwise wasted PCI slot into extra computing power. Also, ask yourself if your kernel requires access to the main memory of your system, and how an FPGA is going to access that data.
So in short, FPGAs are best for custom, non-floating-point kernel operations that are used repeatedly and intensively in your workload. Their strength is that they enable patient users to create custom circuitry that's fast for particular tasks, without having to actually lay out VLSI and get it fabricated.
Game Processors: Attack of the Killer PlayStations?
Does anyone remember a few years ago when some people at Sandia assembled a Beowulf-like cluster out of Nintendo 64 game consoles? It wasn't exactly a success, but it was one of the first indications of the lure of the impressive raw specifications of mass-produced hardware for video gaming. Look at those peak FLOPS! Look at that internal bandwidth! Surely there's some way to take a pickup to a nearby Wal-Mart and drive away with a few teraflops worth of game platforms that can be repurposed to set new LINPACK records...? I mentioned the Nintendo 64 cluster project to a colleague, and he said,
"Oh, yeah. At Argonne, we tried building a cluster out of Sony PlayStation 2s."
"Really? How did that turn out?"
(After a pause, rolling his eyes to the ceiling) "I'd rather not talk about it."
There are obviously some limits to the Beowulf model that consumer-grade computers plus unlimited armies of grad students equals supercomputing. What does this have to do with accelerators? Well, besides trying to build ensembles of game processors, some projects are underway to use them as accelerators to conventional clusters.
Since so many families have plenty of personal computers capable of playing games, the video game platform developers know they have to offer something spectacularly better at graphics than a PC. Hence, the last few generations of video game hardware have had gigaflops ratings that seem to blow away conventional processors of the same era. And of course, their mass production keeps the price low. What's the catch?
There are several. The first is that what they call a "floating point operation" in video gaming isn't exactly what you'd want to use to design a turbine blade. Generally, gaming arithmetic is 32-bit precision and only meets the IEEE 754 standard in the sense of putting the exponent and mantissa bits in the usual place. Don't expect correct rounding or exception handling, for example. If you want the equivalent of 64-bit precision, you may have to manage a pair of 32-bit floating-point numbers that overlap their mantissas, like making a fishing pole by duct taping a pair of yardsticks together.
The other catches all have to do with general programmability. When it comes to programmability, game platforms don't have anything like the usability and software base we take for granted with mainstream processors. They excel at creating an entertaining game-playing experience, but don't expect to find a free Fortran 95 compiler and MPI libraries for them that you can download off the web! And don't expect them to be bi-endian, that is, be able to attach to little-endian or big-endian hosts. Why should they? They were never designed to function as accelerators in the first place.
So the best fit for the approach of repurposing game processors as accelerators is when there is a wealth of programming talent, no need for 64-bit floating-point arithmetic, and the main goal is novel architecture exploration.
Now let's look at a very close cousin of the game processors: the graphics processors (GPUs) available for servers, workstations, and personal computers.
The good news is that graphics processors (GPUs) are actually quite appropriate for certain operations in serious high-performance and technical computing. An obvious one is seismic data processing, which involves data well below 32-bit precision and can tolerate some slightly sloppy math in exchange for blazing fast single-precision speed. Some kinds of medical imaging have traditionally been able to use 32-bit precision, also. There's a technique in Quantum Chromodynamics (QCD) that allows use of 32-bit floats for the small complex matrices used in that science, even though the matrix multiplications are so approximate that the matrices tend to drift away from having unit determinant. Every so often, you renormalize the matrices so the determinant is one, and it keeps the calculation on track. In general, using GPUs means you'd better have plenty of experience with numerical analysis, or know someone who does.
Most GPU-based accelerators do not have much local memory, so think about what your application requires before selecting one as an accelerator. They may only have a few megabytes for a frame buffer, which is much smaller than what many HPTC applications require.
Another thing to watch is the wattage requirements of GPU boards. There's an escalation going on to get ever-higher specifications for polygons per second without regard to the power consumption, and some GPU boards are getting mighty toasty. A single accelerator (which can take up several slots or require an expansion chassis) might use over 200 watts, well beyond the usual 25 watt PCI standard, and thus require its own power supply. So, if your requirements include low power dissipation, note that this is one accelerator technology that does not seem to have a big advantage for flops-per-watt.
We all use GPUs without programming them; in personal computers, they're ubiquitous. They intercept OpenGL library calls and run them instead of burdening the general-purpose CPU with that work. This is the ideal form of acceleration, where you just add the accelerator to your system and performance goes up without any changes to the software. But the OpenGL calls don't have much resemblance to mainstream HPTC operations, so to use them for HPTC you have to set up a programming environment and learn enough to create your own routines for the GPU.
GPUs have historically been quite a challenge to program, mainly because their architecture and system balances are so different from what people are accustomed to. An emerging trend is to create better software environments with which to exploit GPUs such as those made by NVIDIA, ATI, and IBM's Cell processor. This is one of the most exciting things happening in HPTC... a wave of efforts to make it simpler to cope with a heterogeneous node that incorporates special processors suited to various things. Companies, universities, and national laboratories are rapidly creating the infrastructure to support acceleration technologies in general. When the dust finally settles, we may see a standard that unifies accelerator use, similar to what happened when all the various versions message-passing libraries become unified into the MPI standard.
In summary, GPUs have a clear niche within the part of HPTC that can tolerate single-precision arithmetic, modest amounts of local memory, and an unfamiliar programming environment. For those applications, GPUs can offer an order of magnitude more speed than a general-purpose processor.
Accelerators Built for HPTC
Finally, there are accelerators designed specifically for 64-bit and 32-bit IEEE floating-point arithmetic and the needs of the HPTC community. While there are some university projects and startup companies developing such accelerators, the only extant commercial offering has been the ClearSpeed Advance card introduced in mid-2005. The idea of the card is to accelerate HPTC applications the way GPUs accelerate graphics applications: intercept library calls big enough to be worth accelerating, with no changes to the software. Instead of Gouraud-shaded polygons, the cards accelerate routines like those in LAPACK, where the re-use of matrix data easily makes it worth the time to send data to the card and back.
Right now, a ClearSpeed accelerator is about four times faster at 64-bit floating-point arithmetic than the fastest available x86 processor. Because it relies on parallelism (96 cores per chip) instead of high clock rate to get that speed, the board consumes only 25 watts. And because the accelerator fits into PCI slots that might otherwise be empty, they neatly convert empty space into HPTC processing capability.
The main drawback of ClearSpeed boards is that they're too new to have a large body of software; the library routines aren't comprehensive yet, so early adopters may have to develop their own kernels just as with GPUs if they don't find their solution in the existing ClearSpeed libraries. Fortunately, ClearSpeed's Cn programming environment is very close to standard C, and since there is a minimum of one gigabyte of memory on the accelerator, you can work with substantial data sets and not depend on the bandwidth to the host memory. The accelerator chips have a system balance similar to that of the early Cray mainframes, where the speed comes from vector operations on explicitly managed memory tiers instead of primary and secondary caches.
For old-timers in the HPTC business like me, it's fun to see high-performance innovations from decades ago rediscovered by a new generation, and given new life with vastly improved hardware technology and new reasons to find them advantageous. For a while, it looked like supercomputing had degenerated from a vast range of innovative designs into the simple exercise of filling rooms with commodity servers. Accelerators have made things interesting again, because the person who knows how to select the right tool for the workload can gain a huge advantage, and be the first to find a scientific result or develop the best engineering design. HPTC users are exceptional people, and they need to make exceptional decisions in their choice of computing tools.
John Gustafson joined ClearSpeed in 2005 after leading high performance computing efforts at Sun Microsystems. He has 32 years experience using and designing compute-intensive systems, including the first matrix algebra accelerator and the first commercial massively- parallel cluster while at Floating Point Systems. His pioneering work on a 1024-processor nCUBE at Sandia National Laboratories created a watershed in parallel computing, for which he received the inaugural Gordon Bell Award. He also has received three R&D 100 Awards for innovative performance models, including the model commonly known as Gustafson's Law or Scaled Speedup. He received his B.S. degree from Caltech and his M.S. and Ph.D. degrees from Iowa State University, all in Applied Mathematics.
John will be discussing acclerator technologies at SC06 on Wednesday, November 15, Room 13, from 4:00 p.m. to 4:30 p.m.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.