HPCwire

Leading HPC
Solution Providers


























HPCwire >> Features

Less is More: Exploiting Single Precision Math in HPC


Some of the most widely used processors for high-performance computing today demonstrate much higher performance for 32-bit floating point arithmetic (single precision) than for 64-bit floating point arithmetic (double precision). These include the AMD Opteron, the Intel Pentium, the IBM PowerPC, and the Cray X1. These architectures demonstrate approximately twice the performance for single precision execution when compared to double precision.

And although not currently widely used in HPC systems, the Cell processor has even greater advantages for 32-bit floating point execution. Its single precision performance is 10 times better than its double precision performance.

At this point you might be thinking -- so what? Everyone knows double precision rules in HPC. And while that's true, the difference in performance between single precision and double precision is a tempting target for people who want to squeeze more computational power out of their hardware.

Apparently it was too tempting to ignore. Jack Dongarra and his fellow researchers at the Innovative Computing Laboratory (ICL) at the University of Tennessee have devised algorithms which use single precision arithmetic to do double precision work. Using this method, they have demonstrated execution speedups that correspond closely with the expected single precision performance characteristics of the processors.

The overall approach of the ICL team was to use single precision math whenever possible, especially for the most compute-intensive parts of the software, and then fall back to double precision only when necessary. Most applications use double precision math for the following reasons:

(1) To minimize the accumulation of round-off error,

(2) For ill-conditioned problems that require higher precision,

(3) The 8 bit exponent defined by the IEEE floating point standard for 32-bit arithmetic will not accommodate the calculation, or

(4) There are critical sections in the code which require higher precision.

But for many calculations these restrictions don't apply, or if they do they only apply to a portion of the calculation. According to Dongarra, the types of problems where single precision optimization would be most applicable include linear systems (dense and sparse), large sparse linear system using iterative methods, and eigenvalue problems. These types of calculations apply to a wide range of applications in technical computing.

"The things we're doing relating to linear algebra and eigenvalue problems are really ubiquitous," said Dongarra. "So they touch on all areas of scientific research."

Dongarra said the Cell architecture received the initial interest from the ICL researchers because of the large differential between single and double precision performance. The Cell's double precision hardware attains a very respectable 25 Gigaflops per second (peak), but its single precision performance is a phenomenal 256 Gigaflops (also peak). The emphasis on 32-bit performance in the Cell architecture was the result of its initial target market -- games. So the focus of the ICL researchers was to come up with algorithms that would really exploit the 32-bit nature of the Cell and the very high performance that it can offer, but still obtain full 64-bit precision.

As has been noted in recent HPCwire coverage of the Cell, this architecture represents a relatively low-cost device with extremely high performance compared to current commodity processors. This is precisely what attracted the ICL team. Dongarra noted that there's a lot of interest in the Cell architecture right now for HPC and he thinks that's going to continue and grow.

Other processors have this performance discrepancy between single precision and double precision, although not to the extent of the Cell processor. Both the AMD Opteron and Intel Pentium processors have a two to one performance differential between single and double precision performance. In fact, Dongarra's team did their original work on PCs using MATLAB to test their algorithm. MATLAB has the capability to do a computation in both single and double precision, so they could easily compare the precision and performance differences. The researchers found that they could double the performance with their single precision algorithm, which corresponded to the floating point characteristics of the Intel Pentium processor on their machines.

"So even in a high-level programming paradigm, like MATLAB, we were able to extract that factor of two," said Dongarra. "And when we saw the performance, we realized that the Pentium also has this differential between single and double precision, in terms of the performance. It's something that we knew, but really didn't think about exploiting."

To take advantage of this approach systematically, the code changes would have to be done by hand, since the algorithm is beyond the intelligence of compiler technology. In some cases, the changes just involve calling the appropriate single precision BLAS (Basic Linear Algebra Set) routine to accomplish the 32-bit arithmetic. Other changes to the logic have to do with determining when the calculation needs to fall back to 64-bit precision.

"But the coding's pretty simple," said Dongarra. "As long as we have the underlying [BLAS] kernel routines implemented, it's not very complicated or tedious."

The effect on HPC software could be wide ranging. Many scientific computing applications -- everything from designing airplane wings to modeling the earthquake response of a building to measuring the energy levels of a molecule -- use this type of math to perform their calculations. Just doubling the speed of the underlying algorithms in these applications could have a significant effect on overall workload performance.

Even the standard Linpack benchmark could be sped up with this method. In this case, the optimized single precision version would have to be distinguished from the existing benchmark that has been used to measure supercomputer performance for the past 13 years.

"We haven't quite decided how we're going to resolve that", said Dongarra. "We may in fact include it, but put an asterisk next to it, just as they do for baseball."

-----

For more information about the ICL work referenced in this article, visit http://icl.cs.utk.edu/iter-ref/.


Article Tools

  • Print This Article

Share & Save Options

Discussion

There are 0 discussion items posted.  

Sponsored Links

Cray at SC08 – Celebrating Innovation
Visit us at booth #532 and see the latest technology from Cray, including the new Cray XT5 system with ECOphlex technology and the recently introduced Cray CX1 desk side supercomputer.

Visit IBM at SC08 - Experience the latest breakthroughs in High Performance Computing
As the world's leading provider of high performance computing solutions, IBM will showcase Exascale Stream Processing, Cloud Computing, Blue Brain, Interactive Ray Tracing along with many other exciting demos.

Harness the power of Sun to solve your most complex problems
Beat your competition by getting to market first, running more simulations, and solving complex problems with Sun HPC Systems. Sun HPC: Open, Simple, Reliable.



Top Headlines

Hazy Computing

Oct 15 | Linux Magazine | Today machines manage what we cannot. Are we dependent upon results or processes we do not understand? Read more...

Reaching For the Exa-Scale

Oct 15 | International Science Grid This Week | Exa-scale computing is probably years away. But GPUs and volunteer grids may provide a shortcut. Read more...

New Visualization Laboratory Debuts on UT Austin's Main Campus

Oct 14 | Texas Advanced Computing Center | TACC has unveiled a new visualization laboratory capable of reproducing terascale data sets with exceptional clarity and resolution. Read more...

High-Performance Nonsense

Oct 13 | Computerworld | Microsoft will have to overcome Windows' historical baggage if its new HPC Server 2008 offering is to be acceptable to users. Read more...

ORNL's Breakthroughs in Cray Machines Make it Hard to Beat

Oct 13 | Knoxville News Sentinel | Oak Ridge National Laboratory has petaflop computing in sight as it upgrades its 'Jaguar' supercomputer. Read more...

Featured Whitepapers

Panasas® Tiered Parity™ Architecture

Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.

SUSE® Linux Enterprise Server for High Performance Computing

Sep 05 | | The excellent scalability features of Linux, in addition to robust security and performance makes it an excellent choice for server systems, especially in the high performance computing area.

Multimedia

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

High Performance on Wall Street

Newsletters

Stay informed! Subscribe to HPCWire email Newsletters.

Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.





HPC Job Bank

Featured Events

SIFMA
HP-CAST
2008 Virtualization Conference & Expo
World Summit of Cloud Computing
Symposium 2009