Some of the most widely used processors for high-performance computing today demonstrate much higher performance for 32-bit floating point arithmetic (single precision) than for 64-bit floating point arithmetic (double precision). These include the AMD Opteron, the Intel Pentium, the IBM PowerPC, and the Cray X1. These architectures demonstrate approximately twice the performance for single precision execution when compared to double precision.
And although not currently widely used in HPC systems, the Cell processor has even greater advantages for 32-bit floating point execution. Its single precision performance is 10 times better than its double precision performance.
At this point you might be thinking — so what? Everyone knows double precision rules in HPC. And while that's true, the difference in performance between single precision and double precision is a tempting target for people who want to squeeze more computational power out of their hardware.
Apparently it was too tempting to ignore. Jack Dongarra and his fellow researchers at the Innovative Computing Laboratory (ICL) at the University of Tennessee have devised algorithms which use single precision arithmetic to do double precision work. Using this method, they have demonstrated execution speedups that correspond closely with the expected single precision performance characteristics of the processors.
The overall approach of the ICL team was to use single precision math whenever possible, especially for the most compute-intensive parts of the software, and then fall back to double precision only when necessary. Most applications use double precision math for the following reasons:
(1) To minimize the accumulation of round-off error,
(2) For ill-conditioned problems that require higher precision,
(3) The 8 bit exponent defined by the IEEE floating point standard for 32-bit arithmetic will not accommodate the calculation, or
(4) There are critical sections in the code which require higher precision.
But for many calculations these restrictions don't apply, or if they do they only apply to a portion of the calculation. According to Dongarra, the types of problems where single precision optimization would be most applicable include linear systems (dense and sparse), large sparse linear system using iterative methods, and eigenvalue problems. These types of calculations apply to a wide range of applications in technical computing.
“The things we're doing relating to linear algebra and eigenvalue problems are really ubiquitous,” said Dongarra. “So they touch on all areas of scientific research.”
Dongarra said the Cell architecture received the initial interest from the ICL researchers because of the large differential between single and double precision performance. The Cell's double precision hardware attains a very respectable 25 Gigaflops per second (peak), but its single precision performance is a phenomenal 256 Gigaflops (also peak). The emphasis on 32-bit performance in the Cell architecture was the result of its initial target market — games. So the focus of the ICL researchers was to come up with algorithms that would really exploit the 32-bit nature of the Cell and the very high performance that it can offer, but still obtain full 64-bit precision.
As has been noted in recent HPCwire coverage of the Cell, this architecture represents a relatively low-cost device with extremely high performance compared to current commodity processors. This is precisely what attracted the ICL team. Dongarra noted that there's a lot of interest in the Cell architecture right now for HPC and he thinks that's going to continue and grow.
Other processors have this performance discrepancy between single precision and double precision, although not to the extent of the Cell processor. Both the AMD Opteron and Intel Pentium processors have a two to one performance differential between single and double precision performance. In fact, Dongarra's team did their original work on PCs using MATLAB to test their algorithm. MATLAB has the capability to do a computation in both single and double precision, so they could easily compare the precision and performance differences. The researchers found that they could double the performance with their single precision algorithm, which corresponded to the floating point characteristics of the Intel Pentium processor on their machines.
“So even in a high-level programming paradigm, like MATLAB, we were able to extract that factor of two,” said Dongarra. “And when we saw the performance, we realized that the Pentium also has this differential between single and double precision, in terms of the performance. It's something that we knew, but really didn't think about exploiting.”
To take advantage of this approach systematically, the code changes would have to be done by hand, since the algorithm is beyond the intelligence of compiler technology. In some cases, the changes just involve calling the appropriate single precision BLAS (Basic Linear Algebra Set) routine to accomplish the 32-bit arithmetic. Other changes to the logic have to do with determining when the calculation needs to fall back to 64-bit precision.
“But the coding's pretty simple,” said Dongarra. “As long as we have the underlying [BLAS] kernel routines implemented, it's not very complicated or tedious.”
The effect on HPC software could be wide ranging. Many scientific computing applications — everything from designing airplane wings to modeling the earthquake response of a building to measuring the energy levels of a molecule — use this type of math to perform their calculations. Just doubling the speed of the underlying algorithms in these applications could have a significant effect on overall workload performance.
Even the standard Linpack benchmark could be sped up with this method. In this case, the optimized single precision version would have to be distinguished from the existing benchmark that has been used to measure supercomputer performance for the past 13 years.
“We haven't quite decided how we're going to resolve that”, said Dongarra. “We may in fact include it, but put an asterisk next to it, just as they do for baseball.”
For more information about the ICL work referenced in this article, visit http://icl.cs.utk.edu/iter-ref/.