HPCwire

Leading HPC
Solution Providers


























HPCwire >> Features

Algorithm Leadership


Imagine working as a clerk in an electronics store, and a customer comes up to you and says, "I'm very interested in one of your new 8.1 surround sound systems, but it has to be able to play eight-track tapes. I have a lot invested in those tapes, all stacked up in my garage. I'm not buying any of these new audio systems until they play my eight-track tapes in full surround sound."

It sounds too ridiculous to actually happen, but I hear the equivalent almost every day in high performance computing: some customer demands that the most current and advanced hardware runs algorithms that date from the Jimmy Carter administration, and runs them well.

Discarding Dated Algorithms

There have been some amazing developments in computer design (like low-cost vector floating point hardware), but they come with a catch: you'll have to change your idea about "best" algorithm if you really want to stay ahead in this game. In all the talk about how supercomputing helps us compete in the world economy, we often ignore one area: We cling to programming ideas that are decades out of date, and that intellectual inertia holds us back far more than the hardware does. It's not just that we keep legacy software a long time; it's that we hold onto legacy algorithms long past the time when they were a good match for their hardware. Discarding dated algorithms may be one of the most promising ways to advance our productivity.

The thing that makes algorithms go out-of-date is when the relative costs of computing resources changes dramatically. Imagine that the year is 1967. Single processors are very expensive, costing hundreds of dollars per hour. Programmers and users are relatively cheap; they cost a few dollars per hour. Memory is very expensive; you can spend a million dollars for a megabyte of RAM. Memory access is cheap... so cheap that the time it takes is hidden by the time the computer grinds away doing arithmetic. In that era, programming sophisticated algorithms to save memory and FLOPS made a lot of sense. Compare this situation to today, four decades later:

  • Scientists, engineers, and programmers now usually cost much more per hour than the cost per hour of the computer they are using.

  • A megabyte of RAM now costs pennies.

  • A processor now takes hundreds of times longer to read a number from main memory than it does to perform arithmetic on it with fifteen-digit accuracy.

  • Most of the cost in a computer design is in the bandwidth at every level of the memory hierarchy; the actual processing elements have become relatively cheap.

We can point to some specific places where it may be time to change our ideas about those classic "best" algorithms we learned in our computer science courses.

Sparse Equations: Go Ahead, Multiply by Zero

I recently saw a presentation by a researcher struggling to solve a sparse system of linear equations. He assumed that the many weeks he was spending to operate only on the nonzero entries in his system would lead to higher performance. Like all sparse system programmers, his intent was to spare the computer from having to multiply by zero values and add that zero result, an obvious waste of a perfectly good arithmetic unit. To his chagrin and frustration, he discovered he had made his program slower, because now he was spending much more time "pointer chasing"... and he complained bitterly that computer vendors did not know how to build computers correctly because it should not take long to get an index out of memory and use that index to look up the value.

Evidence is mounting that sparse solvers deserve reexamination. A paper published just this month by a prominent HPC researcher studied the use of FPGAs to reduce the time to solve sparse systems. A system of 4000 symmetric equations, 2 percent nonzero elements, took 67.3 seconds to solve on a dual-Xeon server using a sparse solver. After programming the FPGAs in VHDL and using other tools, the author was able to reduce the time to 33.8 seconds. Clever... but I noticed he did not mention how long it would have taken to simply solve it with the Linpack solver on that server, ignoring the sparsity and simply treating it as a dense, nonsymmetric system: Less than three seconds!

Let that soak in for a minute. Even with the unnecessary arithmetic on all those zero elements, even ignoring the symmetry, the "wasteful" dense method is over an order of magnitude faster. The researcher spent thousands of dollars of his time to force-fit an old algorithm into a modern architecture, and wound up with a solution that is over ten times slower and hundreds of times more costly to implement. We are simply optimizing the wrong things. So go ahead, multiply by zero. Complicating the control flow to save scalar operations no longer pays.

"Wasting" Memory and Processors

Ah, you say; but what about all that memory you wasted by not exploiting sparsity? Well, a full dense solver that size takes 128 megabytes, it's true. That much memory costs, hmm... about $20 lately, and it would be hard to find a dual-Xeon server with so little memory that you couldn't just run it. We may know that memory is cheap and getting cheaper, but it's hard to break the habit of seeking algorithms that are memory-stingy.

When I see people wringing their hands about not being able to keep every processor in a highly parallel array busy, I ask them if they're keeping every megabyte in their laptop busy. Probably not. Does that bother them? (Does it bother them to drive to work alone in a car that has seating for four or six people?) Should it? Of course not. If not, then why do researchers continue to plot "efficiency" curves that show the utilization of processors? The efficiency we should be tracking is that of the human using the system, not the hardware.

Moore's Law isn't getting us higher single-processor performance anymore. That stalled out a few years back. However, it continues inexorably to give us twice as many processors and twice as much memory every couple of years. It's inevitable that we will eventually stop making such a big deal about using only a fraction of the hardware, and that's going to free us to use much more productive algorithms.

Linpack "Stunt Machines"

People often say disparaging things about the TOP500 because it's based on dense matrix solver performance, whereas their real-world algorithms use sparse methods.

Many of us learned to express physics problems as nearest-neighbor interactions. If you express the problem that way, most of the matrix that represents the problem has entries that are zero. The classic algorithm courses teach us do this, and then use a sparse solver as mentioned above to get an elegant simulation of partial differential equations in action.

However, this brings up another way we may need to change: Find a way to express the physics that gives rise to less sparse systems. With higher-order approximations, or implicit formulations, or using an update operator that advances several time steps instead of a single time step, you may find you can get faster and more accurate physics simulations. It may be handy to express each variable as depending only on its nearest neighbors, but ultimately, every variable will affect every other variable. Hence, there should be an algorithm that expresses the problem as a completely dense system to solve.

In other words, instead of complaining that vendors are building Linpack "stunt machines," maybe we should look at ways to make our algorithms look more like Linpack. It is not as far-fetched (or as difficult) as it might first sound. Boundary element methods, Green's functions, implicit methods, and high-order polynomial approximations all start to make a lot of sense in an era when gigaflops of computing power are so cheap that you probably use them to run your screen saver.

Unstructured Grids: Assume the Horse is a Sphere

The same people who like sparse methods often say, "We need an architecture that's good at unstructured grids. All our geometries are complicated, not like benchmark examples." They may not realize how much performance they are holding hostage by insisting that a modern computer painlessly gather and scatter memory randomly as needed, because that's what their decades-old algorithms require.

Just as with sparse matrices, it may be time to think the unthinkable: Waste a lot of the hardware, and make it up on the speed of a simpler structured approach that fits the new architectures. There's an old joke that when a physicist is asked how to make a race horse go faster, he replies "First, assume that the horse is a perfect sphere." But maybe there's something to that: If you assume a horse is embedded in a perfect cube, and put horse data where the horse is and nothing where the horse isn't, you might get a computer to simulate a horse much faster than if you do all the work to store data only where the horse actually occupies space. In other words, we may be trending toward methods that use uniform rectilinear grids wherever possible to perform physical simulations, even if it means a total waste of processors and memory locations in areas where nothing is happening. Do unstructured grid methods still make sense, in an era when megabytes cost pennies and processor cores just tens of dollars?

Two Algorithm Leaders: BAE Systems and Toyota

Some HPC users "get it." BAE Systems decided that the best way to stay competitive was to undertake a sweeping rewrite of their application software to adapt to the new realities of hardware system balance. As they do this, they're explicitly building in flexibility to deal with successive generations of changes after this one.

I worry about people at aerospace companies (and some national laboratories) who tell me, "We need eight bytes of memory bandwidth for every flop, and all memory shared, or we're not buying a new system." They might have to wait even longer than the eight-track tape customer will. In contrast, BAE Systems is planning for an era when heterogeneous, hybrid computing is everywhere, memory bandwidth per flop is low and heading lower, multicore processors are universal, and so on. Instead of complaining about the technological limits we're experiencing, they're embracing them and designing around them.

Toyota is another example of an industry that "gets it." It is common practice to use high-productivity environments like MATLAB and Mathematica to prototype algorithms, and once the prototype works, convert it to C or Fortran for production use. Most of Toyota's computer-aided automotive design is done with personal systems running MATLAB, but they never bother to create a tuned, compiled version specific to their problems. Toyota's engineers quickly design complicated systems in the Simulink interactive graphical environment and test them out, essentially running a slow and inefficient interpretive language. To HPC veterans, that seems like a horrifyingly wasteful practice, but it's actually an example of excellent algorithm leadership: Their overall productivity is extremely high because their approach optimizes the engineer's utilization, not that of the computer.

We may now be in an era when engineers can prototype software at a very high level, and then keep it that way. Users and high-level application programmers have always had the universal language of mathematics, and the domain-specific jargon of each field. MATLAB, Mathematica, Maple and similar environments make people much more productive by letting them express their problems in a human-centric way. Toyota's approach places the performance burden exactly where it should be: on the low-level programmers who build those environments, and on the hardware designers who are then free to craft ways to accelerate their most-used kernel operations.

Summary

Those eight-track tape players are gone, folks. They're not coming back, and neither are high performance computers "balanced" for the algorithms of decades ago. Leadership in high performance computing means the ability and willingness to let go of those old methods and seek new ones that exploit things that are now dirt cheap (floating point arithmetic, operations on regular arrays, memory) and avoid wasting things that are now very expensive: data motion, scalar control logic... and people's time.

-----

John Gustafson joined ClearSpeed in 2005 after leading high performance computing efforts at Sun Microsystems. He has 32 years experience using and designing compute-intensive systems, including the first matrix algebra accelerator and the first commercial massively-parallel cluster while at Floating Point Systems. His pioneering work on a 1024-processor nCUBE at Sandia National Laboratories created a watershed in parallel computing, for which he received the inaugural Gordon Bell Award. He also has received three R&D 100 Awards for innovative performance models, including the model commonly known as Gustafson's Law or Scaled Speedup. He received his B.S. degree from Caltech and his MS and PhD degrees from Iowa State University, all in Applied Mathematics.


Article Tools

  • Print This Article
  • Contact the Author

Share & Save Options

Discussion

There are 0 discussion items posted.  

Sponsored Links



Top Headlines

Lustre to Battle Corruption

Oct 07 | GCN.com | Sun Microsystems has been busy building a lot more intelligence into Lustre, a file system used for large-scale cluster computing. Read more...

Oracle and HP's Database Machine Predicated on Voltaire

Oct 06 | The Register | Does the HP Oracle Database Machine represent InfiniBand's big chance to break out its HPC niche? Read more...

3D Imaging Spreads to Fashion and Beyond

Oct 06 | BusinessWeek | A body scan can save a lot of time in the fitting room, and fields from medicine to architecture are adopting 3D computing applications. Read more...

Structural Engineers and Computer Scientists Hope to Integrate Disciplines to 'Revolutionize Building Construction'

Oct 03 | UCSD News | Despite the evolution of computer science over the past 30 years, structural engineering -- hindered by a reluctance to adapt to digital innovations -- has remained relatively unchanged as a discipline. Read more...

Credit Crisis Spreads a Pall Over Silicon Valley

Oct 02 | New York Times | Silcon Valley is starting to feel the effects of the credit crunch. Read more...

Featured Whitepapers

Panasas® Tiered Parity™ Architecture

Sep 04 | | Disk drives are approximately 250 times denser today than a decade ago. This is good news for users who are creating, manipulating and storing more data than ever before. It gives them an opportunity to derive more value from their stored data and lowers the capital acquisition and operating expense associated with that data.

Multimedia

Video White Paper: Architecting a Better Network Storage Solution

BlueArc's Titan architecture represents an evolutionary step in file servers by creating a hardware-based file system that can scale bandwidth, IOPS, and overall data capacity well beyond conventional software-based devices. With its ability to virtualize a massive storage pool of up to four usable petabytes of tiered storage, Titan can scale with growing data requirements, offering a competitive advantage for businesses, researchers, or other enterprises seeking to better manage data growth while still ensuring optimal performance.

High Performance on Wall Street

Newsletters

Stay informed! Subscribe to HPCWire email Newsletters.

Get updates and insights on the High Productivity Computing industry delivered driectly to your inbox.






Featured Events

LCI Workshop
SIFMA
HP-CAST
2008 Virtualization Conference & Expo
Symposium 2009

HPC Job Bank