Visit additional Tabor Communication Publications
April 06, 2007
Imagine working as a clerk in an electronics store, and a customer comes up to you and says, "I'm very interested in one of your new 8.1 surround sound systems, but it has to be able to play eight-track tapes. I have a lot invested in those tapes, all stacked up in my garage. I'm not buying any of these new audio systems until they play my eight-track tapes in full surround sound."
It sounds too ridiculous to actually happen, but I hear the equivalent almost every day in high performance computing: some customer demands that the most current and advanced hardware runs algorithms that date from the Jimmy Carter administration, and runs them well.
Discarding Dated Algorithms
There have been some amazing developments in computer design (like low-cost vector floating point hardware), but they come with a catch: you'll have to change your idea about "best" algorithm if you really want to stay ahead in this game. In all the talk about how supercomputing helps us compete in the world economy, we often ignore one area: We cling to programming ideas that are decades out of date, and that intellectual inertia holds us back far more than the hardware does. It's not just that we keep legacy software a long time; it's that we hold onto legacy algorithms long past the time when they were a good match for their hardware. Discarding dated algorithms may be one of the most promising ways to advance our productivity.
The thing that makes algorithms go out-of-date is when the relative costs of computing resources changes dramatically. Imagine that the year is 1967. Single processors are very expensive, costing hundreds of dollars per hour. Programmers and users are relatively cheap; they cost a few dollars per hour. Memory is very expensive; you can spend a million dollars for a megabyte of RAM. Memory access is cheap... so cheap that the time it takes is hidden by the time the computer grinds away doing arithmetic. In that era, programming sophisticated algorithms to save memory and FLOPS made a lot of sense. Compare this situation to today, four decades later:
We can point to some specific places where it may be time to change our ideas about those classic "best" algorithms we learned in our computer science courses.
Sparse Equations: Go Ahead, Multiply by Zero
I recently saw a presentation by a researcher struggling to solve a sparse system of linear equations. He assumed that the many weeks he was spending to operate only on the nonzero entries in his system would lead to higher performance. Like all sparse system programmers, his intent was to spare the computer from having to multiply by zero values and add that zero result, an obvious waste of a perfectly good arithmetic unit. To his chagrin and frustration, he discovered he had made his program slower, because now he was spending much more time "pointer chasing"... and he complained bitterly that computer vendors did not know how to build computers correctly because it should not take long to get an index out of memory and use that index to look up the value.
Evidence is mounting that sparse solvers deserve reexamination. A paper published just this month by a prominent HPC researcher studied the use of FPGAs to reduce the time to solve sparse systems. A system of 4000 symmetric equations, 2 percent nonzero elements, took 67.3 seconds to solve on a dual-Xeon server using a sparse solver. After programming the FPGAs in VHDL and using other tools, the author was able to reduce the time to 33.8 seconds. Clever... but I noticed he did not mention how long it would have taken to simply solve it with the Linpack solver on that server, ignoring the sparsity and simply treating it as a dense, nonsymmetric system: Less than three seconds!
Let that soak in for a minute. Even with the unnecessary arithmetic on all those zero elements, even ignoring the symmetry, the "wasteful" dense method is over an order of magnitude faster. The researcher spent thousands of dollars of his time to force-fit an old algorithm into a modern architecture, and wound up with a solution that is over ten times slower and hundreds of times more costly to implement. We are simply optimizing the wrong things. So go ahead, multiply by zero. Complicating the control flow to save scalar operations no longer pays.
"Wasting" Memory and Processors
Ah, you say; but what about all that memory you wasted by not exploiting sparsity? Well, a full dense solver that size takes 128 megabytes, it's true. That much memory costs, hmm... about $20 lately, and it would be hard to find a dual-Xeon server with so little memory that you couldn't just run it. We may know that memory is cheap and getting cheaper, but it's hard to break the habit of seeking algorithms that are memory-stingy.
When I see people wringing their hands about not being able to keep every processor in a highly parallel array busy, I ask them if they're keeping every megabyte in their laptop busy. Probably not. Does that bother them? (Does it bother them to drive to work alone in a car that has seating for four or six people?) Should it? Of course not. If not, then why do researchers continue to plot "efficiency" curves that show the utilization of processors? The efficiency we should be tracking is that of the human using the system, not the hardware.
Moore's Law isn't getting us higher single-processor performance anymore. That stalled out a few years back. However, it continues inexorably to give us twice as many processors and twice as much memory every couple of years. It's inevitable that we will eventually stop making such a big deal about using only a fraction of the hardware, and that's going to free us to use much more productive algorithms.
Linpack "Stunt Machines"
People often say disparaging things about the TOP500 because it's based on dense matrix solver performance, whereas their real-world algorithms use sparse methods.
Many of us learned to express physics problems as nearest-neighbor interactions. If you express the problem that way, most of the matrix that represents the problem has entries that are zero. The classic algorithm courses teach us do this, and then use a sparse solver as mentioned above to get an elegant simulation of partial differential equations in action.
However, this brings up another way we may need to change: Find a way to express the physics that gives rise to less sparse systems. With higher-order approximations, or implicit formulations, or using an update operator that advances several time steps instead of a single time step, you may find you can get faster and more accurate physics simulations. It may be handy to express each variable as depending only on its nearest neighbors, but ultimately, every variable will affect every other variable. Hence, there should be an algorithm that expresses the problem as a completely dense system to solve.
In other words, instead of complaining that vendors are building Linpack "stunt machines," maybe we should look at ways to make our algorithms look more like Linpack. It is not as far-fetched (or as difficult) as it might first sound. Boundary element methods, Green's functions, implicit methods, and high-order polynomial approximations all start to make a lot of sense in an era when gigaflops of computing power are so cheap that you probably use them to run your screen saver.
Unstructured Grids: Assume the Horse is a Sphere
The same people who like sparse methods often say, "We need an architecture that's good at unstructured grids. All our geometries are complicated, not like benchmark examples." They may not realize how much performance they are holding hostage by insisting that a modern computer painlessly gather and scatter memory randomly as needed, because that's what their decades-old algorithms require.
Just as with sparse matrices, it may be time to think the unthinkable: Waste a lot of the hardware, and make it up on the speed of a simpler structured approach that fits the new architectures. There's an old joke that when a physicist is asked how to make a race horse go faster, he replies "First, assume that the horse is a perfect sphere." But maybe there's something to that: If you assume a horse is embedded in a perfect cube, and put horse data where the horse is and nothing where the horse isn't, you might get a computer to simulate a horse much faster than if you do all the work to store data only where the horse actually occupies space. In other words, we may be trending toward methods that use uniform rectilinear grids wherever possible to perform physical simulations, even if it means a total waste of processors and memory locations in areas where nothing is happening. Do unstructured grid methods still make sense, in an era when megabytes cost pennies and processor cores just tens of dollars?
Two Algorithm Leaders: BAE Systems and Toyota
Some HPC users "get it." BAE Systems decided that the best way to stay competitive was to undertake a sweeping rewrite of their application software to adapt to the new realities of hardware system balance. As they do this, they're explicitly building in flexibility to deal with successive generations of changes after this one.
I worry about people at aerospace companies (and some national laboratories) who tell me, "We need eight bytes of memory bandwidth for every flop, and all memory shared, or we're not buying a new system." They might have to wait even longer than the eight-track tape customer will. In contrast, BAE Systems is planning for an era when heterogeneous, hybrid computing is everywhere, memory bandwidth per flop is low and heading lower, multicore processors are universal, and so on. Instead of complaining about the technological limits we're experiencing, they're embracing them and designing around them.
Toyota is another example of an industry that "gets it." It is common practice to use high-productivity environments like MATLAB and Mathematica to prototype algorithms, and once the prototype works, convert it to C or Fortran for production use. Most of Toyota's computer-aided automotive design is done with personal systems running MATLAB, but they never bother to create a tuned, compiled version specific to their problems. Toyota's engineers quickly design complicated systems in the Simulink interactive graphical environment and test them out, essentially running a slow and inefficient interpretive language. To HPC veterans, that seems like a horrifyingly wasteful practice, but it's actually an example of excellent algorithm leadership: Their overall productivity is extremely high because their approach optimizes the engineer's utilization, not that of the computer.
We may now be in an era when engineers can prototype software at a very high level, and then keep it that way. Users and high-level application programmers have always had the universal language of mathematics, and the domain-specific jargon of each field. MATLAB, Mathematica, Maple and similar environments make people much more productive by letting them express their problems in a human-centric way. Toyota's approach places the performance burden exactly where it should be: on the low-level programmers who build those environments, and on the hardware designers who are then free to craft ways to accelerate their most-used kernel operations.
Those eight-track tape players are gone, folks. They're not coming back, and neither are high performance computers "balanced" for the algorithms of decades ago. Leadership in high performance computing means the ability and willingness to let go of those old methods and seek new ones that exploit things that are now dirt cheap (floating point arithmetic, operations on regular arrays, memory) and avoid wasting things that are now very expensive: data motion, scalar control logic... and people's time.
John Gustafson joined ClearSpeed in 2005 after leading high performance computing efforts at Sun Microsystems. He has 32 years experience using and designing compute-intensive systems, including the first matrix algebra accelerator and the first commercial massively-parallel cluster while at Floating Point Systems. His pioneering work on a 1024-processor nCUBE at Sandia National Laboratories created a watershed in parallel computing, for which he received the inaugural Gordon Bell Award. He also has received three R&D 100 Awards for innovative performance models, including the model commonly known as Gustafson's Law or Scaled Speedup. He received his B.S. degree from Caltech and his MS and PhD degrees from Iowa State University, all in Applied Mathematics.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
May 10, 2013 |
Program provides cash awards up to $10,000 for the best open-source end-user applications deployed on 100G network.
May 09, 2013 |
The Japanese government has revealed its plans to best its previous K Computer efforts with what they hope will be the first exascale system...
May 08, 2013 |
For engineers looking to leverage high-performance computing, the accessibility of a cloud-based approach is a powerful draw, but there are costs that may not be readily apparent.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.