The Leading Source for Global News and Information Covering the Ecosystem of High Productivity Computing
From the Editor | Main Blog Index
June 16, 2006
"Software bugs are part of the mathematical fabric of the universe. It is impossible with a capital 'I' to detect or anticipate all bugs."
So says Ben Liblit, an assistant professor of computer sciences at the University of Wisconsin-Madison. The article which describes his work is in this week's issue of HPCwire.
Liblit's method to detect software misbehavior enlists people with real applications to help attack bugs in their natural habitat. He does this by allowing users to define the nature of the bugs themselves -- crashing, hanging, invalid output, etc., and then instrumenting the application code accordingly so that it can capture the error condition as it occurs. The results are then gathered and analyzed to help identify the bugs and correct the code.
Today Liblit's work is being used by the open source community as a way to do more rigorous post-deployment debugging on a variety of applications. Apparently it has also attracted the attention of IBM and Microsoft.
And me as well. I recently contacted Liblit to get his perspective on why software continues to be such a problematic piece of the information technology puzzle. In high performance computing, we tend to focus on the challenges of injecting parallelism into our code, but HPC also shares the larger problem of overall software quality. And as HPC applications become more complex in order to address multifaceted problems, the challenge to develop quality software will increase.
Liblit illustrates the basic limitation of software using the "halting problem," which can be described as follows: Given a program and its initial input, determine whether the program ever halts or continues to run forever. Seventy years ago, Alan Turing mathematically proved that an algorithm to solve the halting problem cannot exist. Essentially what he was saying was that if you were to try to write a program that would tell you whether other programs hang or not, there is no way that such a program, itself, is guaranteed not to hang. This may seem like just an inconvenient factoid for computer scientists, but it reveals a fundamental problem for anyone who develops software.
"Mathematically it is impossible to take a non-trivial piece of code and prove that it never hangs," says Liblit. "It's not that we haven't been smart enough to figure out how to do it; we're smart enough to have figured out that it can't be done!"
Liblit goes on to characterize software as a chaotic system, with extreme sensitivity to initial conditions. That means it's very hard to predict how it is going to behave during execution. And that's why, despite all sorts of software testing methodologies that are being used today, bugs continue to inhabit our production code.
This got me to thinking about the nature of the hardware-software dichotomy, which seems to be especially noticeable in high performance computing, but exists across the entire IT industry. And that leads to the question: Why is hardware advancing so rapidly and software not? As processors increase in performance every year, the code running on them is not much better than it was ten years ago. There is no Moore's Law for software.
This is not to suggest that hardware doesn't fail. But hardware failures mostly involve physical breakdowns -- crashing disks, dropping bits, etc. The Mean Time Between Failure (MTBF) characteristic is usually well accounted for during system design. For example, Google's cluster management software expects servers to malfunction on a regular basis and can reroute search engine processing rather transparently. These types of problems are manageable because they're predictable.
Hardware logic errors are more rare, but they do occur. For example, the famous Pentium floating-point-divide bug of 1994 precipitated a chip recall. But why aren't these types of problems seen more frequently? There may be a few things at work here. One is that there's so much more software logic than hardware logic in the world. For every microprocessor, like the Pentium, there are thousands or tens of thousands of applications. And the software developers that wrote those applications probably didn't perform the level of testing that Intel applied to its Pentium chip design.
Another difference is that many applications are more complex than a typical CPU -- in some cases, much more complex. On my PC at work, the Windows XP OS and some of the associated applications are regularly updated with patches, presumably to fix software problems. To its credit, XP is much more stable than its predecessors as far as crash frequency, but new bugs are being discovered weekly. This is not too surprising. XP along with the applications on a typical PC workstation represent tens of millions of lines of source code.
Don't make the mistake of thinking processors are getting more complex because the transistor count is going up. Today, the increase in transistors mostly has to do with adding cores and increasing cache size. These don't add logic complexity. The new "Montecito" Itanium microprocessor contains about 1.7 billion transistors, but only about 20 million or so are in the CPU logic. In fact, the move to multi-core should actually make the hardware simpler, since each core is expected to do proportionately less work.
Software is heading in the other direction. As users demand more features and functionality from their applications, the code gets more ever more complex. Window NT 3.1 had around 6 million lines of source code; Windows XP contains over 40 million lines. But as programs become more complex, they also become more susceptible to bugs. The public perception is that the hardware makers are heroes, while the software developers have let us down.
Even within the industry, there seems to be a perception that hardware and software are symmetrical elements of a computing system. The expectation is that both technologies should be able to advance in concert. But the symmetry is an illusion. Processors have become multi-core as part of a well-defined technology roadmap. Meanwhile, the corresponding move to application parallelism has become a crisis. Software seems to be much more resistant to engineering than hardware.
"I don't know that we're doing a very good job of communicating that to the public, and maybe to software engineers," says Liblit. "I don't think software engineers appreciate the near impossibility of doing their job right."
But it's not hopeless. Software is getting more robust. Again, just look at XP. Applications don't have to be perfect to be useful. The text editor program I'm using to compose this article occasionally goes a little nutty and adds a bunch of blank characters at the end of the file. I just delete them and go on.
But some users can't afford to be so forgiving. If your application is managing a stock portfolio for thousands of investors or controlling a nuclear warhead, losing track of data can have serious consequences. Code for mission-critical systems must be held to a higher standard -- safety-critical code, even more so. Productivity is one thing, but when someone's money or life is at stake, buggy software is not an option. Software engineering advancements are truly needed. Are any solutions are emerging? The answer to that will have to wait for a future article.
-----
As always, comments about HPCwire are welcomed and encouraged. Write to me, Michael Feldman, at editor@hpcwire.com.
Posted by Michael Feldman - June 16 @ 12:00AM
(Digg, Technorati, more)
PGI Accelerator™ Fortran 95/03 and C99 compilers for x64+NVIDIA
Accelerate applications on x64+GPU platforms by adding OpenMP-like compiler directives to existing Fortran and C programs. Available now for Linux, MacOS and Windows. Download a free 15 day trial.
Platform HPC Workgroup Manager
Platform HPC Workgroup Manager integrates all the cluster productivity tools you need to deploy, run and manage your HPC environment.
Michael Feldman is the editor of HPCwire.
More Michael Feldman
Compairson to Core i7-980X by rsingle
HPC? not so much by ewahl
Re: IBM and HPC by truly64
HPC = servers but a lot more by lawries
Multi core deployment becomes a memory game by truly64
Re: Venture Capital Drought? Not So Much. by Ron Van Holst
Re: Podcast: Cray Awarded Defense Deal; SGI Makes Storage Buy; IBM Invents New Algorithm by Nastyanna
Painful Truth by jeffrey.mcallister
SGI = graphics + HPC by johnbarr
HPC = servers but a lot more by truly64
Oracle SPARC != Fujitsu SPARC by Alan M. Feldstein
Sun & HPC != Oracle & HPC by Merblich
a third vendor for lossless low latency 10GbE fabric by lee.fisher@hp.com
Response to GAH by KevinButerbaugh
Response to KevinButerbaugh by GAH
Response to KevinButerbaugh by GAH
Response to GAH by KevinButerbaugh
Response to bdrupp by KevinButerbaugh
Climate Crisis and Exaflops by bdrupp
Climate Crisis and Exaflops by John Hules
Climate Crisis and Exaflops by GAH
Climate Crisis by KevinButerbaugh
IBM "Brain Simulation" article is not properly presented. by Merritt
563 out of 1206 by vvolkov
Little Iron by gadunk
At least it's not "cloud" by KevinButerbaugh
Native QPI Interface? by commike
Mmmmmm by hellcats
New transistorized IC chip scales. by symmecon
Itanium at IDF by Alan M. Feldstein
Communication time by jnapper
"The financial meltdown and computing" by donpellegrino
Human Models by mdgabriel
High-End SPARC Chip for Scientific Applications by Alan M. Feldstein
RapidMind by Mr LolO
Rapidmind by dminor
Longer run times by JohnWest
re: Algo trading Angst by jshore
Results of Testing by in_the_crease
C-DAC announces plans for a petaflop system; IBM researchers are working on vertical integration techniques to extend Moore's Law another 15 years. We recap those stories and more in our weekly wrapup.
Read More...
The Moscow State University supercomputer, Lomonosov, has been selected for a high-performance makeover, with the goal of tripling its processing power to achieve petaflop-level performance in 2010. T-Platforms, who developed and manufactured the supercomputer, is the odds-on favorite to lead the project.
Read More...
Right on schedule, Intel has launched its Xeon 5600 processors, codenamed "Westmere EP." The 5600 represents the 32nm sequel to the Xeon 5500 (Nehalem EP) for dual-socket servers. Intel is touting better performance and energy efficiency, along with new security features, as the big selling points of the new Xeons.
Read More...
Mar 19 | OfficialWire | New super to support intelligence work Down Under. Read more...
Mar 18 | ChannelWeb | Westmere parts already showing up in HPC machines. Read more...
Mar 17 | The Register | But what about the tier ones? Read more...
Mar 17 | Cadalyst Magazine | A new generation of workstations is changing the nature of technical computing. Read more...
Mar 17 | Linux Magazine | Latest iteration of Sun Grid Engine able to tap into Cloud. Read more...
Jan 12 | | In-depth look at vSMP Foundation server virtualization technology, technical implementation, use cases and capabilities. The technical whitepaper provides an architectural overview and details on the three vSMP Foundation products: vSMP Foundation for SMP, vSMP Foundation for Cluster and vSMP Foundation for Cloud.
Jan 18 | | This white paper discusses Gore’s copper cable assemblies, and how they continue to exceed the standards for providing reliable, cost-effective solutions for high-performance computer applications.
Join this online panel discussion for live Q&A with leading industry experts, analysts, and end-users to discuss the latest innovations, best practices, barriers to implementation, and measurable benefits of server virtualization with a particular focus on today's real world solutions.
Learn about scalable fault-tolerant architectures and examples of energy efficient and scalable supercomputing clusters using dual QDR InfiniBand to combine capacity computing with network failover capabilities with the help of programming languages such as MPI and a robust Linux cluster management package.
LIVE@SCO9: The IBM team discusses new innovations in hardware, software and services that help clients better understand their workloads and get insight from their R&D efforts. Technology demonstrations include the soon-to-be-released Power7 HPC processor, the DCS990 system with 2.4 petabytes of storage, the xCAT management tool, secure HPC cloud computing and more. Winners of two HPCwire Readers' and Editors’ Choice Awards! Take the IBM virtual tour at SC09 or more information go online to: http://www-03.ibm.com/systems/deepcomputing/sc09.html