Visit additional Tabor Communication Publications
June 16, 2006
"Software bugs are part of the mathematical fabric of the universe. It is impossible with a capital 'I' to detect or anticipate all bugs."
So says Ben Liblit, an assistant professor of computer sciences at the University of Wisconsin-Madison. The article which describes his work is in this week's issue of HPCwire.
Liblit's method to detect software misbehavior enlists people with real applications to help attack bugs in their natural habitat. He does this by allowing users to define the nature of the bugs themselves -- crashing, hanging, invalid output, etc., and then instrumenting the application code accordingly so that it can capture the error condition as it occurs. The results are then gathered and analyzed to help identify the bugs and correct the code.
Today Liblit's work is being used by the open source community as a way to do more rigorous post-deployment debugging on a variety of applications. Apparently it has also attracted the attention of IBM and Microsoft.
And me as well. I recently contacted Liblit to get his perspective on why software continues to be such a problematic piece of the information technology puzzle. In high performance computing, we tend to focus on the challenges of injecting parallelism into our code, but HPC also shares the larger problem of overall software quality. And as HPC applications become more complex in order to address multifaceted problems, the challenge to develop quality software will increase.
Liblit illustrates the basic limitation of software using the "halting problem," which can be described as follows: Given a program and its initial input, determine whether the program ever halts or continues to run forever. Seventy years ago, Alan Turing mathematically proved that an algorithm to solve the halting problem cannot exist. Essentially what he was saying was that if you were to try to write a program that would tell you whether other programs hang or not, there is no way that such a program, itself, is guaranteed not to hang. This may seem like just an inconvenient factoid for computer scientists, but it reveals a fundamental problem for anyone who develops software.
"Mathematically it is impossible to take a non-trivial piece of code and prove that it never hangs," says Liblit. "It's not that we haven't been smart enough to figure out how to do it; we're smart enough to have figured out that it can't be done!"
Liblit goes on to characterize software as a chaotic system, with extreme sensitivity to initial conditions. That means it's very hard to predict how it is going to behave during execution. And that's why, despite all sorts of software testing methodologies that are being used today, bugs continue to inhabit our production code.
This got me to thinking about the nature of the hardware-software dichotomy, which seems to be especially noticeable in high performance computing, but exists across the entire IT industry. And that leads to the question: Why is hardware advancing so rapidly and software not? As processors increase in performance every year, the code running on them is not much better than it was ten years ago. There is no Moore's Law for software.
This is not to suggest that hardware doesn't fail. But hardware failures mostly involve physical breakdowns -- crashing disks, dropping bits, etc. The Mean Time Between Failure (MTBF) characteristic is usually well accounted for during system design. For example, Google's cluster management software expects servers to malfunction on a regular basis and can reroute search engine processing rather transparently. These types of problems are manageable because they're predictable.
Hardware logic errors are more rare, but they do occur. For example, the famous Pentium floating-point-divide bug of 1994 precipitated a chip recall. But why aren't these types of problems seen more frequently? There may be a few things at work here. One is that there's so much more software logic than hardware logic in the world. For every microprocessor, like the Pentium, there are thousands or tens of thousands of applications. And the software developers that wrote those applications probably didn't perform the level of testing that Intel applied to its Pentium chip design.
Another difference is that many applications are more complex than a typical CPU -- in some cases, much more complex. On my PC at work, the Windows XP OS and some of the associated applications are regularly updated with patches, presumably to fix software problems. To its credit, XP is much more stable than its predecessors as far as crash frequency, but new bugs are being discovered weekly. This is not too surprising. XP along with the applications on a typical PC workstation represent tens of millions of lines of source code.
Don't make the mistake of thinking processors are getting more complex because the transistor count is going up. Today, the increase in transistors mostly has to do with adding cores and increasing cache size. These don't add logic complexity. The new "Montecito" Itanium microprocessor contains about 1.7 billion transistors, but only about 20 million or so are in the CPU logic. In fact, the move to multi-core should actually make the hardware simpler, since each core is expected to do proportionately less work.
Software is heading in the other direction. As users demand more features and functionality from their applications, the code gets more ever more complex. Window NT 3.1 had around 6 million lines of source code; Windows XP contains over 40 million lines. But as programs become more complex, they also become more susceptible to bugs. The public perception is that the hardware makers are heroes, while the software developers have let us down.
Even within the industry, there seems to be a perception that hardware and software are symmetrical elements of a computing system. The expectation is that both technologies should be able to advance in concert. But the symmetry is an illusion. Processors have become multi-core as part of a well-defined technology roadmap. Meanwhile, the corresponding move to application parallelism has become a crisis. Software seems to be much more resistant to engineering than hardware.
"I don't know that we're doing a very good job of communicating that to the public, and maybe to software engineers," says Liblit. "I don't think software engineers appreciate the near impossibility of doing their job right."
But it's not hopeless. Software is getting more robust. Again, just look at XP. Applications don't have to be perfect to be useful. The text editor program I'm using to compose this article occasionally goes a little nutty and adds a bunch of blank characters at the end of the file. I just delete them and go on.
But some users can't afford to be so forgiving. If your application is managing a stock portfolio for thousands of investors or controlling a nuclear warhead, losing track of data can have serious consequences. Code for mission-critical systems must be held to a higher standard -- safety-critical code, even more so. Productivity is one thing, but when someone's money or life is at stake, buggy software is not an option. Software engineering advancements are truly needed. Are any solutions are emerging? The answer to that will have to wait for a future article.
As always, comments about HPCwire are welcomed and encouraged. Write to me, Michael Feldman, at firstname.lastname@example.org.
Posted by Michael Feldman - June 15, 2006 @ 9:00 PM, Pacific Daylight Time
Michael Feldman is the editor of HPCwire.
No Recent Blog Comments
In quieter times, sounding the bell of funding big science with big systems tends to resonate further than when ears are already burning with sour economic and national security news. For exascale's future, however, the time could be ripe to instill some sense of urgency....
In a recent solicitation, the NSF laid out needs for furthering its scientific and engineering infrastructure with new tools to go beyond top performance, Having already delivered systems like Stampede and Blue Waters, they're turning an eye to solving data-intensive challenges. We spoke with the agency's Irene Qualters and Barry Schneider about..
Large-scale, worldwide scientific initiatives rely on some cloud-based system to both coordinate efforts and manage computational efforts at peak times that cannot be contained within the combined in-house HPC resources. Last week at Google I/O, Brookhaven National Lab’s Sergey Panitkin discussed the role of the Google Compute Engine in providing computational support to ATLAS, a detector of high-energy particles at the Large Hadron Collider (LHC).
May 23, 2013 |
The study of climate change is one of those scientific problems where it is almost essential to model the entire Earth to attain accurate results and make worthwhile predictions. In an attempt to make climate science more accessible to smaller research facilities, NASA introduced what they call ‘Climate in a Box,’ a system they note acts as a desktop supercomputer.
May 22, 2013 |
At some point in the not-too-distant future, building powerful, miniature computing systems will be considered a hobby for high schoolers, just as robotics or even Lego-building are today. That could be made possible through recent advancements made with the Raspberry Pi computers.
May 16, 2013 |
When it comes to cloud, long distances mean unacceptably high latencies. Researchers from the University of Bonn in Germany examined those latency issues of doing CFD modeling in the cloud by utilizing a common CFD and its utilization in HPC instance types including both CPU and GPU cores of Amazon EC2.
May 15, 2013 |
Supercomputers at the Department of Energy’s National Energy Research Scientific Computing Center (NERSC) have worked on important computational problems such as collapse of the atomic state, the optimization of chemical catalysts, and now modeling popping bubbles.
05/10/2013 | Cleversafe, Cray, DDN, NetApp, & Panasas | From Wall Street to Hollywood, drug discovery to homeland security, companies and organizations of all sizes and stripes are coming face to face with the challenges – and opportunities – afforded by Big Data. Before anyone can utilize these extraordinary data repositories, however, they must first harness and manage their data stores, and do so utilizing technologies that underscore affordability, security, and scalability.
04/15/2013 | Bull | “50% of HPC users say their largest jobs scale to 120 cores or less.” How about yours? Are your codes ready to take advantage of today’s and tomorrow’s ultra-parallel HPC systems? Download this White Paper by Analysts Intersect360 Research to see what Bull and Intel’s Center for Excellence in Parallel Programming can do for your codes.
In this demonstration of SGI DMF ZeroWatt disk solution, Dr. Eng Lim Goh, SGI CTO, discusses a function of SGI DMF software to reduce costs and power consumption in an exascale (Big Data) storage datacenter.
The Cray CS300-AC cluster supercomputer offers energy efficient, air-cooled design based on modular, industry-standard platforms featuring the latest processor and network technologies and a wide range of datacenter cooling requirements.