FEATURES AND COMMENTARY
San Jose, CA ≠- It is common to all computers, regardless of make, operating system or age. It lurks everywhere from the humble home PC to spacecraft that navigate the Martian atmosphere.
It’s the bug, glitch or bomb ≠ the system crash that brings down the whole works, frustrating users, raising costs, reinforcing doubts about the overall reliability of technology in the 21st century.
After years of working alone on the problem, high-powered minds from academia, government and the private sector are putting aside competitive concerns and banding together to work on creating bulletproof computer systems. “We’re willing to put a lot of information on the table and share it on the theory that high tide raises all ships,” said Richard DeMillo, Hewlett-Packard Co.’s chief technology officer. “If we can bring up the level of dependability in the industry, then we all win.”
The High Dependability Computing Consortium, announced in December, held its first workshop this month to chart a course for stomping out bugs that affect everything from air traffic control to office networks. Initially funded with a $500,000 NASA grant and led by Carnegie Mellon University, the group includes industry heavyweights such as HP, IBM Corp. and Microsoft Corp. After 2Ω days of discussions, participants agreed reliable computing must be a priority.
“High dependability is something that cuts across a lot of what companies do,” said David Garlan, a computer science professor at Carnegie Mellon. “It’s not their main line of business, but they need it. In some sense, sharing that is less threatening than sharing some proprietary feature.”
In the past, most discussion of fail-safe systems focused on national defense and nuclear power plants. But with the explosion of e-commerce and networked business databases, the entire high-tech industry is taking notice, said Garlan.The economic costs of computer crashes and downtime are staggering,though companies aren’t eager to reveal the scope of the problem.
“Most view reliability issues as dirty laundry,” said Dale Way, who led Y2K research for the Institute of Electrical and Electronics Engineers. “A lot of organizations have methods of reacting when things fail and keeping it in-house … It’s serious and expensive, but it’s buried inside the cost structureof organizations.”
In June 1999, a 22-hour outage on the Internet auction site eBay cost the company $3.9 million in lost fees. Sometimes, more than dollars are at stake. In December, officials at San Francisco International Airport halted installation of a new ground radar system after tests showed it was tracking airplanes that weren’t there. And national pride took a hit in 1999 when human and software glitches doomed both of NASA’s Mars spacecraft ≠ just as they were about to begin their missions at the Red Planet. One cost $125 million; the other $165 million.
“Both people in government and industry know we could do much better,” said Microsoft researcher Jim Gray, who leads the company’s San Francisco lab. “We’re doing the best we can with the technology we have, but there must be a better way.”
Computers and software as aware as HAL in the movie “2001: A Space Odyssey” or as reliable as the post office won’t be available anytime soon, but some obvious steps can be taken in the near future to improve technology. The biggest problem is that high quality is not always designed into software from the start. Garlan believes software developers should think and work more like engineers ≠ who incorporate the lessons of failed bridges and buildings into future designs.
“All the things that you would find in a mechanical system would apply to software systems,” DeMillo said. “The appeal here is to build that set of engineering principles and build that culture of learning from failure and making sure it doesn’t happen again.” Beyond learning from spilled milk, computer system developers should try to understand the ecosystems in which their programs operate ≠ and how humans can thwart the software, DeMillo said.
In 1996, America Online went offline for more than 19 hours when new hardware was being added to the system while new software was being downloaded. When the upgrade was finished, nothing worked. “This was probably the first time in the Internet world anyway that we had seen this particular combination of events,” DeMillo said. “You now know that you don’t change four or five things at the same time.”
Consortium partners also will develop ways to simulate how software works before it is deployed, much like how airplanes are tested by computers before they take flight. New techniques also are being explored that will give computers the ability to heal themselves. “Self-evolving systems could adapt on the fly and have the mechanism so that you can be sure the changes you make aren’t going to bring the whole thing down,” Garlan noted.
Sometimes, it’s more cost effective to bite the bullet and rewrite code that has been recycled over the years, Garlan said. Both Apple’s and Microsoft’s next generation operating systems, scheduled to be released this year, are abandoning old code in favor of programming that originated in the business world and has been proven to work well. That’s a good start, said Gary Chapman, director of the 21st Century Project at the University of Texas at Austin. “It’s surprising that there hasn’t been more of a sense of outrage and calls for reform on the part of consumers,” he said. “There’s a kind of rumbling out there that people are pretty annoyed with computers these days.”