Hancock, N.H. - Computer users have developed an unusual immunity to product failure, blithely accepting system freeze-ups, lost data and inexplicable behavior. Such tolerance is hard to understand in the modern cutthroat consumer market: Imagine the furor if a new line of automobiles routinely experienced engine failures the first week off the show floor.
As consumer products increasingly rely on VLSI system chips and complex operating systems that are more prone than ever to failure, that tolerance might erode.
"My digital camera deadlocks every now and then and forgets to retract the lens; my Palm Pilot often crashes when I sync it up with Outlook instead of the Palm Desktop. These are minicomputing systems, which are being bitten by the issues of their larger counterparts," observed Armando Fox, who heads the Recovery-Oriented Computing (ROC) initiative at Stanford University. The joint UC Berkeley-Stanford project is attempting to address the high failure rate of modern computer systems.
Rather than expecting complex hardware and software systems to be near-perfect, ROC takes system failure as a given and then looks for the most graceful way to recover from glitches. So instead of eliminating system failure, ROC tries to anticipate bugs at the system level and contain the damage they cause.
"Our perspective is that we can get to a high number of nines of availability by focusing on fast recovery-ROC-rather than exclusively trying to reduce failures, as we have in the past," said David Patterson, who heads the companion initiative at UC Berkeley. Patterson is best-known for evolving the reduced instruction-set computer (RISC) concept at Berkeley in the 1980s.
The public's tolerance of computer failure has been acquired over many decades in which hardware and software have inexorably increased in speed and capability while the level of unreliability has remained constant. That's no accident, in Patterson's view, since the computer industry has focused on speed to the exclusion of other relevant concerns.
"Since VLSI is a fast-moving technology, if the microprocessor is late, it is relatively slower. Doubling performance every 18 months (Moore's Law] works out to 4 percent per month, so every month you are late you slow down relatively 4 percent," he said. With that kind of pressure to produce high-performance systems quickly, it is no surprise that a high level of error has been tolerated. The assumption has been that the next generation of systems will be better-engineered and errors will be eliminated.
While perfection has not been the top priority of system builders, there has been a more recent push to field more failure-resistant systems, said George Candea, a Stanford researcher on the ROC initiative.
"Software engineers have tried to do this over the past few decades and brought about great improvements, but these improvements cannot keep up the pace with feature growth. Market pressure increases the number of lines of code in software products at a higher rate than any of these tools can eliminate bugs," Candea observed.
So the view of the ROC project is that glitches and errors are simply a fact of life for computer systems, and the best strategy is to build special software components that limit their damage. The ultimate goal of maintaining the availability of a computing resource is to maximize the time the system is up and running properly. The conventional approach has been to focus on increasing the mean time to failure. ROC engineering instead looks at how to reduce the mean time to recovery. Indeed, a system can be in failure mode and then recover without the computer operator's even being aware that something has gone wrong.
In addition, reducing the mean time to recovery offers many strategies that appear to be paying off in the effort to keep systems functioning 24/7. For example, one approach is to partition a system into smaller units that can undergo a "micro-reboot" if some failure occurs in that module. This operation would not affect the other modules, whereas in a traditional architecture the entire system would have to be shut down and rebooted. Added software monitors the partitioned units of the system for failure modes.
Another counterintuitive strategy is "crash-only software," in which the system is forced into failure mode as soon as possible so that a problem can be cured quickly.
Candea has been developing what he calls a "software fuse," a monitoring system that looks at the data entering and exiting a module. The monitor uses predetermined data limits and flags a problem if data is outside the limits. The computer operator can set the limits and monitor how often they are violated.
In addition to the fuse concept, a complete monitoring system includes database guards and what is known as a software cop to create a predictability harness that ensures that the system will remain in normal operating territory.
"The fuses, guards and cops are provided by the software vendor together with the product he ships, in the form of a predictability harness. In some sense, this provides insurance for the vendor that his software won't be exposed to an unusual environment," Candea said. "Operators using the software can then selectively enable/disable any of the three components of the predictability harness and can also extend them as needed."
Of course, all of this monitoring and recovery software exacts a penalty in the overall speed of a system, but the ROC movement accepts that as inevitable, characterizing current systems as "fast and flaky." The alternative is to perfect current systems so that they can run at top speed without crashing.
"Computer science as I learned it was that things should be perfect, and you should keep working until the thing is perfect," said Patterson. In reality, however, "hardware breaks, software has bugs, operators make mistakes.
Deal with it."
Patterson believes that this philosophy should have been adopted right at the beginning of electronic computer technology. "We draw the analogy to bridge builders, whose railroad bridges failed regularly until they started including margins of safety into their designs," he said. But an exclusive focus on pushing up performance has meant that the critical work on fail-safe systems was never performed.
Increasing the reliability of computer systems was one motivation behind Patterson's work on RISC architecture. "A related issue was bugs in large microcode, which were prevalent in the minicomputers of the day. It struck me that it would be very expensive to include a repair mechanism for microcode of microprocessors, which you needed in minicomputers, so it made sense to have a simpler instruction set so that you needed very little microcode," he explained.
After the RISC approach caught on, Patterson began to get more involved in reliability issues. He began designing blade-style computers in which the CPU was inside the disk storage unit and large clusters were connected by redundant Ethernet connections. The goal was to reduce the cost and push up the performance of large computing resources.
"The feedback we got from the people we talked to in industry is that the real problem of servers wasn't better cost/performance, it was the difficulty of making things work properly and cost of ownership," he recalled. In fact, while the cost of computing resources has plummeted, the cost of maintaining computer systems has skyrocketed, although those costs are usually hidden inside corporations' IT staff budgets.
"We also had personal experience running a storage-oriented cluster for the San Francisco Museum," Patterson said. " Our painful experience started us on this road [to ROC design]."