News & Analysis
Chip, heal thyself
Richard Goering
8/28/2006 9:00 AM EDT
Austin and Bertacco are co-authors of two papers that describe some initial research into self-healing chips. The first of these, given at the International Symposium on High-Performance Computer Architecture (HPCA) in February, includes authors from the University of Texas at Austin and discusses a defect-tolerant chip multiprocessor (CMP) switch architecture.
That paper contributes a high-level modeling approach for silicon failures and describes a CMP switch router architecture that incorporates system-level checking and recovery, component-level fault diagnosis and spare-part reconfiguration. This "Bulletproof" switch design claims to be more robust and less costly than existing approaches, including TMR and error correction codes.
The defect-tolerant switch design, aimed at multicore ICs, detects data-corrupting errors through cyclic redundancy checkers at the switch's output channels. Recovery logic is added to the input buffers. To detect errors that cause functional incorrectness, the design uses buffer checker units, extra routing-logic units and an extra switch arbiter. The area overhead is only 10 percent, according to the paper.
A second paper, to be given at the Architectural Support for Programming Languages and Operating Systems (Asplos) conference in San Jose, Calif., in October, discusses more recent work. It outlines a very specific solution for very long instruction word architectures that uses the natural redundancy of VLIW architectures to facilitate repair.
The paper introduces the Bulletproof pipeline, described as "the first ultralow-cost mechanism to protect a microprocessor pipeline and on-chip memory system from silicon defects." This goal is achieved through online, built-in self-test (BIST) techniques combined with system-level checkpointing. For a four-wide VLIW processor with 32 kbytes of instruction and data cache, the approach claims to achieve an 89 percent silicon defect coverage with only a 5.8 percent cost in area, along with a 4 to 18 percent performance degradation after a defect is found.
The approach uses a microarchitectural checkpointing technique to create "epochs" of execution during which BIST is used to validate the integrity of the underlying hardware. If a defect is found, Austin noted, this approach makes it possible to "roll back" time to the last point where there were no defects. Recovery to a correct state is accomplished by flushing the pipeline and copying the architectural registers from a backup register file.
Faulty components are then removed from future operations, and the system is kept running in a degraded-performance mode. Faulty functional units such as ALUs, multipliers and decoders are disabled from future use. Faulty register file entries are repaired using a replacement register. And faulty cache lines are excluded using a 2-bit register in the least recently used logic.
To make this work, the design must include enough redundancy to allow the disabling of faulty functional units. "It's a cost/performance trade-off," Austin said. "If you don't provide redundancy, you'll have very slow, expensive recoveries." But unlike traditional techniques, the Bulletproof approach doesn't require redundancy to detect errors, he noted.
Still some limitations
Limitations and trade-offs cited in the Asplos paper provide fertile ground for the new three-year research project. One is the performance degradation that occurs after error recovery and repair. Designers can "overprovision" components that are highly critical for maintaining system performance, Austin said.
Another limitation is that the current Bulletproof VLIW pipeline doesn't handle transient errors such as single-event upsets. The researchers are working on a new solution that detects these kinds of faults. They are also working to expand the solution beyond VLIW architectures.
Perhaps the main concern, however, is increasing the defect coverage well beyond 89 percent. Austin said he'd like to see it rise to "two-, three- or four-nines of coverage," meaning 99.99 percent. And this must be done by sticking with a 5 to 10 percent area overhead, he noted.
"Much of the work we're going to do during the next couple of years involves trying to get the coverage up, up, up and keep the costs down," Austin said.
There's an educational angle, too. "A future challenge in resilient systems is understanding the effects of physical phenomena on abstract descriptions of your machine," Austin said. "That's really an open question. I hope that once we build some physical models, we can allow architects and designers to better understand how to address these problems."

