Santa Cruz, Calif. -- You wouldn't want a chip in a car, airplane or medical device to suddenly fail, but with reliability challenges that worsen at 45 nanometers and below, that's a real possibility. It's why Semiconductor Research Corp. and the National Science Foundation are funding a groundbreaking research effort into "self-healing" chips that can detect and repair defects in the field.
The three-year research project will fund work by two principal investigators from the University of Michigan--Todd Austin, associate professor of electrical engineering, and Valeria Bertacco, assistant professor of electrical engineering. The two have already published research about defect-tolerant architectures that involve minimal area and performance trade-offs, in contrast to the large sacrifices required by today's modular redundancy approaches. During the research project, the investigators will work to boost defect coverage and to extend the new architectures to a wide variety of chips.
While techniques such as design-for-manufacturability (DFM) and restricted design rules will help maintain nanometer yields, there are still some chips that will fail days, months or years after they've been deployed. Culprits include electromigration, hot-carrier degradation, undetected manufacturing defects, unpredicted process variations and thin and vulnerable gate oxides.
"With much larger chips and much smaller geometries, we're going to have chips in which not all the transistors are going to work," said Bill Joyner, director of CAD and test at the Semiconductor Research Corp. (SRC). "We're looking for research that will give us chips and systems that are going to work, in spite of the fact that components are going to fail."
Self-healing chips, said Austin, may extend Moore's Law for another process generation or two. "There are so few atoms forming the transistors that any amount of variation can cause them to be too weak or too slow," he said. "By building self-healing into the system, you can tolerate these types of things, and give yourself the opportunity to extend the life span of CMOS silicon a little further than it would otherwise."
Chips that can recover from failures are essential, said Bertacco. "Unless there is new technology that makes it possible to overcome failures, pretty soon we'll be producing chips that last a very short time," she said.
SRC members include some of the largest U.S. semiconductor suppliers, and they've shown a strong interest in self-healing chips, according to Joyner. SRC member Intel Corp. is "definitely" interested, said Shih-Lien Lu, senior staff research scientist at Intel.
Lu said there's a need for chips that can detect errors, recover from them and, ultimately, repair themselves. Intel has investigated several methods of detection and recovery, he said, but has not extensively looked into self-healing or repair. "One thing we like about self-healing is that it's not only for memory, but also logic," said Lu. "And it's also about repair in the field, not just manufacturing test."
Self-healing chips are several years away, but they represent "a big issue for the future of technology," said Mary Olsson, research vice president for design and engineering at Gartner Dataquest. Like restricted design rules (RDRs), she noted, self-healing chips may potentially reduce the need for certain types of DFM or IC layout tools. If self-healing chips really take off, perhaps there will be less need for RDRs as well, she said.
New approach to fault-tolerance
Fault-tolerant architectures are nothing new, but thus far they've been restricted to high-end computing systems, said the University of Michigan's Bertacco. The main approach, she said, is triple modular redundancy (TMR), where there are three copies of the system. "This is very expensive technology because it will require a 200 percent overhead in area," she said. "In contrast, the solution we're trying to propose is much lower cost and can thus be applied to a much broader range of systems."
The initial University of Michigan work is with microprocessors, but investigators plan to extend the research to a broad range of chips, Bertacco said. The three-year project, funded to the tune of $100,000 per year, will also involve the creation of high-level defect models, she said. System designers and architects can use such models to evaluate a system's need for resiliency.
Part of the work, noted Austin, is the development of a "simulation infrastructure" that can model potential silicon failures. He said the investigators have taken tools from Cadence Design Systems and Synopsys and added a capability to "inject" faults into a system model. The model can then be used to evaluate the integrity of a design.