Top makers of fault resilient PCs computers that are about 99.99% reliable have for many years endeavored to build computers that would satisfy the stringent requirements of telcos. These are true fault tolerant machines with five nines or greater availability at least 99.999% reliable, succumbing to only about five minutes of downtime per year.
Each PC vendor hasdevised its own technique to achieve true fault tolerance.
(Milwaukee, WI 414-277-1889) led the way, years ago, with their dual PC
Automatic Processor Switchover
(APS) system, where the PC chassis houses two computers with advanced alarm circuitry, one being treated as a spare. RAAC found that CPU failure can be triggered by a failure of the CPU itself, failure of a firmware card or a software application error that hangs the system
until reset. APS can step in and switch to the hot standby CPU if the primary CPU fails.
s (San Jose, CA 408-369-6000) new NEBS-tested
Centellis CO 88520
Cluster-in-a-Box is a SPARC/Solaris compactPCI platform for the central office that uses 2N CPU and I/O cluster nodes to achieve fault tolerance.
Radisys Communication Platforms Division
(Hillsboro, OR 800-950-0044) has a patented form of directed checkpointing in its
Motorola Computer Group
s (Tempe, AZ 800-759-1107)
cPCI chassis achieves such fault tolerance it was selected by Excel Switching Corp. to be a component in its ONE Architecture Expandable Switching System (EXS).
(San Luis Obispo, CA
805-541-0488) joined the ranks of companies building cPCI systems with five nines (99.999%) of availability with its
High Availability System, which features a dual redundant processor architecture.
The dividing line between fault resilient and these true fault tolerant computers is that although fault resilient systems have good cooling, redundant hot-swappable power supplies and RAID storage, they are still vulnerable to the failure of the single board computer (SBC) plugged into
their passive backplane. If the computer chips, memory or support circuitry fail or the software causes the computer to hang, the system goes down. The only way around this weakness thus far has been to have two completely separate systems including the SBC and plug-in CT resource boards. In many cases these telephony boards are very expensive, perhaps as much as 10 times the cost of the SBC.
(San Diego, CA - 888-307-7892) brings forth their new
Fault Resilient Backplane to solve this problem. This industrial-grade backplane option allows a system to be built having two SBCs and one set of shared application boards. The backplane also provides circuitry to share one floppy drive and CD-ROM between the two SBCs. Aside from the two SBC slots, there is one dedicated ISA slot for the I-Bus System Sentinel Monitor and Alarm Board, and 15 PCI shared slots made possible by three Digital 21150 PCI-PCI bridges.
In normal operation, the
primary SBC is powered on and is connected via the PCI bus to the application boards while the secondary SBC is powered off. The System Monitor constantly watches system parameters such as voltage, temperature, redundant power supplies, and the operation of the fans. If the System Monitor detects that somethings out of whack, it creates an alarm so that maintenance personnel can take action to prevent a system failure.
The primary SBC stays in communication with the System Monitor board by resetting
watchdog timers (WDTs) at periodic intervals. Typically, low-level software running on the SBC will be set up to check application parameters, and only reset the watchdog timer if this internal quality-control test is passed. If the SBC hardware or software hangs up or if the software detects that something seems to be going wrong, the WDT will not be reset.
When a WDT alarm occurs, the System Monitor generates a signal that tells the standby SBC backplane to turn off power to the primary SBC and turn
on the power to the secondary SBC. Electronic switches disconnect the failing primary SBC from the PCI bus and then connect the secondary SBC to the PCI bus. The secondary SBC will boot up and take control of the shared application boards.
This switchover process lets the system recover from an SBC hardware or software failure in the time it normally takes for the system to reboot.
Since the failed SBC is immediately powered-down, maintenance personnel, alerted by the WDT alarm, can replace the failed
board while the system keeps running. Once repair/replacement has been effected, the system continues to run using the secondary SBC until another WDT alarm occurs, whereupon the backplane will turn on the primary SBC, turn off the secondary, and connect the shared bus back to the primary.
An optional feature of the backplane allows both SBCs to remain powered-on all the time. A separate switch turns off power to a specific SBC to allow replacement. With both SBCs running, the standby SBC can communicate
with the other one (via a serial port, for example) and thus be kept up to date on the state of the application boards and the running applications status. If an SBC should fail, the system disconnects one SBC from the shared bus and then connects the other, a process which can be performed faster than a reboot. This event, reminiscent of a classical switch-over, requires support to be programmed into whatever software is running at the time.
The new FRBP-18A15 Fault Resilient Backplane
is available as an option in the following I-Bus systems: The IFTA+, the TR6, the Atlas, and the Titan.