Backplanes and motherboards responsible for delivering and distributing power to multiple-card systems must be immune to individual card failures that could potentially jeopardize reliable system operation. While many precautions are taken by backplane designers to avoid this mode of failure, particular attention must be given to system card design in order to isolate failures to that card alone. Failures allowed to propagate into adjacent cards or the backplane can easily bring an entire system down. A method must be employed to cordon off faults at the source in order to maintain system uptime.
Additionally, the fault must generate an alert so service personnel may make repairs as required. Intelligent power management must be designed on the system cards to control, monitor and report the health of the power subsystems. It must also contain the means of recording faults and generating alerts whereby service can be called upon to take action.
The comprehensive design of a power system requires to conversion of the -48V bus to multiple voltages typically required by communication equipment such as network processors, DSPs, and miscellaneous ASICs is presented herein. Included is a review of reliability concerns followed by the design of both power management and power conversion blocks, focusing on areas where difficulties often arise.
Reliability Concerns (Reference 1)
There are many sources and mechanisms of system failure. Faults that cause a system failure are classified by origin or duration. A fault characterized by origin may be caused by incorrect design, environmental factors, physical defects, or incorrect use (e.g., operator error). Incorrect usage and component mortality are typically the most common causes of failure (see Table 1). The duration of the fault can be transient or permanent. A permanent fault generates errors over a period of time coincident with the system's life span. A transient fault generates errors significantly shorter in duration than the system's recovery requirements. If a transient fault occurs repeatedly, then its detection is desirable. High-availability systems move through a typical sequence of events during system failure and recovery as shown in Figure 1. This sequence consists of fault detection, diagnosis, confinement, retrying or masking, compensation, repair and reintegration. The terminology varies but the concepts are the same.
Fault detection uses a combination of testing, monitoring and result comparison from redundant operations and occurs either off-line or on-line. Diagnosis determines what caused the fault and provides information about the failure location and/or properties. Confinement isolates the faulty component from the rest of the system and prevents further propagation of a fault and its effects. Retry and masking techniques ensure that only correct information gets passed on within the system in spite of a failed component. Fault compensation occurs when the system provides additional responses to compensate for the output of the faulty component. Repair and reintegration of the failed component into the system without interruption, completes the sequence.
Table 1 " System Card Failure Modes
Figure 1 " Sequence or Steps of failure identification and reduction needed to achieve Reliable Operation
A New Approach for improving System Reliability
A new approach for achieving higher levels of reliability is to manage all parts of the system's power chain and standardize on a consistent and modular architecture. In this manner, all the various cards and platforms a manufacturer designs and produces can benefit from the data derived from each failure. This approach is now possible using integrated circuits that combine both hardware and software solutions with data outputted on an industry standard bus. The stringent reliability requirements call for devices capable of monitoring functions related to voltage, current and temperature on the individual cards in addition to managing soft-start, hot swap, reset control, supply sequencing/tracking and voltage control. The devices also include status monitoring and reporting, fault diagnostic recording, environmental monitoring, and Active DC Output Control (ADOC) power management. This confines problems to an individual card, which can then be safely disabled and replaced before failure without causing system downtime. Further, analysis of failures can be used to refine the card design.
An example of the power management chain in such a system card using Point-of-Load (POL) architecture is shown in Figure 2. POL is displacing traditional distributed power architecture where power supplies are distributed across the board from four or more isolated step down (buck) DC-DC converters. Instead, the POL architecture uses a single "48V isolated DC-DC converter that is hot swapped into a "48V supply and outputs a quasi-regulated intermediate voltage (+5, +8 or +12V). The intermediate voltage is then bussed to single or multiple non-isolated POL DC-DC converters, switching regulators or LDOs to regulate and control the supply voltage at the load. This is a new concept being introduced in new products designed for systems such as NEBs Blade Servers and AdvancedTCA platforms that require efficient power management to help the manufacturers of data-communications equipment achieve increasing system reliability. Increasing the number of cards with the same architecture also increases the statistical sample size allowing identification and elimination of failures with very low rates of occurrence. Starting with the hot swap circuit function, an explanation of each block and how it is used to prevent downtime is discussed in the following sections.
Figure 2 " A system card design with power management functions necessary to produce reliable backplane power management design.
These include Hot Swap, Supply Cascading/Sequencing and Tracking, Environmental Monitoring and Reset Control. This POL power architecture replaces traditional "48V distributed power designs and turns each power partition of the card design on, allows time for POR of each section and actively controls DC levels at the load for better accuracy. As component power requirements change, the power management device can be in-system programmed using the I2C bus. The power supply is bussed from a main power bus and down converted at the load.
Primary-Side DataCom Power Hot-Swap Controller
Perhaps the primary challenge facing communications engineers is to maintain system operation during system card hot swapping. This means the hot-swap function, which was historically focused on power transitioning of individual system cards during insertion and removal without powering down the system to allow for easy servicing, must also prevent disruption of other system cards when it malfunctions.
Any board or circuit connected to the "48V supply must not cause any disturbance to the bus. A 'Hot-Swap' controller (SMH4804) is used to:
1. Permit live card insertion by soft-starting the "48V live insertion current.
2. Shutdown the "48V power on the native board when an overcurrent or other fault jeopardizes the bus or the native card.
3. Permit orderly power-on/off sequencing of the DC-DC converters. This includes primary to secondary voltage isolation using opto-isolators or other non-galvanic device with the required primary to secondary breakdown voltage rating.
The most basic implementation of the hot swap function must provide card insertion detection and "48V soft-start current limiting as shown in Figure 3. It should also provide advanced fault detection functionality capable of monitoring primary side voltage for over/under voltage (OV/UV) conditions as well as the current into the system card. A hot-swap controller for data communications applications requiring isolated DC-DC converters must also address the increasing power requirements and complexity of POL power systems. Programmable analog technology is used so board designers have a large degree of flexibility, without the need for excessive external components, which affect reliability. Also, most traditional fault-sensing techniques are prone to inadvertent activation during unusual events, such as initial powering of the card, or insertion of other cards into the rack. The new devices are also designed to extend system operation by their ability to ignore spurious events, reacting only to actual faults. These components also allow a system card to be inserted into a live backplane and eliminate any possible disruption of system operation.
Upon card insertion, the hot-swap controller monitors the input voltage ensuring it is within its valid range and checks the pin detect inputs for proper card insertion. Programmable delay times ensure power is not applied during contact bounce. The device applies power to the isolated DC-DC converter by driving an external MOSFET with a programmable slew rate to limit inrush current (Figure 3).
In-rush current limiting and current regulation allow large capacitances to be charged at a fixed current for a defined period of time. Once the input voltage to the DC/DC converter has stabilized, the DC-DC converter is enabled. The hot swap controller block in Figure 2 and detailed in Figure 4 monitors card operation and turns off the card in the event a fault is detected on the-48V side, such as over-current or loss of regulation of the primary supply voltage. A forced shutdown input allows the card to be turned off in the event of a fault is detected on the secondary side. In either case, a fault occurring on this card is isolated from the rest of the system to prevent it from propagating to other system cards. Over-current or circuit breaker functions include selectable quick-trip current values and duty-cycles. Furthermore, a programmable non-volatile circuit breaker can be used to prevent power from being reapplied to a card that has previously had an over current fault. In the example, the hot swap device controls one converter, enabling it after a programmed time period.
A sequence timer input can be enabled, allowing forced shutdown of the -48V switched source in the event a fault is detected on the secondary side of the DC-DC converter. Communications between the secondary side and the hot swap controller are essential in power managed designs. Programming of the device is accomplished through a standardized I2C programming interface, which allows the designer to optimize the various parameters for a particular system card. This standardized interface allows the device to be programmed in-system, eliminating the need for external programming. This solution outputs an industry standard Hex file, which can be used with third party programmers, ATE equipment or in-system programmers to allow development without removing the device. Programmable software design tools provide near instant customizable options for varying power management requirements of system cards thereby improving time-to-market.
Figure 3: -48V Hot Swap Power-On Waveforms using a programmable hot swap controller.
Ch 1 (1V/Div) = 3.3V DC-DC converter output (Yellow trace) Ch 2 (5V/Div) = PG # output (Blue trace)
Ch 3 (20V/Div) = Switched 48V supply voltage (Purple trace) Ch 4 (1A/Div) = Input current (Green)
The Scope plot on the bottom shows the complete hot swap function from connection to output voltage. The top plot shows the inrush current during turn-on and is limited to 300mA. Without the controller, inrush current transients can easily exceed 10A and load the "48V supply.
Figure 4 " Primary-side distributed power Hot-Swap controller and sequencer. Items in blue are user programmable. Depending on the device used, up to 4 isolated supplies can be sequenced from the primary side.