Design Article

IMG1

Software Considerations For Host Processor Hot Swap

Mark Huth

4/3/2000 12:00 AM EDT

Compact PCI hardware now provides the capability to switch control between redundant host system processors. To fully take advantage of the hardware capability, operating systems, device drivers, and applications software must be configured to handle the implications of a host processor switch.

The ability to replace I/O boards in a system without shutting off power (hot swap) provides a tremendous boost to the maintainability and availability of a system. It simplifies the process of replacement for failed boards, minimizes system down time, and eliminates the need to reboot a system after board replacement. Extending hot swapability to system boards and providing redundant system boards can provide the further benefit of allowing the system to be tolerant of both system software and system board failures. If the active system board fails, the replacement board simply gets swapped in, and the system continues operation with minimal interruption.

Compact PCI systems already support the hot swapping of non-system cards, power supplies, and peripheral components. Both the hardware needs and the board software algorithms required for the hot swapping of non-system slots are well described by the PICMG Compact PCI Hot Swap Extensions standard (www.picmg.org). To allow system processor slots to hot swap, several facilities must work in concert. For one, the hardware must allow the Compact PCI bus domains to have their control transferred from one processor to another without disrupting the bus operation. The software must also allow the transfer, and must do so at all levels of the system, from the system controller through each of the system I/O boards.

The hardware requirements have been solved. System chassis such as the Motorola Computer Group CPX8216 and CPX8221 have the hardware necessary to allow this bus takeover. However, successfully performing a domain takeover requires some adjustment to the system software.

Controlling the Bus

Before swapping host controllers, you must either first halt activity on the system bus, or ensure that the post-swap activity will not cause failure of the new host. You can halt bus activity by changing the functional configuration of boards in the system or by using slot control signals as defined by the High-Availability Hot Swap Standard. It is important to understand the effect of these on a board in order to apply them properly.

Adjusting each function's configuration space is one way to stop bus activity. Whether a board presents a bridge or a device as the single PCI load in the slot, you have control over that function's PCI mastering and target response capabilities. Specifically, you can use the master enable bit in the command register to disable origination of PCI bus cycles by a slot. Disabling a function's bus mastering capability, however, may result in overruns or underruns for that function. System software must therefore account for those possibilities anytime a function or bus has traffic suspended for longer than a few microseconds.

You can also stop bus activity by using one of two slot control signals specified by the High-Availability Hot Swap Standard: BdSel or PCIReset. The choice of signal has significant impact on how the system recovers following a hot swap.

BdSel
Negating the BdSel signal removes back-end power from a Compact PCI board. This moves a board into the H0 state of the hardware connection as described in the standard. In this state, the board is effectively powered off. Only early power, which is used to stabilize the connection to the PCI bus signals in the floating condition, remains active. It should be noted that the time to enter the H0 state following the negation of BdSel for a slot is unspecified and will be determined by the hardware implementation of the slot payload.

Negating BdSel to a slot has the disadvantage of requiring that the board go through a power-up sequence prior to returning to service.

PCIReset
Asserting the PCIReset signal to a slot causes the PCI interface for that slot to reset and float its electrical connections for the duration of the reset. PCIReset will propagate onto the board's PCI bus in accordance with the PCI specification, and may reset the entire board or only the PCI bus, depending on hardware implementation. The time from signal assertion until the PCI interface is reset and floating is not specified and will be determined by the hardware implementation.

Negating the reset allows the board to progress to the H2/S0 state. When the new host releases the board from reset, the normal PICMG hot swap enumeration process begins. This process allows a device driver to be configured and PCI resource allocations to be made for the I/O board.

Using PCIReset to halt bus activity allows the board to maintain its power, so volatile memory should not be lost. Whether the board software can recover from PCIReset without complete initialization, however, is a matter for its software designer to determine.

Processor Hot Swap Classifications

Once the bus traffic is quieted, host processor hot swap can proceed. There are several ways to go. Processor hot swaps can be classified along two orthogonal criteria: the relationship of the two processors during the switchover and the maintenance of state within the payload and its associated driver.

There are two possibilities for processor relationship during a hot swap: cooperative and pre-emptive. If both system processors are capable of participating in the bus domain switchover, then the switchover is considered a cooperative switchover. Otherwise, the switchover is considered to be pre-emptive.

In a cooperative switchover the claiming processor notifies the current domain owner of the intent to switch and waits for the owner's consent before claiming the bus domain. A pre-emptive switchover is initiated in the same manner. However, if the claiming system processor determines that the time allotted for the cooperative switchover has elapsed prior to receiving the current owner's consent, the bus domain is forcibly switched to the new processor.

Cooperative switchovers are desired where possible. Certain types of software faults, however, can cause the current owner to not notice a simple request. To maximize the probability that the current owner will take notice, even in the face of software faults, the switchover request should trigger an interrupt.

A cooperative switchover procedure will attempt to notify all I/O functions of the switchover and allow them to halt bus activity before proceeding. Intelligent I/O functions may be allowed to complete checkpoint transfers. Additionally, the current domain owner may attempt to complete state checkpointing of drivers and other items before consenting to the takeover.

By performing the notifications and checkpointing, the switchover procedure is most likely to preserve the system state and halt bus activity. Preserving the system state and halting the bus maximizes the probability of a clean takeover and the subsequent recovery and continuation of the system function.

Certain hardware or software faults may interfere with a cooperative takeover. For example, the checkpoint link between processors may have failed, preventing a clean checkpoint from being established. Another possibility is that the current owner may have established an interrupt-inhibited environment, causing it to fail to recognize the takeover request. Other types of software or hardware faults may have similar effects. The result is a pre-emptive switchover.

A pre-emptive switchover is simply any switchover that did not satisfy the conditions for a cooperative switchover. In a pre-emptive switchover, the most recent checkpoint may be stale, the I/O functions may not have been notified of the change, the bus may not have halted, or any combination of the foregoing conditions may be in effect.

Payload and Driver State
There are three levels of domain switchover related to payload and driver state maintenance, designated cold, warm, and hot. In the cold switchover, the I/O devices and their associated new drivers do not maintain any state from before the switchover. In the warm switchover, I/O devices maintain at least some state from before the switchover and will be notified in some manner that a switch has occurred. In the hot switchover, the I/O devices are unaware that a switch has occurred.

Cold switchovers are accomplished by either using the PCIReset for each board, or by using the BdSel. Because of this, cold vs. warm/hot strategies can be mixed on a slot-by-slot basis. Following the cold switchover, boards are sequenced through the normal I/O hot swap sequences, allowing the standard enumeration procedures to work.

Since there is no state maintained across a cold switchover, very little needs to be done beyond the standard I/O Hot Swap steps. Only the protocols for causing the processor to swap need be added. Additionally, the lack of state maintenance means that there is little advantage of a cooperative switchover vs. a pre-emptive switchover. While applications may benefit from a cooperative switchover, the non-system payload gains no benefit from cooperation.

Warm switchovers are accomplished by disabling the I/O payload's bus mastering capabilities following the bus exchange. The primary mechanism for disabling the bus master capabilities is the PCI configuration header command register. Additional mechanisms, such as device CSRs may be available on a device-dependent basis. The primary requirement for warm switchover is that both the device and its driver are capable of communications regarding the device's state and usage of system resources. This communication must be possible without the I/O device requiring bus mastership.

A communication and potential reconfiguration of PCI resources takes place before the new driver permits the payload to again perform bus master operation. The mastership hiatus permits any necessary PCI reconfiguration to occur. Resources such as bus numbers, PCI memory and I/O space allocations, and DMA buffer allocations are done anew by the new device driver. Device-to-driver communications protocols can be resynchronized, and then bus mastership capabilities can be re-enabled.

Cooperative switchovers have an advantage over pre-emptive switchovers in the warm switchover mode. Cooperative switchovers allow extant device status to be checkpointed to the new system processor and the device to halt activity prior to switchover. Devices may thereby avoid unexpected over/underruns.

Warm switchovers have the advantage over cold switchovers of enabling system continuation without interrupting payload states. This is quite desirable in systems where the payload intelligence is a large part of the system intelligence, such as call switching or cellular applications. In these applications, the existing calls can be maintained.

Warm switchovers maintain state with little support from the host operating systems, since the burden of managing the switchover falls on the device intelligence and its associated driver. However, this is also the drawback to warm switchover. The protocols and checkpointing required to re-allocated resources and resynchronize driver and payload may be quite complex. It is unlikely that standard payload downloads will be capable of such operations.

Hot switchovers are accomplished by quickly switching a domain into an identically configured system processor. The I/O devices then resume operation without reconfiguration. While the devices may be notified of the switchover as an aid to recovering from potential under/overruns, basic operation of the device payload remains undisturbed.

In a hot switchover, cooperative switchovers have an advantage over pre-emptive switchovers. Cooperative switchovers allow extant device status to be checkpointed to the new system processor and the device to halt activity prior to switchover. Devices may thereby avoid unexpected over/underruns.

To perform a successful hot switchover, the new system processor must maintain a resource configuration identical to that of the original system processor. This requires careful checkpointing of system resource allocations such as PCI bus numbers, PCI I/O and memory space address, and DMA buffer physical addresses. Most operating systems will need modification to support this form of system processor switchover. Additionally, the system processor device drivers must be capable of configuration and checkpointing without access to real hardware.

The primary advantage of a hot switchover is that it may be implemented without modification to the payload devices' downloads. Only the drivers for the host processor require modifications. These drivers typically implement simple backplane packet interfaces, rather than the complex protocols of the I/O devices, and will deal only with status, service control and encapsulated data packets. In an environment where complex protocols acquired from third parties run on the payload devices, and the source code is not available, the hot switchover may be a necessity.

Processor Hot Swap System Resource Management

The two processor relationships and three driver maintenance levels yield six possible implementations for processor hot swap, as shown in Figure 1. Each implementation must go through a sequence of configuring the system, making the switchover, and reconfiguring the system. The sequences for each implementation are given in the Figure 1 links. Following the sequence is not all there is to implementing a successful hot swap. You may also need to carefully manage system resources. cold_co warm_co hot_co hot_pre warm_pre cold_pre

Figure 1: The six examples of possible domain switchover sequences for a given system are application, device, and driver dependent. Detection of when a switchover should be performed is not considered in these sequences. The examples assume that the drivers, operating systems, and payloads have the requisite capabilities to handle each class of switchover.

Cold and warm domain switchovers require little in the way of special resource management. This is because they allow PCI reconfiguration between the switchover and I/O resumption. The same cannot be said for hot switchovers. Because the device I/O is allowed to continue without reconfiguration, every resource related to I/O operations must be carefully managed in a hot switchover. These resources include, but may not be limited to, PCI bus numbers, PCI I/O space, PCI memory-mapped I/O space, PCI prefetchable memory space, PCI interrupts, and DMA physical buffer and control addresses. Additionally, device driver configuration must be managed in the absence of physical hardware.

PCI Resources
Hot switchovers require considerable resource management. The obvious management need is for the collective set of PCI resources. These resources must be identical on both processors participating in the hot switchover, yet most operating systems supporting PCI Hot Swap have dynamic allocation mechanisms. For example, PCI bus numbers are allocated as PCI-to-PCI bridges are encountered in the enumeration process. Typically, bus numbers for I/O host swap are allocated in blocks to allow for subordinate bridges. The CPX8216 chassis, for instance, contains two domain bridges. After a small allocation to allow for PMC bridges on the system processor, the remaining bus numbers are divided equally between the two domains.

Typically, operating systems enumerate the PCI bus either automatically through the receipt of the ENUM signal or on demand by the system management interface. In either case, the results may not be identical each time. When configuring for the hot switchover of the system processor, the system not owning the domain must have a means of tracking the allocations made by the owning domain, as it cannot make its own allocations and have them match.

The key for performing bus number allocations for hot switchover is to make sure that the domain bridges have identical allocations based on the domain rather than based upon the PCI BDF (bus, device, function) triple. This is a requirement that is not accommodated by most currently available operating systems, which generally just allocate in order as bridges are discovered, and the discovery process normally proceeds based on the BDF triple.

PCI I/O and Memory Allocations
PCI-to-PCI bridges used as domain bridges currently have only one window for each of the three PCI windows: I/O, memory-mapped I/O, and prefetchable memory spaces. This single window means that the available address pool for each must be divided among the domain bridges. The current recommendation is to expose the entire resource pool through each domain bridge window. The effect of dynamically changing the window size to accommodate insertion and extraction is undetermined, and dependent on the bridge implementation.

When subordinate allocations are made for devices downstream of the domain bridges, the same allocation must be made in the other host's virtual resource pool. This may be done by checkpointing the allocations to the other processor as they are made. This requirement is not yet accommodated by most available operating systems, as normal strategy is to only make allocations when physical hardware is discovered. The operating system concept of resource allocation must be extended to apply to virtual devices not yet physically present.

PCI interrupts are allocated according to the hardware wiring for a given chassis. When an interrupt is allocated in the system currently owning a domain, the logically equivalent interrupt lines must be configured on the non-owning processor.

DMA Buffers
Because I/O devices may have pending DMA requests at the time of domain hot switchover, it is necessary that the physical addresses used for DMA by the active domain be similarly allocated in the standby domain. This requirement is not normally met by current operating systems.

Additionally, in order to manage allocations for multiple domains, the available DMA memory pool must either be divided and allocated into segments for each domain, or an MP safe allocation algorithm must be used to allow the two processors to communicate their allocations as they occur. In any event, the DMA allocations must be checkpointed to the standby system and device drivers.

The physical addresses of the DMA pools must be the same on both system processors, even if the amount of memory on the two processors differs. If virtual addresses are used in any of the packet or control data exchanged between the device and the driver, then the virtual address of such structures must also be identical on the active and standby systems.

Device Drivers
Previous requirements for resource allocations have implied that device drivers have extended capabilities when used in hot switchover configurations. They must be able to accept allocation information via the system checkpoint protocols to ensure that the active and standby drivers can maintain a mirrored device model.

Additionally, device drivers must configure without real hardware, and have the capability of acquiring or releasing the physical hardware upon command from the system. The drivers may also have to be extended to discover which buffers are in use at the time of hot switchover, as that information may not have made the last valid checkpoint.

Backplane Communication Issues

Intelligent I/O devices and the system processor device drivers generally communicate via a shared memory packet interface. Warm or hot switchovers demand that these protocols have certain features. Protocols intended for warm switchover use must allow for suspension and reconfiguration of communication addresses following the warm switchover. The reconfiguration must be accomplished without requiring the I/O device to access the domain bus.

All protocols intended for use in domain switchover configurations should protect against lost or corrupted packets. Packet errors may occur any time over a complex PCI bus configuration and are often difficult to localize. While the system may be aware that an error has occurred, the exact location of the error may be difficult to determine, especially before the erroneous data can be used. Pre-emptive takeovers increase the likelihood of packet errors.

Future Considerations

This set of software considerations for processor hot swap or domain takeover is a good step in the right direction, but should not be considered comprehensive. Ongoing implementations of high availability systems will undoubtedly uncover new considerations. Nonetheless, following these recommendations will help in creating a system with host hot swap capability, allowing it to handle many types of system processor faults.

The cost is that system and application software may require significant changes. The extent of the modifications needed depends on the type of switchover you choose. Given that some options require changes to I/O drivers and applications software, your choices for switchover type may be limited by the extent to which you have control of that software.

About the Author

Mark is currently the Systems Architect for High Availability Real-Time Operating Systems at the Motorola Computer Group. He has been at Motorola for ten years. Previously, Mark held a variety of software and hardware development and management positions primarily related to networking and communications systems. Mark graduated from Bucknell University with a BSEE in 1975.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm