The concept of "five nines"-99.999 percent-availability and reliability is the benchmark for the telecommunications industry, where systems and services are expected to be available every second, minute and hour. Five nines translates to just over five minutes of "unavailable" systems in a year-less than a second a day. After all, when was the last time you couldn't get a dial tone when you picked up your telephone?
But when it comes to making a call on a cellular phone or connecting to the Internet, you can't count on reliability in quite the same way.
With the convergence of the data computing and telecom worlds, businesses and their customers no longer tolerate waiting for connections or losing transactions. Availability doesn't mean rebooting to a redundant component; customers want ongoing service availability. Redundancy in both hardware and software is a necessary feature design for high-availability systems. However, because modern telecommunication and data systems are based upon computer platforms comprising both hardware and software components, simply throwing redundancy at the problem does not translate to 99.999 percent uptime.
Consider CompactPCI, which is rapidly becoming the de facto standard next-generation, highly available telecommunications system platform. A typical CompactPCI chassis contains redundant host CPUs, usually one as active and the other in standby mode. It can have any number of I/O cards with either one-for-one redundancy (2N), or one redundant resource for many of the same type (N+1). Software can also be made redundant by running the same applications on separate CPUs. Sounds easy so far. But how do we get from having component redundancy to providing service availability?
Suppose the failure must be detected automatically, the failed components identified and isolated and the redundant components brought on line quickly and seamlessly. In many instances, those systems are not readily accessible. If the fault management is handled automatically, the system administrators and technical staff must be automatically notified by e-mail or page.
A key feature in the CompactPCI chassis is hot swap-that is, extracting and inserting boards while the system is on. You'd like the software to automatically detect, configure and manage the new board without any user intervention. A CompactPCI chassis can have any number of operating systems; in fact, it is the norm to have different operating systems running on the host CPUs vs. the I/O cards. The newly inserted board could be completely different from the one it replaced, but it still needs automatic detection and integration into the rest of the system.
Recovery policies are a defined set of rules the system takes when a specific event occurs. What if you want to change your policies? Suppose the marketing group decides a new feature must be added to the system in order to make corporate revenue targets for the next quarter. The good news is that the new feature can be added via a software change; the bad news is that you have 5,000 systems to modify across the country and it must be done by the end of the week.
What if your business skyrockets and your systems must be scaled up immediately? It would be nice to have a nonproprietary solution based on existing industry standards, thus enabling new systems to be fielded at a faster rate than is possible with highly custom and system-specific solutions. How do you maintain availability if switchover has to happen during a session? Not only does the hardware have to be up and running, but the applications also need seamless switchover with preservation of session-state data.
Service availability software solution
Management of redundant components for service availability requires six elements:
- A highly flexible and extensible rules-based configuration management engine and database.
- Event detection and multitiered, policy-based fault management.
- Remote management and monitoring via a Web-based user interface.
- The ability to automatically upgrade software components of remotely located systems over that same Internet connection.
- A communication and data transfer mechanism for automated updates to configuration management.
- Check-pointing of current operating-state data between software applications.
The configuration management has to maintain current operating-state information for both hardware components and software applications. A key factor in providing five nines uptime is the application-level check pointing. This enables rapid switchover when failures are detected so that end users are unaware any failure has occurred.
To solve such problems, we developed and deployed a high-availability solution with the architectural flexibility and the total-system approach necessary to make it a reality. When a planned or unplanned event interferes with proper operation of a component, management software initiates and manages an intelligent policy-based switchover to a standby component. After switchover, the software initiates appropriate recovery and switchback operations to restore the system to its full capabilities. It provides the additional necessary dynamic configuration management for all active elements within the system.
Components publish their own capabilities to the built-in configuration manager via industry-standard XML. Fixed, transient and hot-swappable components are all remotely monitored and controlled through Application Programming Interface (API), customizable Web-based user interfaces or via Simple Network Management Protocol network-management consoles.
The same solution is extensible outside of a single chassis using built-in cluster management. A cluster manager establishes connections to each client and monitors its status on either a single LAN or across network boundaries using IP addresses. A role manager designates and manages responsibilities of the clients and the manager.
In the model we use, a library of APIs enables you to create "high-availability"-aware applications. The APIs allow applications to publish, subscribe, heartbeat, checkpoint, write to or record from the in-memory database and to implement other capabilities to ensure continued system operation. Cross-platform architecture supports a variety of general-purpose and real-time operating systems.
So far we have the redundancy management solution. Now, what do you have to do to integrate the five-nines system everybody wants?
In our approach, the software has a "solution pack" architecture that gives the user both ease and control in configuration and implementation. For integration with specialized components in the telecom world-for example, routers and switches-development is needed, but it's simplified. Changes are physically (in one directory) and functionally together in one or more solution packs.
In our model the minimal requirement for any solution pack is to have the XML file named pack.xml properly configured in a directory as the same name as the pack in the "solution" subdirectory.
The primary area of customization is in setting policies for fault management. We use a fully automated, device-based fault management system. It incorporates fault detection, diagnosis and user-configurable policy-based fault recovery and switchover actions.
Fault detectors gather data from various sources including collectors, events, applications and even other detectors. Detectors use this information to make predetermined decisions about the system. If the detector finds a problem condition, it invokes a policy.
When a fault detector is run, it implements a rule to determine the status of the information it watches. This rule is contained in an XML file. If a value watched by a detector violates the rule the detector fires triggering policies, which are also XML files.
The policy can invoke any number of actions such as sending an e-mail message, starting a program, stopping a program, sending an HTML message, writing a message to an error log, etc. These policies can be created and modified quickly using simple XML scripts.
See related chart