United Business Media EE Times


Search

HOMEMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSSMost Popular contentTrusted Sources

 

Bringing High Availability to the Masses

A standards-based approach makes high availability much more realistic.

By Fred Rehhauser


Lately, there has been a lot of talk about high availability in the computer industry. There is no doubt it's an important factor in customer satisfaction for almost every sector in the compute world. The only problem is everyone has their own definition for what the term implies. Loosely, high availability refers to the amount of time a certain computer or network of computer resources is available to perform useful work. But even this interpretation has different meanings for different applications.

High Availability: Many Meanings

It helps to think of high availability as a continuum. At one end of the spectrum are PCs - including those in the home and those on the office desktop. We are all too familiar with how often a PC needs to be rebooted. Even as processor speeds move into the gigahertz range, the reliability and availability of a PC seems to remain miserable as ever. Although this affects productivity, no one is willing to pay more for a PC to obtain higher availability, so everyone continues to endure the situation.

As you move further up the compute chain, however, high-availability requirements become much more demanding and, consequently, better defined. For example, enterprise computing resources that support mission-critical applications, such as ERP and accounting, are typically expected to deliver 99.5 percent uptime. This works out to be less than one hour of downtime per week, whether for upgrades, unplanned failures, or planned maintenance.

If that sounds stringent, consider the Internet. As customers rapidly become accustomed to immediate Internet access, their willingness to tolerate temporary downtime is quickly approaching zero. The growing dominance of mission-critical and business-to-business (B2B) applications will only accelerate these expectations. For this reason, the Internet industry is moving briskly toward the adoption of extremely aggressive high-availability guidelines approaching 99.9 percent availability or what is referred to as "high nines." This equates to approximately an hour of downtime per year. The goal here is to achieve "web-tone" availability similar to that found with the dial tone in the telco industry. In fact, probably sooner than later, the expectation of web reliability will converge with these telco-type expectations.

The telco industry itself resides at the high end of the high-availability spectrum. When it comes to telephone access, businesses and consumers alike expect and get essentially non-stop operation. When a phone is picked up, there is always a dial tone - anytime, anywhere and under any condition. Here, the so called "five nines" or 99.999 percent high availability is the norm, translating into a paltry five minutes of downtime per year.

Standards-based approach emerging

Until now, the telco industry built its proprietary computer systems to meet its demanding standard, rolling their own chips, boards, firmware and software. However, the rapid deregulation and globalization of the telco industry over the last few years, coupled with the Internet explosion, has made it more difficult to keep current,placing incredible demands on the internal infrastructures of public and private networks. In this new environment, time to market means everything in the pursuit of the rate of return necessary to keep pace with the dot.com age. Adding to the difficulties is proprietary, out-of-date hardware and software within these systems-legacy systems which were never intended to support such exponential rise in users and new services.

This has brought about a revolution in compute platforms within the telco industry. As a result, third-party modular, open-standard hardware and software are now available that deliver the levels of reliability, availability, and serviceability the telco industry demands. The sheer amount of development around an open standard not only ensures greater reliability but also makes the components much more cost-effective.

Why should this major shift in the telco world be of interest to the larger compute universe? Simply because the emergence of an affordable, yet extremely reliable framework means that telco-like high availability can now be adopted in other computing environments. Not only will this type of affordable, open-standards platform speed next-generation telco solutions to market, it will also provide the means for the highest levels of availability to quickly migrate into the Internet, enterprise, business, and even home spaces. Such a platform could even make telco-like availability possible for applications like automated highway control, tele-surgery streaming video, set-top boxes, home networking-including everything from computers to home-security systems - and myriad other killer apps yet to be dreamed up.

CompactPCI as foundation

A standards-based approach is now feasible because there finally exists a cost-effective and open board-level standard capable of supporting the demands of high availability. Basically, for the rack-mount systems used in telco applications, the choice of board-level standards for high-availability systems narrows down to PCI, VME, and CompactPCI (cPCI).

The PCI bus is the industry standard for millions of desktop systems. Unfortunately, it doesn't provide the higher levels of reliability or uptime needed in a high-availability system. Further complicating the issue, there is no easy way to cool this type of board and it incorporates edge connectors that are notorious, both for being somewhat unreliable and for making board replacement difficult. In its favor, however, the PCI standard leverages the tremendous advantages of the enormous PC industry, making it very cost-effective, reliable and extremely flexible given its wealth of robust device drivers and proven, inexpensive silicon.

The VME standard, on the other hand, was specifically developed for industrial applications, a place where high availability has long been a major concern. This means it offers superior reliability, was designed specifically for cooling, and is easily installed or removed. It is, however, a proprietary standard, meaning it's not only expensive, but also limited in what it offers and supports.

Figure 1 - A platform to stand on
The CompactPCI and telecom platform architectures are standardizing product development.

To address the limitations of these two established standards, a consortium of over 400 computer suppliers and manufacturers worked together to find a solution. The result was the creation of the cPCI specification.

This standard deliberately merges the performance, scalability, and reliability of VME with the cost efficiency and flexibility of the PCI standard. Enticed by this combination, network, telecom, and service provider manufacturers are embracing the approach. The benefits of cPCI include a standard form factor, PCI compatibility, and compute high performance with 132 MB/sec in 32-bit systems and 264 MB/sec in 64-bit mode. It also is scalable and expandable, extremely reliable, and designed for superior cooling (see Figure 1).

System-level perspective critical to high-availability success

For all its strengths, the cPCI standard can't, in and of itself, make a high-availability system. Achieving aggressive levels of high availability demands more than simply building upon a robust hardware platform. The entire system, from operating system to application, must be carefully scrutinized and optimized to support availability. Every aspect needs to be deliberately crafted to ensure that any and all failures are kept to a minimum on both the hardware and software fronts. If and when failures do occur, mechanisms need to be in place so that alternate resources can be immediately accessed and put to use, ensuring continuous service to the end-user.

Because a high-availability architecture must encompass all levels of the system - from the hardware to the application and management software - it makes for an extremely complex structure. The best way to illustrate how a standards-based approach works is to discuss in detail a specific high-availability architecture - in this case, the CP2000 high-availability architecture for the telco industry from Sun Microsystems, Inc.

Earlier this year, Sun announced its new CP2000 series of cPCI-based high-availability products. These products include open-standard, board-level platforms along with the technology found in Sun's high-end enterprise servers. More than just a hardware solution, the CP2000 platform combines the Solaris 8 operating environment, the Chorus 4.0 real-time operating system, and various implementations of Java, giving telcos the same level of functionality they had with earlier proprietary, monolithic operating systems. Since the platform is based on industry-wide standards and technologies, it offers increased scalability and performance, and a lowered cost of ownership. In addition, it supports the 99.999-percent high-availability needs of the telco industry.

Extensions to the standard

The CP2000 program includes enhancement and extension of the basic cPCI architecture to support specific, high-availability requirements from a systems point of view. The development allows for the introduction of a standard method of communication between satellite cards in a rack and the system controller.

Figure 2 - Hot fun in the summer sun
Sun's PCI hot-swap framework includes a variety of techniques to handle single failures.
Consequently, CP2000 cPCI cards can be installed in either system controller or satellite slots. The improved resilience of the cPCI shelf to single failures is managed via a variety of techniques, including: comprehensive driver support for hot-swap that provides an API to the hot-swap features for a clear interface in the device driver, an IPMI bus for complete I/O card support offers an out-of-band communication link between the system controller and the satellite card, a dual host or alternate system controller to add a second system controller to the PCI bus, removing a key single point of failure - the system controller itself, and application-specific or card-specific failover techniques (see Figure 2).

Operating system requirements

In keeping with the system-level approach to high availability, the CP2000 includes specially optimized operating system capabilities that directly support high-availability requirements. The CP2000 architecture uses the Solaris 64-bit operating environment. It supports high-availability clustering, as well as accepted telecom and networking protocols from both Sun and various third parties. Integrating cPCI-specific drivers and optimized bus backplane communications have enlarged the Solaris Hotswap framework.

The CP2000 program also includes the real-time ChorusOS operating system. Real-time capabilities are crucial for intelligent applications such as voice processing and data-link control. This applies to a variety of applications including cellular base stations, base station controllers, public switches and PBXs, access networks, cross-connect switches, and voicemail systems. The ChorusOS and Solaris operating environment constitute an integrated program with common application programming interfaces, management functions, and Java technology-enabled capabilities for dynamic delivery of IP services. As a result, applications developed using the APIs can be ported from one environment to the other, reducing development costs and time to market on services.

The system controller runs either the Solaris operating environment or ChorusOS. In addition, users can choose either the Solaris operating environment or ChorusOS for satellite cards, or they can create a heterogeneous system which runs Solaris operating environment on the system controller and ChorusOS on the satellite card. The ChorusOS satellite cards can communicate with Solaris system-based controller cards using the CP2000 backplane network as it provides a transparent link between heterogeneous environments. This operating system interworking paves the way for end-to-end integration of equipment across the network.

High availability out to the application

Moving to the next software layer, high-level volume management middleware is implemented in the CP2000 to support data storage and database requirements for data mirroring within a high-availability system. The volume manager also provides high-level disk management, enabling disk concatenation to create a single large partition or disk striping for improved performance. The manager is responsible for associating a particular set of data with a service. This ensures that in the event of a failover, access and ownership of the data is transferred to the new node.

At an even higher level, the CP2000 will include a framework that consists of various monitors with a management interface. These monitors track the availability of the hardware and applications. When the manager is informed of a failure, it will initiate the failover sequence. This framework resides at the data services layer, with utilities that are responsible for monitoring the state of the system and initiating failovers when necessary.

Finally, on the application level, availability must be carefully managed. After all, this is the level about which the end-user cares most. If the desired application is not available, it matters little that the hardware is still functional. It is imperative, therefore, that applications be monitored independently of the underlying hardware and system services. Otherwise, application failure cannot be detected. For this reason, a truly comprehensive high-availability solution like the CP2000 provides the ability to monitor not only hardware and system services, but the applications as well. Moreover, it implements a checkpoint, failover and restart mechanism for applications running within a shelf.

System management crucial

Residing at the top of the CP2000 high-availability structure is the system management framework. A key piece, it is the primary way operators and service personnel interface with the system. So critical is this component, if the system is not adequately managed, the amount of effort that went into the underlying technology ceases to be important.

The system management level provides the high-availability environment with crucial capabilities.

Foremost, it enables the system to be updated, extended or otherwise modified without requiring that the entire system be brought down. In fact, the system management itself is reconfigurable on the fly.

The system management software also supports scalability. Any healthy telecom system will constantly be evolving, with nodes and services being added while old hardware and applications are phased out. Using the system management software, the telco provider can easily add and subtract nodes with the CP2000 architecture without bringing down the whole system. In a large environment, this software supports multiple administrative domains spanning geographies and time zones, allowing management teams to work concurrently and independently.

A flexible, remote system management system is also important so that the entire network can be tailored to best meet specific needs. To that end, the system management environment is easily customized, providing centralized or distributed management based on the desired management hierarchy. Systems can then be grouped by location, server role, and administrative responsibility, among other criteria. Even views can be customized to present information in a tailored format.

In line with the industry-standard approach, the CP2000's system management software is constructed on industry-accepted standards such as the Java programming language and the SNMP management protocol. SNMP is widely used by most network and systems management products for communicating status and alarm information. Standard interfaces and protocols simplify the integration of third-party system management tools. Using these tools in conjunction with the basic CP2000 management framework, telecom system managers can access a single interface to manage an entire heterogeneous enterprise, while very effectively administrating the core elements.

To further extend the capabilities of the system management software, a robust development environment is available. Organizations can use the tools provided in the development environment to build new modules to monitor applications. The system management software is also absolutely secure. A complete security model is used to authenticate system managers, protect system management information from unauthorized access, and ensure data integrity.

Supporting redundancy

The CP2000 is designed to support the highest levels of redundancy, an understood requirement for achieving extremely high levels of availability. Leveraging the high-availability capabilities of the underlying hardware and software components, the CP2000 architecture has redundancy at multiple levels.

Optional component redundancy is one example. The system may be configured with multiple system boards, processors, memory banks, network and I/O controllers, and redundant power and cooling. The CP2000 also supports hot-swappable I/O cards, focusing on the software issues involved, as they are critical to maintaining high availability. The CP2000's solution is to provide a way to make applications hot-swap aware so that failures can be handled seamlessly in software and failed hardware can be simply "swapped out" while the system remains online.

The CP2000 also supports alternate pathing, where disk and network operations are automatically redirected to a predefined alternate path should a failure occur. This permits I/O cards to be serviced without disruption of systems. Alternate pathing makes dynamic reconfiguration possible, allowing operators to change a system's hardware resources while the system is up and running and without requiring a system reboot. The combination of alternate pathing and dynamic reconfiguration enables administrators to perform online repair and reconfiguration of servers, increasing application or service-level availability.

High availability now available

As Sun's CP2000 program demonstrates, the only way to achieve the most aggressive levels of high availability is by taking into account all the layers from hardware to system-level management. The beauty of a standards-based approach is that best-in-breed capabilities can be combined from different vendors to develop extremely robust yet affordable high-availability systems.

The telco industry is already putting in place next-generation systems that are built around standard hardware and software. Meanwhile, larger service providers are seriously considering adopting higher levels of availability as their customers pressure them for better quality of service. As the standards-based approach proliferates-which it will-look to see the most stringent levels of high availability become commonplace throughout the compute universe.


Fred Rehhausser's experience includes more than 20 years in senior or executive management positions at Sun Microsystems' Microelectronics, Chorus Systems, Force Computers, and Motorola's Computer Group. He is the author of numerous papers on bus architecture and bus technology and holds several patents in the computer field.

To voice an opinion on this or any other article in Integrated System Design, please e-mail your comments to sdean@cmp.com.


Send electronic versions of press releases to news@isdmag.com
For more information about isdmag.com e-mail webmaster@isdmag.com
Comments on our editorial are welcome.
Copyright © 2000 Integrated System Design Magazine

  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Looking for a new job?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
SRC Expands R&D Centers
The Semiconductor Research Corp has added a new center to its university R&D efforts.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   

 
Education and
Learning


Learn Now:












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2009 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Terms of Service | About