Design Article

IMG1

Understanding Service Availability--An Industry in Transition

John Fryer, Service Availability Forum

7/22/2009 2:49 AM EDT

Today the dependability of the communications infrastructure and the applications flowing over these networks is more important than ever. As new technologies enable powerful new services, users rapidly become dependent on these services to conduct their personal and professional lives. The communications and enterprise computing industries are under enormous pressure to revamp networks and applications to accommodate explosive growth and emerging technologies. As communications equipment is deployed into packet-based multiservice networks, the dependability and availability of services must not be compromised.

This next phase of network evolution must happen quickly; service providers are already offering new services that challenge current network service capabilities. Users expect new, innovative distributed services to be delivered on demand and without interruption. These demands of the emerging communications environment can be met rapidly with the adoption of open industry standards. To achieve widespread adoption, educating application developers on how best to leverage open specifications for their work is key.

A Paradigm Shift
The transformation of networks--the migration from discrete-service based architectures to converged networks that are IP-based, service oriented and transport agnostic--is not lost on those who supply telecom equipment to network operators. In the last decade, there has been a continued paradigm shift from a vertical to a horizontal industry model, as equipment manufacturers must build communications equipment, and enterprises must develop applications, that achieve the highest possible levels of availability and dependability. This is driven by ever-shorter development cycles and constant pressure to reduce development costs. Service providers, in turn, rapidly deploy new services and vouch for their availability and integrity in order to successfully compete for users and strive to meet customer service level agreements.

What is catalyzing this shift is the emergence of key open specifications that are creating clear delineation between various functional layers of a highly available system. This standardization of functional layers--hardware, operating system, middleware and application services--is greatly facilitating the ability for systems designers to develop highly available deployment ready systems, using commercial off-the-shelf (COTS) building blocks (See Figure 1). The emergence of multiple COTS suppliers for each of the building blocks is helping create a viable and vibrant ecosystem that provides compelling alternatives to build systems by leveraging a strong COTS ecosystem. As a result, development organizations are focusing their precious, often shrinking, resources on activities that differentiate them from competitors--applications and services.


What is Service Availability?
A key requirement to deploy a highly available system is that it must provide uninterrupted service even in the event of hardware or software failures. Examples include the communications industry, where "carrier class" is synonymous with high availability, and the defense industry, where "mission critical" systems are essential in an increasingly high-tech environment. Historically, Network Equipment Providers (NEPs) have designed and built such systems from the ground up, using the specialized, in-house expertise developed over decades.

Traditional definitions of high availability have roots in hardware systems, where redundancy of equipment was the primary mechanism for achieving uptime over a specific period. As software has come to dominate the landscape, the probability of failure is often much higher for applications than it is for hardware, so these concepts have been extended to encompass an overall view of Service Availability, where downtime, irrespective of its cause, is an exceptionally rare event. Services and applications should always be available, whether during abnormal system operation, scheduled maintenance, or software upgrade.

The key principles of Service Availability extend beyond the reactions to a failure. Rather, they encompass the idea of system monitoring where preventative action may be taken before a critical situation occurs. Examples of this might include redundancy, fault prediction and avoidance, stateful and seamless recovery from failures, and mean time to repair. Correct system design and exhaustive testing aside, today's complex system can often interact in ways not envisioned by system designers.

Many systems providers have invested a significant amount of time and resources in developing software services, often referred to as high availability middleware, essential to building platforms and systems that provide service availability approaching FIVE-NINES or better. The concept of a number of "NINES" is the normal measure used, which translates into the amount of downtime per day, or year. Applications with high service availability generally fall into the FIVE-NINE's or higher category, which translates into less than 5.25 minutes of downtime per year or less than .86 seconds per day. This is why in many circumstances phone service may still be available even if there are power failures.

Figure 2 below shows the characteristics of an available system.


As we move up the scale we start to see the typical characteristics of highly available systems. These include system monitoring functions, such as heartbeats, redundancy of components, and alerts, when failures occur. All of these aspects generally represent a focus on increasing the "Mean Time Between Failures," through quality design and implementation, and the use of redundancy.

Redundancy is really a crossover point, where the focus shifts from MTBF--improving the time between failures to one of "Mean Time To Repair"--minimizing the amount of time it takes a system to recover from a failure. This is the real focus of delivering service availability.

Beyond the redundancy of components, availability engineering moves into diagnosis and correction. Many systems operate autonomously, and the diagnostic and corrective options must be managed remotely. Generally, systems may be diagnosed while online, but corrective actions, such as re-configuration, and software downloads, require at a minimum, hard system re-boots.

Many modern Enterprise class systems go a stage further by implementing policy management capabilities. This enables systems to operate at 4-NINEs and above levels of availability. The policies enable automated actions to be taken, based on events and trends that occur within a system. This includes running automated diagnostics and implementing dynamic reconfigurations, without taking a system out of operation. As data is collected about the health of the system, it is possible to predict potential faults, and take automated corrective actions, before critical events impact service availability. A classic example would be monitoring memory usage, and detecting decreasing available memory, over time, in a steady state. The corrective action may be to switch to a backup and restart the original application, or blade in a system, before bringing it back into service and then restarting the backup. Full memory usage is restored, for a period of time, and a critical failure situation is avoided, and most importantly, service availability is maintained. Logging information can be transmitted to the support team who can then diagnose the problem and update the system.

In the final situation, systems are able to self"heal, which goes beyond the concept of corrective actions. If we take the example of telecommunications or networking, it is only when something happens within a network that systems must react and preserve service availability at the network level. This could be, for example, the failure of a critical functional blade within a system, such as one maintaining a routing table. In such situations, it is possible for a system to reach a point of exhaustion where memory and CPU cycles become depleted, due to the loads imposed by applications. Throttling back applications, and re-distributing processing load to optimize system performance, to smoothly return a system to a preferred steady state performance, are characteristics of true service available systems, operating at 5 and often 6 NINE's. Enabling the design and development of such systems is the objective of the Service Availability Forum.

The Service Availability Forum
The Service Availability Forum is a consortium that develops, publishes, educates on and promotes open specifications for carrier-grade and mission-critical systems. SA Forum specifications enable COTS ecosystems for highly available platforms, streamline development, and accelerate time to market. Since its inception, the SA Forum has centered its efforts on producing key specifications to address the requirements of availability, reliability and dependability for a broad range of applications.

To date, the SA Forum has provided a rich set of API specifications that address several areas: a hardware abstraction layer termed Hardware Platform Interface (HPI) and an application abstraction layer called the Application Interface Specification (AIS). In concert these specifications allow for portability and management of service availability middleware as well as applications that comply with them.

As implementations of these specifications are increasingly accepted in the marketplace, the SA Forum is accelerating its effort to educate the communications and computing industries on how to develop applications that can achieve high service availability.


Application Webcast Series
With a robust catalog of specifications, the SA Forum has expanded its focus on educating application developers on how best to leverage open specifications for their work. Enabling real-world, successful applications based on the specifications will accelerate industry acceptance and spur growth in the COTS ecosystem.

The first Webcast in the series educates developers on the principles of service availability. Future topics will include implementation examples and a Developer FAQ. The goal of this valuable educational resource is to help developers navigate high-availability and service-availability issues so they can focus their time and effort on developing applications. For more information on the SA Forum and Service Availability and to download the first Webcast, visit: SAForum.

About the Author
John Fryer is a member of the board of directors and marketing chair for the Service Availability Forum. He is also the Director of Advanced Technology Marketing for the Embedded Computing division of Emerson Network Power. John is responsible for determining market trends, future customer requirements and driving industry software initiatives around high availability. He has a strong technical background with more than 25 years of experience in the communications industry in a variety of marketing and engineering positions. John holds a B.Sc. with Honors from the University of Nottingham, England. John can be contacted at: John.Fryer@emerson.com.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Product Parts Search

Enter part number or keyword
PartsSearch

FeedbackForm