United Business Media EE Times




Search

HOMELATEST NEWSSEMICONDUCTORSMOST POPULARMARKET INTELLIGENCE UNITFORUMSDESIGNNEW PRODUCTSCAREERSBLOGSCONTACTEVENTSSIGN UP!RSS

 

Carrier Grade Linux: What You Need to Know
Interest is building around the use of Linux in telecom equipment designs. To accelerate acceptance, the OSDL is developing a set of specs that will turn Linux into a carrier-grade OS. Here's a look at these efforts.







CommsDesign


While interest is high, acceptance of the Linux operating system (OS) in telecom equipment designs has been modest, at best. One of the main reasons for the reluctance is due to the fact that existing edition of the Linux OS are not optimized to meet the carrier-grade requirements of modern telecom carriers.

Now the Open Source Development Lab (OSDL) is tackling this issue head on. Through its Carrier Grade Linux Working Group (CGLWG), OSDL is drafting a set of requirements that will turn Linux into a more attractive option for telecom equipment developers. Let's take a look at the key efforts under way and the key aspects of the emerging spec that should concern communication design engineers.

The Overview
The OSDL CGLWG defines three main types of applications that carrier-grade Linux will support — gateways, signaling servers, and management.

Gateway applications provide bridging services between different technologies or administrative domains. Gateway applications are characterized by supporting a large number of connections in real-time over a large number of interfaces, with the requirement of not losing any frames or packets. An example of a gateway application is a media gateway, which converts conventional voice circuits using TDM to IP packets for transmission over an IP-switched network.

Signaling server applications, which include SS7 products, handle control services for calls, such as routing, session control, and status. Signaling server applications are characterized by sub-millisecond real-time requirements and large numbers of simultaneous connections (10,000 or more). An example signaling server application would include control processing for a rack of line cards.

Management applications handle traditional service and billing operations, as well as network management. Management applications are characterized by a much less stringent requirement for real-time, as well as by additional database and communication-oriented requirements. A typical management application might handle visitor and home location registers for mobile access, and authorization for customer access to billable services.

Categories Defined
To build an effective specification for the applications above, the CGLWG is working on seven categories: standards, platform, availability, serviceability, tools, performance, and security. Work within each of these categories is broken down into three priority levels: level 1 defines first release requirements, level 2 defines second release requirements, and level 3 defines future release requirements.

To effectively understand how the CGLWG is going to turn Linux into a carrier-grade solution, it's important for designers to understand the work going on in each group. Below, we'll detail the work underway in each category, looking at the three priority levels. Let's start with the standards category.

Category 1: Standards
CGLWG has recognized the importance of standards in the Linux space and is working on a set of specifications that promote portability, ease of programming, and software availability for telecom developers looking to implement Linux in an equipment design. Additionally, the working group is hoping to develop standards that make it easier to avoid problems in coding, and thus improve the reliability of the system.

In the standards category, the working group's first priority to require Linux standards base (LSB) compliance, some IPv6 (including IPSECv6 and MIPv6) compliance, SNMP support, and POSIX interface compliance in the areas of timers, signals, message queues, semaphores, event logging, and threads.

The idea of LSB is to create a standard for Linux software that will promote interoperability between multiple software products. More information on this work can be found at www.linuxbase.org.

Simple Network Management Protocol (SNMP) support in carrier-grade Linux will include support for all SNMP agents (SNMPv1, SNMPv2, and SNMPv3). The POSIX interface, on the other hand, will be based on the IEEE standard 1003.1, 1003.1b, and 1003.25 standards.

The IPv6 RFCs required by CGL are too numerous to list here. To learn more about these, visit www.osdl.org/projects/cgl for more information.

Once the LSB, SNMP, POSIX, and IPv6 work is well underway, the working group will turn to priority 2, which adds support for the stream control transmission protocol (SCTP) and further IPv6 RFC compliance. SCTP is a next generation multi-link end-to-end transport layer that provides reliable transmission of messages over potentially unreliable connectionless packet services such as IP.

In future releases (priority 3), the standards category will also include requirements for service availability middleware. These specs are expected to follow the efforts of the Service Availability Forum (www.saforum.org) to define standard interfaces, but may branch off to other requirements depending on the openness and availability of the Service Availability Forum's specification.

Category 2: Platform Requirements
When defining the platform, the main goal is to include support for hot-swap, remote boot, diskless operation, and console-less operation. The carrier-grade Linux definition of hot swap also includes the concept of hot insert (adding cards to the system not originally in place at boot time), hot remove (not replacing cards that are removed), and maintaining device identities across hot swaps and system boots.

Once these standards are set, the working group will turn its attention to the second priority, which adds network console and automatic alternate boot selection. Network console refers to the ability to have the system console for a CGL system appear on a remote system. Alternate boot selection refers to the ability to have the system self select an alternate kernel to boot if it detects that there are too many reboots within a specified period of time.

There are currently no priority 3 platform requirements for carrier-grade Linux under development by CGLWG.

Category 3: Availability
Clearly, to operate in a telecom environment, high availability is a must. Recognizing this need, the CGLWG is working on a set of specifications for building high availability into the Linux OS.

The first priority here is driver hardening, watchdog timer support, application heartbeat monitor support, Ethernet bonding and aggregation, RAID 1 support, journaling file system support, and HA disk and volume management support.

Driver hardening refers to a process that includes code reviews, panic removal, and the addition of boundary condition checking and fault containment code for selected drivers. This hardening improves the uptime of the kernel and consequently the system. Carrier-grade Linux requires that drivers for networking, disk access, and logical storage devices (LVMs) be hardened.

Watchdog timers are hardware devices with the capability to reset (reboot) the system should the watchdog not be periodically reset by software. Carrier-grade Linux-based systems must support watchdog devices.

Heartbeat monitors are software versions of watchdog timers, with the added difference that a heartbeat monitor may simply restart an application instead of resetting the entire system. The carrier-grade Linux specs will provide an API for registering a monitor with the heartbeat service.

Ethernet bonding and aggregation refers to the ability to bind multiple Ethernet devices into a single logical device with the ability to detect and reroute traffic around a failed link. Aggregation is the ability to utilize all active links simultaneously to achieve higher throughput. Not all switches are able to handle Ethernet aggregation.

The carrier-grade Linux spec will also require high-availability file system features including RAID 1 (disk mirroring), journaled file systems such as JFS, XFS, ext3 and ReiserFS, and online logical volume management features such as changing volume definitions and sizes without taking the volume out of service.

Once the above features are defined, the work group's second priority is to add watchdog timer pre-interrupt (for platforms that support this feature) and software live upgrade features to the carrier-grade Linux spec.

The watchdog timer pre-interrupt will call an interrupt routine that may be able to handle the problem before the actual watchdog timer expires and resets the system. Software live upgrade refers to the ability to upgrade system and application software in a running system with a minimal impact on the availability of services. Minimal impact is specified as 30 to 60 seconds or less depending on hardware parameters and the software being upgraded.

The third priority here will be the addition of application fail-over support, data checkpointing services, multi-node volume management, and support for cluster file systems.

Category 4: Serviceability
On the serviceability front, the CGLWG is currently trying to define requirements for resource monitoring, kernel crash dump and analysis features, structured kernel messages, dynamic kernel probing, hardware error logging, and remote access to the event log.

The carrier-grade Linux specification will provide a publish-and-subscribe resource monitoring API for tracking kernel gauges. Kernel gauges provided by the spec currently include Ethernet traffic, free memory pages, processes created, number of zombies (terminated but still present processes and threads), and current kernel load. New gauges are easy to add to a system for tracking other information, and gauges may be self-discovered by user-level software by utilizing the unique gauge string identifier.

Monitor options in the specification include high and low watermarks for system resources, alarm events, error conditions, rate monitors, and "leaky bucket" monitors. Using this API, system developers can keep track of system resources to help diagnose problems and provide remedial action before the system becomes unusable.

The new Linux specification will also provide structured messages for kernel errors, which provide more information to allow higher-level software to diagnose the problem and its severity, and to take remedial action.

Dynamic kernel probing is a technology that allows designers to examine data structures and establish break points without the requirement that debugging code be compiled into the kernel itself. Kernel probing allows developers to observe system behavior easily and quickly at the debugger level in a running system, without having to compile the modules under observation with debug capabilities beforehand, or indeed even having to decide in advance which modules may need to be probed. Dynamic kernel probing also allows system maintainers to examine a deployed system "in situ" without needing to bring it down.

Hardware error logging refers to the ability to have hardware error traps (interrupts) be logged by the event logging facility. Remote access refers to the ability of a central facility to access the error logs of managed systems.

After the above requirements are set, priority 2 will add fast boot requirements, consistent system device enumeration, online diagnostics, and forced unmount of file systems.

Consistent device enumeration refers to a capability by which applications can itemize all devices in the system and register these devices so they receive hot-swap events (insertions, removals). All devices in the system shall maintain consistent identification across removals, insertions, and reboots. Note that most of this capability is required in the priority 1 platform requirements for hot swap. The additional priority 2 requirement here is the API by which applications can access the consistent device identities.

Through the priority 2 development work, the carrier-grade Linux spec will allow applications to register with an online diagnostic framework and perform diagnostics on registered devices. Devices will be able to add themselves to the diagnostic framework at any time.

Forced unmount functions will also be provided. Forced unmount refers to the capability to force an unmount of a mounted file system, even if there are open files or process working directories on it. Pending requests on the file system will be terminated and return an error indication and the file system will be left in a clean state (if possible).

Priority 3 in the serviceability category adds panic handler enhancements to allow more flexible responses to a panic, and kernel dump generation for a live system. System panics currently halt Linux-based systems. The carrier-grade Linux spec will add the capability to have the panic add an event to a (persistent) error log and then provide the option to reboot, power off, or power cycle the system when a panic occurs.

Kernel dump generation of a live system refers to taking a snapshot memory and resources dump of a live system without unduly perturbing the execution of the system.

Category 5: Tools
Priority 1 in the tools category will focus on the development of debugger support for threaded programs, kernel debugger support, and a kernel crash dump analysis tool.

Threaded program debugging support includes the ability to apply standard debug commands to any thread in a threaded program. Some commands specific to threaded programs will be added, such as notification when a new thread is created, listing of the threads in a process, the ability to switch between threads, the ability to apply debug commands to a list of threads, and the ability hide/un-hide a thread from the thread list.

CGL kernel debugger support requires, in addition to standard debug facilities, the ability to break into a running kernel, either at boot time or during normal operation, to be able to debug as SMP kernel, and support remote debugging via a serial port.

Priority 2 in the tools category will add support for fault injection testing, kernel tracing and kernel profiling.

Fault injection testing is a method by which software faults can be injected into the kernel to test the ability of the kernel to recover from errors, without the need to create elaborate tests to create the same fault conditions through normal kernel operations, allowing for easier and more complete testing of seldom used paths in the kernel.

Kernel tracing provides a set of trace points in the kernel for most significant kernel operations, and an interface for enabling and filtering them, as well as a graphical analysis tool. Tracing is a very powerful diagnostic technique for analyzing system activity, especially for debugging problems caused by interactions between programs and the kernel.

Profiling refers to the ability to build a picture of how the kernel spends its time, either by using the tracing facility (above), or by sampling kernel execution via interrupts and recording the kernel activities that are interrupted.

Priority 3 will add debugging support for following a fork system call into the child process.

Category 6: Performance
In the performance category, priority 1 performance features required include millisecond real-time (less than 10-ms worst case latencies), pre-emptible kernel, RAID 0 (striping), application pre-loading, and a scaling analysis and report to identify scaling bottlenecks.

Priority 2 in this category will add remediation for problems found in the scaling analysis performed during the Priority 1 phase, virtual memory system improvements (page pinning and flushing APIs), SMP CPU process affinity and support for multiple scheduler policies. Priority 3, on the other hand, will bring support for a self-resizing file system for transient data.

Category 7: Security
The current version (V1.0) of the carrier-grade Linux specification does not have any requirements for security, but this will be addressed in a future version of the document. Security is recognized by the CGLWG as a key component of high availability.

More on the Spec
Above we provided a brief look at some of the key development work being conducted by the CGLWG. For more information on the carrier-grade Linux specification, visit www.osdl.org

About the Author
John Mehaffey is a high availability architect at John is the author of the Linux appendix for the PICMG Hot Swap Infrastructure Interface Specification (PICMG 2.12), and has been active in the PICMG 2.13 subcommittee working on the Redundant System Slot Specification. John can be reached by email via mehaf@mvista.com.











  Free Subscription to EE Times
First Name Last Name
Company Name Title
Email address
  Click here for your Free Subscription to EETimes Europe
 
CAREER CENTER
Ready to take that job and shove it?
SEARCH JOBS
SPONSOR

RECENT JOB POSTINGS
CAREER NEWS
10 Search Engines You Don't Know About
Go beyond Google and get vertical. These specialized search sites will help you find the business information you need -- fast.

For more great jobs, career related news, features and services, please visit EETimes' Career Center.


All White Papers »   


 

FEATURED TOPIC



ADDITIONAL TOPICS












Home | About | Editorial Calendar | Feedback | Subscriptions | Newsletter | Media Kit | Contact | Reprints|  RSS|   Digital|  Mobile
Network Websites
International
Network Features




All materials on this site Copyright © 2008 TechInsights, a Division of United Business Media LLC All rights reserved.
Privacy Statement | Your California Privacy Rights | Terms of Service | About