In many distributed computer systems, there are large numbers of computers or processing nodes that control various facets of the system. Typical of such systems are those in many telecom environments and in new highly decentralized Internet data centers consisting of literally hundreds server blades.
As the capabilities of computer software and hardware systems increase, so does the critical nature of the tasks that those systems are expected to perform. Whether the system contains tens or hundreds or thousands of nodes, designers must address the monumental need for fault detection and notification in such system.
Among the many changes required is a shift from current 'heart-beating-based' non-scalable fault detection and notification solutions ranging from low-level 'in hardware' interrupt and polling systems to expensive, complex software hard- coded into an application.
A new, highly scalable, lightweight approach that can be used in small footprint environments makes use of a virtual ring detection, notification and fault-tolerant information transfer (NIFTI) mechanism. This mechanism allows monitoring of hundreds to thousands of nodes and real- time reporting of node "health."
Important to net-centric distributed computing environments, a NIFTI mechanism could be installed in each and every node in a way that is both independent and transparent to applications and OS. Consider, for example, a DSL provider that has many nodes distributed in multiple locations. With the NIFTI virtual ring mechanisms, the DSL provider can proactively respond to faults before angry customers call the service provider when performance or access issues occur. NIFTI also eliminates the need for the DSL provider's programmers to build custom fault detection and notification into the application, saving time, money and resources.
The NIFTI approach is different from typical fault detection and notification systems, which are expensive and not very effective as the node count increases. Typical approaches: a central processor pings each node individually, doubling the load on the network with bi-directional traffic and reducing available bandwidth or the nodes send out periodic heartbeats to the central processor. In either case, an increase in the numbers of nodes can overwhelm the central processor or the network causing lost heartbeats because of overloads.
In contrast, the new NIFTI approach uses each node to monitors only its local neighbor. The central processor receives notification only in the event of fault or failure. The virtual ring can be of any size, scale to large numbers of nodes and span large geographical areas. The default transport protocol is TCP/IP, but NIFTI functionality is easily ported to other transports.
The NIFTI fault detector and notifier system consists of two primary components: the Peer Node and the Central Unit (CU) . The systems' Peer Node group forms the virtual ring. The Peer Nodes are ordered on the ring according to their IP addresses or other unique IDs in case of other transports. At a node level, the NIFTI approach has each Peer Node monitor its upstream neighbor. When the neighbor Peer Node detects a fault, that Peer Node sends an error report to the server or CU.
At an application level, monitoring can be extended to various applications on the same node. This occurs while the Peer Node neighbor monitors the whole node for a crash. This hierarchy allows for scalable distributed fault detection and centralized fault management for easy system maintenance and recovery.
Typically, a system incorporating NIFTI contains two or more CUs to balance the system load and avoid a single point of failure. The reliable, in-memory, distributed and synchronized CU database contains the current system state and should be replicated on redundant CUs to maintain data consistency. CUs report system faults to a Fault Notifier which allows users to subscribe for, and to be notified of, fault reports from the CU database. Various kinds of filters can be applied to fault reports ranging from specific node faults to time specified faults to rules-based faults.
NIFTI supports dynamic setup and configuration change. Peer Nodes can join and leave the virtual ring at any time, and CUs join and leave the CU group at any time. When a new CU is added to the system, the distributed database automatically synchronizes all of the CU databases, and CUs communicate changes in their membership to the Peer Nodes efficiently.
See related chart