The openness of networked communications leads to greater pressure to protect systems from internal and external threats. How do you protect the network at all levels? One aspect is of course, what is commonly referred to as Network Security - in the IP world, security protocols such as IPSEC, SSL, SSH, and related protocols like IKE, RADIUS, etc, are finding their way into an increasing number of embedded devices - devices whose function is not involved with the network infrastructure itself. These protocols of course do a good job of protecting a system from unauthorized access and this is important. But secure access is only half the problem as the world of network-connected devices grows.
What about the issue of problems that arise from unintended but technically, from a security point of view, legal usage? Will it fail or operate in an un-safe manner that is deleterious to its networked environment? Is it being used in ways the designers did not envision? How can the managers of such devices detect such situations? And if such situations exist, can the device software be upgraded to fix or relieve the problem?
The whole issue becomes one of developing robust network connected devices/systems that can deal with a potentially unpredictable environment, where the emphasis is on correct and safe operation.
While simple or relatively simple networked devices need not necessarily implement the full HA (High Availability) set of strategies as for infrastructure equipment, they still need to be able to: a) manage faults so that failure information may be retrieved, and b) it fails in a "safe" manner. Only some aspects of HA fault management are important here. Not strictly needed are full repair or fault tolerance capabilities to ensure minimum interruption of service.
Of the HA concepts needed though, most notable is the concept of component "isolation". Isolation means that the chief components of a system are both logically and physically (via MMU or memory protection/partitioning), including the operating system. Thus if a serious error condition occurs, the error may be contained and localized to and within that component. Further, this yields the ability to do a better job of fault identification. Opportunities for dynamic repair are improved as well, but this is not strictly required here, but proper reporting or registration of the fault is required. Graceful failure and reboot could be an option for many devices. But mere isolation of components is not enough. It is a tenet of HA philosophy that error detection and handling cannot be part of the potentially affected component. Error handling therefore must be outside component - i.e. a centralized error handling system, often associated with the operation system, is critical.
Another facet of fault management important is fault detection, and an important aspect of fault detection in HA involves "health monitoring", sometimes called "supervision". Sometimes there are failure modes wherein a subsystem or component does not generate an observable error condition - it is still "running" but not executing correctly. Health monitoring is a technique wherein all subsystems or components are monitored for correct operation.
Designers of networked embedded devices/systems cannot be expected to anticipate everything that their system may encounter in the field. It therefore becomes necessary to devise strategies for being able to perform newly designed and developed tests and analyses on their systems "on-demand" or dynamically as new and unanticipated situations are encountered. We are not talking about simple on-line access to already "designed in" diagnostics or test. Rather, we are talking about a bolder proposal - to download and execute new programs that can peek and poke around the system, while in operation, to examine and gather data on its internal operation and/or history of operations. This may seem risky, but with the proper attention to design, may be achieved. Of course, loading new programs into the system would be done using appropriate security authentication and access.
If problems in the adaptation of the device/system, or bugs are uncovered, then the device software must be upgraded, as physical repair options (manual replacement or maintenance) may be impractical or very costly. Here it is not enough merely to download into the device new software from the network, but that it be downloaded and stored into local persistent storage, so that it becomes a permanent part of the application, and is effected by reboot operations. And it is highly desirable to be able to do this on a component level - i.e. replace a single loadable component rather than the whole operating environment. It is not strictly required that such operations be done without taking the system down or "out of service" for a time, but it is desirable.
Dynamically loading temporary tests requires that the test software must ultimately be deleted or "killed" From an operating system viewpoint, this means that the processes or tasks within the domain are terminated. It is additionally necessary and extremely important to return, reclaim, or clean up all other resources that are owned or have been allocated by the subsystem. Otherwise, the system could suffer resource leakages and other failures over time. Often the operating system provides some help with this, and this in particular should be looked for as a beneficial RTOS feature.
Due to the wide variety of faults, and their differing nature, error handling or more precisely, fault management must not be handled at the lowest level, i.e. in the code that detects the error. Processing error returns from system entities (like an RTOS) in application code space is generally a bad idea. Rather, it is better to "throw" the error to some centralized fault management entity, in order to a) help contain the fault, and b) to effect consistent recovery and or reporting policies. It is good practice to associate an error handler or fault manager component with the system, as well as with each defined application component. It is especially important that the fault manager NOT be part of the application domain itself, as we must assume that the whole recovery domain is corrupted. Another benefit of this approach is that separation of error/fault handling code from the application into a centralized place, makes for simpler code modules, and simplicity itself is practically a virtue in any design. In essence the fault management component is part of the operating environment, and any help that the operating system or RTOS itself can provide in this regard, is of great benefit.
One way to insure that a dynamically loaded component in a system is properly isolated from other system components, is that it be a separately or independently linked entity. This means it is compiled and linked as a complete program, with no external unresolved references. And it may use the full power of the linker to define all appropriate text and data sections, in a single file. Thus the operating environment should support loading of separate programs. Having all the location information in a standard object module format, makes it easy to invoke the proper physical isolation requirements, i.e. MMU protection of the loadable module.
Another derived requirement from any dynamic software component-loading feature is that of dynamic configurability of these components. If the newly loaded component is to be properly and logically isolated from the rest of the system, then it must be able to dynamically bind or configure itself to the run-time environment, through standard operating system or operating environment interfaces, so it can establish communications and operation with the rest of the system. All or most static configurations for a software component should be avoided. Dynamic loading of individual components, while not strictly a requirement herein, is often a very useful fault recovery or repair aid.
Dynamic configurability of individual software components usually needs direct support from the operating environment, as it often includes communication issues and run-time library or shared library issues.
Michael Christofferson is Product Marketing Manager at Enea Embedded Technology, San Diego, Calif.
See related chart