Design Article
Designing secure, reliable systems
Brian Doherty, Engineering manager, Integrity RTOS, Green Hills Software, Inc.,Santa Barbara, Calif.
8/1/2003 8:16 AM EDT
The rapid technological advancement in the embedded software industry brings exponentially increasing complexity to its products, as devices combine more functions on the same processor. As a result, reliability issues are more important than ever, as the likelihood of unwanted code interaction increases, and as the consequences of system failure become more dramatic. In addition, the growing popularity of connecting devices to a global network has increased the danger posed by those who would violate systems for nefarious purposes. The result is that the issues of security and reliability, which once were of primary concern only to the military/aerospace and telecommunications industries, are now becoming paramount in smaller grade devices. This has increased the feature set required of real-time operating systems (RTOS), as they are called upon not just to parcel out time amongst tasks, but also to give programmers the basis they need to design the secure, reliable systems of the 21st century.
In the 1980s and early 1990s, embedded devices were much simpler than they are today. The cell phone was simply a telephone, with a basic function set implemented via simple software. In automobiles, the most advanced electronics were often in the radio. Even after more tasks were computerized, they were broken up across several processors, each responsible for a single function, such as automatic transmission control or anti-lock braking. A "home gateway" was nothing more complicated than a modem. Correspondingly, the RTOS conceived in that time are simple, combining a hardware abstraction layer with a thread scheduler and set of primitives. They contain no facilities to assist developers in designing complex systems.
The problems of today are much different. Imagine that you are designing a next-generation cell phone, which adds to the basic phone an address book, a calendar, and an Internet browser with the ability to download custom Java applets. How would you keep hackers from breaking into the phone and disrupting its service or stealing the customer's personal information? How would you keep the untrusted Java applets that customers run on the phone from violating security? Similarly, imagine that you are programming a single microcontroller to handle all of the automated tasks in an automobile -- internal functions, heads up display, navigation system, drive-by-wire and Internet access. How do you ensure that a bug in the heads up display system does not disrupt drive-by-wire and cause an accident? How do you ensure that no one can break into the car via its network access and similarly cause an accident? Lastly, consider the modern home gateway, with its combination of a DSL link, wired and wireless switching, and phone service. How do you prevent unauthorized access to the gateway's software/data from the Internet? How do you isolate damage from a bug in one component so that other components continue functioning normally?
The new complexity of embedded devices has brought these issues from their traditional purview in high-end applications to even the smallest products. In addition, the highly competitive nature of the industry has magnified their importance, since the slim profit margins for consumer devices mean that the slightest difference in reliability can determine market success or failure. Even more importantly, many devices literally have their users lives in their hands. With the array of medical devices on the market, along with consumer devices, such as drive-by-wire and anti-lock brake systems in automobiles, more embedded devices than ever before have failure modes that are fatal to their user.
The final responsibility for resolving these issues lies with the system designer. However, there are features of the underlying OS that are critical to accomplishing this goal. These features are all related to one core concept -- partitioning, which is the creation of multiple virtual machines on the same physical machine, such that the virtual machines are completely separated, except where the system designer has allowed them to interact. The ability for the RTOS to partition components such that they are entirely unable to affect another without permission is the crucial building block that a system designer needs in order to:
1) Isolate the part of the system that is exposed to the network, so that even if a rogue agent is able to insert an arbitrary piece of code into that component, they are not able to crash/deny service to the system or its components, or steal data from the system.
2) Isolate critical parts of the system from each other, so that a bug in one component only affects that component, and does not interfere with other components.
3) Isolate the kernel from the user components of the system, so that a user bug cannot crash the kernel, and so that when a user bug does crash a component, corrective action can be taken without system restart.
Enabling an RTOS to accomplish this partitioning requires the addition of several features that act together to defeat attempts to break down the wall between components. Several of these features are found in many of the RTOS on the market today:
- Real-time scheduling. This is obviously the defining feature of an RTOS. It allows the CPU time to be partitioned such that tasks are made hard guarantees about scheduling.
- Memory protection. This is the most obvious feature that helps isolate tasks from one another. It allows the system to be separated into address spaces, each of which is unable to access the memory of the kernel or other address spaces, which means that tasks in these address spaces cannot corrupt their code or data. Without memory protection, there is no intra-system protection from programming errors, or from the introduction of malicious code into the system.
However, these two features are not sufficient to accomplish true separation of system components. For example, consider the Unix "fork bomb" that simply creates copies of the current task ad infinitum. All of the memory in the system and the slots in the task table are consumed in the creation of these tasks, and all of the CPU time is consumed as they run. Most RTOS similarly permit a task to hog CPU time and system resources, denying service to other tasks in the system. However, some RTOS, such as Green Hillss INTEGRITY, implement the additional functionality needed to prevent this:
- Guaranteed resource availability in the time domain. Many RTOS implement a simple priority based scheduling scheme, where the tasks at the highest priority level share the CPU equally, and where a task can create another task of equal or lesser priority. However, one task can create confederates at the same priority level, and increase its share of the CPU, thereby starving other tasks and disrupting their performance of their duties. In order to guarantee the availability of the CPU, the RTOS must also support weighting, such that each task has an entitlement to a certain percentage of the CPU and such that a task must surrender a portion of this entitlement if it wishes to create another task.
- Guaranteed resource availability in the space domain. Whenever there is a shared resource between address spaces, one address space can potentially consume it completely and deny access to the other address spaces. In order to eliminate this possibility, each address space must have its own pool of resources. The system designer must assign each address space a pool of physical pages at boot time for the use of creating kernel objects, creating mapping tables, and facilitating any other memory requirement, and ensure that only memory from this pool is needed to perform these operations. Consideration of this property must permeate the kernel's design from its very beginnings in order to avoid introducing hidden resource sharing that could cause problems later.
- No locking or disabling of interrupts in the kernel. Many RTOS use these to synchronize access to critical data structures. However, locks can cause hidden dependencies between tasks that can cause unpredictable results in the field, and the disabling of interrupts can disrupt the systems real-time guarantees. Instead, data structures must be protected via a scheme where kernel calls clean up their state and transfer control within a deterministic amount of time if a task switch is needed.
Once this code partitioning is achieved, then the system designer can set up message based interfaces between the address spaces that tightly control data flow between them, and can build a framework within which address spaces can provide services to each other but still be sufficiently isolated to prevent the effects of bugs and security violations from flowing across the address space boundary. In addition, because the kernel is isolated as well, it can be used as a failsafe agent to detect failures and restart portions of the system in a short amount of time.
In conclusion, as embedded systems become both more complex, and more accessible to untrusted parties, security and reliability issues are becoming dominant for even the smallest devices. Only with the proper RTOS support can enable embedded designers produce systems that will satisfy the requirements of the future.


