Today's embedded network infrastructure controls crucial services: telephone switching systems connecting data and voice for an entire region of a country, power plants distributing power over geographical areas, and satellite monitoring systems for controlling fleets of weather satellites. Reliability and 100-percent availability must be the highest priorities for such systems. But the system software controlling them is so complex and has so many possible configurations that during the development stage, it is almost impossible to predict real-world conditions and events that might cause a system to malfunction.
Developers try to anticipate problems that may occur once a system is deployed, and incorporate procedures to deal with them. Before a system ships, most bugs have been identified and fixed. But as embedded systems become more complex, developers are faced with the difficult decision of how much testing is economically feasible when weighed against economic factors such as time-to-market, where further testing means delays that may lose market share.
Debugging in-field has risks that debugging during development does not have and different methodologies are required. Engineers must be able to view the system and gather complex data from a remote location over a potentially unreliable network and then use that data to diagnose the problem, devise a solution, and incorporate the solution into the fielded system with certainty that none of these actions will bring down the system or cause an unacceptable interruption of service. For this reason, it is risky to use traditional debugging techniques, such as breakpoints for in-field debugging because breakpoints can cause a number of undesirable effects-including a system freeze or reboot-which would be disastrous in a system where 100-percent availability is crucial.
Tracepoints are markers that identify points in the program where bugs are suspected, but do not halt or slow down the target and do not communicate with the host system while collecting data. As additional protection, any strategy for diagnosing fielded systems must also include access protection to ensure that engineers cannot inadvertently perform intrusive actions that will interfere with system operation.
Breakpoints are put in a program to stop execution at a certain point where the engineer has reason to believe a problem might exist, or where data of interest is to be viewed and analyzed. Setting a breakpoint requires the debugger to modify the machine code to replace the instruction to be executed at the point chosen by the engineer and insert a new instruction that detours the program to a holding place in memory.
All code execution is stopped. Interrupting the flow of execution of instructions can have significant side effects. For example, it might prevent the system from responding to an event that it would normally have to respond to in order to continue to operate, such as the setting of a timer, the processing of an interrupt or the arrival of data that needs to be processed. The system can't process data if it is paused while the engineer examines values in memory.
In the lab under controlled test conditions, breakpoints are not a problem, but in the field where real-time operation is crucial. For debugging in the field, techniques must be used that gather information without stopping the system.
There are several ways in which a breakpoint can cause a system to stall. One example: causing a watchdog timer to falsely indicate a system failure. A watchdog timer is a hardware safeguard that is used to determine if the system has failed. If the timer is reset on schedule, the system knows that everything is working. But if the application fails to reset the timer before it expires, the watchdog interprets this as a failure and the system typically reboots. This is what it is supposed to do. However, a breakpoint can interfere with the instructions that reset the timer, allowing it to expire and causing the system to reboot. Watchdog reset cycles can be set for mere fractions of a second, and any interruption is likely to trigger a reboot.
Another way a breakpoint can cause system failure is by stopping the flow of communications in the system. Modern systems often do not run on a single processor, but rather use distributed architectures comprised of multiple CPUs that are constantly exchanging messages. Breakpoints can cause delay in the flow of data either internally between components or from one computer to another.
Internally, when a program does not respond to messages from other system components, the system must assume that something has gone wrong and take corrective action, usually a reboot. In some cases, the entire system might have to be restarted in order to reboot the unresponsive processor.
When breakpoints are used with a network that is slow or undependable, the result can be disastrous. If the engineer is monitoring a fielded system over a network connection and halts a task with a breakpoint in order to determine the value of a variable, the system can freeze if there is a network failure while the task is halted.
Like a breakpoint, a tracepoint detours instruction execution but it introduces a delay that can be measured in tens of clock cycles or billionths of a second, rather than in tens of seconds like a breakpoint does, and it does not stop program execution. Usually 10 or 20 billionths of a second are insignificant in terms of program operation. In those few billionths of a second the tracepoint gathers data values from registers and/or memory, stores them, and returns control to the program so that it continues execution without ever stopping. If a program is designed such that a couple of ten-billionths of a second delay is going to be critical, then there is a possibility that even a tracepoint could cause a problem, but such programs are rare and so this is highly unlikely.
Tracepoint software includes a rich definition language that includes the ability to set conditional tracepoints, enable or disable other tracepoints, and perform call stack traces. Conditional tracepoints collect data only if a line of code is executed and conditional statements evaluate to true. A conditional tracepoint that is triggered but does not evaluate to true consumes only a small number of clock cycles.
For example, a conditional tracepoint on a Compaq AlphaStation DS10 with a 617-MHz Alpha processor and 512 megabytes of RAM that is running the Tru64 UNIX 5.1 operating system consumes only 15 nanoseconds when the tracepoint's condition evaluates to false. In addition, by combining conditional tracepoints with the ability to enable or disable other tracepoints, engineers can create a suite of tracepoints that starts collecting data only when a condition evaluates to true and stops collecting when a different condition is true. Using this method the amount of data collected is drastically reduced, thereby minimizing the time needed to sift through and analyze the diagnostic information.
Call stack traces provide engineers with the ability to track down elusive software bugs that occur sporadically in functions that can be called from various code paths. The engineer can simply set a tracepoint at the location where the error becomes noticeable, program the tracepoint to perform a call stack trace, and then let the system run for a while before analyzing the data.
But it isn't enough to simply choose debugging tools that include tracepoint capability. Features that support and enhance tracepoint technology are equally important. These include passive and active mode analysis, event analysis, self monitoring breakpoints, the use of a persistent buffer, configurable buffers and permanent tracepoints.
Choose software that has a passive mode for viewing the system and use tracepoints to collect data, but which protects the integrity of the system by preventing the person examining the system from modifying the system in any way. They will not be able to halt the target, set breakpoints, write to registers or to memory, send signals, or perform any other actions that could adversely affect the system's ability to function. Attempts to perform any such actions results in an error message. Passive mode limits the engineer to gathering information to help him determine what the problem is.
An event analyzer is a valuable tool that displays the occurrence of various events in a system on a timeline, and can be used while in passive mode to determine what the system was doing when a problem occurred. Events appear as icons on a graphical representation of whatever timeline the programmer wishes to select. Timer interrupts, arrival of data packets, completion of message transfers, context switches, and activation of various tasks are among the events the analyzer will track.
Choose software that implements self-monitoring tracepoints that automatically disable themselves if they are taking too long to collect data or are being triggered too often. Once a tracepoint has disabled itself, it no longer consumes CPU resources.
And, look for a solution that has a persistent buffer that collects and stores data on the target until it can be uploaded to the host, whether or not an engineer is actually on line. The buffer should be in non-volatile memory to ensure that data remains available across host sessions and system restarts.
You should be able to set the tracepoints and "walk away" while the tracepoints collect information on the running system. This ensures that when you log back in to the system from the host, the data will still be there whether it's been one day or one week, regardless of whether the system has been interrupted for any reason during the interim. The software allows you to configure the size and policy of the debugging buffer based on the unique requirements of the system.
Infinite trace is an advanced buffer strategy that continuously transfers data from the tracepoint buffer to another local machine equipped with large memory resources. Infinite trace technology should be combined with a dynamic buffer upload, which configures the storage machine to automatically upload to the host at a specified interval.
Software should provide permanent tracepoints that continually provide diagnostic data about the running system for monitoring the system continuously from the host. Also, tracepoints should not interact with the host while collecting data. This will guarantee that the target system will never get bogged down waiting for a response from the host over an unreliable network connection. One should be able to set a single tracepoint in the debugger to collect an unlimited number of values. Once the collected data is uploaded to the host, it should be tightly integrated with the IDE to facilitate the use of tools such as the debugger and an event analyzer.