Distributed systems come in every form imaginable, from SoCs to massive fault-tolerant clusters. Regardless of a distributed system's scope and purpose, one thing holds true: software programs running on the system's various processors must exchange messages to accomplish some, or even all, of their tasks. In order to produce the correct results, these interactions must typically occur in a defined order and within strict time constraints.
Conventional debug tools are ill-equipped to provide insight into such interactions. By halting only the program being debugged and not the whole system, a source debugger can change the order in which the system's operations occur. This phenomenon -- often called the probe effect -- can temporarily mask race conditions and introduce "errors" that occur only when debugging is performed.
High-quality tools for source debugging, execution profiling, and memory analysis are just as important in a distributed system as in a conventional uni-processor system. But they're only useful once you determine which component, or set of components, to fix. To do that, one must first understand how the system behaves as a whole.
One step is determining which nodes are exchanging messages, and in what order by identifying which processes or threads are involved in each inter-node transaction and tracing the execution path from one node to another even if the nodes are based on different processor architectures. Without such knowledge, the causes of timing conflicts and other performance bottlenecks may appear to be located in one part of the system when, in fact, they are located somewhere else.
Needed is a tool that can consolidate multi-node activity into a single context; for instance, a system profiler tool. Like a debugger for tracing the flow of control from one thread to another within a single program, this visualization tool lets you "see" how the various components in a system interact, whether they all run on the same processor or across many heterogeneous processors.
It is, in effect, a software logic analyzer: if something goes wrong, the tool helps pinpoint when the event occurred, which software components were involved, what those components were doing, and, importantly, how to interpret the event.
To provide insight into a distributed system, a system profiler must first provide an accurate ordering of system events - a sometimes tricky proposition. For instance, if two events occur, one on node A and the other on node B, how does the tool ascertain which occurred first? If all nodes in a distributed system were governed by a global clock, determining the answer could be fairly straightforward. In reality, however, each processor in a distributed system is typically governed by its own clock, making a system-wide ordering of events difficult to establish.
To address this problem, all clocks in the system could acquire their initial state from the same master clock. That way, the combined events of all nodes could, in theory, be time-stamped and ordered for analysis.
Unfortunately, the variable latencies of a typical system bus can cause each clock to receive its initial setting at a different time. Even if the latency problem is addressed, the various clocks, once set, can still drift apart.
Since information based on multiple physical clocks is unreliable, system profiling can instead rely on "logical clocks" to determine the total ordering of events.
To understand this concept, it's important to remember that, when diagnosing distributed behavior, you don't have to be concerned with every event on every node. The most important events for diagnosing problems occur when nodes are exchanging messages. With a properly implemented messaging scheme, these messages can also act as synchronization points, allowing the creation of logical clocks that, instead of keeping physical time, simply provide a correct ordering of inter-node events.
For instance, consider synchronous message passing, which our RTOS uses as its fundamental means of interprocess communication (IPC). With this form of message passing, a thread that sends a message to another thread will block until the target thread receives the message, processes it, and sends a reply. If any thread becomes ready to receive a message without any messages pending, it will also block until another thread sends it a message. This blocking, which synchronizes the execution of the communicating threads, occurs whether the threads reside on the same node or on different nodes.
Because each synchronous send/receive/reply sequence can be assigned a unique identifier by the operating system, a system profiler can detect when a blocking transaction from one node (a send) has been received by another node (a receive) and then fully serviced (a reply). Using this information, the system profiler can reconstruct an accurate sequence of inter-node events.
Message passing operations can also be assigned a localized timestamp. Consequently, the system profiler can determine the order of thread interactions in two ways: If the interactions are local, the tool can use the fine-grained timestamps assigned to messages and other events. If the interactions span nodes, the tool can determine event ordering by employing the logical clock method; namely, by using the unique identifier assigned to each message transaction and by taking into account the causal nature of synchronous messaging, whereby a receive can occur only in response to a previous send. The send always occurs first, even when timestamps suggest otherwise.
A good system profiler is non-instrusive; it provides insight without requiring code modifications, and has little or no effect on system behavior. Properly implemented, it will let you diagnose a live system without interrupting or degrading the services provided by that system a real boon for high-end routers, 911 dispatch systems and other applications that must remain continuously available.
To achieve this "non-intrusiveness," a system profiler can use a technique called trace analysis. Unlike conventional debug methods that rely on breakpoints and other overhead-intensive techniques, trace analysis uses fast, selective logging of system events, including messages, kernel calls, and interrupts. User-written code doesn't have to be modified, since this event logging can be performed by an instrumented kernel.
Once the number of events inside a buffer reaches a high-water mark, a data capture utility either writes the events to a storage location (e.g. flash memory) for offline analysis or passes them directly to the system profiler for real-time manipulation.
A well-designed instrumented kernel can run at virtually same speed as the standard kernel. Performance is affected only when events are being collected, but even then, an instrumented kernel can provide a variety of mechanisms to ensure minimal intrusion.
For instance, the kernel can allow the developer to trigger event logging only when certain conditions occur. It can also provide user-definable filters so that only events of interest are collected during a logging session. And, if the kernel is fully pre-emptible, time-critical operations can preempt the event-logging process and, as a result, continue to meet their hard deadlines.
Of course, it's still possible that the overhead of event logging will change how the system behaves. To help the developer determine whether this is occurring, an instrumented kernel should be capable of logging all types of events, including any events generated by an event-logging operation.
Problems in a distributed system can be subtle, sporadic, or both. Thus, an effective system profiling environment allows the developer to trigger event logging for any potential condition, at any level in the system. To achieve this flexibility, the environment must allow any diagnostic tool to act as the trigger for any virtually other tool.
Thus, tool-to-tool integration becomes critical. For instance, if a process, thread, or routine exceeds its defined boundary condition for CPU usage, an application profiler that monitors code execution could trigger a system profiling session to help the developer understand what was happening in the system when the condition occurred. Conversely, the system profiler could itself act as the trigger: if it detects an event sequence typically associated with poor performance (e.g. a process receives several messages within a short timespan), it could launch an application profiler session, allowing the developer to quickly locate which process, and which function within that process, is creating the bottleneck.
As another example, let's say that data from the system profiler suggests a problem is arising from an interaction between three processes, each running on a different node. The system profiler could then be configured to trigger a debug session when the condition recurs.
The debugger would dynamically attach to the three processes and, ideally, present the multiple debug sessions in a single integrated view, making it easy for the developer to trace the execution path from one CPU to another.To achieve this level of integration, tools must be plugged into a common framework where information about a tool's capabilities can be shared with all other components.
Even if a distributed system is performing acceptably, it may still be a candidate for system profiling. By visually analyzing system behavior, developers often uncover hidden inefficiencies that, when corrected, allow for surprising increases in performance. A system that is running well but at apparent capacity may, in fact, have the headroom to support significantly more features or services. Thus, a system profiler shouldn't be used only when a distributed system "goes south," but should be employed throughout the design and integration process to ensure the system delivers everything it is capable of doing.