Design Article

IMG1

Doing design and debug on real-time distributed applications

Bob Kindel, Real-Time Innovations

1/31/2008 5:15 AM EST

Real-time system designers and embedded software developers are very familiar with the tools and techniques for designing, developing and debugging standalone or loosely coupled embedded systems. UML may be used at the design stage, an IDE during development and debuggers and logic analyzers (amongst other tools) at the integration and debug phases.

However, as connectivity between embedded systems becomes the norm, what used to be a few nodes connected together with clear functional separation between the applications on each node, is now often tens or hundreds of nodes with logical applications spread across them.

In fact, such distributed systems are becoming increasingly heterogeneous in terms of both operating systems and executing processors with tight connectivity between real-time and enterprise systems becoming the norm.

This article will identify the issues of real-time distributed system development and discuss how development platforms and tools have to evolve to address this challenging new environment.

The idea of a 'platform' for development has long pervaded the real-time embedded design space as a means to define the application development environment separately from the underlying (and often very complex) real-time hardware, protocol stacks and device drivers.

Much as the OS evolved to provide the fundamental building blocks of standalone system-development platforms, real-time middleware has evolved to address the distributed-systems development challenges of real-time network performance, scalability and heterogeneous processor and operating system support.

And as has already happened in the evolution of the standard real-time operating system, new tools are becoming available to support development, debug and maintenance of the target environment " in this case, real-time applications in large distributed systems.

The Distributed-System Development Platform
From the individual application developer's perspective, there are three basic capabilities which must be provided by an application development platform when a logical application spans multiple networked computers:

1. Communication between threads of execution
2. Synchronization of events
3. Controlled latency and efficient use of the network resources

Communication and synchronization are fairly obvious distributed platform service requirements and are analogous to the services provided by an OS. However for distributed applications they have to run transparently across a network infrastructure of heterogeneous OS's and processors with all that implies in terms of byte ordering and data representation formats.

It should ideally use a mechanism that does not require the developer to have an explicit understanding of the location of the intended receiver of a message or synchronizing thread so that the network can be treated as a single target system from an application development perspective.

Typically a user will use a commercial or home-grown middleware to provide these key capabilities. There are several middleware solutions which support this approach, such as JMS and DDS (Data Distributions Service) from the Object Management Group (OMG).

Figure 1. DDS provides a framework for providing controlled latency and efficient use of target network resources.

But only solutions such as DDS (Figure 1, above) explicitly address the third point; controlled latency and efficient use of (target) network resources, which is a critical issue in real-time applications. DDS provides messaging and synchronization similar to JMS, but additionally incorporates a mechanism called Quality of Service (QoS).

QoS brings to the application level the means to explicitly define the level of service (priority, performance, reliability etc) required between an originator of a message or synchronization request, and the recipient.

DDS treats the target network somewhat like a state machine, recognizing that real-time systems are data driven and it's the arrival, movement, transition and consumption of data that fundamentally defines the operation of a real-time system.

Some data is critical and needs to be obtained and processed within controlled/fixed latencies, most especially across the network. Moreover, some data need to be persisted for defined periods of time so it can be used in computation; other data may need to be reliably delivered but is less time critical. QoS facilitates all these requirements and more.

Perhaps the greatest advantage of using middleware isn't often appreciated until late in the application development process: defining interfaces in a rich middleware format makes it much easier to integrate, debug and maintain a system. What good middleware does is allow you to completely specify the data interaction through quality of service which forms a "contract" for the application.

DDS, for example, allows a data source to specify not only the data type but also whether the data is sent with a "send once" or "retry until" semantic, how big a history to store for late arriving receivers, the priority of this source as compared to others, the minimum rate at which the data will be sent, as well as many, many more possibilities.

By setting these explicitly many of the soft issues that creep up in integration can be addressed quickly by matching promised behavior to that requested. DDS middleware will even provide warnings at runtime when contracts aren't met.

The Distributed System Tools Challenges
A development platform isn't complete until it has the tools to support the environment throughout the application lifecycle. Ask any support or sustaining engineer and they will tell you that they need three things: good documentation, great tools, and code written to expose the state and event parameters as easily as possible.

Provided that a clear interface definition language between the networked application nodes is used, current toolchains that operate on a single node are still quite useful in running down memory, code correctness, performance and, in some cases, can be used for white box testing.

The new challenge for developers is isolation, identification and correction of the problems that are exhibited at the integration stage, when individual distributed sub-components are connected and the networked subcomponents start " for the first time - to execute as a large integrated application.

Most engineers are familiar with debugging within a single-board environment, and will have developed a high degree of debug competence in fixing "hard faults", i.e. faults that halt or crash the process.

These are relatively easy to debug because you can normally work backwards from the state of the crash or, if you were really lucky, you could get it to crash in a debugger and you were home free!

The nastiest hard faults to debug are normally multithreading related, so it should comes as no surprise that as we move to larger, more complex distributed systems you will see more and more of these types of faults; every node will have its own thread(s) of execution, potentially working on the same data at the same time received from across the distributed system architecture.

Distributed systems are also much more likely to be subject to numerous types of "soft faults". In these cases, no application crashes, but the warning lights are flashing and the distributed application either performs poorly or not at all.

There are numerous types of soft-faults, but many of them come down to the synchronization of data generation and processing across many machines. One example, for instance, is the effect of a single dropped message; if that message is one sample of an update of data it might not be a big deal, but if it is transitional event or command, you could suddenly have the system in an unexpected state.

Moreover, you may not be able to detect this until some time after the initial fault occurred, leading to a debugging nightmare. This is just one type of soft fault, many others occur regularly: high latencies (either sustained or periodic) which cause control loops to lose stability, self-reinforcing data dropouts, unexpectedly blocking applications, systems that work in the lab but fail when scaled up, data mismatches between what is provided and what is expected etc.

Thus for distributed systems, it is vital to be able to get at the state and event information without stopping or significantly slowing the system.

New Tools for Distributed App Development
Starting with the basics: the first thing that you need is a tool that allows you to generate common data types across all your boards and a process that keeps them in synchronization. If you are using middleware you will normally write your data types in a meta-language (IDL, XML, XDR) and autogenerate the code that handles the data types.

Some systems will allow you to create new types on the fly, but beware that this is potentially a source of error since it will be much harder to verify the usage contract on data if the programmer doesn't know its details.

Fig 2. Using an IDL file to define the data types tools like 'rtiddsgen' can generate code that handles the defined data types. Extensions to rtiddsgen can be used to generate data types that are also compatible with CORBA.

The next tool you need allows you to design the applications and specify the data and QoS requirements. This class of tool should ideally be used to design as many of the applications as possible so that the QoS contract between senders and receivers is met at design time (much easier than debugging and fixing it later).

In an ideal world, this tool should integrate with your normal design methodology. For instance, UML users may wish to consider Sparx UML. This tool has interface description components for middleware such as DDS to make it easier to initially set these up.

Once your applications are deployed you need to make sure that the communications are happening as intended, QoS parameters are set properly and the system is running! One of the first questions you will need to answer at integration is "are these distributed application functions talking properly?".

With the appropriate middleware interrogation tool such as RTI Analyzer you can determine that the middleware has "hooked up" the two applications and you can make sure that the designers of the two application functions actually met specification.

Fig 3: RTI Analyzer is a system level debugging tool that finds RTI Data Distribution Service objects in a running system, organizes them, and shows you their communication parameters. Correlating this information with your system design can quickly expose performance and reliability issues.

Such a tool also needs to show you which objects are exchanging data, or more importantly, not exchanging data, and if not, suggest why not. You can truly appreciate these tools when you have 3 different subcontractors (or even just free-willed developers) each building part of a distributed application and it comes time to integrate. Root cause of most configuration issues can be found quickly, accurately and with a minimum of debate.

Fig 4: RTI Analyzer showing the QoS mismatch error in 'Ownership' between a DataReader and DataWriter.

Three use-cases for debugging
You now have great up-front design, good interfaces that people are following and yet it still isn't working. This is where distributed system-wide state and event analysis becomes key. Typically there are three use cases during the debugging:

Use Case #1. Monitoring of overall distributed system health. In this case you might want to see the high-level behavior of most of the applications in the system. Tools such as RTView from SL Corporation allow you to build one or many Control Panel GUIs or Data Report views by listening to data put out by the middleware as well as your application.

By selectively instrumenting key variables in your application this can be a great first step in isolating system issues and ensuring that your system is running properly. When taking advantage of data-centric middleware implementations such as DDS, tools like RTView can generate displays without detailed information about its source.

Merely knowing that it exists and in what format it is available (as defined by your data meta-language) and how the data is made available (QoS) facilitates rapid assimilation of the information needed for such useful system overview displays.

Typically the applications leveraging this sort of tool have many different data sources, primarily at low time resolution, that need to be combined and displayed together to create a meaningful perspective of the systems health.

Tools like these are often deployed as part of the maintenance environment for the distributed system and as such include easy to use GUI builders that allow end user oriented displays of system data and health to be generated.

Fig 5: RTView provides virtual Instruments for user views of the key distributed data

Use Case #2. Getting into the guts of a faulty application. Once you've isolated which nodes are having a problem with the system health tool you may need to get more detailed and higher time resolution data from a few selected applications and their interaction across the network. Tools such as RTI Scope provides this functionality by allowing the user to look at the different data streams into and out of an application graphically, in real-time, without pre-configuration.

Think of RTI Scope as an oscilloscope for the data coming out of an application from anywhere in the network, complete with negative time triggering, multiple plot types (vs time, x vs y), derived signals and the ability to save out the data for post processing. RTI Scope still operates at the defined data level, but is designed to capture fewer data sources, in a minimally intrusive manner.

It is ideal for capturing data that runs out of bounds, or is delivered outside of its required throughput or performance objectives. Its full knowledge of the underlying middleware implementation means that it can 'discover' the data sources and recipients and connect to them across the network, leveraging the middleware to pull the data through for local analysis and visualization.

Fig 6: RTI Scope showing DDS Topic Data plotted against time with an 'Oscilloscope-like" display.

Use Case #3. Network Analysis. Sometimes the middleware is attempting to perform the service requested of it by the application, but the underlying network implementation itself is not behaving as expected. Perhaps the router is dropping packets, or a wireless hop is providing lower bandwidth than needed, or a node periodically drops off the network for a second or two or any one of a number of other problems.

Drilling down to the wire
At this point you are left with no choice but to drill down to the wire and see what's happening. You reach for your protocol analyzer and it gives you all the UDP or other packet information you need. But it's meaningless unless you can correlate it back up to the application.

Well constructed distributed middleware include a standardized on the wire protocol; DDS for example uses the open standard RTPS (Real-Time Publish Subscribe), and as you'd expect such a platform includes the ability to monitor the wire traffic and pull out the associated middleware packets, dissecting them for correlation back to the application layer. RTI can help here too with a dedicated Protocol Analyzer, capable of providing a real-time display of all "on-the-wire" activity.

Fig 7. RTI Protocol Analyzer allows you to see the 'on wire' traffic.

As we have seen, the development of real-time applications operating across a large and complex network requires an innovative approach to deliver an effective tools strategy in the face of the multiple challenges posed by such a distributed environment. Without such a coherent and integrated strategy, both system performance and project development times can be severely compromised.

The fundamental requirements for an effective tools are essentially two-fold: the ability to define and support a consistent and predictable real-time environment across heterogeneous operating systems, processors and network topologies; and a fully integrated toolchain that provides comprehensive debug information at each level (design, code, integration, debugging & maintenance) across the distributed system architecture comprising the development application.

Dr. Bob Kindel is Vice President of Engineering Services at Real-Time Innovations, Inc. He joined RTI in 2000 as an applications engineer with a strong background in control systems and distributed network engineering. He is an expert at the design and debugging of complex distributed applications and spent two years focused on embedded and network-system debugging. His past consulting work has included customer training, system design and integration debugging. He can be reached at Bob.Kindel@rti.com.


print

email

rss

Bookmark and Share

Joinpost comment




Please sign in to post comment

Navigate to related information

Most Popular

Product Parts Search

Enter part number or keyword
PartsSearch


FeedbackForm