As in-car computers become readily available, consumers will demand that they be robust enough to deliver a myriad of functions with the same level of reliability they have come to expect in other automotive devices. Developing the operating system for in-car computers capable of delivering the flexibility of today's PCs while meeting the automotive industry's stringent demands for longevity and reliability is no small order.
Achieving reliability in automotive applications starts with developing a comprehensive product specification that provides a "stake in the ground" for the features that will be in the product and their intended operation. The specification must establish the constraints, performance expectations and integration requirements for the final product. Because many design issues cannot be anticipated until development is under way, even a great specification will typically capture only 80 percent of the design issues. But it will invariably boost reliability because most problems will be understood before development begins.
Probably one of the most difficult components of the specification is designing a user interface that will work the way the consumer expects it to. The best way to specify great usability design is to put the design in consumers' hands early and often while a usability specialist observes their actions.
A common and simple method is to ask subjects to simulate using the interface by working with pictures of the device. For later stages of development, it is helpful to use prototype devices (based on previous hardware versions or PC-based emulations) to start verifying designs. Having users accomplish short tasks on these devices in a lab setting goes a long way toward identifying opportunities to improve designs and boost eventual reliability. Sometimes, lab simulations are a challenge to set up, and simple, two-dimensional images of the interface aren't practical-as, for example, with voice recognition systems required for hands-free operation of in-car computers.
Beyond ensuring that the device will function the way consumers expect, developers should consider designing three levels of reliability features into their applications and devices: preventive, corrective and diagnostic.
Preventive features are designed to keep reliability problems from occurring. For example, we used software installation management features to allow automotive manufacturers to specify which (if any) applications can be installed on Windows CE for automotive-based devices in their cars. This gives them the ability to manage the reliability of their onboard devices more effectively. Car makers can set the software installation management controls to allow installation of any application, only Windows CE applications, or only applications approved by that car maker.
Other examples of preventive features include trusted modules and protected memory spaces. Trusted modules can protect key system APIs by putting them off-limits to installed applications that have not been certified as trustworthy. Protected memory spaces require each application process to run in its own protected virtual memory region. This prevents a process from trampling on the memory of another process, which could cause system error and failure. Corrective features, as the name implies, kick in to stop problems once they have occurred and been detected. For example, in the event that a critical process is not running properly, process monitoring can terminate and restart the routine, ensuring greater reliability.
Process monitoring can also stop memory leaks. In the event that a process exceeds a preset memory bound, process monitoring can then terminate it or not allow it to get more memory. If a process hangs in the system without response, the process can be terminated to keep it from slowing the rest of the system. Diagnostic features greatly improve reliability by providing additional ways to monitor errant system functions. For example, with event logging, any event in the system can be logged to a persistent data store that will survive cold boots as well as warm boots. This information can be recalled later for analysis.
Once the design specification, with its usability design and reliability features, is completed, it is time to construct the product. Various methods that you can employ during construction-including code reviews, daily builds, schedule tracking and issue tracking-can add up to a major difference in the ultimate reliability of your software or device.
Continual code reviews are a key way to build in reliability during construction. As the product is built, developers should review each of the modules, looking for design and development flaws to correct. Testers, meanwhile, should collect information on how each module operates, so that they can determine how best to verify the code. During these code reviews, each module is mapped back to its specified feature to ensure that all of the intended features are implemented.
It is also important to have an independent build machine that compiles the source code and creates the system components. The build machine should synchronize the code tree daily and construct a clean build. The build of the components is then published to the team once it's available so that the testers can download and verify what is ready for verification. Every day the build machine gives the team feedback on how well the components integrate with each other so that product integrity is maintained.
Automotive software and devices don't exist in isolation, but rather are integral parts of larger automotive projects. Meeting the scheduled delivery deadlines of those larger projects is crucial. As the project progresses during construction, developers need to keep a handle on its overall status and to stay within the limits of project milestones. Progress should be checked weekly and any anomalies should be managed rapidly.
There can be a tremendous number of issues identified, debugged and fixed over the course of a development project. An issue tracking system is important to monitor all of this.
To track each of our issues in the construction of Windows CE for Automotive, for example, we use an issue tracking system with these procedures:
1. Whoever identifies the issue writes it up in the tracking system, detailing the sequence of steps required to reproduce the problem.
2. The issue is ranked for severity on a scale from 1-4. For example, crashing the system would be severity 1 and painting a misplaced pixel might be severity 4.
3. The issue is ranked for priority, also on a scale from 1-4. If the function under question were one that the consumer is likely to use frequently, it would be a priority one issue. Infrequently used functions are assessed a lower value.
4. The issue is assigned for investigation. A person in one of the functional groups is identified and assigned the task of resolving the issue. Once assigned an issue, the individual may come up with better steps to reproduce the problem, determine how a feature should be implemented or debug and fix the issue.
5. The issue is resolved and closed. Once the issue is resolved, it is assigned back to the person who originated it. That person inspects and confirms the resolution. If the resolution checks out, the issue is then closed. Our issue tracking system is very important for the Windows CE for Automotive team. On heavy testing and development days, 200 to 300 issues can be identified and 100 to 200 issues can be resolved.
Once you've implemented all the features, you must test the entire product to ensure that functionality and performance are correct. We use testing strategies that include component testing, hardware verification, beta testing, and several forms of integration verification.
Your first testing task is to ensure that all components are operating properly. Are all the functions implemented? Are the functions correct? Do all of the interface combinations work? Is the component performing fast enough? You can answer these questions by developing a test harness that is aimed at particular subsystems such as speech, audio, drivers and others. When you are developing a product that is closely mated to hardware, much of the early debugging of the overall device needs to include hardware verification that considers the hardware as a potential source of problems.
It is very important that the hardware be verified functionally before beginning most other tests.
We also consider a range of test cases to be essential for verifying in-car computing device functionality:
- Power management. Does the system move from power state to power state properly? This is a key problem for devices installed in the car.
- Ignition on/off testing. Does the system start/stop properly? Are all of the drivers properly initialized and uninitialized? When testing devices built on Windows CE for Automotive, we use an automated system with which we can sweep through ignition on/off frequencies as low as 10 milliseconds up to 10 seconds.
- I/O testing. Do the devices perform properly for all edge cases? Do the devices work between ignition on/off cycles?
- In-car testing. No matter how much bench testing is performed, more issues are discovered once the device is in the car.
We use component testing and hardware verification repeatedly to ensure that we haven't stepped back on reliability as we've added new features. But they're just the first steps in the verification process.
Test the integration
Once you believe you have good hardware and software components, you are ready to test the integration. You can use several strategies to do this, including scenario-driven testing, ad hoc testing, random walk testing and memory consumption tests. Scenario-driven testing, as the name implies, relies on testing the product against documented scenarios that simulate the way consumers will actually use the product. For example, you might select a contact from the software's address book and then use the contact's phone number to dial the phone. In the case of Windows CE for Automotive, we have documented thousands of scenarios and they are all tested against before the product ships.
With ad hoc testing, the tester strays from the scripts of the scenarios and attempts scenarios that may be out of the ordinary. Often, it is best to perform ad hoc testing on an area of the system with which you have little experience, because you may not know how it is supposed to work. We also conduct ad hoc tests by installing devices in the cars of the team members. Everyone has his or her favorite way of using a product and this creates a fairly diverse set of ad hoc tests.
Random walk testing-which tests completely random and ineffectual ways a consumer might randomly interact with a product-is also called "monkey testing." That's based on the idea of letting a monkey loose on your system for hours at a time. Would it come up with the unified field theory or would it smash your system to bits? The way we enable random walk or monkey testing is to create a program that randomly executes accelerated user actions: press a key, say a recognized voice command word, etc. Then it does this for hours on end, executing millions of command sequences.
Usually, in the early stages, we are able to crash the system pretty regularly with this sort of testing regimen. However, over time the monkey stays up for longer periods, giving us confidence in our reliability (if you can survive a monkey...). Just so you know, the versions of Windows CE for Automotive that we release to our manufacturing partners all can withstand "monkeys" that stay operational for weeks on end.
Memory consumption testing addresses the simplest problem with most computer programs: they consume more resources than they should. This usually shows up as a system component that is slowly leaking memory and eventually takes over all memory, ultimately locking up the system. For system reliability, it's very important that you catch and fix memory consumption issues. Early in the device verification process, memory consumption can be fairly easy for you to identify and terminate. As testing continues and bugs are eliminated, memory consumption bugs are among the most difficult to detect and correct. One way to identify leaks is to perform repeated testing and to observe the memory consumption trends per process over time.
In many ways, the ultimate testing is beta testing: getting real devices into the hands of real consumers in real settings. Equally important as getting the device to consumers is getting their impressions in their own words. By this time, you should have a good understanding of your product's reliability and shouldn't encounter surprises. Beta testing is valuable in confirming the importance of various features and verifying the user design for a last time before you release the product.
Bruce Johnson, Development Manager, Automotive Business Unit, Microsoft Corp., Redmond, Wash.