Somewhere a developer is kicking himself for not "kicking the dog."
In 1994, the Clementine Deep Space Program Science Experiment was lost in space not because of a hardware failure but, as software expert Jack Ganssle will describe in his DESIGN West session, Mars Ate My Spacecraft, a software failure followed by missing code for the run-of-the-mill watchdog timer.
In his session, on Wednesday, April 24 at 11:45am, Ganssle will describe high-profile embedded-systems disasters and extract lessons all software engineers should heed in future projects. Schadenfreude is not the goal: Ganssle wants embedded systems software developers to practice what hardware developers have done for years out of necessity, namely architect right the first time.
[Click here to register for DESIGN West 2013,
April 22-25 at the San Jose McEnery Convention Center. Options range
from an All-Access Pass -- which includes Black Hat (security)
Conference Session to Free Expo Admission].
Hardware designers spend most of their time in the design stage, carefully creating nearly perfect architectures before implementation--mainly because hardware modifications are so costly. Software developers, on the other hand, often jump right in and write poorly architected code, and then spend half their time debugging it. The result: a disproportionate number of system failures are caused by software, as evidenced by several billion-dollar failures in space.
|The Clementine mission to fly from the moon to a nearby asteroid was a failure, because the overworked embedded software team did not write the code to use their watchdog timer.|
"My session will be recount some of the world's most famous embedded-system software failures--the rationale being that hardware failures are often presented to young engineers so that those mistakes are not made again, but in the software world failures are often quietly sweep under rug," said Ganssle. "As the title of my session indicates, my examples will be drawn from failed spacecraft. The results were enormous wastes of money, and yet this information has largely been buried. My point is that instead of burying their mistakes, wise engineers need to share them widely so that others can learn from their mistakes, rather than continuing to make all the same mistakes over again ourselves."
A prime example Ganssle will describe in detail is the Clementine Deep Space Program Science Experiment, which was lost-in-space in 1994. The mission failed during its second phase, when it was scheduled to travel from the Moon into deep-space where it would fly-by the Geographos asteroid. The problem was that after heading toward the asteroid, the spacecraft went silent for 20 minutes because of a software crash. And when it finally came back online, all of its fuel had been wasted by firing its thrusters for 11 minutes straight. The mission consequently had to be scrubbed, resulting in an enormous waste of funds and resources.
"The thing that fascinates me about the failure of Clementine is that it could have been saved by a simple watch-dog timer, which was available in the hardware, but the development schedule had been so compressed that the programmers never had time to write the code to turn it on," said Ganssle.
As a result of that failure, Clementine's software engineers went to other ongoing space-mission programmers to encourage them to add the code to use their watchdog timers. Unfortunately, the Near Earth Asteroid Rendezvous (NEAR) launch in 1998 ran into exactly the same problem, because its programmers did not heed the warning. As a result, 29 kilograms of reserve fuel was dumped when its thrusters fired in error--a problem that could have been avoided by a watchdog timer--demonstrating just how difficult it is to learn from other programmers' mistakes.
"In the embedded world we are too focused on just fixing bugs and moving on," said Ganssle. "After my review of the last 40 years of embedded software development, the number one lesson we have to learn is that if you focus on fixing bugs, you will never get a quality product. Quality needs to be addressed at the beginning--top down. We need to think long and hard about our architecture, review our designs, then write a program that is pretty close to perfect before we even start testing it."
Software engineering needs to use the same design principles as hardware engineering, namely to predict what is going to happen, design something that is going to meet those predictions, and then measure the results to make sure that indeed met those predictions.
"We need to close that feedback loop--rather than just quickly write a program that only does a part of the job, then deal with the bugs," said Ganssle.
Session info: Mars Ate My Spacecraft on on Wednesday, April 24, 11:45 AM - 12:30 PM.
Conference Home Page: Design West 2013 (April 22-25, San Jose, Calif.)