News & Analysis
Comment
Rick.Bosma
VectorForce
Clementine was developed to space qualify technologies for NRL; which realized ...
DESIGN West: Mars ate my spacecraft
R Colin Johnson
3/12/2013 1:20 PM EDT
Somewhere a developer is kicking himself for not "kicking the dog."
In 1994, the Clementine Deep Space Program Science Experiment was lost in space not because of a hardware failure but, as software expert Jack Ganssle will describe in his DESIGN West session, Mars Ate My Spacecraft, a software failure followed by missing code for the run-of-the-mill watchdog timer.
In his session, on Wednesday, April 24 at 11:45am, Ganssle will describe high-profile embedded-systems disasters and extract lessons all software engineers should heed in future projects. Schadenfreude is not the goal: Ganssle wants embedded systems software developers to practice what hardware developers have done for years out of necessity, namely architect right the first time.
[Click here to register for DESIGN West 2013, April 22-25 at the San Jose McEnery Convention Center. Options range from an All-Access Pass -- which includes Black Hat (security) Conference Session to Free Expo Admission].
Hardware designers spend most of their time in the design stage, carefully creating nearly perfect architectures before implementation--mainly because hardware modifications are so costly. Software developers, on the other hand, often jump right in and write poorly architected code, and then spend half their time debugging it. The result: a disproportionate number of system failures are caused by software, as evidenced by several billion-dollar failures in space.
"My session will be recount some of the world's most famous embedded-system software failures--the rationale being that hardware failures are often presented to young engineers so that those mistakes are not made again, but in the software world failures are often quietly sweep under rug," said Ganssle. "As the title of my session indicates, my examples will be drawn from failed spacecraft. The results were enormous wastes of money, and yet this information has largely been buried. My point is that instead of burying their mistakes, wise engineers need to share them widely so that others can learn from their mistakes, rather than continuing to make all the same mistakes over again ourselves."
Clementine
A prime example Ganssle will describe in detail is the Clementine Deep Space Program Science Experiment, which was lost-in-space in 1994. The mission failed during its second phase, when it was scheduled to travel from the Moon into deep-space where it would fly-by the Geographos asteroid. The problem was that after heading toward the asteroid, the spacecraft went silent for 20 minutes because of a software crash. And when it finally came back online, all of its fuel had been wasted by firing its thrusters for 11 minutes straight. The mission consequently had to be scrubbed, resulting in an enormous waste of funds and resources.
"The thing that fascinates me about the failure of Clementine is that it could have been saved by a simple watch-dog timer, which was available in the hardware, but the development schedule had been so compressed that the programmers never had time to write the code to turn it on," said Ganssle.
As a result of that failure, Clementine's software engineers went to other ongoing space-mission programmers to encourage them to add the code to use their watchdog timers. Unfortunately, the Near Earth Asteroid Rendezvous (NEAR) launch in 1998 ran into exactly the same problem, because its programmers did not heed the warning. As a result, 29 kilograms of reserve fuel was dumped when its thrusters fired in error--a problem that could have been avoided by a watchdog timer--demonstrating just how difficult it is to learn from other programmers' mistakes.
"In the embedded world we are too focused on just fixing bugs and moving on," said Ganssle. "After my review of the last 40 years of embedded software development, the number one lesson we have to learn is that if you focus on fixing bugs, you will never get a quality product. Quality needs to be addressed at the beginning--top down. We need to think long and hard about our architecture, review our designs, then write a program that is pretty close to perfect before we even start testing it."
Software engineering needs to use the same design principles as hardware engineering, namely to predict what is going to happen, design something that is going to meet those predictions, and then measure the results to make sure that indeed met those predictions.
"We need to close that feedback loop--rather than just quickly write a program that only does a part of the job, then deal with the bugs," said Ganssle.
Session info: Mars Ate My Spacecraft on on Wednesday, April 24, 11:45 AM - 12:30 PM.
Conference Home Page: Design West 2013 (April 22-25, San Jose, Calif.)
In 1994, the Clementine Deep Space Program Science Experiment was lost in space not because of a hardware failure but, as software expert Jack Ganssle will describe in his DESIGN West session, Mars Ate My Spacecraft, a software failure followed by missing code for the run-of-the-mill watchdog timer.
In his session, on Wednesday, April 24 at 11:45am, Ganssle will describe high-profile embedded-systems disasters and extract lessons all software engineers should heed in future projects. Schadenfreude is not the goal: Ganssle wants embedded systems software developers to practice what hardware developers have done for years out of necessity, namely architect right the first time.
[Click here to register for DESIGN West 2013, April 22-25 at the San Jose McEnery Convention Center. Options range from an All-Access Pass -- which includes Black Hat (security) Conference Session to Free Expo Admission].
Hardware designers spend most of their time in the design stage, carefully creating nearly perfect architectures before implementation--mainly because hardware modifications are so costly. Software developers, on the other hand, often jump right in and write poorly architected code, and then spend half their time debugging it. The result: a disproportionate number of system failures are caused by software, as evidenced by several billion-dollar failures in space.
![]() |
| The Clementine mission to fly from the moon to a nearby asteroid was a failure, because the overworked embedded software team did not write the code to use their watchdog timer. |
Clementine
A prime example Ganssle will describe in detail is the Clementine Deep Space Program Science Experiment, which was lost-in-space in 1994. The mission failed during its second phase, when it was scheduled to travel from the Moon into deep-space where it would fly-by the Geographos asteroid. The problem was that after heading toward the asteroid, the spacecraft went silent for 20 minutes because of a software crash. And when it finally came back online, all of its fuel had been wasted by firing its thrusters for 11 minutes straight. The mission consequently had to be scrubbed, resulting in an enormous waste of funds and resources.
"The thing that fascinates me about the failure of Clementine is that it could have been saved by a simple watch-dog timer, which was available in the hardware, but the development schedule had been so compressed that the programmers never had time to write the code to turn it on," said Ganssle.
As a result of that failure, Clementine's software engineers went to other ongoing space-mission programmers to encourage them to add the code to use their watchdog timers. Unfortunately, the Near Earth Asteroid Rendezvous (NEAR) launch in 1998 ran into exactly the same problem, because its programmers did not heed the warning. As a result, 29 kilograms of reserve fuel was dumped when its thrusters fired in error--a problem that could have been avoided by a watchdog timer--demonstrating just how difficult it is to learn from other programmers' mistakes.
"In the embedded world we are too focused on just fixing bugs and moving on," said Ganssle. "After my review of the last 40 years of embedded software development, the number one lesson we have to learn is that if you focus on fixing bugs, you will never get a quality product. Quality needs to be addressed at the beginning--top down. We need to think long and hard about our architecture, review our designs, then write a program that is pretty close to perfect before we even start testing it."
Software engineering needs to use the same design principles as hardware engineering, namely to predict what is going to happen, design something that is going to meet those predictions, and then measure the results to make sure that indeed met those predictions.
"We need to close that feedback loop--rather than just quickly write a program that only does a part of the job, then deal with the bugs," said Ganssle.
Information Redux
Session info: Mars Ate My Spacecraft on on Wednesday, April 24, 11:45 AM - 12:30 PM.
Conference Home Page: Design West 2013 (April 22-25, San Jose, Calif.)
Navigate to related information



DrQuine
3/12/2013 8:59 PM EDT
It is human nature to believe that we won't repeat a "mistake" made by someone else. A key lesson to me is that (as a bare minimum), every one of these lessons should be added to the readiness checklist for new projects. While I'm not in the aerospace arena, I find it is also very helpful to record the root causes of each programming error that I experience. This not only increases my awareness of the likely errors (so I avoid making them) but also gives some hints of where to look for trouble. (Years ago, I found that my software problems were most often associated with mismatched global variables.)
Sign in to Reply
sixtysixscrews
3/13/2013 7:46 PM EDT
I had a story I told about auto repair:
'Guy drags in his car, tells you the fuel pump is bad. You replace the fuel pump; car still doesn't work. You ask the guy if he wants you to fix the car.' End of story? No...you, as the story teller, are and arrogant s**t and should have done your diagnostic work regardless of what the owner said.
This applies to specifications passed on to software/hardware engineers by project manglers - question first, then develop. The contrary is risking the mission and your career/company.
wb
Sign in to Reply
AndyKunzHH
3/15/2013 2:24 PM EDT
Clementine didn't fail because of firmware, software, or hardware. It failed because of management. PERIOD. The comment, "... but the development schedule had been so compressed that the programmers never had time to write the code to turn it on" tells the root cause. Schedules are not set often enough by the people with a) the most at stake (the engineers, who will be fired for failing), and b) the people with the most sense as to a reasonable estimate (again, the engineers).
Sign in to Reply
Mike.Nemeth_#1
3/15/2013 3:00 PM EDT
This lesson is less for the programmer and more to the project managers running these developments. It's just too easy for short-sighted managers to focus on progress based on lines-of-code-written rather than on spending time developing a quality architecture with the clock ticking. These clueless folks fear that all the code won't get written if their people spend their time designing and not writing actual code.
Sign in to Reply
AlPothoof
3/18/2013 5:56 PM EDT
I firmly agree with Jack and always have; software jocks get accused of being cowboys with their code and pushing the "churn and burn" model.
And yet one of our clients is sending some of us to Test Driven Development training. The talk there is all about "the tests embody the requirements" and "emergent design."
Two ends of the spectrum.
Sign in to Reply
ooferwog
3/21/2013 5:00 AM EDT
But in his article, isn't Jack doing this very thing? Accusing software jocks of being cowboys with their code and pushing the 'churn and burn' model (even whilst flatly contradicting himself in his own article as to the problem's true cause?)
On the one hand Jack ascribes the problem as being that of programmers' failure to come up with the perfect design from outset (dream on, bro), but then states that the problem was a "compressed schedule" which, by its very nature, forestalls any such luxury! And who came up with this compressed schedule, anyway? The programmers?
I don't mean to rain on your parade, but Jack's head is evidently planted firmly where the sun don't shine.
You firmly agree with Jack and always have? Surely you jest. This entire article is a comedy of errors. The man is clueless.
-Tex
Sign in to Reply
ooferwog
3/21/2013 8:26 PM EDT
Little Jack Horner sat in the corner, eating his Christmas pie. He put in his thumb and pulled out a plum, and said "What a good boy am I!"
In this nonsensical rhyme Jack Horner concludes he is a 'good boy' based on what evidence? He pulled out a plum. Why would this act signify anything of the sort to anyone with any sense? It is a non-sequitur; exactly the same sort of non-sequitur Mr. Jack Gannsle draws in his planned presentation. Non-sensical. He even states the actual problem and then draws from it a conclusion having nothing whatsoever to do with his example - and everything to do with project mismanagement. With managers like this on-the-job, is it any wonder Clementine failed?
Furthermore, the solution Gannsle proposes would not have succeeded in Clementine's development environment regardless. Why not? There simply wasn't time. Does Gannsle think that all this careful planning beforehand would have sidestepped Clementine's failure when the reality was that this planning would never have occurred in the first place? At what point in Clementine's software development effort would there have been time, when Gannsle himself states that there wasn't enough time even to 'turn on the code,' (whatever that means).
Finally, in his proposed solution, Gannsle goes on to ignore *what else* has gone on meanwhile, in his forty years' overview. What about unit testing? Testing soon and often? Making use of reusable and thoroughly-tested software components and libraries? Best practices for robust, reliable software development? Redundancy? A whole host of techniques and technologies which have been developed over these past forty years, tested in the Real World and deployed in same with huge success? No, design near-perfect software from the outset (how will he know it's near-perfect), then write the code. No mention of testing?
Why is this noob giving a presentation at Design West?
-Tex
Sign in to Reply
john_e_k
4/4/2013 5:33 PM EDT
Because management made him do it?
Sign in to Reply
Shenal
4/11/2013 2:09 AM EDT
Haha. exactly. Thats Hilarious. Well said John.
Sign in to Reply
Rick.Bosma
4/23/2013 8:05 AM EDT
Mr. Tex,
Out of respect for Mr.Gannsle, I'd like to point out that he's been doing embedded system work since the first microprocessors rolled off the line. Perhaps you could extend him the common courtesy of at least Googling him to find out he is a highly respected expert who was around prior to the term "noob" becoming part of Gen X vocabulary. You may find out you'd actually like to attend his lecture or his excellent course.
Sign in to Reply
VectorForce
4/22/2013 1:37 PM EDT
Clementine was developed to space qualify technologies for NRL; which realized that it could also be used to obtain better data on the moon in the process as a secondary mission. It was completely succesful in these missions, establishing a "cheaper, faster, better" baseline.
Having completed its missions, but still in good condition, ti was given a futher mission to Geographos. To say that its failure to accomplish that additional mission means it was a failure is inaccurate. However, we can still learn lessons from that failure.
Sign in to Reply