The steps represent what any highly-available/reliable system has to go through. After 30 years designing/implementing software that runs major manufacturing plants that has to
1. tolerate any single component failure (hardware/software/network).
2. tolerate multiple failures at multiple levels
3. run on a 365x24 basis (6-sigma is not good enough)
I understand something of this. Part of the process is also planning for failure. :-) So, I'd like to see some treatment of this issue. Planning and designing for reliability is one thing, but if you don't also plan for / design for inevitable component failures (components being hardware, software, or engironmental) then you are only 1/2 the way there... :-)
As you say you have to ensure you can with stand failure and have the reliability architecture to enable your mission goals. This is where the Failure Modes Effects and Criticality Analysis is very important as it allows you to determine the effect on the system of a component failure.
Typically our systems have to function for 17 years with very little down time as a telecommunications satellite is a business. This means you need a high probability of success which is determined by the FIT rate after FMECA and reliability architecture.
Should I write more about this area ?
I am not from "space telecom/satellite industry" but to some extent understand the rigor behind multiple phases of prototyping, reliability calculation, testing as the equipments you design can't afford to have a failure causing loss of function and can't afford to come back for a "repair or replacement" :). It is pretty interesting to know the 'Model Philosophy' behind your job.
I am from the industrial background and the design philosophy differs a bit. Money is a factor here and also time to market. Hence things are done in shorter time, spending less on numbers of prototype passes but certainly not compromising with the "needed" quality. Again, there are certain applications such as "Functional Safety" where the failure of a system could cause hazard to human lives or loss of properties...there we also practice similar kinds of rigor...we do HAZOP, FMEDA etc. to calculate probability of failures and follow standards (such as IEC 61508) to avoid "systematic faults", design necessary diagnostics or build redundancies to maintain required "Safety Integrity Level"...also we need to go through assessment by independent agencies (as TUV) and approvals. You might have heard about Functional Safety (IEC 61508).
Hi Adam: Thanks for posting this. It's certainly interesting to see how a prototyping strategy shifts when you are dealing with very expensive products. I recently participated in a team building exercise where we were challenged to build an "egg protection device" using only plastic straws, duct tape, and balloons. Each team had to "purchase" supplies (our attempt to buy up the entire inventory of straws didn't fly) to make their protection device. Under the rules of engagement, if more than one egg in its protective shroud survived the drop off a ten-foot ladder, the team with the lowest BOM would be declared the winner. It turned out that our team had the only surviving egg--and the highest BOM of all! That's because we opted for a srategy of "screw the cost, we need to get something out there that works". I'm wondering if other readers identify with this approach? Or iis it a ridiculous strategy that only works when the money you are dealing with is play money?
It is a big challenge as both the non recurring cost and recurring cost need to be controlled. But as you say quality is key, we do find we try to do a lot of bom rationalisation accross the design and if possible other projects to. There is also alot of justification as to is that component really needed.
@Adam: very nice blog and a really interesting topic!!
I completely agree in that building real models is essential in order to reduce costs when a hardware design is going to be implemented.
Of course, in the vertical markets I use to work, reliability requirements are no so strict as yours; in this way, I usually jump from bread-board --or dev-kit plus breadboard-- prototype to pre-series production...
@kfield, The operational life is 15 years the additional 2 years is for the testing and qualification campaign on the satellite and any storage while it awaits the launcher being available.
Last week the most advanced telecommunications satellite ever was launched, the payload processor (the heart of the system and what makes it so advanced) was developed my group here in the UK, this took up a large portion of my life over the last few years. I have a blog for Max on it just getting approval so hopefully next few days or so and he should have it.