Engineering Investigations
Comment
The Gen
I worked for a company for 12 years designing a range of fibre optic switches. ...
m.cheah
The solution seems to be that SW guys need to learn more HW skills and vice ...
It always worked before, so you broke it
Tim Fyock
5/18/2011 1:22 PM EDT
I work as a software developer (BS in Computer Engineering) for a wire saw company. Several of the wire saws for years had a random problem of wire running off the reels, but it was considered a minor problem. A new saw model that was being built was bigger and had multiple cutting yokes. According to the schedule, we had a month; until sales told the customer we could deliver the saw in two weeks (the senior project engineer said we needed two months). The mechanical/hardware people handed over the saw a few days before the shipping date to software where all we had to do was load our usual software (modified), test, and ship to a happy customer.
Then the fun began. Each cutting yoke had unique problems and all of them had run-off problems. Mechanical and hardware assemblers said it was obviously software. Being the only embedded software engineer on this saw, all their eyes fixed on me as the bottle neck in their finely tuned organization. During the meeting, it was decided that everyone had done their job using the same mechanical/hardware as other saws before and did not have these new problems, so it was time to own my job. In a few hours of testing, I came to the conclusion that the same software running on several parallel yokes but displaying different problems was indicative of hardware malfunctions or miss-wires (with several hundred wire screws, this had happened before).
My logic was not accepted, and it was seen as trying to blame others for my buggy software (again, everyone did their job and I was the FNG who replaced an uber engineer, who over the years could do backflips on water while juggling anvils). Over the next few days, I was called to frequent meetings to give updates. After a few rejections by the technicians of my idea to check the wiring, I grabbed a screw driver and after a couple of eyeball scans looking for differences between each yoke's hardware, found several miss-wires in each yoke. Now the software was running uniformly on all the yokes. Upper and middle managers left their meeting room and asked me on the shop floor what I had fixed in my software; I said it was a just a few mis-wires. They looked at each other with even bigger scrunchy faces and ran back into their meeting room (no atta boys).
This still left the reel run off problem that now plagued the saw. Again, after several meetings, I still needed to fix my software problems and ship that saw. By this time management called back the uber engineer and another engineer to get their HMI/data-logging program to work "even better" (no, not buggy at all) and I contributed a few minor "improvements" they missed while interfacing the hardware, but I digress. I was under the microscope to fix the reel runoff problem that I had obviously made worse with my new software and my manager made the casual comment that younger engineers work harder. I was bouncing between the shop floor and my small office/file room making changes and not progressing.
On my office desk I cobbled together (out of replaced parts) a test saw to test my software (management promised engineering their own production saw to develop SW on, but it never transpired). At first, software changes took 20 minutes to load and verify (walking to the shop with an eeprom). After a few months of begging for a debug eprom and an IDE, I was now able to make software changes in seconds. In my office the software worked correctly on the test saw but randomly failed on the production saw. During even more meetings I showed that the problem resided in a single line of code that looked at the ADC to detect wire speeds, but this was questioned by sales--why did this LOC now fail when it has been working fine for five years in all the other saw models? Running the IDE on the production saw would prove my point. Earlier, I had requested a better laptop to run the IDE on the production saw (engineering had an older Pentium laptop while management/sales had P4/Core). Alas, I used the old laptop which resulted in a lot of crashes due to IDE lag times. This impressed everyone when crashes now spewed wire everywhere. The technician (who wired the saw, dropped an oscilloscope, and later broke a laptop) kept moving my laptop while running tests every time it was in his way. Eventually, the IDE showed that the ADCs, even though receiving the correct stable voltages, had a sizable ripple that prevented zero speed from being detected by the LOC I predicted was the one causing the wire to runoff the reel.
Because the test saw did not have this problem, it was a simple process of ceteris paribus that indicated the 18' lengths of display cables caused the ADC's levels to ripple. I shortened the display cables to 6'' and the saw functioned correctly. Upon showing the technician the problem, he commented that customers were not going to walk around and open the cabinet to use the saw displays and it did not look professional. My ennui kicked in and remembered the story he told months earlier of another difficult saw project he worked on that bankrupted this company thus leading to the present management team. After consulting the embedded board manufacturer, another engineer and I designed a display buffer that fixed the problem. The saw shipped, my job description no longer existed in the company, and when parting, my manager said I needed to "own my job." They then hired a son’s young friend to do a "totally different position." He didn't last. Ironically years later, most of those special people were purged. The remaining engineer wrote a recommend stating my HW/SW solutions were still being used in their current saws.
Then the fun began. Each cutting yoke had unique problems and all of them had run-off problems. Mechanical and hardware assemblers said it was obviously software. Being the only embedded software engineer on this saw, all their eyes fixed on me as the bottle neck in their finely tuned organization. During the meeting, it was decided that everyone had done their job using the same mechanical/hardware as other saws before and did not have these new problems, so it was time to own my job. In a few hours of testing, I came to the conclusion that the same software running on several parallel yokes but displaying different problems was indicative of hardware malfunctions or miss-wires (with several hundred wire screws, this had happened before).
My logic was not accepted, and it was seen as trying to blame others for my buggy software (again, everyone did their job and I was the FNG who replaced an uber engineer, who over the years could do backflips on water while juggling anvils). Over the next few days, I was called to frequent meetings to give updates. After a few rejections by the technicians of my idea to check the wiring, I grabbed a screw driver and after a couple of eyeball scans looking for differences between each yoke's hardware, found several miss-wires in each yoke. Now the software was running uniformly on all the yokes. Upper and middle managers left their meeting room and asked me on the shop floor what I had fixed in my software; I said it was a just a few mis-wires. They looked at each other with even bigger scrunchy faces and ran back into their meeting room (no atta boys).
This still left the reel run off problem that now plagued the saw. Again, after several meetings, I still needed to fix my software problems and ship that saw. By this time management called back the uber engineer and another engineer to get their HMI/data-logging program to work "even better" (no, not buggy at all) and I contributed a few minor "improvements" they missed while interfacing the hardware, but I digress. I was under the microscope to fix the reel runoff problem that I had obviously made worse with my new software and my manager made the casual comment that younger engineers work harder. I was bouncing between the shop floor and my small office/file room making changes and not progressing.
On my office desk I cobbled together (out of replaced parts) a test saw to test my software (management promised engineering their own production saw to develop SW on, but it never transpired). At first, software changes took 20 minutes to load and verify (walking to the shop with an eeprom). After a few months of begging for a debug eprom and an IDE, I was now able to make software changes in seconds. In my office the software worked correctly on the test saw but randomly failed on the production saw. During even more meetings I showed that the problem resided in a single line of code that looked at the ADC to detect wire speeds, but this was questioned by sales--why did this LOC now fail when it has been working fine for five years in all the other saw models? Running the IDE on the production saw would prove my point. Earlier, I had requested a better laptop to run the IDE on the production saw (engineering had an older Pentium laptop while management/sales had P4/Core). Alas, I used the old laptop which resulted in a lot of crashes due to IDE lag times. This impressed everyone when crashes now spewed wire everywhere. The technician (who wired the saw, dropped an oscilloscope, and later broke a laptop) kept moving my laptop while running tests every time it was in his way. Eventually, the IDE showed that the ADCs, even though receiving the correct stable voltages, had a sizable ripple that prevented zero speed from being detected by the LOC I predicted was the one causing the wire to runoff the reel.
Because the test saw did not have this problem, it was a simple process of ceteris paribus that indicated the 18' lengths of display cables caused the ADC's levels to ripple. I shortened the display cables to 6'' and the saw functioned correctly. Upon showing the technician the problem, he commented that customers were not going to walk around and open the cabinet to use the saw displays and it did not look professional. My ennui kicked in and remembered the story he told months earlier of another difficult saw project he worked on that bankrupted this company thus leading to the present management team. After consulting the embedded board manufacturer, another engineer and I designed a display buffer that fixed the problem. The saw shipped, my job description no longer existed in the company, and when parting, my manager said I needed to "own my job." They then hired a son’s young friend to do a "totally different position." He didn't last. Ironically years later, most of those special people were purged. The remaining engineer wrote a recommend stating my HW/SW solutions were still being used in their current saws.
Navigate to related information


Dave33
5/18/2011 4:30 PM EDT
I've written and debugged firmware for many years and I can really relate to this story. Solving problems like those described in the story take time and a very good engineer. Unfortunately there's no way to measure the difficulty of a problem so there's no way to measure the value of the solution. Managers usually just look at how long you took and draw their own conclusions.
Sign in to Reply
jnissen
5/18/2011 6:20 PM EDT
Been there done that! Don't care to repeat it!
Sign in to Reply
Frank Eory
5/18/2011 6:35 PM EDT
Interesting that in a new system -- new hardware and new software -- what ultimately turned out to be a signal integrity problem was blamed on software. "Mechanical and hardware assemblers said it was obviously software." And management simply took their word for it, without any data to back up that claim?
Sign in to Reply
cdhmanning
5/22/2011 11:14 PM EDT
There are two reasons for that:
1) It is relatively easy to see how hardware works and much harder to see how software works. That means people will always tend to believe the problem is where they can't see it (ie. in the software).
2) It is also a matter of hope. Software errors can be fixed on short deadlines, but not hardware errors. Thus management/sales people always hope it is a software problem. When a product is on stop ship they will readily believe anything that gives them hope that a solution is at hand.
Even if it is not really a software problem, there is often a software solution. Whenever this happens it becomes a software problem.
I once had to fix a lubrication problem in software. Without that, many idetms would have had to be recalled and scrapped costing the company a few hundred thousand dollars.
On another occasion a mechanical stop was repositioned. This could have cost approx $100k and 4 months to fix mechanically meaning the company would have lost a few million in sales. I fiddled for a day and found a software solution.
At the end of the day, all that matters is that a solution is found. The customer doesn't care what is broken or who's fault it is.
Just work together to find a solution. Everyone is trying to earn money for the same company.
Sign in to Reply
DutchUncle
5/23/2011 8:41 AM EDT
The trouble is, when you find a software WORKAROUND (emphasis important) to a hardware PROBLEM, and all that people remember is "we had to rev the software to fix it", the software side loses respect - including at review time.
Sign in to Reply
cdhmanning
5/23/2011 2:48 PM EDT
Why should the software guys lose respect?
Make sure that it does not get framed that way. Make sure that the software guys get the credit for turning around a multi-million dollar loss.
If you want recognition in a company then the best way to do that is to make an association between what you do and the performance of the company. Demonstrate **value**.
In your performance review, point out the value of what you did.
Sign in to Reply
J Kosin
5/27/2011 10:35 AM EDT
The software guy looses respect, because he caused the product to be late in the first place.
It doesn't matter if it was a HARDWARE problem, we get the short end of the stick; because we fixed the problem after everyone said the product was ready to ship.
Sign in to Reply
Tombo
5/19/2011 9:27 AM EDT
I deal with this everyday
Sign in to Reply
t.alex
5/29/2011 9:06 AM EDT
It is important the sw guy pick up some hardware debugging to prove themselves 'innocent' :)
Sign in to Reply
zeeglen
5/20/2011 2:41 PM EDT
This reminds me of a time when a watchdog timer would sporadically time out when a processor took too long to execute a command/response sequence on a control bus. This was a new product in both HW and SW, and the late night sessions in the labs were not burdened by "it's a HW fault / no it's a SW fault". We just admitted that nobody knew yet where the fault was and we had to work together to find it.
Finally, using an analog scope (digital scopes had not been invented yet) the SW guy and myself saw an event whiz by that the timeout occurred within the 1 second allocated time of the hardware. I took another look at the hardware, a long-chain ripple counter and realized that the guy who designed this had done the stage count based on a complete cycle at the final stage. He forgot that the timeout actually occurred on the rising edge HALFWAY through the cycle.
Solution: knife and green wire.
Lesson learned: Never assume HW or SW. Test, test, test....
Sign in to Reply
t.alex
5/29/2011 9:07 AM EDT
Isn't it true most of the time the sw guy do the test and debugging instead of the hardware?
Sign in to Reply
zeeglen
5/29/2011 12:53 PM EDT
Depends on what is being tested and debugged.
Sign in to Reply
jimfordbroadcom
5/20/2011 3:02 PM EDT
Well, I've worked with enough boneheaded software types who didn't know anything about hardware to say that you are the exception, Tim. Embedded engineers who know both are worth their weight in gold, even if boneheaded managers don't realize it. I totally agree with zeeglen; forget the finger-pointing and just figure out what's not working. Sometimes it's both H/W and S/W!
Sign in to Reply
J Kosin
5/27/2011 10:38 AM EDT
Yes, sometimes it is both. But we mustnot start pointing fingers before exploring the problem and what the cause is. I recently spent two days troubleshooting a problem and have worked with the HW engineer to try and determine the cause. So far, we have elimiated SW as being a possible cause to the issue.
Sign in to Reply
DutchUncle
5/20/2011 4:10 PM EDT
Sometimes this is all in the software domain. Reviewing DOS driver code while writing new interface code on an ISA plug-in board, I found some code that seemed to have a race condition and asked about it. Since this had been written by one of the original product group, now a technical leader, and had been working for years, my question was disdained and dismissed out of hand. A mere few months later, intractable customer problems were blamed on my interface code. I finally traced it back the the same DOS driver code. Turned out this was the first customer PC system we had seen at whatever speed it was (500 MHz?) and it was the first time that the PC was fast enough to catch the race condition. Now it was still my fault for not having pushed harder when I found the problem before.
Sign in to Reply
J Kosin
5/27/2011 10:40 AM EDT
Unfortunately, it is a loose-loose situation. Sorry to bring you bad news.
Sign in to Reply
Alan60
5/20/2011 5:29 PM EDT
Been there, done it, got the company golf shirt. Years ago we had a hydraulic problem, excess pressure when braking. The redesign fix was messy, even for hydraulics. The $1M machine was down and people were not happy. I suggested a SW workaround to control deceleration, which we did and were soon back up running. However, word got around that I had to fix the SW to get the machine running. Unfortunately now I have to think twice before helping others.
Sign in to Reply
WKetel
5/20/2011 8:40 PM EDT
Nobody has an exclusive right to make the mistakes, that is for certain. The problems arise when either the product definition is fuzzy, or the designer makes a mistake, or the builders don't follow the drawings. The funniest one that I experienced was a fairly simple circuit that I had designed and then built. I carefully selected the resistors to all be in the middle of where they needed to be. But when the production model was built, it was miles out of specifications. The problem eventually was traced to the arrogant detailers who did the circuit drawing. They corrected the "Mistake" made by "That dumb engineer", and added the "K" that I had neglected to put on several low value resistors. So instead of 22 ohms and 470 ohms there was 22,000 ohms and 470,000 ohms. The shortcoming of my designs are that they don't work as intended unless they are built as designed. This was reported to my manager at my next performance review, when the problem with that product was brought up. The response was "OH".
Sign in to Reply
Bob Lacovara
5/23/2011 1:27 PM EDT
Mr. Fyock, welcome to the monkey house. Your best strategy would have been to quit, although for sure that is not always a feasible solution. Since idiotic management and moronic sales types have a tendency to butt into engineering (software and hardware) you might want to consider this advice that I was once offered: "the time to start looking for a job is the day you land your new one." Sounds cynical? Wait until you've been in the field for, let's see, 11 minus 74, ah, 37 years, and consider it again. Well, in any event, we've all been there...
Sign in to Reply
GarySXT
5/26/2011 1:51 PM EDT
In the groups I manage, when there is a bug both hardware and software are considered guilty until proven innocent. Hardware and software engineers are expected to work as a team. As mentioned above software often fixes hardware problems. Also hardware debugging techniques are sometimes helpful in tracking down a software bug. Management should not allow the different disciplines to blame each other.
Sign in to Reply
J Kosin
5/27/2011 10:58 AM EDT
I look at it a bit differently.
Without hardware, software is useless and without software, hardware is pretty lame.
It is the cooperation of both the hardware and software that makes a well designed system; which is our goal ultimately for software and hardware.
Sign in to Reply
LiketoBike
5/27/2011 2:26 PM EDT
Software and hardware are often trained by management to blame each other :-)
(As in: when the "real" mission to create and ship and SUCCEED [as a team] is derailed into the "fake" mission to grab credit and avoid fault, this blame-game usually results...)
Sign in to Reply
jsandjs
6/9/2011 7:03 PM EDT
This is the second story about the hardware induced noise I read here. Since noise mixed in a signal is so common and most of the time it is random, problems happen here and there somehow give a hint, even not strong and clear.
A specific software in general will hardly produce random troubles here and there.
Sign in to Reply
DutchUncle
6/10/2011 8:29 AM EDT
Fresh real-life example: A few weeks ago I released code that worked exactly as expected on prototype hardware. When used on the next board-spin of prototypes, it didn't work, which of course meant it was a software fault. Yesterday it was proven that not only was the circuit changed, but a wrong part had been used on the dozen proto builds, so it wasn't even the proper changed circuit. Put in the right part, and the software works again. Gee, what a concept.
Sign in to Reply
m.cheah
11/2/2012 8:35 PM EDT
The solution seems to be that SW guys need to learn more HW skills and vice versa.
It's the ones that are at ease straddling disciplines that bridge the divide and I've certainly seen this in action.
Sign in to Reply
The Gen
6/16/2013 4:28 AM EDT
I worked for a company for 12 years designing a range of fibre optic switches. (check out the spelling of fibre!)
My bit was the opto-mechanics. I can sympathise deeply with the sentiments in these posts. Guys, I feel your pain. One of the posts earlier mentioned that it is obvious when mechanics isn't working, but not so software and electronics. In the fibre switch, the mechanics was small and delicate, and small mechanical disturbances (thermal, vibration, electrical noise) could make a difference but it wasn't obvious by observation. When there was a problem the blame was assigned thus:-
1. it's a mechanical problem, the mirror is in the wrong place (it was once many years previously but not on the five plus subsequent problems)
2. Right, the mechanics is OK, must be an electrical problem
3. Right the electrics is OK, must be a software problem
Usually the 'head in charge would decide on what the root cause was and waste time and money trying to fix something that wasn't broken. The three disciplines had enough respect for each other to get on and sort the problem (eventually when the management "help" had evaporated when the solution was looking intractable)
Oh, and I relate to the story above about the saw project that was estimated to take 2 months scheduled to be done in one month and quoted to the customer as 2 weeks.
I don't work for that company any more :-)
Sign in to Reply