One of the many tasks I have had as a Hardware and Software Engineer is troubleshooting unintended consequences of design decisions made many years previously. I was working on a VMEBus-based system that ran on the pSOS real-time operating system. This system used an FDDI (Fiber-Distributed Data Interface) network interface, and due to its unique configuration, we thought it would be a good idea to perform some sort of BIT (Built-In Test) on it.
Nine years later, the manufacturer of the CPU board we used decided to replace it with a new model. All of a sudden, our FDDI BIT test started failing. This was mighty suspicious to me because I recalled that the vendor actually supplied the BIT (built into the circuit card's firmware).
The BIT Test Executive that our company wrote would just start the test, wait for a period of time, and then read the status register to see if it passed. I actually tested it myself using the "watchful eye" of a VMEBus analyzer. I started the test, saw it fail, and immediately went to read the status register and saw that the test had PASSED!!!
At this point, I smelled a rat. I changed the VMEBus analyzer over to asynchronous mode and ran the BIT, swapping both the new and the old CPU boards. At first I noticed that that new CPU executed a VMEBus read in 60% of the time that it took the old one. Then, I noticed that the BIT Test Executive was continuously reading the status register. I finally got my hands on the software and confirmed my suspicions: the BIT Test Executive was performing a read loop, continuously reading the status register a fixed number of times and then doing a compare of the status bit with a "pass" condition. Since the new CPU was faster, it completed "n" reads before the FDDI card completed its BIT.
I guess you could call this simply a matter of lazy programming: you have to read the register anyway, so why not just do that "n" times? A better way to do it would be to "WAIT(TICKS)", utilizing the built-in timing in pSOS, or even to decrement a CPU register: this is a reliable technique, since the time interval is based on CPU clock speed and not interface speed. However you do it, DOCUMENT, DOCUMENT, DOCUMENT!!! That way, some poor sot like me won't have to work WEEKS to find it!!!
Describe a memorable experience in which you solved a
baffling technical problem, involving irate bosses or customers (or both). Share
your best investigative work and we’ll pay you $100 if we publish it.
Questions? Email Brian Fuller or Naomi Price.