An analog engineer and a digital engineer join forces, use their respective skills, and pull a few bunnies out of a hat to troubleshoot a system with which they are completely unfamiliar. Our sales department had just accepted a new challenge on behalf of Engineering. They promised a customer that yes, of course, we can repair a telecom product that we have never seen before and for which we have no systems, no test fixtures, and no schematics. (The OEM no longer supported this product.)
Engineering was once again expected to shake our rattles, do our magic voodoo dance, and pull bunnies out of hats. About fifteen of these backplane-pluggable boards showed up in my office for initial evaluation and perusal of their inner workings. They had a proprietary SIMM (socketed memory module), which on several units turned out to be bad. Temporarily substituting the memory modules from other cards with obvious smoke damage failure modes brought them back to life when powered while lying flat on the bench. (Remember, there was no test chassis available.) They would then boot and talk to us over their RS232 ports.
These modules were populated with four SRAMs and four flash memories, each flash and SRAM shared an 8-bit-wide data bus, and each pair of SRAMs was enabled together with the same chip select. I proposed to the boss that we build a small test fixture that would take the DUT memory module, run SRAM tests, and if necessary reprogram the flash.
A digital/software colleague three cubes away was assigned to work with me on this project. He had previously designed and laid out a PCB that used a surface-mount PIC microcontroller as a universal I/O for our current and future test fixtures. It turned out that it had just enough I/O lines to handle the address and data buses on the DUT memory module, with two spares, as long as I tied the four separate DUT data busses together into two pairs on the fixture. So we decided to use it.
I ordered the necessary SIMM connector and a plated-through-hole protoboard, along with some ribbon cable and IDC header sockets to connect to the PIC board. It was somewhat annoying that the 72-pin SIMM connector was spaced at a.05-inch pitch, so the protoboard also had to be this pitch. Its tiny .025-inch-diameter holes did not accept .025-inch-square pins, so wire-wrap was impossible. (Now I know where that old adage, "Can't fit a square peg into a round hole," came from.)
I had to solder ribbon cable directly to the protoboard and string short 30AWG wires to the SIMM connector. As long as the stranded ribbon wires were not overly tinned (to keep the strands together), they actually fit into the protoboard holes.
Endeavor brings back cuss words long since forgotten
Another annoyance was that the SIMM connector had plastic retaining tabs that quickly wore out from repeated insertions of memory modules. The maker had designed them for maybe a single SIMM replacement over the lifetime of the product. We wanted to plug DUTs in and out constantly.
Fortunately I had used socket pin strips in the protoboard for the SIMM connector in anticipation of eventually needing to replace it easily. I subsequently found a connector with metal retaining tabs. This particular feature does not show up in vendors’ online part descriptions. I had to look at the mechanical drawing of each of many to find "W/ Metal Latch."
The first test of the fixture went well. My colleague coded a walking-ones SRAM test that immediately identified bad SRAM chips on a couple of the DUT (Device Under Test) boards. We replaced them and now they booted, but with the disconcerting message "RAM is BAD." Due to availability we had used 12 nsec SRAMs in place of the original 20 nsec SRAMs, so speed was probably not the issue. Hmmm, maybe we needed to improve the test algorithm.
Then we got brave and copied about five different versions of firmware from the flash of the good memory modules and tried to re-write the new firmware into a module, which semi-booted at first but complained about a "missing application loader." After the firmware re-load it would no longer even talk to us over its RS232 port. Somehow a 'known good' firmware load messed it up. My colleague verified that the firmware in the good and bad modules was identical. So why did one boot and not the other? Speed?
My colleague continued writing his code and progressed to a walking-zeros test. Strange things began to happen. On several known good memory modules two SRAMs with their data busses tied together consistently failed in the same way: When 7F was written, FF got read back. It only failed on one pair of SRAMs. The other SRAM pair always worked properly.
Had I connected a wire wrong on the fixture? We put a scope on the fixture and verified that yes, when he wrote 7F that is what came back from the DUT SRAM and the fixture. Clearly his PIC microcontroller was reading a definite logic 0 as a logic 1, but only on bit 7 of that data bus. Yet the walking one's test had worked and bit 7 was correctly read as a logic 0 during that test.
Since I was not familiar with his PCB layout or the PIC chip, I asked him to send me his KiCAD board layout file. I already knew there were no power/ground planes, but I had not expected to see that some of his ground pin connections snaked in and out in roundabout paths when they should have all been joined together under the PIC chip.
Some of his Vdd connections were not even connected to the Vdd copper, but instead depended on connections within the chip. His decoupling capacitor was an inch away, adding two inches of trace inductance. I smelled analog problems here, possibly due to the power routing. One way to find out if a suspect actually is the cause of a problem is to eliminate it. I used an approach that had been successful before, which was to add power planes and more decoupling. Here is a photo of the end result, done by one of our highly-skilled production soldering experts:
Two squares of single-sided copper-clad form the mini-power planes. Decoupling 0805 chip capacitors standing on their ends are just the right size to AC-couple the planes together. (Somehow this sounds like an oxymoron). The PIC cannot complain about poor power etch routing. All its power and ground pins are now tied together.
Unfortunately this did not help. But it did eliminate the power suspect. I still smelled an analog problem.
This was further confirmed when we ran some tests to see if any other byte patterns caused bit 7 to falsely read a one when it was really a zero. Turned out there were many patterns that did this. If as few as three lower-order bits were ones, the PIC would read bit 7 as a one when it was really a zero. It didn't seem to matter which lower order bits, all it took was three or more set to one. With enough of them HI they seemed to bleed into bit 7. Was it analog voltage summation?
Then it hit me. My colleague's PIC was running at 3.3V. My memory module DUT was powered at 5Vs. My colleague had previously assured me that his PIC inputs were 5 volt tolerant -- the data sheet said so. I took a closer look at the data sheet. On the first page it did say "5.5V Tolerant Inputs (digital-only pins)." So if the inputs are configured as digital, they should be 5V tolerant, right?
Some 146 pages into the data sheet was a bit (no pun intended) more detail: Any inputs that could be configured as either analog or digital are NOT 5V tolerant. They have clamp diodes to 3.3V Vdd. All eight bits of the problem data bus and one bit of the other data bus went to such inputs. Yes, it was an analog problem -- the 5V ones were overdriving the inputs and adding voltage-wise. I invented a couple of new cuss words.
This explained the problem with the one flash we had overwritten that would no longer boot. All the firmware images we had copied previously were garbage. I had to heat up the soldering iron again, hack into the test fixture, and carefully cut ribbon cables to add a couple of 74LVC245 bus transceivers with 5V tolerant inputs. My knowledge of PIC microcontrollers and my expletive vocabulary both improved considerably.
But it solved the problem and we could now identify bad SRAM devices and re-write the bad flash. The "RAM is BAD" message turned into "RAM is OK" after a flash re-write. Possibly the flash had logged the previous SRAM failures.
Success was achieved by a pair of engineers, one digital and one analog, each with his own skill set, working together to solve the problem.