An analog engineer and a digital engineer join forces, use their respective skills, and pull a few bunnies out of a hat to troubleshoot a system with which they are completely unfamiliar.
My colleague continued writing his code and progressed to a walking-zeros test. Strange things began to happen. On several known good memory modules two SRAMs with their data busses tied together consistently failed in the same way: When 7F was written, FF got read back. It only failed on one pair of SRAMs. The other SRAM pair always worked properly.
Had I connected a wire wrong on the fixture? We put a scope on the fixture and verified that yes, when he wrote 7F that is what came back from the DUT SRAM and the fixture. Clearly his PIC microcontroller was reading a definite logic 0 as a logic 1, but only on bit 7 of that data bus. Yet the walking one's test had worked and bit 7 was correctly read as a logic 0 during that test.
Since I was not familiar with his PCB layout or the PIC chip, I asked him to send me his KiCAD board layout file. I already knew there were no power/ground planes, but I had not expected to see that some of his ground pin connections snaked in and out in roundabout paths when they should have all been joined together under the PIC chip.
Some of his Vdd connections were not even connected to the Vdd copper, but instead depended on connections within the chip. His decoupling capacitor was an inch away, adding two inches of trace inductance. I smelled analog problems here, possibly due to the power routing. One way to find out if a suspect actually is the cause of a problem is to eliminate it. I used an approach that had been successful before, which was to add power planes and more decoupling. Here is a photo of the end result, done by one of our highly-skilled production soldering experts:
Two squares of single-sided copper-clad form the mini-power planes. Decoupling 0805 chip capacitors standing on their ends are just the right size to AC-couple the planes together. (Somehow this sounds like an oxymoron). The PIC cannot complain about poor power etch routing. All its power and ground pins are now tied together.
Unfortunately this did not help. But it did eliminate the power suspect. I still smelled an analog problem.
This was further confirmed when we ran some tests to see if any other byte patterns caused bit 7 to falsely read a one when it was really a zero. Turned out there were many patterns that did this. If as few as three lower-order bits were ones, the PIC would read bit 7 as a one when it was really a zero. It didn't seem to matter which lower order bits, all it took was three or more set to one. With enough of them HI they seemed to bleed into bit 7. Was it analog voltage summation?
Then it hit me. My colleague's PIC was running at 3.3V. My memory module DUT was powered at 5Vs. My colleague had previously assured me that his PIC inputs were 5 volt tolerant -- the data sheet said so. I took a closer look at the data sheet. On the first page it did say "5.5V Tolerant Inputs (digital-only pins)." So if the inputs are configured as digital, they should be 5V tolerant, right?
Some 146 pages into the data sheet was a bit (no pun intended) more detail: Any inputs that could be configured as either analog or digital are NOT 5V tolerant. They have clamp diodes to 3.3V Vdd. All eight bits of the problem data bus and one bit of the other data bus went to such inputs. Yes, it was an analog problem -- the 5V ones were overdriving the inputs and adding voltage-wise. I invented a couple of new cuss words.
This explained the problem with the one flash we had overwritten that would no longer boot. All the firmware images we had copied previously were garbage. I had to heat up the soldering iron again, hack into the test fixture, and carefully cut ribbon cables to add a couple of 74LVC245 bus transceivers with 5V tolerant inputs. My knowledge of PIC microcontrollers and my expletive vocabulary both improved considerably.
But it solved the problem and we could now identify bad SRAM devices and re-write the bad flash. The "RAM is BAD" message turned into "RAM is OK" after a flash re-write. Possibly the flash had logged the previous SRAM failures.
Success was achieved by a pair of engineers, one digital and one analog, each with his own skill set, working together to solve the problem.