troubleshooting a finicky memory module, engineers learn to
thoroughly assess ALL factors of a failure -inducing operating
troubleshooting a finicky EPROM memory module, engineers learn to
thoroughly assess ALL factors of a failure-inducing operating
Of the several weird technical problems I’ve had the pleasure of
solving over my three decade career (so far), one in particular sticks
in my mind. I started out in the Product Development group of an
aggressive, young telecom equipment manufacturer, whose flagship
product was rapidly gaining prominence in the PABX market. The system
was a dishwasher-sized box with about a dozen or so large circuit
boards slid in and interconnected via backplane. The main CPU executed
out of EPROM memory (2716 generation), which was arrayed on large
multi-chip modules that sat on the CPU board as well as on an expansion
board next slot over. Before I was given the opportunity to actually
design products, I spent a few years cutting my teeth on investigating
both production and field problems of this system, and making design
improvements where warranted.
Around the time that single rail (+5V) EPROM technology was
supplanting 3-rail (+12V, 5V, -5V), I was given charge of qualifying a
particular manufacturer’s single-rail devices. These were supposedly
equivalent or better speed, lower power, cheaper, and to boot, drop-in
pin compatible. They had the potential to save the company many
thousands of dollars per month, and although there was no looming
obsolescence cloud at the time, there was much eagerness to get these
into the product stream quickly. We programmed up a bunch of devices
that the vendor sampled to us and populated a few memory modules.
Everything looked good in the development chassis in our lab; lower
supply current, clean signal transitions, solid CPU read access. CPU
booted up on the bench, no problem.
So we stuck our “perfectly working” sample memory module into a real
production unit, set it up for environmental chamber testing and then…
WHAT? It’s not starting up?!! LEDs are flashing Memory checksum error
code? What’s going on? We haven’t even turned up the heat yet!! What
followed was an intensive electrical investigation: supply noise,
signal quality, voltage tolerance, heat/cold sensitivity, slot to slot
EM induction, chip date codes, etc. Nothing we could do within reason
to reproduce the problem in our development chassis, however, as soon
as we put the modules into a production unit, BINGO, memory checksum
error. A week had quickly slipped by with no headway, and I was in
constant contact with the EPROM manufacturer, who’s App Eng was
Finally, out of desperation, we hauled a production system off the line
into the lab. Swap the new boards in; still failing! OK, let’s check
some signal timing. Put the memory carrier card out on an extender
board and hook up an analyzer. What!! It’s working now! Is it a timing
problem? Earlier measurements told us no, in fact, timing got better as
the new chips access faster than the three-rail units they replaced.
Are they too fast now and the extender and analyzer load is fixing the
timing? Removed the analyzer; still working OK. Take the card off the
extender and back into the card slot, failures are back. Usually, the
exact opposite behavior is seen since the extender degrades signals a
bit, even at the very low bus speed of the system. Is there something
about the combination of backplane and the particular boards in the
This is the worst kind of problem a system can inflict upon its
creators. To rule out backplane differences, we move all the cards from
the production chassis to the open chassis test frame/backplane in the
lab, and powered it from the production system power supply. Ah Ha!
Failures stopped. Must be the backplane, it’s the only thing different.
Ran all kinds of tests and probed backplane signals on the test frame;
everything looked good and worked without fault. What is it about that
production backplane, I wondered as I put my notepad down on top of the
test frame. HEY WAIT! Now the test frame is failing? What Happened?
Retrace my steps; Check my notepad and try… WHAT!! Its working fine
again!. All I did was… put my notepad down on the frame like this and…
Holy $#*^@! It fails when my notepad is on top of the frame! I can
reproduce the failure 100% of the time with nothing more than a
cardboard and paper notepad!
Needless to say, the investigation proceeded to a solution very swiftly
from that moment.
The problem was clearly not caused by an EM field since the paper
notepad had no metal other than the staples in the binding. I tried
with metal objects but the failure would only occur if it was a large
sheet of anything opaque. And that was the big clue. It was an OPTICAL
effect! What is it that could react that way? There were no optical
sensors in the system as this is a telephone switch.
Wait a minute… we didn’t put stickers on the windows of the EPROMs
like production units have. Install stickers, and presto!, the module
fails everywhere, 100% of the time. In fact, the CPU can’t even start
booting. OK, now that we have reproduced the problem, what is the
analysis? Turns out that when the EPROM die was in the dark, the chip
select inputs exhibited a high leakage current, high enough to
overwhelm the Vol of the unbuffered 4000 series CMOS logic gate outputs
that drove them. 4000 series could barely drive an LSTTL Vil level at
5V Vdd, and the EPROM CE inputs were exhibiting a leakage that
approached that of a standard TTL input.
When exposed to light, the leakage diminished and allowed the CE inputs
to be driven to a health Vil level (<0.4V). The lab frame was all
open and exposed to the strong lighting that prevailed, allowing the
memory system to operate correctly. In the system frame, the metal
enclosure blocked most of the ambient lighting and hence provoked the
problem. Why the system could still sometimes boot far enough to detect
and report memory errors was just through the chance that the startup
code resided in a memory device that was close to the card edge. It saw
the most light leakage through the space between cards and worked well
enough to allow the CPU to run diagnostics. We basically had to
restrict the use of this particular manufacturer’s memory chips from
It wasn’t long after that the company launched an effort to develop the
next generation CPU system. I had the honour of designing the CPU and
memory system for it, using the 2764 generation of EPROM and a
combination of LSTTL and HC CMOS. I never ran into this exact problem
again but it certainly opened my mind up to unexpected possibilities
when facing a perplexing problem.
Starting in his early teens, building guitar amplifiers and effects as well as learning to service Hi-Fi equipment, Rick Hille has honed 30+ years of technology industry experience in various roles in Telecom equipment design, Video desktop and surveillance systems, and network server appliances. He is a graduate of Ryerson Polytechnical Institute, and continues to serve the technology industry as a Hardware Designer.