Software engineers are perplexed when a wave of customers start reporting server outages
Back in the late 90s I was an architect for an Audio Conferencing
server, and we were completing the next major software release. This
release had extensive architectural improvements on our MSC card – the
heart of the conferencing system. Testing went well, so we rolled out
the software upgrade.
Months down the road, our Technical Support center received a call from
one of our larger, Fortune 100 customers reporting a P1-outage.Their
conferencing system up and restarted with no apparent reason that
morning. We examined the system logs: one moment the system was
running normally, and the next moment we saw a Watchdog timeout and
reset. No other data was in the log.
A few days later another customer called with the same symptom
P1-outage. Then a few days later, yet a third customer called – and
now Executive Management voiced concern and wanted daily updates. I
could feel their breath on the back of my neck. What do you tell them
when you have no data?
No one in the development teams had any idea why these systems were
restarting. Logs were useless.The floor was messy from people pulling
their hair and with no inkling of why the systems were restarting.
I asked the Support Staff, when these systems were last upgraded. They
checked and said “a long time ago.” I asked how long, and found that
the first customer was upgraded 6 months previously.
“Hmmm, that’s odd that a system runs stably for 6 months then fails”, I thought.
“What about the 2nd restart?.” Same thing, 6 months.
“The third customer?” Also 6 months.
“Aha! A pattern is forming.”
Further analysis showed that the systems had been upgraded 198 days
prior to their restarts We dubbed this the “198-day failure” problem.
I pulled together the rest of the system architects to brainstorm what
would cause a 198-day delayed failure. No one had any thoughts, until
I remembered a conversation that I had with the MSC developer.She had
implemented a 32-bit (4 Giga-count) clock counter word. I remember
asking her whether we would need to worry about wrap-around. Her
response was “2^32 is such a big number, you’ll probably restart the
system for other reasons before you would encounter a wrap-around.”
I was skeptical then, but now I was even more skeptical so I did the calculations:
2^32 counts x 4ms/count = 17,179,869.184 secs
17,179,869.184 secs x 1 hour / 3600 secs = 4,772 hours
4,772 hours x 1 day / 24 hours = 198.84 days
My fears were realized as was the probable root cause of the restarts.
The timer on the MSC was wrapping around, and that caused the firmware
to go unstable, and hence the watchdog restart. But we had to prove
this theory, and management would probably say “waiting 198 days isn’t
good for business.”
So we wrote a special test build of the MSC firmware which preloaded
the counter to 0xFFF24460. We loaded the firmware, restarted our test
system and got on a conference call. Nervously we waited–-when true
to life after one hour the system restarted. The logs showed the same
signature: all okay, followed by a Watchdog timeout and restart.
Needless to say, after our Y4G situation. our company was quite prepared for the Y2K scenario.
Glenn Inn is one of the founding architects for the MeetingPlace
Conferencing System. He was a principal designer of the hardware and
firmware components for MeetingPlace’s Audio Conferencing engine.He now
works as a visionary for the Office of the CTO, Voice Technology Group
at Cisco Systems.He holds a commercial pilot airplane and private pilot
This was a problem back when nobody thought you'd ever need more than 2G of RAM. Many OS's, including Solaris and Windows were not tested with addresses greater than 2G. The problem was the address became negative, so when adding pointers, things got a bit whacky. All these OS's are fixed today I'm sure as 4G isn't that unreal any more. More than 4G is handled through virtual memory of course (you actually don't need a 64-bit CPU).
I'm reminded of the Therac-25 (?) back in the 80s. One of the flaws mentioned was an overflow in an 8 bit cycle counter for tuning the radiation beam. When the counter rolled-over, it just continued to focus finer and finer. Caused several fatal burns before the machine was pulled and the manufacturer ended up in court.
Hi from ZÜRICH,SWITZERLAND.
"WHAT CAN GO WRONG WILL".IF IT WORKS DO NOT FIX IT.
LET US KNEEL ON THE EARTH THIS SUNDAY AND:-
LIFT UP YOUR EYES & HEART,
CROSS YOUR FINGERS,
TO SAINT MURPHY.
AND WHEN ALL FAILS USE AN "IRISH-SCREWDRIVER".
NAMELY A 1 KILO HAMMER
Yours "without wax"
This article brought back lots of old memories, having worked on a system where we had counter that the documentation said was 64 bits but turned out to be 32 bits which caused our systems to reboot every 540 days. Needless to say tracking that one down was a joy.
This reminds me of an article I read a few years ago about a flight of 12 F-22 fighters losing their brains when crossing the International dateline on a flight between Hawaii and Okinawa. And, back in the early days of the PC era, most PCs shipped with 64K or 128K of RAM. (Yes, that's "K") The box was capable of holding up to 640K of RAM total but quite a few software applications crashed if the computer had more the 512K installed due to a roll-over error. Never take anything for granted when dealing with mins and maxs.
It's really mind-boggling to think of what my hourly rate would be if I were paid for the actual time that I spent troubleshooting a problem to get to a fix. The challenge is to avoid going down a rathole.
David Patterson, known for his pioneering research that led to RAID, clusters and more, is part of a team at UC Berkeley that recently made its RISC-V processor architecture an open source hardware offering. We talk with Patterson and one of his colleagues behind the effort about the opportunities they see, what new kinds of designs they hope to enable and what it means for today’s commercial processor giants such as Intel, ARM and Imagination Technologies.