Software engineers are perplexed when a wave of customers start reporting server outages
Back in the late 90s I was an architect for an Audio Conferencing
server, and we were completing the next major software release. This
release had extensive architectural improvements on our MSC card – the
heart of the conferencing system. Testing went well, so we rolled out
the software upgrade.
Months down the road, our Technical Support center received a call from
one of our larger, Fortune 100 customers reporting a P1-outage.Their
conferencing system up and restarted with no apparent reason that
morning. We examined the system logs: one moment the system was
running normally, and the next moment we saw a Watchdog timeout and
reset. No other data was in the log.
A few days later another customer called with the same symptom
P1-outage. Then a few days later, yet a third customer called – and
now Executive Management voiced concern and wanted daily updates. I
could feel their breath on the back of my neck. What do you tell them
when you have no data?
No one in the development teams had any idea why these systems were
restarting. Logs were useless.The floor was messy from people pulling
their hair and with no inkling of why the systems were restarting.
I asked the Support Staff, when these systems were last upgraded. They
checked and said “a long time ago.” I asked how long, and found that
the first customer was upgraded 6 months previously.
“Hmmm, that’s odd that a system runs stably for 6 months then fails”, I thought.
“What about the 2nd restart?.” Same thing, 6 months.
“The third customer?” Also 6 months.
“Aha! A pattern is forming.”
Further analysis showed that the systems had been upgraded 198 days
prior to their restarts We dubbed this the “198-day failure” problem.
I pulled together the rest of the system architects to brainstorm what
would cause a 198-day delayed failure. No one had any thoughts, until
I remembered a conversation that I had with the MSC developer.She had
implemented a 32-bit (4 Giga-count) clock counter word. I remember
asking her whether we would need to worry about wrap-around. Her
response was “2^32 is such a big number, you’ll probably restart the
system for other reasons before you would encounter a wrap-around.”
I was skeptical then, but now I was even more skeptical so I did the calculations:
2^32 counts x 4ms/count = 17,179,869.184 secs
17,179,869.184 secs x 1 hour / 3600 secs = 4,772 hours
4,772 hours x 1 day / 24 hours = 198.84 days
My fears were realized as was the probable root cause of the restarts.
The timer on the MSC was wrapping around, and that caused the firmware
to go unstable, and hence the watchdog restart. But we had to prove
this theory, and management would probably say “waiting 198 days isn’t
good for business.”
So we wrote a special test build of the MSC firmware which preloaded
the counter to 0xFFF24460. We loaded the firmware, restarted our test
system and got on a conference call. Nervously we waited–-when true
to life after one hour the system restarted. The logs showed the same
signature: all okay, followed by a Watchdog timeout and restart.
Needless to say, after our Y4G situation. our company was quite prepared for the Y2K scenario.
Glenn Inn is one of the founding architects for the MeetingPlace
Conferencing System. He was a principal designer of the hardware and
firmware components for MeetingPlace’s Audio Conferencing engine.He now
works as a visionary for the Office of the CTO, Voice Technology Group
at Cisco Systems.He holds a commercial pilot airplane and private pilot