In the 1980s I worked for a computer company that manufactured multi-user time-shared computers. In those days, a system consisted of several 10"x12" circuit boards filled with hundreds of TTL and CMOS chips.
I helped design and build a laptop-based system to keep track of circuit boards and parts in the stockroom. (There were thousands of different boards and parts in inventory.) The system used a laptop computer with a bar code scanner and software we developed for the laptop. An uploader program was also developed for a mainframe computer that would receive the data from the laptop for processing. To take inventory, everything in the stockroom would be scanned with the bar code scanner plugged into the laptop. The data would then be sent from the laptop to the mainframe using the uploader program.
The development process went well. We did several tests bar-coding data into the laptop and uploading it to the mainframe. We also did several dry runs where we bar-coded and uploaded most of the inventory -- a task that took several hours for each dry run.
Finally, the big day came to take inventory. Things started out well. We scanned the entire inventory into the laptop, connected to the mainframe, and started the uploader program. After about 45 minutes, the uploader program froze up. At first I thought this was a glitch, so I restarted the uploader program. Again, the program froze up after about 45 minutes. We had uploaded large amounts of data to the mainframe many times during testing with no problem, so why would the program freeze up now and why after about 45 minutes? At this point, several megabytes and several million packets had been sent to the mainframe.
After more investigation, I found that the uploader stalled on the exact same data packet each time. What was so special about this particular packet? Was there something about the data in this packet that caused the mainframe to stall? More investigation showed that the data did not matter -- the system always froze on the same data packet no matter what the data stream was. What was so special about this particular packet in the data stream?
After much more investigation, the problem turned out to be the I/O processor (IOP) in the mainframe. As data packets are sent to the mainframe, the IOP stores the data in its own buffer until the mainframe can find a buffer in main memory and empty the IOP buffer. Each time the upload process began, the mainframe memory was empty (no other processes were running during the upload), so the operating system (OS) was able to find a memory buffer and empty the IOP buffer before the next packet arrived from the laptop. After about 45 minutes, all the free memory in the mainframe was gone. The OS, which is a virtual memory system, would need to find a buffer to flush to disk to make room for the new data from the IOP, which takes more time.
It turns out that the IOP buffer was the same size at the data packets, so before the IOP could accept a new packet from the laptop, the previous packet in the IOP memory had been moved into system memory. (Remember, it was the 1980s, when memory was more expensive and less plentiful.) Because of a bug in the IOP firmware, the IOP did not send a "not ready for data" signal to the laptop, causing the next packet from the laptop to overwrite the packet already in IOP memory. This caused the IOP to lock up (another bug), stopping the entire upload process. Each time the upload process was restarted, the OS would empty memory, providing the same amount of free memory for the uploader each time, causing the process to freeze on the same packet each time.
Talking to the engineers who designed the system and the IOP, they mentioned that they had not expected large amounts of data to be sent to the system at a high rate of speed over a sustained period of time. These were terminal-based systems, designed for people to enter data from a keyboard, not upload data from a laptop. The engineers simply had not expected anyone to do this. It's interesting how the intended use for a system can drive design decisions that can cause it to fail when it's used in ways that are not expected, or prevent you from thinking about possible failure modes that could occur.
The company decided that fixing the IOP hardware and upgrading all the field systems would be too expensive, since customers didn't normally upload data from a laptop. To fix the problem, the laptop put a small time delay between each data packet providing enough time for the mainframe to empty the IOP buffer before the next packet arrived. This was not the best solution, but it worked. This is an example of the software person writing software to get around a hardware problem. This occurs more often than you might think. Anyone remember the old serial port fix for the 8086 processors in the early PCs? You had to put in small time delays between IN and OUT instructions in the comm port code to get it to work with some UARTS.
About author Frank Rose: "I have over 30 years of experience as a software developer (both application software and embedded software) and an analog/digital circuit designer. Today, I work as a digital designer developing FPGA programs using VHDL and designing analog/digital circuits for L-3 Power Paragon in Anaheim, CA."
The Frankenstein's Fix has just come to an end. Stay tuned to read the submissions and see what kind of difficult job of judging we have ahead of us! Submission details and full contest rules here.