This paper provides an overview of the processor reorder buffer timeout and provides methodology to debug these types of system issues. Using the debug methods and debug tools suggested in this document should help reduce the time to debug these system issues. The process is to gather more information about the failure until the cause is identified and then put preventive steps in place to eliminate the failure.
Processor reorder buffer (ROB) timeout is not new, yet debug engineers often spend a lot of time debugging system issues that result from seeing a processor ROB timeout. The purpose of this paper is to give context and guidance to help hardware engineers and software engineers troubleshooting these issues.
Typically processors indicate a ROB timeout with an IERR# signal assertion. Interestingly IERR# assertion does not mean ROB timeout condition only, this means that the processor has experienced an internal error, and it may be a result of issues such as an error condition in the cache unit, error conditions in the internal bus etc.
For processors that support the Intel Quick Path Interconnect interface, there is no longer IERR# or MCERR# signals from the processors. Instead they have been replaced by the CATERR# signal pin to indicate that a catastrophic error condition has been experienced by the processor.
If the Machine Check capability of the processor is enabled, this event can also be recorded in the Machine Check Status register. The processor ROB timeout is only one of the Machine Check events that can be recorded. This paper will only focus on the processor ROB timeout error condition, and provide guidance on debugging this Machine Check event.
Processor ROB timeout
First, let’s examine the meaning of a processor ROB timeout. Figure 1 is an example of a P6 Processor Micro-architecture with Advance Transfer Cache Enhancement.
Fig 1: Intel P6 processor micro-architecture with advance cache transfer enhancement
From the above figure, the processor execution consists of a few blocks:
Bus unit which interacts with the system bus, known as the Front Side Bus for earlier Processors, and Intel Quick Path Interconnect for more recent processors;
Second Level Cache unit which interacts with the Fetch/Decode Unit and the First Level Cache unit;
First Level Cache unit which interacts with the Out-of-Order Execution Unit;
Execution Out-of-Order Core unit which is handling out of order execution;
Retirement unit which is responsible for retiring processor instructions in order;
Branch Prediction unit which offers branch predicting hints for the processor.
In the processor Retirement unit, the processor instructions are retired in order even though the processor can support out-of-order execution. It is important to note that the instructions must retire in order to ensure the correctness of program execution. The ROB timer is reset on retirement of each micro-instruction. During normal operation the processor retires instructions before the ROB timer times out.
When the ROB timer expires, something usually is going on within the system hardware or software or both. This document discusses some examples of ROB timeout events.
Machine Check status
As described earlier, the ROB timeout is a type of Machine Check event, thus it is recommended that Machine-Check Architecture events are enabled in the system in order to capture information related to a Machine Check event. (See Intel 64 and IA-32 Architectures Software Developer's Manual, Volume 3A: System Programming Guide Part 1
for details on Machine-Check Architecture initialization.
Processor ROB timeout is reported in bits [15:0] of the MCi_STATUS – the MCACOD field – as an internal timer error condition with MCACOD == 0x400. Bit 38 of MC0_STATUS will also be set in the processor to report a BINIT# (Bus Init) timeout condition. This Processor signature is referenced in Table E-1 and E-3 of the Intel 64 and IA-32 Architectures Software Developer's Manual Volume 3B: System Programming Guide Part 2
Causes and examples
When the next instruction to be retired is a read operation and that read does not complete before the ROB timer expires, a ROB timeout will be reported. This means that a memory read, an IO read, a Memory mapped IO read, or a Configuration read can cause this error.
When any single thread issues a read operation, it may be able to execute other instructions that do not require the completion of the read. At some point the thread needs the read result and it will stall waiting for a completion. When that completion does not occur, the system is headed for a ROB timeout event.
There are several conditions that could prevent the read from completing before the ROB timer expires. Some of the more interesting ones include a device which never responds to a device status read, a device which partially responds or one that returns an error result to other parts of the system but never actually completes the read.
Typically, writes are posted and should not, by themselves, cause an ROB timeout. It is possible that some issues downstream have consumed all the transaction resources causing the functional unit to push back on the system bus, resulting in no progress as seen at the processing unit. This eventually leads to a ROB timer expiring since the pending write instruction cannot be retired within the ROB timeout interval.