On-Board Failure Logging (OBFL)
Author: Ashish Nagar, Cypress Semiconductor Corp. (firstname.lastname@example.org),
Board failure at customer sites or in the field can happen for many reasons, and reconstructing the exact failure at the failure site continues to be a challenge for developers. To assist in troubleshooting failures, all board environment variables and failure messages should be logged and stored at the time of a board failure so that the root-cause for the board failure can be determined at a later point in time. This article describes a structured approach to adding on-board failure logging (OBFL) capabilities to facilitate reproduction of failure on Field Return boards. It highlights the importance of OBFL, categorizes failure log data by defining OBFL Records, reviews the organization of failure logs in persistence memory, and outlines the system software support needed for storing and retrieving OBFL data.
A board with OBFL is configured to store failure-related data to non-volatile memory, which can be retrieved and displayed for failure analysis at a later point in time. These failure logs help in the post-mortem of the board.
Implementing an OBFL system feature requires a combination of hardware and software. On the hardware side, it involves a) identifying on-board OBFL resources (for example, temperature sensor, memory, interrupt source, board_ID, etc.) which give board failing information and b) on-board non-volatile memory to retain failing information even in case of board or system failure. OBFL software is needed to configure and store board variables as OBFL records in non-volatile memory during healthy board operation and during any board failure event. The OBFL software also has to have the intelligence to analyze multiple error events, records, and historical failing records to deliver a narrowed-down estimation of the failing cause. Such analysis can save considerable troubleshooting since otherwise potentially large numbers of OBFL records would need to examined manually by the failing analysis engineer.
Figure 1.0 – OBFL enabled System Architecture
Figure 1.0 shows the layered Architecture of an OBFL-enabled embedded system. The OBFL layer, which resides between the application layer and operating system, either directly accesses hardware or uses Operating System APIs to communicate to hardware. The APIs provided by the OBFL layer are called by the application layer and perform three major classes of tasks:
OBFL Resource and Configuration: This sub-module provides APIs to obtain the runtime value of OBFL variables from predefined OBFL Resources. Application software calls these API in the interrupt handler. They are also called when the OBFL Resource encounters an error condition and needs to record values. The Resource Manager provides an API which will be called periodically by application software to collect OBFL variables from OBFL resources. This sub-module also provides API for configuring OBFL Resources.
OBFL Display: This sub-module provides multiple APIs to retrieve specified OBFL data. These APIs access the non-volatile memory to retrieve OBFL records and present stored data in a variety of formats. These APIs are linked to the command line utility of the application layer so that during troubleshooting the failing analysis engineer can enter commands to understand the sequence of events happened before the board crash or system failure.
OBFL Record Keeper: This sub-module is responsible for organizing OBFL records which comprise multiple baselines, event logging, and message logging records. It is required that every entry in this record is time stamped. This record uses Time as the Key when storing data.
Defining the onboard failure logging record (data) is crucial for any system as this step decides which parameters will be captured and logged to the non-volatile memory to assist debugging of the board at a later time. All OBFL Records are time stamped as this greatly helps to understand the sequence of events that happened before the Board failure. A detailed, planned and thoughtful definition of the OBFL record helps to reduce the response time to determine the root-cause of a failure. Figure 2 shows how an OBFL record is further divided into three categories:
(1) Baseline Record
(2) Event Logging Record
(3) Message Logging record
Figure 2.0 – OBFL Record in Non-Volatile Memory
The OBFL Baseline record is created independently of any board failures. The OBFL system should always have one initial Baseline record and a minimum of one recent baseline record. The initial Baseline is created immediately after the first successful board bring up followed by OBFL configuration of the system in field. The first-time creation of a recent baseline record uses the initial baseline record. Recent baseline records are created to capture recent values of OBFL resources. Subsequent baselines are created after every successful board reset. OBFL also calls for a provision of archiving old baselines. Any stable Baseline may be marked as a “Golden” baseline to use as a reference baseline during failure debugging. In the absence of a Golden baseline record, the Initial baseline record can be used as a reference.
The baseline record captures specific hardware and software configuration details and saves in non-volatile memory. The hardware section of the baseline record includes board configuration data details such as chassis number, slot number of the card, serial number, daughter card identification details, and FPGA and ASIC revision numbers. This section should also store the make, serial number, and configuration details of all onboard memory like SRAM / SDRAM /DDR. The BIOS version, firmware version, OS details, and application software version should be stored under the software baseline record. This record is very helpful in narrowing down board failures caused by recent hardware or software upgrades.
The third section of the baseline record stores values of board environment variables. Any board environment variable recorded stores recent ‘N’ values, maximum actual value, and the minimum permissible value of the board environment variables. Environment variable include board power section parameters like voltage, current, and temperature readings coming from one or multiple sensors situated on the board.
The value of board environment variables is periodically collected, stored as a recent value and compared with the maximum permissible value. If the value collected is more than the maximum permissible value, an Environment Error Event logging record is updated with the current time stamp. In addition, boards typically have multiple voltage source and temperature sensors. The temperature should be periodically recorded (i.e., once every 30 minutes) while voltage data can be collected less frequently (i.e., once in every 60 minutes).
Event Logging Records
OBFL Event Logging records store hardware failure event indications like board crashes due to on-board memory failures, system reset, exceptions, and interrupt or board environment errors. OBFL Event logging is divided into following the categories:
Memory Errors: Typically, SDRAM or DDR Errors are categorized into correctable errors and uncorrectable fatal errors. Correctable errors are corrected by additional hardware logic in memory to find and correct single-bit ECC errors. Though this type of error is nonfatal, it can impact system performance so logging this event helps to debug system performance errors. Additionally, logging correctable errors helps in generating warning events on the possibility of potentially fatal errors in the future. Multi-bit memory errors which are uncorrectable fatal errors are logged in the event log record with the failing address location, expected data, and other memory details.
Temperature Errors: The multiple temperature sensors attached to a board provide periodic values of the surrounding temperature. A temperature error is logged in the Event record each time the temperature reading exceeds the permissible temperature range defined in the baseline record. Event records should be updated with the Temperature sensor ID, Temp error, and appropriate permissible limits.
Voltage Errors: The power section of the board generates various voltages required for different ASICs. Any component failure in the power section may change the intended voltage. The board should provide features to measure voltage automatically, and voltage readings should be taken at periodic intervals. Any deviation of the actual voltage from the permissible voltage range defined in the baseline record should be reported to the Event logging record.
Bus Errors: The peripheral bus or IO Bus found on any typical board is used to connect the CPU to peripheral devices. The noise on this bus may cause address phase or data phase errors. The master or target devices on a peripheral bus are designed to detect these errors. If a device detects address errors, it can assert a System Error and, in the case of data errors, it can assert a parity error. The internal bus on any SOC may also generate a system error or data error due to noise on the bus. The system error logic connects the system error and parity error to CPU interrupt pins. An event record should be updated with error details and the device id for failure analysis.
Interrupts: An interrupt error is an event originated by external hardware sources, internal peripherals, ASIC interrupts, or software interrupts notifying the CPU of an error. The interrupt handler for all errors should record the interrupt number and source of the interrupt to the OBFL Event record.
Reset: A board reset may occur due to system failure or intentional manual reset. Every board reset should be recorded with a time stamp. ASIC resets asserted to overcome the error should be recorded as well.
OBFL event: OBFL-supported critical operation like initial and golden baseline creation, OBFL logging disable and enable, and log file deletion are important events and should be time stamped and stored as OBFL events.
Message Logging Record: This OBFL record provides detailed time stamped message logging of all failures triggered by system software, including system alarms, alert notifications, warning messages, and system error details. Message logging can be divided into several different levels:
Alarm (Level-1): System alarms that need immediate attention should be logged in this category.
Error (Level-2): In case of errors seen by system software, the complete stack trace, processor register dump, and ASIC register dump should be logged as message logging. Memory leak errors should be reported with requested memory details. Any firmware failure should generate a debug message and be logged in this record.
Debug (Level-3): This is a debug mode logging and when enabled provides software module trace information for function entry and exit. This can be helpful in narrowing down the failure region. In case of failure, if diagnostic software is triggered, all messages from that function should be stored under the debug help category.
OBFL Software Support
This section describes the system OBFL Software architecture. OBFL system software support is needed to organize OBFL records in memory and to support easy retrieval of failure logging information. As shown in Figure 3, the OBFL software architecture is broadly divided into following blocks:
Figure 3.0 – OBFL Software Architecture
OBFL Command Line: Support for a command line interface (CLI) to the Display Manager facilitates retrieval of OBFL records. CLI support also allows for configuration of OBFL Resources and managing OBFL baseline records, Event logging, and message logging record.
OBFL Configuration API: The OBFL software provides APIs to assign unique a OBFL resource ID to each source of each board environment variable. Software provides configuration support features so that each resource ID can be configured. The Temperature sensors around the board are configured with a permissible temperature range. Each voltage source should be configured with maximum and minimum voltage and current requirements.
OBFL Record and File Management: This block functions as the OBFL record keeper. It provides features like auto creation of baseline records and archiving of old baselines. API support is provided for manually creating and deleting baseline files. Support functions for clearing unwanted message log files helps to make space to store newer logs. This block also provides features like OBFL Enable, Disable, Display All, and specific event or message logging.
OBFL Resource Manager: This software block provides APIs to obtain the runtime value of OBFL variable from predefined OBFL Resources. Application software calls these API in the interrupt handler when an OBFL Resource error condition is encountered.
OBFL Display: This is a very important feature of OBFL which shows board failure OBFL records stored in non-volatile memory.
Using Resource ID/category: On entering a OBFL , the display shows the latest ‘N’ historical readings in a tabular format, each with its time stamp. It also shows how many times each error was reported by the resource, the amount of error, and configuration details of the resource. Also shown are the commands operated upon the OBFL resource category, like OBFL temperature and OBFL voltage. Entering OBFL Display list all OBFL resources under the specified category.
Example of OBFL Display Feature using Resource Category:
Using Start / end time: This command displays Event logging records and/or message logging records which are time stamped between the specified time range. The command format for this data retrieval is OBFL DISPLAY ALL . This displays all time sequenced information from the Event record and message record database. Entering OBFL event will display all events which have happened between the time limits.
Using debug option: This features provides an intelligent debug report. Upon execution of the command, it find the latest failures in the Event or Message log to find the failure category, takes note of the associated time stamps, and locates any warnings or any other error category or message log category associated with it. This feature also compares historical and archived data for similar error and message categories, as well as highlights any discrepancies found by comparing the latest baseline file the to initial baseline and Golden baseline files.
Example of OBFL Display Debug feature:
OBFL Periodic Task: This block provides an API for OBFL Resource management and is called periodically by application software to collect OBFL variables from OBFL resources.
NV Memory Store/recall and RTC: This is explained in the section below along with NV hardware functionality.
All OBFL records, along with time stamps, are kept in the non-volatile memory for failure analysis at a later time. The non-volatile nature of the memory retains failure logs even the board losses power. Several memory technologies such as Flash, EEPROM, FeRAM, and NVSRAM offer non-volatility capabilities. NVSRAM memories are preferred because they consist of an SRAM cell and corresponding non-volatile cell. NVSRAM read and write operations provide the same high speed access as SRAM.
During normal operation, the VCC connected to the NVSRAM is used to supply the power. The AUTOSTORE operation provided by the NVSRAM uses a capacitor connected to the memory to supply power during power failure or a board crash. This enables the system to store the latest possible OBFL data before complete board failure. During power up of the board, memory contents in the non-volatile region will be automatically transferred to the SRAM part or memory.
NVSRAM also provide Soft store and Recall features. These enable OBFL software to store OBFL data in the non-volatile region of the memory whenever required. Soft recall feature may be used by OBFL software to extract data at the failing site. NVSRAM memory uses the Real Time Clock (RTC) as each and every record entry should be time stamped so that the sequence of failure events can be determined.