Strategies for Handling Application Errors

When an application error occurs, the device can end up in one of a few error states:

  • The device could end up in a state where the firmware is stuck and the watchdog is not being refreshed, triggering a watchdog interrupt.

  • The system could enter a state where a reset occurs due to one or more power supply issues.

  • The system could encounter faults in the program execution due to program errors, erroneous execution, or access to secured resources when in the wrong system state.

The behavior of user applications when any of these kinds of errors occurs is important for the stability of end products. The RSL15 device provides a variety of information and resources for determining the causes of system errors, and we can recommend some top-level best practices for handling errors. However, the details of how your user application handles errors and error recovery is dependent on your application use cases. The following subsections discuss general best practices, and the information that the system provides for handling and debugging watchdog interrupts, resets due to any cause, and faults.

Best Practices in Error Handling

Best practices for error handling in user applications requires the user applications to include:

  1. A handler for the watchdog timer interrupt

  2. Handlers for the hard fault and any other faults that may occur in the user application

  3. Checks of the reset status registers (ACS_RESET_STATUS and RESET_DIG_STATUS) at startup to confirm the cause of the most recent system reset

If any of these error modes is detected, a device needs to perform a set of recovery actions. Possible steps include:

  1. Checking the available status information to determine what can be done

  2. Trigger a hardware reset of specific system blocks or even the whole device, if the system state indicates that a hardware reset is needed

  3. Setting timeout flags to trigger an early exit from blocking loops and application level error handling

  4. Signaling an external device in the system using a GPIO or interface for error handling at the extended system level

If these steps do not result in system recovery, more drastic steps can be taken to ensure that the application does not waste the system’s battery life.

For example, a user application could put the RSL15 device into Sleep Mode with a wakeup into an alternate error handling state. When using this kind of handling, best practice is for the error handling state to:

  • Be different from any other Sleep Mode and Wakeup implemented by the application.

  • Design this Sleep State to be in the lowest possible power state.

  • Perform more rigorous checks of the device before switching to the normal state of operation.

This kind of error handling routine allows the application to retain limited state information, and can extend the overall system's battery life by keeping the RSL15 device in a lower power mode.

NOTE: Debugging the causes of watchdog interrupts, resets, and faults uses the same strategies and information as handling these errors.

Watchdog Interrupts

Expiry of the watchdog timer is often the first sign that application execution has failed. The watchdog timer interrupt triggers when the watchdog has not been refreshed within a defined number of system operating cycles. If another interrupt were to occur, the system would reset with the RESET_DIG_STATUS register indicating that a watchdog reset had occurred.

When a user application handles a watchdog interrupt, the system state has not been reset—which typically simplifies the identification of the causes of errors and improves debugging. Wherever possible, we recommend that an application use this interrupt to evaluate the state of the RSL15 device and to proceed appropriately. To assist in this handling, the watchdog timer interrupt handler can use:

  • Application state variables

  • Device register settings and status bits

  • The system context stored onto the stack frame, including the core processor registers R0 to R3, R12, link register (LR), program counter (PC), and processor status register (PSR).

Resets

If a reset has occurred, the reset status registers (ACS_RESET_STATUS and RESET_DIG_STATUS) indicate what events or system state has triggered the reset. These registers and the possible reset causes are discussed in more detail in Resets from the RSL15 Hardware Reference.

NOTE: We recommend clearing all reset status flags in these registers at the start of application execution (after the reset source has been determined), to allow future executions to determine the cause of a reset or resets. To clear the status bits that indicate the source of a reset, the RESET_DIG_STATUS register must be cleared before the ACS_RESET_STATUS register.

Faults and Lockup

If the Arm Cortex-M33 processor encounters a fault condition, it will enter into a fault handler. Faults that can be detected include:

  • Bus faults indicating an error in physically accessing a requested memory location

  • Memory management faults indicating an error in memory management, such as issues during exception stacking, and accesses to memory that exists but cannot be accessed in the current system state (instruction or data access violations)

  • Usage faults indicating an error in the code, such as:

    • Division by zero

    • Stack overflows

    • Unaligned data or code accesses

    • Invalid instructions

  • Secure faults where the application violates the security requirements defined for the current processor state

The Arm Cortex-M33 processor is required to handle all faults, with faults being promoted to be handled by the Hard Fault handler if a specific fault handler is unavailable. A fault is escalated to the hard fault handler if:

  • A fault handler causes the same kind of fault as the one it is servicing. This escalation to HardFault occurs because a fault handler cannot preempt itself; it must have the same priority as the current execution priority level.

  • A fault handler causes a fault with the same or lower priority as the fault it is servicing. This is because the handler for the new fault cannot preempt the currently executing fault handler.

  • An exception handler causes a fault for which the priority is the same as, or lower than, the currently executing exception.

  • A fault occurs and the handler for that fault is not enabled.

If a fault occurs that cannot be handled (including faults in the NMI handler, or faults that occur while handling a hard fault), the Arm Cortex-M33 processor enters into a lockup state. The processor remains in this state until the core is reset or is halted by a debugger. In the lockup state, the program counter (PC) is forced to 0xEFFFFFFE.

If you encounter a fault, there are several items that can be used to figure out why the fault has occurred. The fault handling provides:

  • The Configurable Fault Status Register (CFSR) that indicates the causes of bus, memory management, and usage faults

  • The Bus Fault Address Register (BFAR) and Memory Management Address Register (MMAR) provide the address accessed that have caused the fault to occur, when a bus fault or memory management fault occurs at a known address (only valid when the corresponding bit in CFSR is set).

  • The Secure Fault Status Register (SFSR) that indicates the causes of secure faults

  • The Hard Fault Status Register (HFSR), which indicates if a hard fault has been triggered directly due to a debug event or failed vector fetch, or if it triggering is due to a fault that has been promoted to a hard fault.

  • The system context stored onto the stack frame, including the core processor registers R0 to R3, R12, link register (LR), program counter (PC), and processor status register (PSR).