Everything eventually fails. This is a reality that experienced engineers understand. While it is true that some systems can be designed to work perfectly (with a high level of confidence) over their intended life cycles from a few thousands of a second to 30 years, the probability of failure grows over time due to mechanical stress and external interference. It is the way devices fail and the way such a failure affects the operation of the device that is of primary concern in understanding the risks and functional safety of using such products.
Given that one cannot, within limits, design a system that works perfectly forever, it is important to understand both the effect of failure and the root cause of failure.
For the effect of failure and the root cause of failure, the accepted method is the failure mode effects analysis (FMEA). This method uses a systematic approach by which an effect is first identified followed by potential root causes that drive the effect. This method is very useful when applied to a system in context to the system’s intended function. Using FMEA for the analysis of components designed without knowledge of the system the components are used in is inadvisable because the system level effect is not known.
The steps for performing a FMEA are as follows:
- Identify and list each process the IP performs in a table making sure the processes identified are simple and detailed, so they are manageable.
- Brainstorm potential ways each process can fail.
- List the potential effects of each failure.
- Assign severity rankings for each failure based on the consequences of the failure.
- Assign occurrence rankings of each failure based on the probability of a potential failure.
- Assigning detection rankings of each failure based on the probability of detecting a failure before the potential effect occurs.
- Calculate the risk priority number of each failure based on the product of severity, occurrence, and detection.
- Develop an action plan to address mitigation tactics for the highest risk priority number metrics.
- Implement tactics.
- Reevaluate the risk priority number metrics to determine if the system is acceptable.
As is apparent by the steps outlined, asking a designer to perform this analysis is overwhelming because the designer only knows the context of the IP's operation and has limited if any operational context of the IP's application in the target system. The engineer might begin making assumptions that are both time consuming and frustrating. Even a functional safety engineer with years of experience might not be able to complete this exercise with any meaningful outcome that is usable for system integration. Also, a key tenant in functional safety is the notion that no dangerous fault is more unsafe than another dangerous fault. Consequently, using the ranking and rating methods prescribed in an FMEA is meaningless.
The primary focus for a design is to meet the technical requirements set out by the architecture. In functional safety, a key requirement for the detection of a random hardware fault is diagnostic coverage.