Failure Mode and Effects Analysis (FMEA): A Practitioner's Guide to Proactive Risk Management

6 days ago
17 min read

Updated: 2 days ago

By Allan Ung | Founder & Principal Consultant, Operational Excellence Consulting

A clear mountain stream flowing through a sunlit woodland forest, used to illustrate the Japanese quality proverb that prevention upstream eliminates the need for costly correction downstream — the core philosophy of Failure Mode and Effects Analysis (FMEA). — "If you can get clean water upstream, it is needless to purify the water downstream." — A common Japanese saying. In quality management, FMEA is the discipline of going upstream: finding and eliminating failure modes before they reach your customer.

Allan Ung is the Founder and Principal Consultant of Operational Excellence Consulting, a Singapore-based firm established in 2009. With over 30 years of experience leading operational excellence and quality transformation across manufacturing, technology, and global operations—including senior roles at IBM, Microsoft, and Underwriters Laboratories—Allan brings deep shopfloor expertise to every learning room he enters. A Certified Management Consultant (CMC, Japan), Lean Six Sigma Black Belt, TPM Instructor, TWI Master Trainer, and former Singapore Quality Award National Assessor, he has facilitated structured problem-solving and Lean programmes for organisations including the Ministry of Education, Tokyo Electron, Panasonic, Micron, Lam Research, Toyota Tsusho, NileDutch, Sika Group, and NEC.

Every quality failure that makes the news — the Takata airbag inflators linked to at least 20 deaths and a $24 billion recall, the Samsung Galaxy Note 7 batteries that caught fire mid-flight, the Space Shuttle Columbia foam strike that engineers had flagged but failed to act on — shares a common, painful thread: the failure mode was knowable before it happened.

The question is never whether failure is possible. In any sufficiently complex product or process, it always is. The question is whether your organization has a structured method for finding those failure modes before they escape to your customers, your regulators, or the headlines.

That method is Failure Mode and Effects Analysis — FMEA.

This guide covers what FMEA is, how it works, where it is most powerfully applied, and — critically — where it goes wrong even when teams try to do it right. If you have encountered FMEA as a checkbox exercise buried in an ISO audit, this article will show you what it looks like when it is done seriously.

What Is FMEA?

Failure Mode and Effects Analysis is a structured, team-based method for systematically identifying the ways in which a product or process can fail, estimating the risk associated with each failure mode, and prioritizing actions to reduce that risk before failures occur.

The operative word is before. FMEA is a preventive tool, not a corrective one. It belongs at the design and planning stages of a process or product — not in the post-mortem after a customer complaint arrives.

At its core, FMEA asks three disciplined questions about every potential failure:

How severe would the consequence be? (Severity)
How likely is this failure to occur? (Occurrence)
How well can our current controls detect it before it reaches the customer? (Detection)

The answers to these three questions are scored, multiplied together to produce a Risk Priority Number (RPN), and used to prioritize where corrective action is most urgent.

This is deceptively simple. The power comes from doing it systematically, cross-functionally, and with intellectual honesty about what can go wrong.

A Brief History: From the U.S. Navy to the Shop Floor

FMEA's origins trace back to the early 1950s, when the U.S. Navy Bureau of Aeronautics developed it as a reliability engineering technique for military aircraft and weapon systems. NASA adopted and formalized the methodology through the 1960s and 1970s — notably, it was the kind of structured risk assessment that critics later argued was insufficiently applied in the Challenger and Columbia shuttle programs.

The Department of Defense codified the methodology in MIL-STD-1629A during the 1970s. Ford Motor Company then brought FMEA into mainstream manufacturing in the 1980s, and the automotive industry collectively developed the standards that many practitioners know today through the Automotive Industry Action Group (AIAG) manuals, including the joint AIAG-VDA FMEA Handbook published in 2019.

Engineers across semiconductors, medical devices, aerospace, electronics, and increasingly service industries have adopted and adapted the methodology. It is now referenced in standards including ISO 9001:2015's risk-based thinking requirements, IATF 16949 for automotive suppliers, and AS9145 for aerospace. It is also integral to the Quality Maintenance pillar of Total Productive Maintenance (TPM).

The methodology has evolved, but the underlying logic has not changed in 70 years: find the failure modes before the failure finds your customer.

The Business Case: The 1-10-100 Rule

One of the most important frameworks for understanding why FMEA matters financially is the 1-10-100 Rule, sometimes called the "Rule of 10."

The principle, well-established in total quality management literature, states that:

$1 spent on prevention avoids...
$10 spent on correction (rework, scrap, re-inspection), which avoids...
$100 spent on failure costs (warranty claims, recalls, litigation, lost customers, brand damage)

This is not merely a theoretical ratio. The Takata airbag recall ultimately cost an estimated $24 billion by 2016. Samsung's Note 7 recall cost $5.3 billion and required the recall of 2.5 million devices within two months of launch. In both cases, the failure modes — inflator design flaws and battery cell compression — were physically discoverable through rigorous pre-launch FMEA work.

FMEA is not a quality department exercise. It is a financial risk management decision made at the design stage, when the cost of change is still low.

This is why the most common FMEA question in leadership briefings is not "how do we run FMEA?" but rather: "why does it always seem we have plenty of time to fix our problems, but never enough time to prevent them in the first place?"

Two Types of FMEA: Design and Process

FMEA comes in two primary variants, each targeting a different phase and source of risk.

Design FMEA (DFMEA)

DFMEA focuses on the product itself — the possibility of product malfunction, reduced product life, and safety or regulatory failures that arise from design decisions. It examines:

Material properties and their behaviour under stress
Geometric tolerances and their interactions
Interfaces between components or sub-systems
Engineering noise factors: operating environments, user behaviour profiles, degradation over time, and system-level interactions

A DFMEA is the designer's structured way of asking: in what ways could this product fail to do what the customer needs it to do, and what are the consequences?

The air bag inflator is a classic DFMEA scenario: the design engineer must anticipate not just whether the inflator deploys, but whether excessive deployment force could itself cause harm — which is precisely the failure mode that made the Takata recall the largest in U.S. automotive safety history.

Process FMEA (PFMEA)

PFMEA focuses on the manufacturing or service delivery process, examining the ways in which process steps can fail and produce non-conforming output. It examines six primary sources of process variation, often remembered as the 6Ms:

Man (human factors, training, attention)
Method (procedures, work instructions, sequencing)
Material (incoming quality, component variation)
Machine (equipment capability, calibration, maintenance)
Measurement (inspection systems, gauge accuracy)
Mother Nature / Environment (temperature, humidity, vibration)

A PFMEA for air bag assembly, for example, would examine whether an operator might fail to install the inflator correctly on the assembly line such that it does not engage during impact — a process failure distinct from the design failure, but equally catastrophic in consequence.

In practice, DFMEA and PFMEA are complementary documents. The output of DFMEA informs the critical characteristics that PFMEA must control. Together they form a continuous thread of risk awareness from design intent through production delivery.

The FMEA Scoring System: Severity, Occurrence, and Detection

The analytical core of FMEA is the three-dimensional risk model. Each dimension is scored on a 1–10 scale, and the three scores are multiplied to produce the RPN.

Severity (S): How Bad Is the Effect?

Severity scores the worst-case consequence of a failure mode on the customer — internal or external. The scale runs from 1 (no perceivable effect) to 10 (hazardous failure without warning that endangers safety or violates regulatory requirements).

A score of 8 or above typically signals that design changes are mandatory regardless of occurrence or detection ratings — because some failure effects are simply unacceptable at any frequency.

Score	Effect
10	Hazardous, without warning — safety or regulatory failure
9	Hazardous, with warning — illegal or injurious
8	Very high — product or service rendered unfit for use
7	High — extreme customer dissatisfaction
6	Moderate — partial malfunction
5	Low — complaint-generating performance loss
4	Very low — minor performance loss
3	Minor — nuisance, no performance loss
2	Very minor — unnoticed, minimal effect
1	None — unnoticed, no effect

Critical principle: Severity scores a failure effect, not the failure mode itself. Changing a control or detection method does not change the severity — only a design change that eliminates or mitigates the failure effect can do that.

Occurrence (O): How Likely Is the Cause?

Occurrence estimates the frequency with which a specific failure cause will occur during the intended life of the design or process, ideally informed by historical data, process capability indices (Ppk), and field experience.

Score	Likelihood
10	Very high: persistent failures (Ppk < 0.55)
7–9	High: frequent failures
4–6	Moderate: occasional failures
2–3	Low: relatively few failures
1	Remote: failure is unlikely (Ppk ≥ 1.67)

The occurrence score targets the cause of the failure mode, not the failure mode itself. Reducing occurrence means eliminating or controlling the cause — through process redesign, mistake-proofing (Poka-Yoke), statistical process control, or supplier qualification.

Detection (D): How Well Can We Catch It?

Detection scores the ability of current controls to identify a failure mode or its cause before it reaches the customer. Critically, the scale is inverted: a score of 10 means the defect is essentially undetectable, and a score of 1 means detection is virtually certain.

Score	Detection Capability
10	Almost impossible — no detection system exists
7–9	Low probability of detection
4–6	Moderate — sampling or manual inspection
2–3	High — systematic detection methods in place
1	Almost certain — automatic detection prevents non-conforming output

A common error is over-relying on high detection scores to justify inaction on high-severity or high-occurrence failure modes. Detection controls are downstream safeguards — they catch failures after they occur. Prevention through reduced occurrence is always preferable.

The Risk Priority Number (RPN)

RPN = Severity (S) × Occurrence (O) × Detection (D)

The RPN ranges from 1 (minimal risk) to 1,000 (maximum risk across all three dimensions). It is used to rank failure modes and prioritize the allocation of corrective action resources toward the "vital few" that represent the greatest actual risk.

A worked example from a purchase requisition process:

A Process FMEA for a PR submission workflow identifies the following failure mode:

Process step: Engineer completes purchase requisition form
Failure mode: Form filled out incorrectly (actual cost of goods not entered in required fields)
Failure effect: PR rejected; delivery of goods delayed
Severity: 5 (moderate — operational delay, no safety risk)
Cause: Engineer not trained on correct requisition process
Occurrence: 4 (occasional — happens perhaps once per month)
Current control: Purchasing department manually verifies all incoming PRs for accuracy
Detection: 3 (systematic — verification catches most errors)
Initial RPN: 5 × 4 × 3 = 60

Recommended actions: conduct PR training for all engineers (webinar series, recorded for onboarding); develop a digital requisition system that prevents submission of incomplete forms.

After training was delivered to 75% of engineers, the occurrence score dropped from 4 to 2.

Revised RPN: 5 × 2 × 3 = 30 — a 50% risk reduction.

This example illustrates something important: FMEA is not a one-time analysis. It is a living document that is updated as causes are addressed and controls improve.

Important Limitations of RPN: What the Number Does Not Tell You

Experienced FMEA practitioners know that the RPN has limitations that must be understood to avoid dangerous conclusions.

RPN does not reflect absolute risk thresholds. A failure mode with RPN 60 from Severity 10 × Occurrence 2 × Detection 3 is categorically more urgent than one with RPN 120 from Severity 2 × Occurrence 6 × Detection 10 — even though the second has double the RPN. High-severity failure modes (8 and above) demand action regardless of their RPN score.

RPN comparisons are meaningful within a single FMEA, not across FMEAs. Scoring tables are calibrated to organizational context. A cross-FMEA comparison of raw RPN figures is not meaningful.

Never stop at the RPN calculation. One of the most consistent reasons FMEAs fail is that teams invest hours in scoring discussions and then fail to generate, assign, and follow up on recommended actions. The RPN table without an action plan is an expensive documentation exercise. The action plan is the point.

FMEA vs. Root Cause Analysis: Understanding the Relationship

A question that frequently arises in quality training programmes is: what is the difference between FMEA and Root Cause Analysis (RCA)?

The distinction is timing and direction.

FMEA is prospective. It identifies and addresses potential failure modes before adverse events occur, by systematically working through what could go wrong and putting controls in place.

RCA is retrospective. It investigates failures after they have occurred, using tools such as 5 Whys and fishbone (Ishikawa) diagrams to identify and eliminate the systemic root causes of actual failures.

The two tools are complementary rather than competing. In practice, the relationship is bidirectional:

FMEA informs 8D problem solving: When a quality escape occurs and an 8D is initiated, the FMEA document serves as a pre-brainstormed database of possible causes. Rather than starting from a blank fishbone diagram, the 8D team can review the relevant FMEA for potential causes that were already identified and cross-reference them against the actual failure — significantly accelerating the investigation.

8D and RCA data improves FMEA: Actual field failures, customer complaints, and completed 8D corrective actions are primary inputs to FMEA updates. When a new failure mode is discovered in the field, it is added to the FMEA as a confirmed failure mode, the occurrence score is revised, and the control effectiveness is re-evaluated. This is how the FMEA becomes more accurate and predictive over time.

This feedback loop between reactive tools (8D, RCA) and proactive tools (FMEA) is a hallmark of mature quality management systems. Organizations running quality improvement programmes with clients across high-complexity manufacturing environments — semiconductor equipment, industrial coatings, automotive components — consistently find that FMEA quality improves fastest when it is formally linked to the 8D close-out process.

The 10-Step FMEA Procedure

A rigorous FMEA follows a defined sequence. Skipping or rushing any step is a leading cause of low-quality output.

Step 1: Define the scope and assemble the team

FMEA is a team tool, not an individual exercise. The team should include the process owner, process engineer, quality engineer, a representative from the customer (internal or external), and — for DFMEA — a design engineer. The team should not be led by the process owner alone; an independent facilitator improves objectivity significantly.

Step 2: Develop a process map or block diagram

Before failure modes can be identified, the process or design must be understood at the step level. A process map shows the logical sequence of activities; a block diagram shows how system components relate. The FMEA is built structure-by-structure against this map. Attempting to conduct FMEA without a current, accurate process map is one of the most common errors.

Step 3: Identify potential failure modes for each step or function

For each process step or design function, the team systematically asks: in what ways could this step fail to perform its intended function? Multiple failure modes can exist for a single step. This brainstorming phase benefits from diverse team perspectives — operators, engineers, and quality staff often identify different failure modes for the same step.

Step 4: Determine the potential effects of each failure mode

For each failure mode, identify the consequence on the next process step, the end product, and ultimately the customer. Effects are the what that motivates the severity score.

Step 5: Assign severity scores

Score the worst-case effect using the severity table. This score does not change unless the design is changed to eliminate or mitigate the effect itself.

Step 6: Identify potential causes

For each failure mode, identify the specific, correctable causes — the why that, if eliminated or controlled, prevents the failure mode from occurring. Causes should be stated specifically enough to suggest a corrective action. "Operator error" is not a useful cause statement; "operator not trained on step 4 verification procedure" is.

Step 7: Assign occurrence scores

Score the likelihood of each cause using historical data where available, supplemented by team experience and process capability data (Ppk).

Step 8: Identify current controls and assign detection scores

List the existing prevention or detection controls for each cause or failure mode, and score their effectiveness using the detection table.

Step 9: Calculate the RPN and prioritize

Multiply S × O × D for each failure mode-cause combination. Rank the results and identify the high-priority items for corrective action. Always prioritize by severity first, then by RPN.

Step 10: Develop, assign, and track recommended actions

For each high-priority failure mode, define specific actions to reduce occurrence (eliminate or control the cause), improve detection, or — in high-severity cases — redesign to reduce the severity of the effect. Assign a named owner and a target completion date. Update the FMEA with actions taken and revised RPN scores.

When to Use FMEA

FMEA is most valuable when deployed early — before the design is locked or the process goes live. Specific trigger points include:

When new products, services, or processes are being designed from scratch
When existing designs or processes are being changed significantly, even if the change seems minor
When carry-over designs from one product generation are being applied to a new application with different operating conditions
After system or process functions have been defined but before hardware is specified or production tooling is ordered
Early in any process improvement investigation, as part of understanding current risk before redesigning the process

The worst time to conduct FMEA for the first time is after a customer complaint, a warranty claim, or a regulatory audit. By that point, prevention has already failed and the organization is in correction mode — paying a 10x premium for every dollar it did not spend upstream.

Why FMEAs Fail: The Ten Most Common Errors

Based on practical experience running FMEA workshops with cross-functional teams across manufacturing, semiconductor, and service environments, there are consistent failure patterns that organizations repeat.

1. One person completes the FMEA alone. FMEA is explicitly a team tool. A lone engineer completing the form without cross-functional input produces an incomplete risk picture and generates no organizational buy-in for the actions.

2. The scoring system is not calibrated to the organization's context. Generic scoring tables produce generic results. Severity, occurrence, and detection criteria should be customized to reflect what "severe" and "likely" mean in your specific operating environment, industry, and customer base.

3. The most knowledgeable people are excluded — or allowed to dominate. The design or process expert must be in the room, but should not be allowed to shut down discussion of failure modes they are emotionally invested in. Effective facilitation manages this dynamic.

4. Team members are not trained in FMEA methodology. Without a shared understanding of the definitions and scoring logic, FMEA sessions become scoring debates that generate frustration rather than insight. A brief training investment before the session pays for itself in session quality.

5. The team rushes through failure mode identification. The quality of an FMEA is determined entirely by the completeness of its failure mode list. Rushing this phase to reach the scoring columns is the single biggest source of FMEA gaps.

6. Every failure mode gets the same effect. "Customer dissatisfied" is not a failure effect; it is a category. Effects must be specific enough to score meaningfully and to generate meaningful actions.

7. The FMEA stops at RPN calculation. The RPN table without recommended actions and owners is a risk register, not a risk management system. The action plan is the entire point of the exercise.

8. High-severity failure modes are not treated as unconditional action items. Failure modes with a severity of 8 or above require action regardless of RPN. Teams that let a high-severity, low-occurrence, low-detection combination produce a "comfortable" RPN and walk away have made a dangerous mistake.

9. The FMEA is not re-evaluated after actions are completed. Updated RPNs after corrective actions are the evidence of risk reduction. Skipping this step means the organization has no documented proof that the actions worked.

10. The FMEA is treated as a static document. A living FMEA is updated whenever the process changes, new failure modes are discovered in the field, or completed 8D investigations identify causes that were not previously captured. An FMEA that has not been touched since its initial creation is an outdated risk picture.

FMEA as an Organizational Knowledge Asset

Beyond its immediate risk management function, a well-maintained FMEA serves as the organization's institutional memory for quality risk.

It documents the reasoning and experience of engineers — what was considered, what was designed in as a control, what was accepted as residual risk and why. For new engineers and team members, the FMEA is an invaluable resource that prevents re-learning past failures from scratch. For future design and process decisions, it is the baseline from which risk comparison begins.

This is particularly important in industries with high staff turnover or long product life cycles. The institutional knowledge embedded in a living FMEA does not leave the organization when experienced engineers retire or move on.

FMEA's Place in the Quality Excellence System

FMEA does not operate in isolation. It is one of several interlocking tools that together constitute a proactive quality management system.

Within a Total Quality Management (TQM) framework, FMEA is the primary mechanism for delivering on the principle that quality must be designed in, not inspected in. It translates the customer focus and process orientation of TQM into specific, actionable risk assessments.

FMEA connects naturally with:

Total Quality Process (TQP) — The operating system within which FMEA operates as the primary prevention discipline
Root Cause Analysis (RCA): RCA feeds field failure data back into FMEA updates; FMEA pre-populates the cause hypothesis space for RCA investigations.
8D Problem Solving: FMEA and 8D share cause-and-effect documentation that is formally cross-referenced in mature quality systems. D1–D3 of an 8D investigation benefit directly from the FMEA's pre-existing cause database.
Poka-Yoke (Mistake-Proofing): The failure causes identified in PFMEA are the design brief for mistake-proofing solutions. FMEA tells you what needs to be prevented; Poka-Yoke tells you how to engineer it out.
Cost of Quality (COQ): FMEA is a prevention cost investment that directly reduces internal and external failure costs. The COQ framework provides the financial language to justify FMEA resources to leadership.
Advanced Product Quality Planning (APQP): In automotive and aerospace applications, FMEA is one of the mandatory core tools within the APQP framework, alongside the Control Plan, Measurement System Analysis (MSA), and Statistical Process Control (SPC).

Conclusion: Prevention Is a Leadership Decision

FMEA is not difficult to understand. The scoring system is straightforward, the logic is intuitive, and the methodology is well-documented. What makes FMEA hard in practice is not the tool — it is the organizational commitment to use it seriously, early, and continuously.

The organizations that do FMEA well share a common leadership posture: they have decided that finding and preventing failure modes is more valuable than their engineers' time is expensive. They understand the 1-10-100 Rule not as a slogan but as a financial reality that plays out in every warranty claim, every re-inspection run, every product recall.

The Japanese proverb that has guided quality practitioners for decades captures it simply: "If you can get clean water upstream, it is needless to purify the water downstream."

FMEA is the structured practice of going upstream — deliberately, systematically, and before the cost of failure makes the choice for you.

About the Author

Allan Ung, Founder & Principal Consultant, Operational Excellence Consulting (Singapore)

Allan Ung is the Founder and Principal Consultant of Operational Excellence Consulting, a Singapore-based management training and consulting firm established in 2009. With over 30 years of experience leading operational excellence and quality transformation in manufacturing-intensive environments, Allan's expertise spans Lean Thinking, Total Quality Management (TQM), TPM, TWI, ISO systems, and structured problem solving.

He is a Certified Management Consultant (CMC, Japan), Lean Six Sigma Black Belt, TPM Instructor (Japan Institute of Plant Maintenance), TWI Master Trainer, ISO 9001 Lead Auditor, and former Singapore Quality Award National Assessor.

During his tenure with Singapore's National Productivity Board (now Enterprise Singapore), Allan pioneered Cost of Quality and Total Quality Process initiatives that enabled companies in the electrical and fabricated metals industries to reduce quality costs by up to 50 percent. In senior regional and global roles at IBM, Microsoft, and Underwriters Laboratories, he led Lean deployment, quality system strengthening, and cross-border operational transformation.

Allan's FMEA and problem-solving training programmes have been deployed by organisations including Ministry of Education, Tokyo Electron, Panasonic, Micron, Lam Research, NileDutch, Sika, Toyota Tsusho, and Nippon Paint — spanning semiconductor equipment, industrial manufacturing, automotive supply chains, and public sector service delivery.

He holds a Bachelor of Engineering (Mechanical Engineering) from the National University of Singapore and completed advanced consultancy training in Japan as a Colombo Plan scholar.

His philosophy: "Manufacturing excellence is achieved through disciplined systems, capable leadership, and sustained execution on the shopfloor."

His practitioner-led toolkits are used by managers and organisations across Asia, Europe, and North America to build quality capability and drive sustained operational improvement.

👉 Learn more at: www.oeconsulting.com.sg

Further Learning Resources

Operational Excellence Consulting offers a full catalogue of facilitation-ready training presentations and practitioner toolkits covering FMEA, problem solving, and Operational Excellence. These resources are developed from real workshops and transformation projects, helping leaders and teams embed proven frameworks and achieve sustainable improvement.

FMEA Training Presentation — The complete practitioner's toolkit for facilitating cross-functional FMEA sessions, including standardized scoring tables, worked examples, and a professional-grade Excel RPN calculator.
8D Problem Solving Toolkit — The structured methodology for resolving quality escapes that FMEA was designed to prevent.
Root Cause Analysis (RCA) — Investigate failures deeply to feed accurate occurrence and control data back into your living FMEA.
Poka-Yoke (Mistake-Proofing) — The engineering discipline that eliminates the causes your FMEA identifies.
Cost of Quality (COQ) — The financial framework that gives FMEA investment its business case.
Total Quality Management (TQM) — The management system within which FMEA operates as a core risk prevention discipline.

👉 Explore the full library at: www.oeconsulting.com.sg