J.H. Berk and Associates
Fault-Tree-Driven, Disciplined Failure Analysis Approach
Berk and Associates
, Upland, California
of the things that makes continuous improvement efforts simultaneously
stimulating and frustrating is what often seems to be a constant stream of
problems. Strong problem solving
skills are essential to successful continuous improvement activities.
Without these skills one is doomed to solving the same problems
repeatedly. This paper
presents a methodology for identifying and eliminating problem root causes, and
specifically, the root causes of complex systems failures.
discussion begins with systems failure and systems failure analysis definitions. A systems failure occurs when a system does not meet its
requirements. A laser failing to
designate its target, an aerial refueling system failing to transfer fuel at the
proper flow rate, a blood chemistry analyzer failing to provide accurate test
results, a munition that detonates prematurely, and other similar conditions are
all systems failures. A systems
failure analysis is an investigation to determine the underlying reasons for the
nonconformance to system requirements. A
systems failure analysis is performed to identify nonconformance root causes and
to recommend appropriate corrective actions.
1. The Systems Failure Analysis
Process. This approach assures root
cause identification and effective corrective action implementation.
1 shows our recommended systems failure analysis approach.
Systems failure analysis begins with a clear understanding of the failure
(i.e., a definition of the problem). Once
this has been accomplished, all potential failure causes are identified using
fault tree analysis. The process
than objectively evaluates each of the potential failure causes using several
techniques, including "what's different" analysis, pedigree analysis,
failed hardware analysis, and designed experiments. These techniques help in converging on the causes of failure
among many identified potential causes. Once
the failure causes have been identified, the approach outlined herein develops a
range of corrective actions and then selects and tracks optimum corrective
Tree Analysis: Identifying All
Potential Failure Causes
confronted with a systems failure, there is often a natural tendency to begin
disassembling hardware to search for the cause.
This is a poor approach. Failed
hardware can reveal valuable information and safeguards are necessary to prevent
losing that information from careless teardown procedures.
One must know what to look for prior to disassembling failed hardware.
This is where fault tree analysis enters the picture.
tree analysis is a graphical technique that identifies all potential failure
causes. The approach was developed
in the early 1960s by Bell Laboratories working with the U.S. Air Force and
Boeing on the Minuteman missile development program.
When developing this system, Boeing and the Air Force were concerned
about inadvertently launching a nuclear missile. The Air Force needed a technique that could analyze the
missile, its launch system, the crew, and all other aspects of the complete
weapon system to identify all potential causes of an inadvertent launch.
Bell Laboratories developed the fault tree technique for this purpose.
fault tree starts with a top undesired event, which is the system failure mode
for which one is attempting to identify all potential causes.
The analysis then continues to sequentially develop all potential causes.
examine a simple example to see how this is done, but first, let's consider
fault tree analysis symbology. Figure
2 shows the symbols used by the fault tree.
There are two categories of symbols:
events and gates. Let's
first consider the four different symbols for events.
The rectangle is called a command event, and it represents a condition
that is induced by the events immediately below it (we'll see how shortly). The circle represents a basic failure event (these are
typically component failures, such as a resistor failing open, or a structural
member cracking). The house
represents a normally occurring event (for example, if electrical power is
normally present on a power line, the house would be used to represent this
event). The last event symbol is
the diamond (it looks like a rectangle with the corners removed), which can
represent either a human error or an undeveloped event.
A human error might be a pilot's failure to extend the landing gear when
landing an aircraft, a technician's failure to properly adjust a variable
resistor, or a crew member inadvertently depressing a self-destruct button on a
missile control console. An
undeveloped event is one that requires no further development.
Usually command events considered extremely unlikely are designated as
undeveloped events to show that they have been considered and eliminated as a
possible failure causes. Fault tree
events are linked by gates to show the relationships between the events.
There are two types of gates: "and"
gates, and "or" gates. The
"and" gate signifies that all events beneath it must occur
simultaneously to result in the event above it.
The "or" gate means that if any of the events beneath it occur,
the event above it will result.
2. Fault Tree Symbology.
Different symbols represent events and logic gates.
best approach for developing the fault tree is to assemble a team consisting of
personnel with a good understanding of how the system is supposed to operate and
associated support functions. The
team should typically include an engineer, a quality engineer, a manufacturing
engineer, an assembly technician, and perhaps others, depending on the nature of
now examine how all of the above comes together to generate a fault tree
analysis. We'll consider a simple systems failure analysis.
Suppose we have a system with a light bulb that screws into a socket, and
the light bulb illuminates when someone turns a switch on.
Figure 3 shows a schematic for this system.
One day, we flip the switch and the light bulb does not come on.
3. Light Bulb Wiring Schematic.
This is the system for which we'll prepare a fault tree analysis.
first step to define the problem. The
problem here is that the light bulb does not illuminate.
This becomes the top undesired event in the fault tree for this system
failure, and Figure 4 shows it in a command event (the rectangle symbol).
Top undesired events are always shown in a command event symbol, as they
will be commanded to occur by events in the tree below.
4. Indicator Light Fault Tree
Analysis. This simple fault tree
develops potential causes for an indicator light system failing to illuminate.
next step is to look for the immediately adjacent causes that can induce
the top undesired event. This is a
critically important concept. A
common shortcoming is to jump around in the system, and start listing things
like a power loss in the building, a failed switch, and perhaps other events,
but the fault tree requires discipline. One
has to look for the internal or immediately adjacent causes.
An approach for doing this is to imagine yourself as the light bulb,
screwed into the socket, and ask "what can happen in me or right next to
me to prevent me from illuminating?" If one considers only these conditions, the answers are:
open light bulb filament
terminals in the socket
bulb that's not fully screwed into the socket
electrical energy from the socket
show these events immediately below the top undesired event and determine which
symbol is appropriate for each. The
open filament is a basic component failure, so it goes in a circle symbol. Contaminated terminals in the socket could be caused by a
variety of conditions, but for the purposes of this analysis we won't fully
develop these, and we'll put contaminated terminals in an undeveloped event
symbol (the diamond). Not fully
screwing the bulb into the socket is a human error, so it goes into a human
error symbol (also a diamond). Finally,
no energy from the socket is a condition that will be commanded to occur if
other events occur elsewhere in the system.
This event becomes a command event, and it goes into a rectangle.
above events are all of the internal or immediately adjacent conditions that can
cause the light bulb to fail to illuminate, and this nearly completes the first
tier of the fault tree for this undesired event.
To complete this tier, we have to link these internal and immediately
adjacent events to the command event above them.
Either an "and" gate or an "or" gate will be used.
The question is: Will any of the events below the top undesired event result
in the top undesired event, or are all events below the top undesired event
required to result in the top undesired event?
In this analysis, any of the events below the top undesired event will
result in the light bulb failing to illuminate, so the "or" gate is
fault tree analysis continues by developing the potential causes for the next
command event, which in this case is the event that appears below the top
undesired event on the fault tree's first tier:
No electrical energy available in the socket.
We now need to identify all conditions internal to and immediately
adjacent to the socket. In this
case, the socket can be disconnected from the wiring, or the wiring can have no
power delivered to it, or the wiring could have a short circuit, or the wiring
could break open. The socket being
disconnected from the wiring would probably be the result of a human error, so
this event is shown in a diamond. The
wiring having a short circuit is a basic component failure (the wiring
insulation fails), so it is shown in a circle.
The same is true for the wiring failing open.
No power to the wiring is commanded to occur by conditions elsewhere in
the system, so it is shown as a command event.
Any of these conditions can cause the command event immediately above
them (no electrical energy in the socket), so an "or" gate is used to
show the relationship.
continue the analysis by identifying all internal or immediately adjacent
conditions to the wiring. The
internal wiring conditions have already been addressed (the wiring failing open
or having a short circuit), so only the immediately adjacent conditions need be
shown. This brings us to the next
element in the circuit, which is the switch.
The switch can fail open (this is a basic switch component failure, so
it's shown in a circle). The switch
can have no power delivered to it (this is a command event).
Finally, the system operator might forget to turn the switch on (this is
a human error). Any of these
conditions can induce the "no power to wiring" command event shown
immediately above these events, so again, an "or" gate is used.
same kind of wiring failures are shown for the wiring leading from the
electrical power source to the switch (these are similar to the types of
failures we developed for the wiring earlier).
The only other condition that can cause no power to this set of wiring is
no power from the power source. Since
there is no additional information about the power source, it is shown as an
undeveloped event. Any of the
conditions can induce no power on the wiring, so an "or" gate is used.
this point, the fault tree logic for our simple example is completed.
What does this mean? With
the data available, the fault tree started with a definition of the failure
(which became the top undesired event in the fault tree) and systematically
developed all potential causes of the failure.
It is important to note that the fault tree logic development started at
the point the failure appeared (in this case, a light bulb that failed to
illuminate), and then progressed through the system in a disciplined and
systematic manner. The fault tree
logic followed the system design. Systematically
working from one point to the next when constructing a fault tree forces the
analyst to consider each part of the system and all system interfaces (such as
where the switch interfaced with a human being).
This is key to successful fault tree construction, and in taking
advantage of the fault tree's ability to help identify all potential failure
leaving the fault tree, there's one more task, and that's assigning a unique
number to each of the basic events, human errors, normal events, and undeveloped
events. These will be used for
tracking purposes in a management tool called the failure mode assessment and
assignment matrix, which will be explained in the next section.
Failure Mode Assessment and Assignment Matrix
completing the fault tree, the next step is to prepare the failure mode
assessment and assignment matrix (the FMA&A).
As Figure 5 shows, the FMA&A is a four column matrix that identifies
the fault tree event number, the fault tree event description, an assessment of
the likelihood of each event, and what needs to be done to evaluate each event.
The FMA&A becomes a table based on the outputs of the fault tree that
lists each potential failure mode. Figure
5 shows an FMA&A prepared for the indicator light fault tree analysis
developed in the preceding pages. The
FMA&A shown in Figure 5 shows what actions are required for evaluating each
indicator light potential failure cause, and it provides a means of keeping
track of the status of these actions.
bulb for open filament. Rodriguez;
16 March 93
socket for contaminants. Perform
FTIR analysis on any contaminants observed in socket. Rodriguez; 16 March 1993
Bulb Not Fully Screwed In
bulb in socket to determine if properly installed.
Smith; 14 March 1993.
Disconnected From Wiring
wiring and perform continuity test. Ashoggi;
16 March 1993.
wiring and perform continuity test. Ashoggi;
16 March 1993.
wiring and perform continuity test. Ashoggi;
16 March 1993.
Does Not Activate Switch
operator and check switch function. Rodriguez;
16 March 1993.
switch function. Rodriguez;
16 March 1993.
wiring and perform continuity test. Ashoggi;
16 March 1993.
wiring and perform continuity test. Ashoggi;
16 March 1993.
Power From Power Source
power supply with multimeter. Ashoggi;
14 March 1993.
5. Failure Mode Assessment and
Assignment Matrix for the Indicator Light Failure Analysis. The FMA&A matrix identifies potential failure causes,
likelihood assessments, and assigned investigatory actions.
third column in the FMA&A, the likelihood assessment, lists the failure
analysis team's assessment of each potential failure cause being the actual
cause of the failure. Usually,
failure analysis teams list each hypothesized failure cause as likely, unlikely,
or unknown. When the FMA&A
matrix is first prepared, most of the entries in this column should be listed as
unknown, since at this point no work beyond the fault tree construction and
initiation of the FMA&A has been started.
last FMA&A column (the assignment column) lists the assignments agreed to by
the failure analysis team members to evaluate whether each hypothesized failure
mode actually caused the observed failure.
This ties back to our earlier discussion in which we advised against
tearing a system apart immediately after a failure without knowing what to look
for. The fault tree and FMA&A
provides this information. These
analysis tools reveal to the analysts what to look for when disassembling the
system, as well as when conducting other activities to evaluate the likelihood
of each potential failure cause. The
assignment column of the FMA&A defines the actions necessary to look for and
determine if each hypothesized failure cause did or did not contribute to the
failure. We recommend that failure
analysis team members also indicate in the fourth FMA&A column who has
responsibility for each assignment and required completion dates for these
assignments. The assignment column
is also used for general comments describing progress in evaluating each event,
significant findings, and other information.
A review of this column should provide a general indication of an ongoing
failure analysis' status.
people find it effective to update the FMA&A during failure analysis
meetings instead of taking notes, editing the FMA&A after the meeting, and
distributing the FMA&A at some later date.
Most word processing packages include a tables feature, and if possible,
a computer should be used during the meeting to update the failure analysis
status in real time. In this
manner, an updated FMA&A can be printed immediately at the end of each
failure analysis meeting (this helps to keep the failure analysis team members
focused and sustain failure analysis momentum).
the failure analysis team has completed the above steps, the team has a fault
tree that identifies each potential failure cause and an FMA&A that provides
a management tool for evaluating each potential cause. At this point, it becomes necessary to turn to the family of
supporting technologies shown in Figure 1 to complete the FMA&A, and in so
doing, converge on the true failure causes.
Each of these are discussed below.
"what's different?" analysis is a technique that identifies changes
that might have induced the failure. The
basic premise of this analysis is that the system has been performing
satisfactorily until the failure occurred; therefore, something must have
changed to induce the failure. Potential
changes include system design, manufacturing processes, suppliers, operators,
hardware lots, and perhaps other factors.
"what's different?" analysis will almost certainly identify changes.
As changes are identified they should evaluated against the potential
failure causes identified in the FMA&A.
One has to do this in a systematic manner.
Changes are always being introduced, and when a change is discovered, it
doesn't necessarily mean that it caused the failure.
changes can be identified by talking to the engineers assigned to the system.
Procurement specialists should be asked to talk to suppliers, as changes
may have occurred in procured components or subassemblies.
We've found that it also makes sense to talk to the people responsible
for maintaining the engineering drawings (this function is normally called
configuration management or document control).
The people responsible for maintaining the engineering design package
normally keep records of all design changes, and they can frequently identify
changes the design engineers may not remember.
Manufacturing process changes are sometimes more difficult to identify.
Assembly technicians, their supervisors, and inspectors may be able to
provide information on process changes. The
manufacturing engineers and quality engineers assigned to the system may have
information on process changes. Most
companies have written work instructions, and sometimes records are kept when
these documents are changed, so the work instruction history should also be
researched. Tooling changes can be
similarly identified. Companies
often keep tooling release records, so these, too, should be reviewed.
Here's another tip: Look
around the work area for tools not identified in the assembly instructions or
otherwise authorized. Sometimes
manufacturing personnel use their own tools, and these might be changed without
the system in which the failure occurred was manufactured in a facility that
uses statistical process control, process changes may be more readily available.
Ordinarily, process changes are noted on statistical process control
area to search for potential changes is in the system operating environment.
Sometimes a subtle environment change is enough to induce a failure.
It makes sense to evaluate the environment in which the failed system
operated against the environmental capabilities of the system.
Even though the system may have been operating in its specified
environment when the failure occurred, it is not unusual to discover that as a
result of a design oversight, one or more of the system's components or
subassemblies are being asked to operate outside of their rated environment.
For example, power supplies in enclosed electronic cabinets may be
operating in temperatures higher than the designer intended.
leaving the "what's different?" analysis, a few caveats are in order.
Recall that at the beginning of this discussion we identified a basic
premise, which was that a change occurred to induce the failure. This
may not be the case. Sometimes
nothing has changed and the failure resulted from normal statistical variation.
There's another possibility, and that is that the failure may have been
occurring all along, but it was not previously detected.
analysis examines all paperwork related to the components and subassemblies
identified in the fault tree and the FMA&A.
Due to normal quality assurance and other record keeping requirements,
most companies have a fairly extensive paper trail. Pedigree analysis involves studying this paperwork to
determine if it shows that the components and subassemblies identified in the
fault tree meet requirements. This
paperwork can include test data, inspection data, raw material data sheets, and
data described above should be examined for evidence showing nonconformances or
other inconsistencies. It's not
uncommon for test or inspection sheets to show that a part or subassembly did
not meet requirements but was accepted in error.
one has to be on guard for unrelated findings.
We recommend only examining the data for the parts or subassemblies
identified by the fault tree analysis (and consequently, the FMA&A).
Since the fault tree identifies all potential causes of the observed
failure, it isn't necessary to explore other data.
If a nonconformance is indicated in the data sheets, it should be
addressed, but the nonconformance has to be compared to the
fault-tree-identified potential failure cause.
Nonconforming conditions may not have caused the failure (we've found
that when performing pedigree analysis, one often finds other problems).
These need to be corrected, but they may not be related to the systems
failure being analyzed. Pedigree
analysis should also review prior quality records for the system and all
components and subassemblies identified by the fault tree as potential failure
causes to identify any prior similar failures.
In many cases, specific systems failure modes will not be new, and
reviewing the findings of previous analyses will support the failure analysis
experiments can include tests designed to induce a failure or special tests
based on analysis of variance or other statistical techniques.
Tests designed to induce a failure can often be used to evaluate a
hypothesized failure cause. By configuring the hardware with the hypothesized failure
cause present, one can determine if the previously-observed failure mode results
(i.e., the one that initiated the systems failure analysis).
The disadvantage of this type of test is that it tends to require
absolute failure causes (in other words, the hypothesized failure cause must
induce the observed failure mode all of the time).
In many cases, hypothesized failure modes can induce the observed failure
mode, but they may not do so all of the time.
Another disadvantage of this type of test is that it can typically only
evaluate one hypothesized failure cause at a time.
In so doing, the effects of other contributory conditions often cannot be
tests (a specialized analysis of variance technique) are useful for evaluating
the effects of several variables and their interactions simultaneously with a
minimum number of test samples. Frequently,
when evaluating all of the hypothesized failure causes in the FMA&A, one
eliminates most of the causes but several remain. In many instances, more than one of the remaining
hypothesized failure causes contributes to the failure.
Because the Taguchi test technique evaluates several variables
simultaneously, it can be used for assessing the relative impact of each of the
remaining potential failure causes.
want to again emphasize deferring failed hardware analysis until after all
potential failure modes have been identified and the failure analyst knows
exactly what to look for in the failed hardware.
The fault tree analysis and FMA&A allow development of a logical
hardware teardown and inspection process.
We recommend using this information to prepare written disassembly
instructions and an inspection data sheet.
Photodocumenting the disassembly process and all hardware as the analysis
progresses also makes sense. The
analyst may later become aware of additional potential failure causes, and
photodocumenting the teardown process allows one to re-examine the hardware as
it was removed from the system. There
are several tools available to assist the hardware analysis.
These fall into several categories:
magnification equipment, material analysis equipment, general measuring
equipment (used for evaluating compliance to engineering design requirements),
x-ray systems, and other specialty equipment.
microscopes and scanning electron microscopes are two magnification tools that
have gained wide acceptance. Optical
microscopes are available in most companies, and they permit greatly magnified
observations of suspect components. Sometimes
tiny defects or witness marks that cannot be seen with the naked eye are visible
under magnification (these defects or other marks often support or refute
hypothesized failure causes). If greater magnification is required, the scanning electron
microscope has proven to be a valuable tool.
Scanning electron microscopes bounce electrons off the surface being
examined to produce an image, and they can magnify images up to one million
times their actual size. Scanning
electron microscopes are generally available in larger companies, and there are
many commercial metallurgical and other failure analysis laboratories that offer
most common material analysis tools are energy dispersive x-ray analysis,
Fourier transform infrared spectroscopy, and other forms of spectroscopy.
The operating principles of these tools and their limitations are beyond
the scope of this article, but the analyst should recognize that such tools are
available (perhaps not in your organization, but certainly in others
specializing in this business). Materials analysis tools help to evaluate if the correct
materials were used, or if contaminants are present.
measuring equipment consists of the standard rules, micrometers, gages, hardness
testers, scales, optical comparators, and coordinate measuring machines
available in almost every organizations' quality assurance department.
This is where quality assurance failure analysis team members can lend
tremendous support. For most systems failures, the fault tree analysis and FMA&A
will identify many components, which, if nonconforming, could have caused the
failure. The quality assurance
organization can support the failure analysis effort by inspecting each of the
components hypothesized as potential failure causes to determine compliance with
the engineering design.
analysis is frequently useful for determining if subsurface or other hidden
defects exist. X-rays can be used
for identifying weld defects, internal structural flaws, or the relationship of
components inside closed structures. This
technique is particularly valuable for observing the relationships among
components in closed structures prior to beginning the disassembly process.
Sometimes the disassembly process disturbs these relationships
(especially if the structure was not designed to be disassembled), and x-rays
can reveal information not otherwise available.
X-ray services (including portable x-rays) are usually available through
commercial failure analysis laboratories.
or filming moving machinery is another specialized technique that can reveal
interactions not readily apparent when the machinery is stationary.
Many videotapes and film projectors allow the film to be shown in slow
motion, which also helps to evaluate potential failure modes.
of the failed hardware should be identified and bonded so that it is available
at a later time if the need arises. Sometimes
new potential failure modes are identified as the failure analysis progresses
and it might become necessary to re-evaluate hardware previously examined.
We recommend keeping failed hardware until corrective action has been
implemented and its effectiveness in eliminating failures is confirmed.
Toward Failure Analysis Completion
the potential failure causes identified by the fault tree analysis and the FMA&A
are evaluated, what typically occurs is most of the potential failure causes are
ruled out. In most cases, a few
potential causes remain. One
approach is to perform additional specialized testing as described earlier to
converge on the actual cause or causes of failure.
Another approach is to implement a set of corrective actions that
addresses each of the remaining unconfirmed potential failure causes.
Either approach is acceptable, although one should take precautions to
assure that the selected corrective actions do not induce other system problems.
In most cases, failure analysis teams will find a confirmed failure cause
during their systematic evaluation of each potential failure cause in the FMA&A.
The natural tendency is to conclude the failure analysis as soon as a
confirmed failure cause is found without continuing to evaluate the remaining
potential failure causes contained in the FMA&A.
We advise against concluding the failure analysis until all
potential causes are evaluated. The
reason for this is that multiple failure causes frequently exist.
In a recent circuit card failure analysis, the failure analysis team
performed a fault tree analysis and identified 87 potential failure causes. In systematically evaluating each of these, the team found
that six of the potential causes were actually present. Any one of these failures was sufficient to induce the
failure mode exhibited by the circuit card.
If the team stopped when they confirmed the first cause, the other five
causes would have remained, and the circuit card failures would continue to
Selecting, and Implementing Corrective Action
are usually multiple corrective action options.
The most preferable corrective actions are those that eliminate failure
root causes through redesign. In so
doing, these corrective actions totally eliminate reliance on people to perform
in a specific manner such that the problem is eliminated (a practice often
referred to as making a product, process, or service "idiot proof").
The least preferable corrective actions are those that do not eliminate
the problem at its root cause, but instead rely on people to perform special
actions to guard against the problem recurring.
The corrective action order of precedence (starting with the most
preferable to the least preferable category of problem solutions) is described
Upgrades To Eliminate Or Mitigate The Problem.
This category of corrective actions modifies the product, process, or
service to eliminate the features that induced the problem.
When a product, process, or service is modified to eliminate the
feature inducing the problem, the problem is attacked at its root cause.
In many instances, a problem is only a problem because the product,
process, or service does not meet a specification.
For example, the problem may be that a process has a yield of only 94
percent, when a yield of 95 percent is required by the customer.
One solution is to negotiate a relaxation of the requirement such
that a yield of 94 percent becomes acceptable.
Obviously, such an approach is not entirely responsive to the
customer's requirements, and as such, it tends to depart from the TQM
philosophy of focusing on customer needs and expectations.
Nonetheless, under certain circumstances requirements relaxation may
be the only practical approach, and as such, where requirements relaxation
makes sense it should be considered.
In many instances, problems can be eliminated by providing training
to customers, assemblers, or other personnel to control the circumstances
that could induce a problem. One
company we worked with experienced multiple instances of hydraulic fluid
leakage on an aerial refueling system during system checkout prior to
delivering the product to the customer.
Application of the systems failure analysis process described in this
paper revealed that hydraulic assembly personnel were unfamiliar with proper
hydraulic line installation procedures.
After training the hydraulic assembly personnel on proper hydraulic
fitting and line installation techniques, the leakage rate during system
checkout dropped significantly. Training
might appear to not be as desirable a solution as would be redesigning the
system such that it becomes insensitive to installation technique, but
redesigning the aerial refueling system in this situation was not considered
feasible by either the company producing the system or the customer.
Testing or Inspection.
Under certain circumstances, falling back on sorting good product
from bad through additional testing or inspection may be the most expedient
solution. Again, this is
counter to the TQM prevention (rather than detection) philosophy, and for
that reason, additional testing or inspection is usually not an acceptable
way of doing business other than in unusual circumstances.
To illustrate the concept, we recall an example in which a munitions
manufacturer procured a large number of detonators and found during
acceptance testing that some of them were defective.
The defect was induced by concavity in the detonator output surface,
which resulted in a failure to reliably initiate an explosion (in this case,
that's what the device was supposed to do).
The munitions manufacturer had to meet a tight delivery schedule with
it customer, the U.S. Air Force. The
solution to this problem was to inspect the entire group of detonators for
surface flatness, and to only use those that were acceptably flat.
That was the short term solution, and it was selected to allow
continuing production until the problem was eliminated at its root cause.
The long term solution (which eliminated the root cause) resulted
from the munitions manufacturer working with the detonator manufacturer to
modify the detonator production process.
The process modification eliminated the process features that induced
Another category of solutions includes incorporating cautions or
warnings on products or in related documentation to prevent problems. This is not a preferred solution, but there may be
circumstances that require such a solution.
We've probably all seen high voltage warnings on electrical
equipment. Wouldn't it be
better to simply insulate the high voltage areas such that it became
impossible to contact a high voltage surface?
The answer, obviously, is yes, but doing so may not always be
technicians may require access to the system when it is energized, and it
may not be possible or practical to isolate or insulate high voltage
surfaces inside an electrical device. Again,
if it's practical to eliminate the problem through a design change, then
that it the most preferable solution.
Operational or Process Actions.
The last category, and the least preferable from a long term
perspective, is to rely on special operational or process steps as a problem
solution. When the U.S. Army procured Beretta 9mm service handguns, the
initial shipments experienced structural failures after firing only a few
thousand rounds. The Army's
interim solution was a special operational action, which required replacing
the handguns' slides after firing a specified number of rounds.
The Army's long term solution was to eliminate the metallurgical
deficiency that allowed the failures to occur.
it may sometimes be necessary to implement less preferable solutions (due to
economic or other reasons), in all cases the long term solution should migrate
toward a product or process change that eliminates the root cause of the
corrective actions have been identified, evaluated, and selected, the final
steps consist of implementing the corrective action and evaluating its
effectiveness. This may seem so
obvious as to almost be insulting, but our experience has shown that in many
instances this last two crucial actions are not accomplished.
This is particularly true in larger organizations, where it is sometimes
easier to make assumptions about other people implementing corrective action.
We recommend assigning specific actions to assure corrective action
implementation, and monitoring the situation to assure that the problem has in
fact been eliminated.
many companies that manufacture complex systems, a more sophisticated approach
is required to arrive at the root causes of systems failures.
The approach presented in this paper offers sophisticated techniques for
identifying potential failure causes and evaluating corrective actions.
Fault tree analysis is used to define all potential causes of failure and
the FMA&A is used to guide the systematic evaluation of each of these
potential causes. Several
supporting technologies ("what's different" analysis, pedigree
analysis, designed experiments, and hardware analysis) are used to assist in
this systematic evaluation. Once the failure causes have been identified, corrective
action options should be developed, evaluated, implemented, and evaluated again
after implementation to assure the failure has been eliminated.
The systems failure analysis process outlined in this paper provides a
systematic approach for converging on the most likely failure causes and
implementing effective corrective actions.
Joseph and Susan Berk, Sterling Publishing, 1993.
Joseph and Susan Berk, Sterling Publishing Company, 1991.
Techniques for Quality Engineering,
Philip J. Ross, McGraw-Hill Book Company, 1988.
System Failure Analysis," Joseph Berk and Larry Stewart, Proceedings of
the Annual Reliability and Maintainability Symposium, 1987.
Tree Construction Guide,
Armament Development Test Center United States Air Force, May 1974.
Intel Corporation, 1985.