The life cycle of electronic and other types of instrumentation
commonly follows the well-known bathtub reliability curve. The name
comes from the curve’s shape, which resembles a bathtub. The bathtub
curve can be divided into three periods or phases: the infant mortality
period, the useful life period, and the wear-out period. These periods are
illustrated in a graph of failure or hazard rate h(t) versus time (t) in Figure
2-1. In some devices, the failure rate may be measured in units such as
failures per counts, operations, miles, or rpm, rather than in time. An
example of this is an electromechanical relay, for which the failure rate is
stated in failures per mechanical operations and failures per electrical
operation.
The infant mortality period, shown as Area “A” in Figure 2-1, occurs
early in the instrument’s life, normally within the first few weeks or
months. For the user, this type of failure typically occurs during the
factory acceptance test (FAT), during staging, or just after installation.
Failures during this period are primarily due to manufacturing defects or
mishandling before or during installation. Most manufacturing defects are
caught before the instrument is shipped to you, through the manufacturer
testing and burn-in procedures. Be careful of rushed or expedited
shipments, though, as vendors may bypass some of their testing and burnin
procedures to satisfy your schedule. Mishandling is more difficult to
control. Inspection, observation, and care before and during installation
can minimize mishandling.
The second phase on the bathtub curve is the useful life period,
shown as Area “B” in Figure 2-1. This is where the failure rate, called the
random failure rate (λ), remains constant. The time length of this period is
considered the useful life of the instrument. Normal failures during this
period are considered to be statistically random. An instrument that fails
during this period and is repaired rather than replaced effectively restores
its reliability. Many times individual instruments, while repairable, are
simply replaced due to expediency. So, while the instrument is nonrepairable
to the user, the overall system is repairable.
Measures of Reliability
An important concept to understand during this period is the
instrument’s mean-time-to-failure (MTTF), a measure of reliability of the
instrument during its useful life period. The MTTF is the inverse of the
failure rate (1/λ) during the constant-failure-rate period. The MTTF is not
related to the useful life of the instrument, which is the time between the
end of the infant mortality period and the beginning of the wear-out
period. A device could have an MTTF of 100,000 hours but a useful life of
only three years. This means that during the three years of its useful life,
the device is unlikely to fail, but it may fail rather rapidly once it enters its
wear-out period.
Another example illustrating the difference between MTTF and
useful life is human death rates—the failure rate of a human “instrument.”
For humans in their thirties, this rate is estimated to be 1.1 deaths per 1,000
person-years, or a MTTF of 909 years. This is much longer than our
“useful life,” which is usually less than 100 years. In other words, in their
middle years people are very “reliable” (subject only to the random failure
rate). But past that, in their wear-out period, their reliability decreases
rapidly. Another example is a computer disk drive with an MTTF of 1
million hours but a useful life of only five years. Within its useful life, the
drive is very reliable, but after five years the drive will begin to wear out
and its reliability will decrease rapidly. The drive with an MTTF of 1
million hours, however, would be more reliable than a drive with an
MTTF of 500,000 hours with the same expected useful life.
A related measure is mean-time-to-repair (MTTR), the mean time
needed to repair an instrument. MTTR has several components as shown
below:
MTTR = Mean time to detect that a failure occurred
+ Mean time to troubleshoot the failure
+ Mean time to repair the failure
+ Mean time to get back in service
The second item, “Mean time to troubleshoot the failure,” is of
particular interest. It is a major component of MTTR that affects the
uptime or the availability of an instrument.
Mean-time-between-failures (MTBF) is a measure of the reliability of
repairable equipment. It is the MTTF plus the MTTR:
MTBF = MTTF + MTTR
Many times vendors use the terms MTTF and MTBF interchangeably.
If the MTTF is much larger than the MTTR, this is an acceptable
approximation.
“Availability” is the fraction of time the instrument is available to
perform its designated task. Availability is given by the equation:
An availability of 0.99 would mean that an instrument is available
99% of the time.
To have a high mean-time-to-failure (i.e., a low failure rate) select a
well-designed, sturdy instrument and apply it properly. Selecting an
instrument designed and properly installed for maintainability is essential
to having a low MTTR. Unfortunately, other factors such as cost, delivery,
and engineering preference, can reduce availability. (That is what keeps
troubleshooters in business.)
The Wear-out Period
The third period on the bathtub curve is the wear-out period shown
as Area “C” in Figure 2-1. This is where the instrument is on its last legs; it
is wearing out. Detecting the beginning of this period is a key to knowing
when to replace rather than repair an instrument, before it becomes a
“maintenance hog.” Because the instrument as a whole is wearing out
during this phase, it makes more sense to replace it than to repair
individual components.
Mechanical equipment with rotating or moving parts begins wearing
out immediately after it is installed. Such equipment typically has only the
infant-mortality phase (A) and the wear-out phase (B), though the wea-rout
out phase for mechanical equipment should have a shallower slope than
for the electronic instrument’s wear-out phase. The failure curve for
mechanical equipment is shown in Figure 2-2.
Catastrophic failures (such as an instrument being run into by a
forklift truck, or struck by lightning) are not considered in the bathtub
curve, nor are failures due to human error or abuse. While these types of
failures cannot always be prevented, they can be minimized.