6 minute read

Alan Harding

Next Article
Fraser N. Hatfield

Fraser N. Hatfield

What the...! System Failures

Alan Harding

Advertisement

Abstract

Failures of artifacts and systems usually take us by surprise albeit we know things do fail. This piece is reading for the new Maritime Management BSc top-up engineering module to introduce concepts around system failures relevant to systems engineering. It is an overview of the topic that is then explored in depth in the lecture series.

The use of some pre-reading material that is not over long or involved can help garner interest in a subject and provide an early view where a series of topics is headed, giving some of the drier content more meaning.

We can learn much about a subject from exploring failures from the past, and particularly accident reports, where the multifaceted nature of accidents is analysed in detail. An accident that occurred in aerospace is briefly broken down in this short paper. This is then used as a template for student analysis of maritime accidents such as the Costa Concordia.

Introduction

We can learn much about an engineering subject from exploring failures of components and systems from the past, and particularly accident reports, where the multifaceted nature of accidents is analysed in detail. In this article an accident that occurred in aerospace is briefly broken down. This can be used as a template for student analysis of maritime accidents such as the Costa Concordia as part of their maritime technology management module.

Why Systems Fail

The failure of an electronic or mechanical system often comes as a surprise. If the system is safety critical for the function, such as a structural failure, the result can be a disaster. A system performing a function is often more than the equipment but includes the operators. Those operators function in an environment that includes the ambient conditions, ergonomics, culture and training.

All systems will eventually fail. Even the pyramids crumble, albeit after millennia. Failures occur for a variety of reasons that includes, for example, wear, crack growth from defects, and radiation embrittlement. We often deal with these by ascribing a life to a component. However, this cannot always be determined and so we end up with estimated failure rates that are often based on historical data. However, software is different as it is either right or wrong for the circumstance that arises.

Manufactured parts can include defects, and under mechanical stress those defects can result in cracks that grow over time, some components are given an operating life. That life has a margin of error ascribed and at the end of an operating period the component is inspected, repaired or replaced. Because software is either right or wrong, its integrity is assured by the design, test, and inspection process. Those processes are governed by engineering standards that are applied to various industries such as the nuclear industry.

Because systems fail, we have risks of the failures occurring. We have become used to accepting some risk. For example, in the UK we experience some 3 x 10-9 deaths/km when driving and that is accepted, although some improvements are sought. For civil aircraft, a rate of 10-7 failures/hr is an agreed engineering standard. This is attained by design, maintenance, regulations, and training.

Failure Rates and Fault Trees

The causes of a particular failure can be varied. To enable analysis of this we use fault trees. A fire can occur for a variety of failure reasons and rates of those failures based on historical data and engineering knowledge can be ascribed. An example is shown below, where either a fuel leak or an electrical defect could cause a fire:

C

A B

The probability of C = (probability of A) + (probability of B)

However, if both failures are required to result in the outcome, then: C

A B

The probability of C = (probability of A) X (probability of B)

A fault tree can include human factors and environmental conditions:

Human factors can include environment (e.g. illumination, communication, culture, ergonomics, and training). Environmental factors can include weather, temperature, vibration etc. Poor instrumentation, for example, has led aircraft pilots to shut down the wrong engine in an engine failure event. Hot brakes have led to fuel fires where fuel leaks have occurred.

Software induced failures may occur where an operation is conducted outside the conditions envisaged or tested. These are not failures of the software, but failures of the software development process. Code does not fail, but its implementation may result in an error.

One misconception about failure probabilities is that if the rate is low, say 1 failure per 100,000 hours, then a failure will not happen in the first hours. Just after the RAF Tornado entered service one made a big hole in the Queen’s Sandringham estate. The failure had a 1x10-6 probability. The likelihood at any time depends on the failure mode and a rare event can happen at any time, if time is not a factor in the failure mode.

Mitigations

The results of a system failure can be mitigated. Mitigations do not stop failures but lessen their impact. Lifeboats and parachutes and insurance are mitigations against fatal hazards. Fire extinguishing systems are mitigations, but also can be included in a fault tree to address a top hazard (out of control fire).

Redundancy

One approach to system failures is to introduce redundancy. A back-up generator, two engines or multiple channels of control. This has an Achilles heel: a common mode failure. In 1982 a British Airways 747 flew through a volcanic ash cloud just off Indonesia and all four engines failed. Common mode failures can result from design, manufacture, operation and environment. One way round this is to have different design and manufacture for the channels.

Accidents

Accidents invariably result from a chain of events. This is often referred to as a Swiss Cheese model in that if all the events (or holes in the cheese) do not line up the accident will not happen. The events can be quite historic compared with the accident. For example, poor training, that may have occurred years before can become significant.

An aircraft accident in 2009 shows how small occurrences can build. Air France 447 was flying from Brazil to France when it plunged in to the Atlantic. The cause was icing of a sensor and the pilots subsequent actions causing the aircraft to stall and fall out of the sky. However, there was a sequence of events any one of which might have prevented the outcome. • The pilots were briefed about weather on route: they could have avoided it. • The captain decided to fly through the weather. • The captain decided to take a break just before the bad weather leaving the less experienced crew in control. • The pilots in control did not comprehend the sensor failure warning and the effect on the flight control system.

• The pilots took opposing actions and did not communicate. • One pilot took, and continued with, a fatally incorrect action.

To this list can be added poor training of the pilots and, possibly, poor company regulation and culture. The captain of AF447 re-entered the cockpit as the aircraft plunged earthwards. I imagine he exclaimed ‘ce que…!’.

Bibliography

Aviation Safety Network Database: www. aviation-safety.net/database

Avreesky D.R. (Ed), (1992) Hardware and Software Fault Tolerance In Parallel Computing Systems, Chichester, Ellis Horwood

Dale C, Anderson A (Ed) (2009), Safety-Critical Systems: Problems, Process and Practice, London, Springer-Verlag

This article is from: