A senior scholar at the National Academy of Engineering explains why human error is a symptom—not the reason—for most safety failures
In the eyes of philosopher Paul Virilio, each accident is an “inverted miracle.” “When you invent the ship, you also invent the shipwreck,” he said in Politics of the Very Worst. Mishaps—such as the Boston molasses disaster, oil spills, or cyberattacks—often generate tighter rules and safeguards to prevent future blowouts. However, corporations can become complacent with their safety records. An “atrophy of vigilance” sets in as organizations overlook ominous signs, becoming victims of “normalization of deviance,” unknowingly or knowingly doing the wrong thing. As scholar Barry Turner observed, “We provoke the hazards whose risks we then have to learn to cope with.”
Many organizations have the unenviable task of operating high-risk technologies where mistakes can cost lives, dollars, or both. On an ink-blue Pennsylvania dawn in March 1979, a mélange of malfunctions melted a commercial nuclear power plant at Three Mile Island. Design flaws, equipment frustrations, and worker faults aligned like holes in a stack of Swiss cheese, fostering a free flow of failures. Similarly, risks ratcheted up in the Bhopal gas tragedy of 1984 and the destruction of the space shuttle Challenger in 1986, followed by the Chernobyl nuclear accident weeks later. These failures exemplify what the scholar Charles Perrow termed “normal accidents.” He meant that the technologies and the systems operating them were such that catastrophic accidents should not be seen as anomalies but as predictable outcomes.
Two key features distinguish technologies prone to normal accidents: they are highly complex and tightly coupled. High complexity means the system can interact in ways its designers and operators didn’t anticipate. Tight coupling means a small component failure can spread its consequences through the entire system, triggering a catastrophic collapse. Perrow argued that making such systems risk-free is impossible; catastrophic failures may never be eliminated, but their risks can be diminished. A simple oversight or moment of distraction in the air-traffic control system can lead to an in-air collision between hundred-million-dollar planes carrying hundreds of passengers. Racing cars and roller coasters are further examples. Yet another is the control room of a utility grid, which is as complex as a nuclear reactor. None of these cases comes close to the wicked complexities of ecological traumas, epidemics, financial meltdowns, and climate maladies.
Over the years, some organizations—particularly those that are flexible, non-hierarchical, and highly attuned to risk—have developed nearly error-free approaches to reduce catastrophes, if not eliminate them. In a “high-reliability organization,” operations are hierarchical, with a chain of command. But in high-pressure situations, the organizational structure undergoes a dramatic shift; it becomes much more horizontal. Individuals must recognize potential problems and deal with them immediately without approval from higher-ups. The effect is a continuous-learning orientation, what scholars Karl Weick and Kathleen Sutcliffe described as “mindful organizing.” Such a mentality strives for “a minimum of jolt, a maximum of continuity.” The hallmark of a high-reliability organization “is not that it is error-free but that errors don’t disable it.” One can pick up a potential problem or error and respond robustly when a work structure is designed to encourage this mindfulness.
The “normal accident” and “high-reliability organization” theories are valuable because they are simple. They advocate for redundancy, the idea of deliberately duplicating critical components of a system to control risk and increase reliability. But more reliability doesn’t necessarily mean more safety. Reliability and safety are often confused, so it’s essential to differentiate them.
In engineering, safety is generally about acceptable risks and avoidable harms; reliability is satisfying a particular requirement under given conditions over time. One property can exist without the other, but both can coexist and compromise each other. A system can be reliable and unsafe if one disregards the interactions among components, even when they fully meet the requirements. A system can be safe but unreliable if one doesn’t dependably follow the rules and procedures. Reliability is neither sufficient nor necessary for safety. Inspecting a pipeline valve can tell much about the valve’s reliability but not whether a refinery is safe. Safety is a function of how that valve interacts with other refinery units. That’s why an organization, even with a “culture of safety,” may be unable to prevent accidents if it’s fixated on reliability. In highlighting these differences, systems engineer Nancy Leveson reminds us that safety is an emergent property of a system, not a component property.
“Messy safety” problems occur when it’s impossible to exhaustively test the range of system behaviors. Such complexities force design changes that must be consciously considered so that other errors aren’t introduced. “Productivity and safety go hand-in-hand for hazardous industries and products. They are not conflicting except in the short term,” writes Leveson. In a messy problem, productivity and safety may be contrary approaches for an organization, or some may see them as the same. Whether it’s productivity and safety, or affordability and profit, and whether they are reconcilable, safety needs to be understood using systems theory, not reliability theory. Accidents and losses should be seen as a dynamic control problem rather than a component failure problem, Leveson argues, which requires considering the entire social system, not just the technical part. Human error is a symptom, not the reason for most accidents, and such behavior is affected by the context in which it occurs. This view should inform safety from the outset before a detailed design even exists.
Normal-accident theory was part forethought, part fatalism, arguing that avoiding complex hazards is impossible despite planning. High-reliability organizations are about reducing errors while treating safety and reliability alike. Leveson’s systems theory approach suggests safety control, not in a military sense, but through “policies, procedures, shared values, and other aspects of organizational culture” that help dissolve messy problems. Messy safety issues are about lessening the frequency of things that go wrong and increasing the frequency of things that go right. Messy safety should also reveal the biting vulnerabilities often buried in various processes and presentations. Processes are essential for operations but can also be misused and flawed. A process isn’t a hard solution to safety or a substitute for leadership. Processes can even worsen failure when mindlessly used. As one observer put it: “In engineering, like flying, you need to follow the checklist, but you also need to know how to engineer.”
Similarly, engineers use presentations to communicate design concepts and results without always knowing that their slide decks can invoke “death by PowerPoint.” The ubiquitous use of slides for technical presentations is itself a safety hazard. Tragically, PowerPoint has been implicated in NASA’s technical discussions in 2003 about the reentry of space shuttle Columbia that soon after exploded, killing seven crew members. The slides compressed information on significant risks into poorly formatted bullet points, which was highlighted in the Columbia Accident Investigation Board’s review: “The Board views the endemic use of PowerPoint briefing slides instead of technical papers as an illustration of the problematic methods of technical communication at NASA.”
PowerPoint is “a social instrument, turning middle managers into bullet-point dandies,” as a writer in The New Yorker described its “private, interior influence” in editing ideas. “PowerPoint is strangely adept at disguising the fragile foundations of a proposal, the emptiness of a business plan; usually, the audience is respectfully still . . . , and, with the visual distraction of a dancing pie chart, a speaker can quickly move past the laughable flaw in his argument. If anyone notices, it’s too late—the narrative presses on.”
Indeed, the Columbia Accident Investigation Board critiqued NASA’s process and presentation philosophy: “The foam debris hit was not the single cause of the Columbia accident, just as the failure of the joint seal that permitted O-ring erosion was not the single cause of Challenger. Both Columbia and Challenger were lost also because of the failure of NASA’s organizational system.” NASA’s culture and structure fueled the flawed decision-making, ultimately affecting the space shuttle program.
Dissolving messy safety problems requires constantly comparing current conditions with emerging requirements and failure modes. The aviation industry wouldn’t have thrived without dramatically reducing accident rates. Thanks to their safety and maintenance protocols, public confidence in the aviation industry has steadily increased since the 1960s, when only one in five Americans were willing to fly. While human-induced errors are quantifiable, the behavior that leads to them is not; hence, flight training remains significant in crew preparation. And finally, with messy safety problems, it can be tempting to blame others, but “blame is the enemy of safety,” Leveson notes. The goal should be to design systems that are error resistant, not operator dependent.
Public safety frequently comes down to how safe is safe enough. But how fair is “safe enough”? And how often do engineers make decisions that impose risk on others, and to what extent do they reflect on those acts? Substituting “fair for safe” isn’t some linguistic quibble, note scholars Steve Rayner and Robin Cantor. The question of the fairness of safety is about debating the effects of technology based more firmly on trust and social equity than on what’s more measurable. This also means that safety-critical engineering, like the ideal of freedom, requires eternal vigilance. And because engineers recognize that absolute safety is impossible, their decisions must be mindful of the risks imposed on others. Engineering design almost always involves trade-offs between safety and cost, feasibility, and practicality considerations. Such decisions create ethical challenges that rarely appear on a PowerPoint slide but are critical to alleviating wicked problems. Trust and fairness are intuitive elements that easy definitions cannot encapsulate. However tempting it is, as the Nobel-winning Toni Morrison wrote in Beloved, definitions belong to the definers—not the defined.
The engineer and advocate Jerome Lederer—whose career involved maintenance roles in the earliest US Air Mail Service flights in the 1920s, the preflight inspection of Charles Lindbergh’s Spirit of St. Louis, and eventually leading FAA and NASA safety programs—knew about the varieties of experience in safety. Lederer was called the “Father of Aviation Safety,” and he imagined a “psychedelic” dreamworld made of collaborative, trusting professionals. In it, plaintiff lawyers participated in aircraft design before the accident, not afterward; every safety suggestion resulted in greater passenger comfort; engineers invited others to contribute to design from scratch and were friends with test pilots, not foes; engineers eagerly received new ideas from everywhere: the “NIH” factor “Not invented here” would become “Now I hear.” Such manners, Lederer observed, were essential for a working safety culture.
The polymath British judge Lord John Fletcher Moulton, whom Lederer drew upon, said manners include all things “from duty to good taste.” In his 1912 article “Law and Manners,” the jurist presented “three great domains of human action.” One was the Positive Law, where an individual’s actions followed binding rules. The second, Free Choice, defines an individual’s preferences and autonomy. Nestled between these two domains is the third, the Obedience to the Unenforceable, which was neither force nor freedom. This self-imposed domain was about how a nation trusted its citizens and how those citizens upheld the trust. Obedience to the Unenforceable involves negligence, complacency, decreased oversight, and deviating from responsible practices. Lederer observed that, like doctors and lawyers, engineers have a code of conduct. “Engineers, however, have more problems in putting theirs into effect because they function more often as part of an organization, subject to organizational pressures.”
Another facet is that the success of a safety program depends on the context. Technologies often outpace our ability to conduct a safety program or even grasp its complexity. And ironically, as we saw with Steve Frank’s paradox of robustness, continuous failure reduction may increase vulnerabilities. In an analysis of “ultrasafe” systems, where the “safety record reaches the mythical barrier of one disastrous accident per 10 million events,” scholar René Amalberti observes that an optimization trap can perversely lower system safety “using only the linear extrapolation of known solutions.” The design of these ultrasafe systems becomes a political act rather than a technical one, with a preference for short-term yields over more durable measures. Amalberti offers that a safe organization shouldn’t be viewed as a sum of its defenses but as a result of its dynamic properties. Learning those would create a “natural or ecological safety,” which aims not to suppress all errors but to control their fallout within an acceptable margin.
Add to this the diversity of perspectives that influence how a system produces or prevents errors, deadly in some instances. In fire safety, for example, many different forms of expertise can be claimed in the tightly contested professional domain, each legitimately. Scientists can declare their authority just as firefighters can claim theirs, as can lawyers and regulators. This conflict of views collided in investigating a June 2017 fire disaster in London, the deadliest UK fire in over three decades. Grenfell Tower, a 24-story residential building in London’s North Kensington, caught fire, killing over 70 people and injuring dozens more. A refrigerator malfunction in one of the apartments rapidly spread fire in all directions, sped by the highly combustible external facade.
The British prime minister launched an official public inquiry. The commission confronted a grieving and justifiably angry public. Media coverage focused equally on fire-resistant claddings and national fire-safety governance policy. The coverage also emphasized trust, fairness, and expertise and how they underpinned the practice and perception of safety. Luke Bisby was one of the expert witnesses for the inquiry. He wrote six technical reports spanning several hundred pages that were the subject of multiple public hearings. Bisby saw firsthand how a hard safety problem could become a wicked safety problem. The nexus of regulation, innovation, education, professionalism, competence, ethics, corporate interests, social and environmental policy, and politics created a complex mess of interests, power, and, alas, victims, primarily immigrants and people of color.
As an engineer, Bisby was frustrated that there was no coherent voice on this subject and that the opinions came from all directions. The experience, however, made him more reflective. To truly understand what had occurred at Grenfell, Bisby spent five years tracing decades of parallel narratives in government policy, regulation, fire-safety design oversight and approval, and professional responsibility. The effort tested his belief in the primacy of science and rationality and, ultimately, his faith in engineering.
A year after the Grenfell Tower fire, Bisby spoke at an international meeting of fire-safety experts. He lectured about his work on sociological issues in fire-safety regulation. The audience asked him about his expertise in performance-based design. While Bisby is both a proponent of performance-based engineering design and a critic of the often thoughtless rule-following that many building codes enable, he told the audience, “I believe the idea of expertise is a myth.” He crossed his hands over his black suit and conference badge. “My definition of an expert is someone who is profoundly aware of their incompetencies.” The audience was stunned by what he said. Bisby explained how people are usually generous to themselves by overestimating, not underestimating, their competencies. Bisby tells his students regularly that their engineering degree will make them deeply aware of how little they know. His idea is to instill, early on, a mentality of “chronic unease,” an intellectual humility required for promoting public safety. Paradoxically, only by becoming more competent can people recognize their incompetence. Specialized expertise is not a pitfall, but inattention to one’s fallibility is. “We all must constantly ask ourselves if we have any idea what we’re talking about,” Bisby said bluntly. “If we did that, the world would improve.”
Excerpted from Wicked Problems: How to Engineer a Better World. Copyright (c) 2024 by Guru Madhavan. Used with the permission of the publisher, W.W. Norton & Company, Inc. All rights reserved.
A sale happens not because you want to sell, but because someone wants to buy
Founders can make or break a company, and getting over oneself is the key
Erasures of handwritten notes taken at a job investigating workers' compensation claims in the insurance fraud industry