Astronaut Dr. Charles J. Camarda has uncovered a recurring cause of accidents that no one has articulated yet—loss of a research culture that places a premium on learning and the quest for knowledge and what that means. He shows how to develop high-performing teams and networks of such research teams to solve anomalies rapidly, which can help prevent catastrophes in complex high-risk/high-reliability organizations.
Astronaut DR. CHARLES J. CAMARDA is an inventor, author, educator, and internationally recognized invited speaker on subjects related to engineering, engineering design, innovation, safety, organizational behavior, and education. He has over 60 technical publications, holds 9 patents, and has over 20 national and international awards. Dr. Camarda is a NASA veteran with over 22 years of experience as a research engineer, 18 years as a NASA Astronaut who flew on STS-114, the return-to-flight mission following the Columbia disaster; and 13 years as a Senior Executive holding many positions within NASA. He is an adjunct professor at several universities, has developed an innovative conceptual engineering design pedagogy called ICED which he has taught to NASA engineers, and which forms the basis for his 501 (c)(3) educational nonprofit called the Epic Education Foundation which he founded to democratize STEM/STEAM education for students of all ages around the world.
DR. CHARLES J. CAMARDA
“I absolutely love this book! While reading Mission Out Of Control, I was reminded of my many experiences and lessons from my aviation career in civil, military, and flight tests and the space shuttle program. While it is not pleasant to remember difficult times, it is much more essential to learn from the past and share these experiences to help prevent future occurrences. Dr. Charles Camarda has certainly achieved this. As a researcher, engineer, and astronaut with an impressive understanding of organizational culture, he can succinctly describe the conditions within NASA. In this book, he describes his experiences and uses case studies and other examples to keep the reader engaged. These lessons are timeless, and this book will still be valuable many decades into the future.” —COL. EILEEN COLLINS, First female Commander of Space Shuttle, Author, Through the Glass Ceiling to the Stars
MISSION OUT OF CONTROL
“A bold and courageous expose of how a big government agency can lose its way. Filled with details that only a research insider would be able to see, Charlie’s work is a masterpiece of fact-finding, problem-solving and innovative solutions. Once you start believing your own hype, the true story is often embarrassing and hidden, particularly in a bureaucratic organization like NASA. A must-read for everyone in big government and industrial bureaucracies, not only to avoid preventable tragedies but to use the author’s suggestions, approaches, and ideas to enhance and improve their culture and operations.” —DR. BART BARTHELEMY, former Director of the National Aero Space Plane Program and Founding Director of the Wright Brothers Institute
MISSION OUT OF CONTROL
DR. CHARLES J. CAMARDA
“The story of one man’s integrity and grit, and the challenges he faced in delivering a message that some did not wish to hear.” —DR. MARK J. LEWIS CEO Purdue Advanced Research Institute LLC
“The book does a remarkable job of pulling together every possible relevant concept and evidence from organizational research to support the goal of helping organizations get better.” —DR. AMY EDMONDSON NOVARTIS Professor of Leadership and Management, Harvard Business School, author of The Fearless Organization (Wiley 2019) and several case studies on Columbia, Challenger, and safety cultures
“Dr. Camarda’s insider critique of NASA exposes its corruption of research by sending vulnerable astronauts on potentially suicidal missions without the checks and balances of scientific inquiry. He outlines what it would take to make NASA and other large organizations, such as Boeing, which has been beset with airplane catastrophes, develop high-performance organizations by balancing research goals with performance objectives.” —JACK MATSON Professor Emeritus, Penn State University and author of Innovate or Die
“Dr. Camarda describes a near miss not as a success but a ‘system failure’—Many Firefighters would not have died or been seriously injured over my 32 year FDNY career if the ‘Cultures’ and ‘Biases’ he teaches us about in his book were understood.” —COMMISSIONER THOMAS VON ESSEN Commissioner of FDNY during 9/11 Tragedy, author of Strong of Heart, Harper Collins, 2002
“Dr. Camarda’s remarkedly well written and documented book clearly and effectively presents the methodology for developing and maintaining High-Performing Organizations.” —DR. RANDY GRAVES Former Director of Aerodynamics, NASA Headquarters
“Charles Camarda has drawn on his rich personal experience at NASA and extensive research to describe how a high-risk organization can build an effective safety culture, as well as how such a culture can erode over time. Managers in any high-risk environment can learn from this book.” —PROF. MICHAEL ROBERTO Bryant University, author, Unlocking Creativity, Assistant Professor Harvard Business School, and Visiting Assistant Professor at the NYU Stern School of Business
“Mission Out of Control is told from the moving perspective of a friend and colleague of the astronauts who perished on Columbia. It should be required reading not only for students of spaceflight history, but also by anyone interested in how organizations develop certain cultures, how those cultures can change over time, and how, in turn, culture can lead to success or ultimate failure. In the end, this is also a book about speaking truth to power; the story of one man’s integrity and grit, and the challenges he faced in delivering a message that some did not wish to hear. —DR. MARK J. LEWIS, CEO Purdue Advanced Research Institute LLC “Mission Out of Control exposes the systemic breakdown and erosion of the ‘research engineering’ culture that made the Apollo program so successful. Until NASA is reinvented, safe return to the moon and on to Mars will remain very high risk.” —JOHN NEER, Former Chief Engineer for Lockheed Martin during the Columbia Accident Investigation
Dr. Charles Camarda
Headline Books, Inc. Terra Alta, WV
Mission Out of Control: An Astronaut’s Odyssey to Fix High-Risk Organizations and Prevent Tragedy by Charles Camarda copyright ©2025 Charles Camarda All rights reserved. No part of this publication may be reproduced or transmitted in any other form or for any means, electronic or mechanical, including photocopy, recording or any information storage system, without written permission from Headline Books, Inc. To order additional copies of this book or for book publishing information, or to contact the author: Headline Books, Inc. P.O. Box 52 Terra Alta, WV 26764 www.HeadlineBooks.com mybook@headlinebooks.com ISBN 9781958914502 Library of Congress Control Number: 2024945844
P R I N T E D I N T H E U N I T E D S TAT E S O F A M E R I C A
To my amazing family who have been by my side physically and spiritually throughout my journey and who have shown me amazing grace, love and support through some very trying times. And to my mentor Jim Starnes who showed me how to be a researcher and who provided a safe environment for me to learn and grow.
Content Foreword........................................................................................................................5 Introduction..................................................................................................................7 Part I: It’s the Culture, Stupid!.................................................................................. 14 Chapter 1: Boiling a Frog–How Great Organizations Lose Their Way........... 16 Chapter 2: Culture as Cause–Connecting the Dots, Challenger to Columbia...................................................................................... 37 Chapter 3: Cosmetic Fixes and Near Misses........................................................... Chapter 4: When All Else Fails........................................................................... 128 Part II: Creating a High-Performance Culture–How to Build a Research Culture and Network of High-Performing Teams............................ 179 Chapter 5: Research as a Distinct Culture........................................................ 180 Chapter 6: The idea of a “Group,” a “Team,” and a “High-Performing Team” (HPT).................................................................... 210 Chapter 7: Developing High-Performing Teams to Solve Epic Challenges....................................................................................... 240 Chapter 8: Fixing a Broken Culture–Putting It All Together......................... 284 Epilogue.................................................................................................................... 313 Glossary of Terms.................................................................................................... 316 Acknowledgments................................................................................................... 320
Foreword Even though quite successful in its initial stages, no organization has ever escaped the need for reform at some point in its existence. There are numerous historical examples from the Marian reforms of the ancient Roman army that made it the most formidable military of its day, the sweeping changes of the Magna Carta that spread power more evenly across medieval England, Martin Luther’s Reformation that modernized Christianity, and the rise of air power force projection in American war theory by General Billy Mitchell. In all these cases, there was considerable and even violent reactions within the organizations affected which decidedly included the defamation and disparagement of the reformers, the intent of which was to force them to shut up and quit. Unfortunately, from ancient times to these, the negative and outraged reaction by those in power to those calling for reform has not changed. The need for improved management and reform in a host of large corporations is evident today through our observation of their often self-defeating policies that bring on the loss of billions of dollars for their stockholders. But what about the most respected, admired, and even beloved organization within our federal government? Has it also been infected by poor leadership and destructive policies? Indeed, I am referring to the conquerors of the moon, the exalted organization that won the very heavens for us, our very own National Aeronautics and Space Administration (NASA). How could anyone dare suggest there is anything wrong with this organization of heroic American pioneers pushing into the final frontier? It would take a very brave person, indeed, and would have to be someone who has seen the agency from within and without and not only studied its problems but withstood the anger and threats of its entrenched managers. Fortunately, there is such a person, the author of this book. After the deaths of his friends and colleagues aboard Columbia, a disaster that did not and should not have happened, Astronaut Charlie Camarda and his fellow crewmates worked diligently through their grief and many stress5
Mission Out of Control
filled days to return the United States to space on the next shuttle flight. This allowed NASA to once more take on the prideful mantle of success but only after paying a very steep price, a price that seems at times to have been already forgotten but may have to be paid again unless there are changes within both the organizational structure and culture of the agency. Like Dr. Camarda, I have great respect for NASA’s astronauts, engineers, and technical staff but believe improvement is required on how its managers make decisions on nearly everything from programmatic priorities to safety. Like doomed canaries in coal mines, NASA’s failures of management that have caused such awful sacrifice by its astronauts and their families provide us with a powerful warning that something is very wrong in our approaches to leadership and teamwork, not only at NASA but in our once innovative and productive private sector as well. To begin fixing things, I believe the very best starting place for those who oversee our space agency and also those who wish to lead our most important commercial companies back to their former greatness is to read this book in its entirety and heed its sound advice. Homer Hickam, retired NASA engineer and author of Rocket Boys/October Sky
6
Introduction It was a cold, still afternoon in Star City, Russia, when a Russian crewmate crossed the commons that separated the Yuri Gagarin Cosmonaut Training facility from the U.S. cottages to tell the American crew the tragic news; the Space Shuttle Columbia had disintegrated in the skies over southeast Texas during Earth entry. We were in shock; we just lost seven friends, three of whom were classmates that I had trained with for over two years. We ran into our cottage to turn on the TV and stood transfixed as we watched in disbelief as the news stations played and replayed the launch and entry video footage. The launch video showed a huge piece of foam pop off the large external tank (ET), which housed the cryogenic liquid propellant, and strike the left wing of the shuttle orbiter, the vehicle that carried the crew. The large piece of ET foam struck the wing and exploded into a particle cloud that trailed past the vehicle. The next series of video footage taken 16 days later during the crew’s fiery descent through the atmosphere and return to Earth showed the breakup of the vehicle into thousands of glowing pieces streaking across the night sky over southeast Texas. My emotions bounced back and forth from sadness to extreme rage; how could engineers and program managers in Houston not have understood the severity of the foam strike? The piece that had popped off was the size of a large Styrofoam cooler, hundreds of times larger than prior debris strikes that caused serious damage to the fragile thermal tiles that protected the aluminum structure during entry. More importantly, why didn’t the former head of the astronaut office, Bob Cabana, call me? He knew I had over twenty-two years of research experience in thermal protection systems (TPS) when he hired me to be an astronaut. At Langley, I headed a research branch of some of the sharpest people who devoted their careers to understanding this exact phenomenon. We were responsible for analyzing and testing the fragile TPS of hypersonic vehicles like the space shuttle. How could engineers in Houston not have contacted me 7
Mission Out of Control
or any of the experts at Langley or the other research centers? How could they have possibly concluded that there was no critical damage worthy of reporting to the crew when the foam strike first occurred after launch? What I learned from studying the causes of accidents in the aerospace industry, as in numerous other industries, is that these stories and questions about improper behavior and flawed decision-making are not uncommon. What I also learned was that most organizations fail miserably in their attempts to right the wrongs that cause an accident. They fix what they believe to be the proximate technical causes, make meek gestures to fix safety protocols, processes, and things like “organizational governance,” pat themselves on the backs when similar accidents are not imminent, and sigh with disbelief when tragedies are repeated years later, with disastrous loss of life, pain and suffering, and financial ruin. In fact, the failure rate of companies trying to change or transform their culture is over 80%. The Columbia Accident Investigation Board (CAIB) published the results of their investigation a little over six months after the accident. For the first time in history, the primary cause of an accident was attributed to the failure of an organization’s culture. I was working for NASA for eight years when the Challenger accident occurred, and now, post Columbia, I was determined to devote the remainder of my NASA career to understanding and helping solve the problems that led to the death of my colleagues and friends on STS-107. I realized the NASA I remembered when I was hired in 1974 at Langley was very different from what I was experiencing when I was selected as an astronaut in 1996 at the Johnson Space Center (JSC). I had seen a decided shift from what I would later identify as a strong “research culture” to what noted sociologist Diane Vaughan would call a “production” culture—a culture that was driven to meet the schedule and budget demands of a very public space program. Chapter 1 takes us back in time to understand what made NASA during the Apollo years such an amazing, high-performance organization, how it evolved over time, and the factors that led to its slow, gradual decline. I spend considerable time in Chapter 2 describing the myriad of cultures and sub-cultures which exist at NASA and how these organizational cultures evolved during the life of the Agency and its predecessor organization, the National Advisory Committee for Aeronautics (NACA). I show how large, complex organizations typically do not have a singular, homogeneous culture but instead are composed of a multitude of cultures at multiple levels within the structure of an organization, even at the granular group or team levels. You can only imagine how much more complex this becomes with much larger 8
Charles Camarda
multinational companies like Boeing, for example, that have over ten times the number of employees and evolve and merge with totally different companies throughout their lifetimes. Each of the companies that are absorbed carries with them many years of evolution with different leadership, vision, missions, and values of their own. Admiral Harold Gehman, who headed the Columbia Accident Investigation Board (CAIB), relied heavily on renowned sociologist Diane Vaughan and her studies of the causes of the Challenger accident to convince other board members that the bad behaviors and culture that caused the Challenger accident in 1986 were directly linked to and responsible for the Columbia accident, 17 years later. Although the technical causes of the accidents were different, the same social and political pressures and cultural issues were present at both times, which resulted in similar bad behaviors and flawed decisions. Chapter 2 will highlight the key sociological, behavioral, and organizational elements that were directly related to the causes of both accidents and try to address why NASA’s attempts to correct such bad behaviors failed. I dedicated forty-six years to working at NASA, first as a research engineer, then as an astronaut, and finally as a senior executive. I served as a crew member on the space shuttle mission that flew immediately after Columbia (STS-114) and formed and led several key research teams that identified the technical cause of the disaster, as well as developed new repair technologies for astronauts to use to prevent similar incidents. During that time, I experienced firsthand how NASA’s culture contributed to the Columbia accident and how that same culture remained unchanged and caused several other (unreported) near misses throughout my remaining tenure at NASA. Post-Columbia, I researched hundreds of books, articles, and academic papers concerning the causes of accidents and had the opportunity to interview some of the leading authors and researchers in this field. This research has led me to identify similar instances of complex organizations where culture and organizational behaviors lead to recurring accidents. It also helped me understand how my research background helped me to recognize a culture that had been overlooked previously: the research culture. When most people think of NASA, they think of this amazing organization of scientists and engineers who are working at the cutting edge of aeronautics and space technology that can accomplish the most complex, daunting challenges. It is an organization that is consistently selected as the “best” place to work out of all the government agencies. It would be difficult for anyone to believe how corrupt and dysfunctional such an agency could become without my devoting 9
Mission Out of Control
several chapters outlining in painful detail how wrong that initial assessment could be. As much as it pains me to relive the stories in Chapters 3 and 4, I feel it is necessary to illustrate what every organization is capable of, to show how insidious the gradual slide to dysfunctionality can be, how easily it can be overlooked, and how it can result in the careless loss of human life and economic ruin. I will do this by describing the immediate slide back to a dysfunctional culture, which resulted in three near misses with disaster and a critical anomaly that was observed on a reinforced carbon-carbon (RCC) wing leading edge panel on our space vehicle, Space Shuttle Discovery, upon our return from space on STS-114. The anomaly was a series of large voids beneath the fragile silicon carbide (SiC) oxidation protection coating on the starboard/right wing of the vehicle in one of the most highly heated regions, RCC panel 8R. The blow-by-blow description and timeline of what occurred when I first realized the criticality of this anomaly and how it was being misrepresented by the same technical working group that downplayed the severity of the Columbia foam impact and how badly I was treated will shock most readers. The organizational changes which NASA had put in place to appease the requirements of the CAIB, such as the development of the NASA Engineering and Safety Center (NESC), were not only ineffective in discovering safety hazards, but the NESC actively participated in silencing concerns of experienced subject matter experts and even joined forces with the Shuttle Program to help them develop flight rationale to continue flying defective hardware. Yes, that is correct. The same organization that was supposed to be the objective engineering and safety oversight for the Space Shuttle Program was, in fact, aligned with them in helping to provide reasons to continue to fly with known safety hazards! I begin the second half of the book with a description of what I consider to be the real cause of accidents like Challenger and Columbia: the organizational loss of what I call a “research culture.” This research culture was the reason for the success of NASA’s predecessor organization NACA in advancing the United States to becoming the leader of the world in aeronautics and the resulting burgeoning aerospace industry in this country. The essence of a true research culture is a persistent curiosity and thirst for knowledge, the use of the scientific method, and a rigorous building-block approach for validating new knowledge. This is a process that relies on analytical solutions and/or numerical methods to describe a phenomenon, careful validation of those predictions by intelligent experiments and testing to failure, and the use of mathematical nonlinear programming or optimization to design and predict how hypothetical ideas and/ or solutions will perform. A sound research culture has a healthy preoccupation 10
Charles Camarda
with failure and a constant commitment to understanding and predicting how systems will fail. Above all, a research culture places psychological safety and a psychologically safe environment at the top of its priorities, ensuring that every voice is heard, and people feel free to engage, disagree, dissent, and take interpersonal risks without fear of reprisal or recrimination. I describe the differences between excellent engineers and research engineers and show how it is vital to develop integrated teams of scientists, research engineers, engineers, and skilled technicians to solve the complex, highly coupled technical problems that lead to disastrous failures. The seeds of this research DNA spread from NACA to create NASA and secure the U.S. winning of the space race to the moon in the late ’60s. I will describe how approximately 40 research engineers from two NACA Centers, Langley and Lewis, formed JSC and eventually all the other NASA Centers and will trace the various cultures of these centers and how they interacted during the Apollo missions to the moon and discuss why and how they were successful. I will show how this once-great research culture declined due to a lack of funding and total ignorance of program managers at NASA of its importance for the health of a supposedly cutting-edge technical agency that looks far into the future to initiate the research necessary to carry out breakthrough, not incremental, advances in technology. The second half of this book builds on the understanding of how great organizations like NASA lose their way, fail big, and continue to do so, and explains why recurring disasters happen. It then lays out a systematic way NASA and other large organizations can transform and build research networks of high-performing teams to identify early “weak signals” of technical or behavioral dysfunction and rapidly diagnose and fix problems within existing “recovery windows” to prevent tragedies from occurring. Teams are how work gets done today. Many books and technical papers have been written describing the many reasons for faulty decisions in high stakes environments and often include suggestions on how to prevent such problems in the future. In organizations around the world, work is no longer individual. It is networked and connected. The more complex the work is to be done, the more important the team is. All teams are not equal, however. It is important to define what a team is, what it’s not, and how it differs from a typical group of individuals working cooperatively to solve a complex problem. While all teams can be considered groups, not all groups are teams! Chapter 6 defines these terms and focuses on “High-Performing Teams” or HPTs, detailing the attributes of a team and how 11
Mission Out of Control
these attributes function to define the dynamics and interrelationships that are necessary to monitor team health and predict team performance and outcomes. It adopts a scientific lens of team performance coupled with real-time, firstperson observations and field analysis of teams breaking down in high-stress and high-knowledge-need environments. It lays out a strategy to effectively create, motivate, organize, and lead effective high-performing teams (HPTs). It will include key case studies of team breakdown in numerous sectors and provide detailed, firsthand experiences illustrating the problems and proposed solutions. Chapter 7 explains why a “command” organizational structure with dense, structured hierarchies of control and accountability, which exists in many large organizations, can impede the rapid response and technical depth to address critical anomalies in time to prevent weak signals of anomalies from growing into large, sometimes catastrophic failures. A research network-ofteams (RNoT) and flat organizational structure is a recommended solution to mitigate risk and prevent disaster. I will demonstrate several examples of how I initiated such HPTs to diagnose the technical cause of the Columbia accident, create a validated, physics-based understanding of the foam impacts on the orbiter vehicle, and develop on-orbit wing repair strategies. While these teams are usually created after accidents occur, they are usually disregarded during “normal” operations because it is believed such capabilities are excessively expensive and viewed as a program cost. I will show how very small HPTs are a critical investment that can have a significant return on investment (ROI) and save lives. A research network of teams but also allows workers to assemble ad hoc teams to troubleshoot and address problems as they arise. This increased agility allows for consistent, efficient, and safe systems engineering practices. The final chapter will describe several strategies for organizations to either develop a high-performance learning organization and/or recommend methods for reclaiming an organization’s successful core ideology. It will pose solutions that take advantage of technology and organizational psychology to create, educate, and connect teams to collaborate and solve not only technical problems but also the social and cultural problems that degrade team performance. More importantly, it will provide assessment and health metrics that leaders can use to ascertain team health and improve team culture and performance.
12
Charles Camarda
References 1. 2. 3.
Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press, 1996. McCurdy, Howard E.: “Inside NASA – High Technology and Organizational Change in the U.S. Space Program.” The Johns Hopkins University Press, 1993.
13
Part I: It’s the Culture, Stupid! On January 28, 1986, while driving back to work from lunch, I heard on my car radio that the Space Shuttle Challenger had exploded seventy-three seconds into its flight. I rushed back to my office and joined my colleagues clustered around an office TV. At the time, I worked as a research engineer at the NASA Langley Research Center, primarily analyzing, designing, and testing advanced structures, thermal protection systems (TPS), and wing leading edges for hypersonic vehicles like the space shuttle. Hypersonic vehicles fly faster than five times the speed of sound (Mach 5), approximately 3,806 mph or 5,582 fps. At these speeds and greater, flight through the atmosphere causes aerothermal heating and temperatures on the vehicle to increase considerably. Like the rest of my colleagues, I was shocked and distressed by the disaster. I sought answers, asking other researchers what had gone wrong. In the following two and a half years, NASA investigated the incident and found and fixed the technical issue: a faulty field joint design for the Shuttle Solid Rocket Boosters, which caused the joint to open under loading and allow hot gases to blow by two rubber O-rings and escape and ignite the liquid fuel tank during launch. About seventeen years later, I watched from our cottage in Star City, Russia, as the Space Shuttle Columbia disintegrated upon Earth entry. By now, I had become an astronaut, and I was training as a backup crewmember for Expedition 8, a mission that entailed an extended stay on the International Space Station (ISS). On Columbia, I lost seven friends and colleagues, three of whom were classmates that I had trained with for over two years. So, you could say that I took the Columbia disaster personally. We quickly learned about the most probable technical cause of the accident: during launch, a large piece of foam had come off the space shuttle’s external tank and struck the orbiter’s wing leading edge, which damaged its thermal protection system and caused Columbia to disintegrate as it traveled through the atmosphere on its return to Kennedy Space Center (KSC). 14
Charles Camarda
Six months after the accident, the Columbia Accident Investigation Board (CAIB) released a shocking finding. They wrote, “In our view, the NASA organizational culture had as much to do with this accident as the foam1.” Never before had such a claim been made in a technology-driven organization like NASA. But I knew, intuitively, that the board’s assessment was correct. In my career, I experienced firsthand many of the cultural deficiencies that led to both accidents. Worse, I had witnessed the gradual erosion of the remarkable culture responsible for NASA’s once-storied past. A culture that most experts had overlooked or did not fully understand, but one I was aware of and worked within during my first 22 years at NASA, is a research culture. The title of the first half of this book heralds the continued ignorance of supposedly smart technical personnel and high-level program managers within the agency, who either failed to understand the complex meaning of what culture is or who knowingly brushed its warnings aside after multiple tragedies and allowed it to persist and even erode further, resulting in recurring near misses and even tragedies. While renowned sociologists like Diane Vaughan refused to place blame on anyone at NASA following the Challenger accident2, there was no excuse for the leaders of NASA not to have read, no, studied, her book and papers, which clearly highlighted the cultural causes of that accident and to have fixed the problem. Vaughan’s exhaustive research post-Challenger and her work on the CAIB post-Columbia enabled the board’s findings to map the causal links of both accidents to a NASA culture, which clearly had not changed after multiple accidents. More importantly, following Columbia, the continuance of said dysfunctional behaviors almost caused two more accidents and the mishandling of individuals who spoke up and tried to alert upper management of impending safety risks. The fact that these employees were silenced and threatened is disgraceful and negligent. I have absolutely no problem addressing these concerns with an insider’s perspective; as an astronaut who flew on the very next flight following the Columbia accident, STS-114, a research engineer of over 22 years who was proficient in the disciplines of engineering responsible for the incidents in question, and as one of the persons who was threatened and fought attempts to be silenced and to silence others. In addition, I have spent over 15 years researching the social, psychological, and behavioral issues and can serve as what Dr. Vaughan would call an “embedded ethnographer”—someone who embeds themselves deeply and over a long period of time into a community to objectively observe and document their everyday lives, interactions, and behaviors. 15
Chapter 1
Boiling a Frog – How Great Organizations Lose Their Way Most great, highly successful organizations that have endured and prospered for over a century or more have what authors Jim Collins and Jerry Porras call in their book “Built to Last” 3, a winning core ideology. That is a combination of core values or guiding principles plus a fundamental reason for existence or purpose; it’s a guiding star on the horizon, which is not to be confused with money, specific goals, or business strategy. The authors researched dozens of corporations and highlighted what differentiated the ones that survived over 100 years, at times even losing and then regaining their core ideology, from those organizations that did not survive. One of those winning corporations, Boeing Aerospace, is currently struggling with the loss of its core technical values after merging with McDonnell Douglas. The cultural slide away from those core values has been listed as a cause for Boeing’s two tragic 737 Max disasters (Indonesia’s Lion Air Flight 610 on October 29, 2018, and Ethiopian Airlines Flight 302 on March 10, 2019). The Boeing disasters would be the second time that culture was raised as the cause of an accident in the aerospace industry! Similar cultural causes could also be found in accidents in other high-risk/high-hazard industries like the Three Mile Island (TMI) nuclear reactor accident, the British Petroleum (BP) Deep Horizon Oil Spill, and the Union Carbide chemical plant accident in Bhopal, India, which killed over 20,000 people and affected over 500,000. An organization’s core values are the DNA that drives the vision, and together, they attract the best and the brightest to work long, hard hours. The new employees have the passion and are driven to excel and be part of a grand mission or purpose that is bigger than themselves. An organization’s culture is defined by noted sociologist Diane Vaughan as the “values, norms, beliefs, and practices that govern how institutions function” 1,2 and is very much related 16
Charles Camarda
to the idea of an organization’s core ideology. More than anything else, it is a company’s cultural devotion to its core values that enables it to persist and succeed during its most difficult times. Sometimes great companies take great risks and set Big Hairy Audacious Goals (BHAGs) in Collins’ words, sometimes they have visionary leaders like Bill Gates of Microsoft and Steve Jobs of Appel, and sometimes they have both, like Elon Musk, whose vision for his company Space Exploration Technologies Corporation (Space X) is a multi-planet species and the colonization of Mars. By setting his sights far into the future and addressing some of the most pressing needs upfront, Elon was able to transform a struggling commercial space startup into one of the top commercial space corporations in the world, reducing the cost of payload to low Earth orbit by almost an order of magnitude in less than ten years. In addition to a great vision, mission, and core ideology, according to noted business author Warren Bennis in his book, “Organizing Genius – The Secrets of Creative Collaboration,” is the idea of great groups or high-performing teams (HPTs)4, whose magical “it-factor” or cohesion enable them to overcome unbeatable odds to accomplish the impossible. These groups act as if they are on a “mission from God.” Bennis and Biederman dissect and analyze groups and entire organizations such as the creative artistic efforts of “Troupe Disney” and Black Mountain College, the successful presidential campaign team of Bill Clinton, the technical successes of the Manhattan Project, and the Lockheed Skunk Works. In Chapter 6, we will address in detail the qualities and attributes of great groups or HPTs and how they compare to highly successful sub-units within NASA and other organizations. Understanding the teams and cultures that coalesced to create the magic for NASA to pull off one of the most extraordinary engineering feats of all time, landing men on the surface of the Moon and returning them safely back to Earth, is a good benchmark to begin. Since most complex organizations are composed of dozens, if not hundreds or thousands, of working groups or teams, it is only natural to assume that there could be as many sub-cultures as there are sub-units. In Dr. Vaughan’s words, “The idea of organizational culture must be refined even more, however, for rules, rituals, and beliefs can evolve which are unique to work groups.” 4 Yes, working groups such as those that worked on the O-ring problems before Challenger and the ET foam release and damage assessment groups before Columbia. It takes only one of these groups to lose its way for critical anomalies to be overlooked or misdiagnosed for disasters to occur.
17
Mission Out of Control
NACA – What Made NASA Great I used to get really upset when I would hear the term “the NASA culture” used by the media, as if NASA was the homogeneous collection of a Headquarters in Washington, D.C., and nine field centers and facilities across the United States, which had a singular set of values, norms, beliefs, and practices. Diane Vaughan’s analysis of the Challenger accident focused primarily on one NASA Center, Marshall Space Flight Center (MSFC), because it was responsible for the Solid Rocket Boosters (SRBs) whose field joint failure was the primary technical cause of the accident. The research for her Ph.D. dissertation and book2 took ten years. During that time, she analyzed over 122,000 pages of your NASA documents, 2,500 pages of testimony transcripts from the Presidential Commission Report (the Rogers Commission), 5 on the Challenger Accident, 9,000 pages of transcripts from government interviews, and numerous personal and telephone interviews. In Diane’s role as a “historical ethnographer” (her words), she was able to identify numerous NASA cultures within MSFC, such as an engineering culture, a work group culture, a production culture, and a culture of production. As someone who had worked for NASA for almost 30 years, when I first read The Challenger Launch Decision, I was very impressed with how accurate her assessment of the behaviors she classified was, considering her restrictions as a historical ethnographer and not someone who observed NASA from within. Imagine how effective she would have been if she was employed by NASA to actively monitor and observe NASA from the inside as an “embedded ethnographer,” someone who objectively observes a culture from within without influencing the outcomes? Something she requested and was refused by NASA following Columbia.6 Following the Columbia accident and upon my return from space, I would later interview Dr. Vaughan at her office at Columbia University in Manhattan and give her a signed collage that we flew on our mission, STS-114, to thank her for work on the Columbia Accident Investigation Board and her service to NASA. Author Howard McCurdy in his book “Inside NASA” 7, best describes the organizational history of NASA as a confederation of cultures, “a confederation of institutions, each with its own history and traditions.” Yes, to truly understand NASA and the multitude of its cultures and sub-cultures, you must understand its predecessor organization from which it was formed, the National Advisory Committee of Aeronautics (NACA), its research laboratories (shown in figure 1), the German rocket culture of Wernher von Braun and the Army Ballistic
18
Charles Camarda
Missile Agency (ABMA), the Air Force systems management culture, and their unique founders and leaders, their visions, and missions. In 1915, the United States established the National Advisory Committee for Aeronautics (NACA) after the country’s lead in aeronautics, following the Wright brothers’ accomplishments, quickly evaporated as the pressures of World War I pushed France, the German Empire, and the British Empire to develop warplanes. The U.S., by contrast, did not fly a single plane in that conflict. NACA’s purpose during this period was “to supervise and direct the scientific study of problems of flight with a view toward their practical solution and to determine the problems which should be experimentally attacked, and to discuss their solution and their application to practical questions.8,9” In plain English, NACA’s mission was to discover as much about flying machines as possible, as quickly as possible, and use that knowledge to help the United States aircraft industry develop the greatest airplanes possible. This mandate was the seed of the early NASA’s research culture. It established that what mattered the most was truth, knowledge, discovery, and the freedom to think critically, experiment, and validate new knowledge. To achieve this goal, NACA formed an intelligence network of some of the most knowledgeable mathematicians, scientists, and engineers. In 1917, NACA established the Langley Memorial Aeronautical Laboratory (LMAL) as the central hub of that intelligence network. Named after the famed aviation pioneer Samuel Pierpont Langley, LMAL would later become the NASA Langley Research Center (LaRC), where I spent the bulk of my career. LMAL was led by mathematicians, physicists, and engineers from some of the top schools in the nation who traveled to a small town in Virginia called Hampton, where these strange scientist types became known as “NACA Nuts” (pronounced NackUh) by the locals because of their peculiar eccentricities. They were these geeky “technical sophisticates par excellence who wanted to know the RPM of his vacuum cleaner and asked that his lumber be cut to the nearest sixteenth of an inch9.” This early team at LMAL was very young and inexperienced; they had to learn the fundamental physics of flight through experimentation and empirical relationships with theories that they would investigate and develop. Hence, three groups formed to conduct the research needed to understand the physics of flight through the atmosphere: the theoreticians (mostly physicists, scientists, mathematicians, and engineers) who built the mathematical or numerical models to predict the outcome of a laboratory or wind tunnel test, the experimentalists (the engineers and scientists who developed the wind 19
Mission Out of Control
tunnels, instrumentation, and laboratory facilities) who would conduct the experiments or tests, and the highly skilled technicians, craftsmen who built and instrumented the models and test components to collect the data during experimental testing. Often, these researchers would use a “cut-and-try” method of testing ideas held in their “mind’s eye” as much as numerical analyses and parameter variation to understand the fundamentals of say, transonic flow (flow close to the speed of sound), and to develop the unique wind tunnels to conduct precise experiments9,10. These researchers would use a rigorous scientific method of analysis, coupled with experimentation to validate the phenomena they were interested in studying, for example, airflow over a wing and the resulting air loads that had to be carried by the vehicle’s structure. Langley researchers used an inhouse, “hands-on” approach to analyze, build, test, fail, discover, and learn. Their researchers focused on aeronautics, fluid flow, flight mechanics, aircraft design, materials, and structures. To truly understand the various cultures of LMAL, you must also study the history of its great leaders who founded the organization and the world-class researchers who developed some of the top technical innovations of the time who worked there9. The great theoreticians and experimentalists at Langley were often at constant odds with one another as they struggled to understand the principles of flight. This healthy tension was allowed to flourish in a psychologically safe environment where feisty researchers would critically evaluate ideas and validate new knowledge through intelligent experiments. They utilized a “hands-on,” building-block approach to construct knowledge by testing and analyzing elements, components, subsystems, and integrated systems in laboratory test facilities and wind tunnels. Knowledge thus gained and validated would then be used to design, build, and test full-scale systems and eventually be verified by actual flight testing. This research culture demanded the best of its employees to have an incessant thirst for truth and knowledge and an unrelenting quest for accuracy. A true meritocracy where deference is always paid to the person with the knowledge, regardless of rank or status within the organization. One of the primary products of NACA was its technical papers and reports, which documented its aeronautical research in an open forum for critical review. These early researchers were allowed to take full ownership of every aspect of their research, working closely with the machinists, instrument developers, and laboratory technicians. This early mentorship in running small projects “hands-on” prepared them to confidently progress to technical management of much larger programs. They 20
Charles Camarda
were learning the entire process of applied research and development from the ground up, which would enable them to become very strong technical leaders as the organization grew. Applied research is different from fundamental or basic research in that it is focused on the practical solution of a specific problem and the advancement of knowledge in a particular field, as per NACA’s mandate, “the problems of flight.”8.9 Throughout its history, NACA produced over 16,000 technical reports that were highly regarded, sought after, and used by industry and academia. “The goal of the NACA was to provide the best aeronautical research that it could in all fields, whether it be low- or high-speed aerodynamics, gas dynamics, structures, or guidance and control7.” An in-depth description of this research culture is given in Chapter 5. It was the primary reason for the success of the NACA organization and the reason that the U.S. rapidly became the world leader in aeronautics before World War II. “Research done at Langley in the fields of subsonic stability and control, loads, propulsion, and structures – that is, research on the practical aeronautical problems of the day – contributed significantly to the design of military aircraft essential to the Allied victory9.” Since LMAL was the first laboratory formed and its excellent staff often seeded the cultures of the other laboratories, Langley is often referred to as the “Mother” Center. The next NACA laboratory, the Ames Aeronautical Laboratory at Moffett Field in California, was formed in 1939 and was staffed with a nucleus of Langley researchers. Named after Joseph Sweetman Ames, a physicist and founding member of NACA, the Ames staff pursued studies in aerodynamics with a focus on high-speed flight11. One year after Ames was formed, a third laboratory, the Lewis Flight Propulsion Laboratory in Cleveland, Ohio, was established to concentrate on engine and propulsion research to augment and lead the work that was being conducted in private industry at that time. These three research laboratories maintained a very similar applied research approach as opposed to more basic or fundamental research being conducted in most academic institutions at that time. While there was some overlap in the various disciplines studied at the NACA laboratories, this allowed competing theories to be thoroughly vetted, a vital component for a thriving research culture. What we will learn when we study the attributes of high-reliability organizations (HROs) is that most of the characteristics of a research culture, such as redundancy and overlap, are also hallmarks of successful HROs. NACA was not a “political” organization, nor was it bureaucratic; its technical and scientific standards became the dominant criteria for decision-making. 21
Mission Out of Control
Researchers were allowed to try ideas without a formal chain of command, to fail and learn as quickly as possible. This “permission to try and try again”12 or permission to fail and flat organizational structure of NACA was a critical part of the rapid advancement of research. “Learning by repeated attempts may appear cumbersome, but failures indicated areas where further research was needed to improve the understanding of flight phenomena. At Langley, the mistakes were just as important as the successes, for they sowed the seeds of future accomplishment.12” Perhaps the remote locations of the first three NACA laboratories across the country helped maintain this technical purity. The term “engineer in charge” was often used to describe the powerful technical leadership in upper management which enabled the success of NACA and early NASA. After World War II, NACA Langley engineers and technicians set up a field station on a barren portion of the Mojave Desert in southern California near a U.S. Army Air Force test site near a large dry lakebed to study high-speed flight using experimental aircraft. The MUROC Flight Test Unit was established in 1946, and NACA and the Army Air Force tested the Bell X-1 aircraft, which Chuck Yeager flew to be the first person to break the sound barrier on October 14, 1947. Twenty years later, the U.S. Air Force and NASA would fly the North American X-15, which reached the hypersonic speed of Mach 6.7 (6.7 times the speed of sound or 4,520 mph). NACA was rapidly gaining useful knowledge of engine and rocket-propelled flight vehicles through the atmosphere, which would be put to immediate use during the dawn of the space race with Russia. Full-scale flight testing of new ideas, concepts, systems, and subsystems was a very important part of the knowledge verification process to understand the limits of performance and, thus, minimize risk. The NACA Culture Meets the Space Age Prior to the Soviet launch of Sputnik 1 on October 4, 1957, as the deputy director of the Langley Pilotless Aircraft Research Division (PARD), Robert Gilruth had been testing the aerodynamics of transonic and supersonic flow over an airfoil by firing rocket models over the Atlantic Ocean from a modest launch facility at Wallops Island, Virginia. Langley’s cutting-edge research into high-speed flight at ever-increasing altitudes was directly responsible for its leadership role in the newly formed Space Task Group (STG), which was created in 1958 in response to the Sputnik challenge of the Soviet Union. Robert Gilruth would head the STG, which was located at Langley and consisted of 37 research engineers (27 from Langley and ten from Lewis) who managed the 22
Charles Camarda
Mercury space program, NASA’s first manned missions into space. The STG would include some of the great leaders of the young space agency, names like Max Faget, head of the Flight Systems Division where spacecraft were designed and tested and who designed the Mercury capsule; Chuck Mathews, head of Operations Division; Christopher C. Kraft Jr., his assistant and head of Mission Control (the “Man Called Flight”), and other noteworthy names in the space industry like Gene Kranz (famous flight controller during the Apollo 13 mission), a young Glynn Lunney—people I would later meet as an astronaut13. The strong research culture that pervaded the NACA laboratories could now be used to seed the early beginnings of NASA and its space program. The Army Ballistic Missile Agency (ABMA) – The German Rocket Culture The next important culture to influence the development of NASA came from a group of about 125 German rocket scientists7, engineers, and technicians headed by Wernher von Braun, who, at the end of the war in Germany, surrendered to U.S. troops in a clandestine operation called “Operation Paperclip,” a secret program by the United States after World War II (WWII) to recruit German technical experts to work for the U.S. government and the military. Following WWII, NACA employees conducted research on the aerodynamics and stability and control of missiles. They did not develop their own rocket programs; those programs were considered part of the military. The German rocket team worked for the U.S. Army, testing their V-2 rockets at White Sands Proving Grounds in New Mexico before being relocated to Redstone Arsenal in Huntsville, Alabama, as part of the Army Ballistic Missile Agency (ABMA). While the research engineering culture spawned from Langley was the dominant culture during NACA that spread to the early NASA Research Centers (Langley, Ames, and Lewis) and the space flight center in Houston (Johnson Space Center), it was primarily the German rocket culture which helped form the other two space flight centers at NASA (Marshall Flight Center (MSFC) and Kennedy Space Center (KSC). The culture that permeated ABMA and MSFC was a product of its visionary leader, Dr. Wernher von Braun. Von Braun was a charismatic leader who was technically brilliant, politically adept, and whose far-reaching vision and creativity for space exploration could inspire the young and old and capture the imagination of people like Walt Disney. His philosophy was grounded in attention to technical detail and testing to failure to understand the limitations of design concepts. His German rocket scientists believed in controlling all aspects of a rocket project using an in-house team very similar to the NACA 23
Mission Out of Control
culture. He fought to maintain control of hardware development as much as possible during the early years of the Apollo program, so much so that the first eight of the Saturn I rocket’s stages were built at the Huntsville facility. The von Braun team used to drive the program managers and industry contractors crazy early in the space program because they would take apart rocket components and engines built by outside contractors to understand how they were built, put them back together, and then test them. They did not trust outside contractors’ work and reluctantly had to accept hardware developed outside their shop as the Apollo program grew and the number of contractors and hardware necessary to build the Saturn V rocket grew so large it became impossible to do everything in-house. Von Braun argued for keeping an in-house, hands-on capability to maintain the technical competence necessary to catch mistakes made by outside contractors. This is another very important element of a good research team, being able to maintain technical excellence and mentor the next generation of engineers needed to scale up during the Apollo program and, thus help minimize risk. He maintained an environment of psychological safety where everyone could speak up and raise issues, mistakes, and even failures without any fear of recrimination. There are stories of how he stood up to his boss on the German V-2 rocket program at Peenemunde, who was going to shoot the engineer responsible for a failure, and took full responsibility to protect his team. Or how he celebrated a team member responsible for an early Redstone rocket failure who came forward and admitted a mistake which probably led to the failure by praising him in front of the team with a bottle of Champagne. He led by example because he knew the only way to succeed was to create an environment where failures, which are expected to happen when dealing with complex systems, should be brought out into the open to be studied and learned from. These were successful failures.—the “right kind of wrong” as Harvard Business School Professor and author Amy Edmondson would research much later and write about extensively14. Like the NACA research culture, the von Braun team held a healthy respect for maintaining an in-house technically excellent team to act as a second set of eyes to catch mistakes by doing independent inspections, analyses, and testing. They maintained a healthy skepticism for anything manufactured externally, conducting single prototype tests of individual components to failure to discover faults, flight item tests to identify manufacturing errors, and complete spacecraft tests of integrated systems. The rocket scientists at MSFC preferred an incremental, step-by-step approach to verifying engine components through flight testing as opposed to 24
Charles Camarda
their Air Force missile counterparts who leaned toward ground testing and fullup rocket tests. This step-by-step flight test strategy was similar in ways to the NACA research culture, which used a building-block approach to construct new knowledge which was heavily tied to and validated by rigorous analytical and numerical correlation. The NACA and ABMA research cultures would clash with the Air Force missile culture when the American space program and Apollo became more complex, and systems engineering and new systems management techniques offered by leaders in the Air Force Intercontinental Ballistic Missile (ICBM) Program would be required to keep the space race on track to meet schedule and budget pressures. The Air Force Missile Program – Systems Engineering and Systems Management Culture Following World War II, the U.S. Air Force focused much of its rocket development work on the development of guided missiles that could carry a nuclear warhead. The large, complex systems necessary to produce reliable, accurate long-range capabilities for Intercontinental ballistic Missiles (ICBMs) relied on a complex network of Air Force and industry contractors to analyze, design, manufacture, test, and qualify such systems. The Air Force was forced to use external contractors because they did not have the technical competencies in-house as the Army had, and they did not have the time to recruit and train new engineers. The Air Force also relied on a relatively new “systems engineering” approach to develop these complex weapon systems, which also required a systems management philosophy that was very different than that used by NACA and ABMA rocket scientists. The complexity of the systems and subsystems required to develop reliable missiles forced them to use systems engineering to understand how individual systems and subsystems (e.g., the propulsion system or engines, propellant tanks, guidance, navigation, and control systems, etc.) could be integrated and how they would interact to perform reliably. The systems engineering and management function was often contracted out; “the system allowed them to formulate requirements, manage costs, control schedules, check performance, and plan operations7.” It was a scientific way of managing and codifying the thousands of steps involved during the research and product development process. Processes called configuration management became necessary to ensure that design changes were tracked with hardware, interfaces between systems were understood, as were the interactions at those interfaces during all phases of development. A typical example of an interdisciplinary interaction in early rocket failures 25
Mission Out of Control
was caused by the sloshing of fuel in the fuel tanks, which produced dynamic loads to the vehicle, causing stability and control problems. Hence, the loads at the subsystem level had to be understood as to how they could affect the performance of the full-scale flight systems. This new field of engineering was quite different from academic research that would delve deeply into one specific discipline, for example, materials science or structures, without addressing interactions at the higher systems levels. Chapter 5 will give a much more detailed explanation of how interdisciplinary systems and their response can lead to totally unexpected outcomes if not properly understood and analyzed. You can imagine that, for such large, complex systems, the management of information and communication between organizations at every system and subsystem level can be quite burdensome. One small detail or change to one of millions of components could result in failure and tragedy. NASA – The Early Years The Launch of Sputnik 1 on October 4, 1957, was a clear signal to the United States that we were in a technological race with Russia. Dominance in key areas of nuclear technology development, missile technology, and the importance of the “high ground” of space loomed heavily on the minds of government and military leaders in Washington, D.C., and throughout the country. President Eisenhower signed the National Aeronautics and Space Act in July of 1958, and NASA began operations as a civilian space agency on October 1, 1958. Two of the new agency’s top priorities would be the “expansion of human knowledge of phenomena in the atmosphere and space and the preservation of the United States as a leader in aeronautical and space science technology and in the application thereof to the conduct of peaceful activities within and outside the atmosphere15.” Clearly, research and development were to be a primary purpose of NASA as per the Space Act, which created it. NACA laboratories at Langley, Lewis, and Ames were quickly integrated with work being conducted in the Army Ballistic Missile Agency (ABMA), which became the Marshall Space Flight Center; the Goddard Space Flight Center (GSFC), which consisted of a nucleus of scientists and engineers from the Naval Research Laboratory who had been working on the Vanguard missile program; the NACA High-Speed Flight Research Station which became the Dryden Flight Research Center; the Jet Propulsion Lab; the Johnson Space Center which was initially formed with a nucleus of engineers from the Space Task Group (STG) from Langley and Lewis; and the Kennedy Space Center which drew staff from the ABMA German rocket team. 26
Charles Camarda
The initial STG was composed of about 27 researchers from Langley and ten from Lewis. Langley played a prominent role in the initial space studies and led the Mercury Program from its research center in Hampton. In fact, the first seven Mercury astronauts were stationed at Langley when they began their training. The principal designer of the Mercury capsule was Langley engineer Max Faget. Its initial heat shield development was a combined NACA effort, and the initial flight tests used a Little Joe rocket at Wallops Island, Virginia. All were the result of extensive research and technical support from the Langley team. The hands-on research DNA from NACA, which flowed into the technical leadership and engineering teams at JSC, integrated well with the strong engineering and test culture from the ABMA. This strong in-house research and test culture provided the key ingredients that propelled the early success of the space program in the U.S. during the Mercury and Gemini programs. However, it became evident very early during the buildup of the Apollo Program that scaling up to analyze, design, build, and test the complex subsystems and systems necessary to safely land and return humans from the surface of the moon would require a much larger network of industry contractors, academic researchers, and government resources which all had to be scientifically managed to ensure the reliability of a rocket system designed to carry humans in space. The efficient flat organizational structure and informal chain of command and authority, which was once the standard for NACA, had to now give way to a more military, hierarchical chain-of-command structure and a “contracted-out” vs. “in-house” control over manufactured hardware. While the NACA research culture was still able to continue in-house work with very deep penetration into work being done at contractor facilities during Mercury, times were changing as the Apollo program grew and pressures to meet the schedule to meet President Kennedy’s “Man on the Moon” challenge at Rice University on September 12, 1962, where he uttered those famous words “We choose to go to the moon in this decade and do the other things, not because they are easy, but because they are hard.” NASA had fewer than eight years to accomplish this epic challenge. Fortunately for NASA, sound research and engineering staff mentored and raised at NACA became leaders and filled the senior management ranks in the fledgling organization: leaders like Bob Gilruth, who became director of the Johnson Space Center and who conducted extensive research on high-speed flight, and Max Faget, who directed engineering and was the primary designer of the Mercury space capsule. These great men provided the strong technical leadership needed to go toe-to-toe with many of the new leaders at NASA from the Air Force who pushed the systems program management philosophy, which 27
Mission Out of Control
what was now needed to meet the demands of President Kennedy’s challenge. Air Force leaders like General Samuel C. Phillips, who was the vice commander of the Air Force Ballistic Missile Systems, joined NASA to manage the Apollo Program. NACA leaders were not accustomed to the strict “chain-of-command” enforcement of communication of the military culture, not to mention the level of power the military brass was used to wielding. However, at that time, NASA had real technical powerhouses running the engineering side of the program, and they could hold their own when it was necessary to win a technical argument over a purely budget- and schedule-driven decision. This healthy tension was necessary. A great example of the strong technical and leadership culture and psychologically safe environment during early NASA is the story of how a feisty, sharp Langley researcher, John Houbolt, went around the chain of command at Langley and sent a letter to Bob Seamans at NASA HQ to push his Lunar-OrbitRendezvous (LOR) strategy to go to the moon16. Even though von Braun was the leader of the Apollo Program and his team was pushing the development of a very large rocket, the NOVA, which would have been developed at MSFC and which would have flown an Earth-Orbit-Rendezvous (EOR), at the end of a heated meeting at MSFC, von Braun stood up and said “It is the position of Marshall Space Flight Center that we support the lunar orbit rendezvous plan.13” Without consulting with his team, von Braun had switched sides, the sign of a true leader who was a rocket scientist first and a bureaucrat second! Oh, how things would change within the agency over time. Years later, when Apollo budgets began to sink precipitously, and large centers like JSC and MSFC would be locked in competition for funding, such leadership would be hard to find. Over the years, after the Challenger accident, according to Diane Vaughan, NASA engineers would become “servants of power”2 and bow to the demands of much more powerful program managers. One reason was the imbalance of power of a strong program office and a very weak and underfunded and lessrespected engineering directorate. During NACA, the scientists and engineers ruled the organization; scientific integrity, curiosity, knowledge, and truth reigned, not political clout or programmatic power of the purse strings! Many NASA historians credit the success of the Apollo Program to several factors: the galvanized will of the American people toward the triumph of our free democratic capitalist ideology over the totalitarian, socialist regime of our communist rivals in the Soviet Union during a raging cold war; the increased level of funding and sustained political support; and the technical excellence of our government, industry, and academic partnerships. While all the above are true, I believe our depth and breadth of applied research and hands-on 28
Charles Camarda
product development cultures ala NASA and ABMA, together with the systems management discipline skills of the Air Force, stressed strong, centralized programmatic control and enabled the networks to communicate and function effectively as a high-performing network-of-teams17. With the help of some feisty creative geniuses like John Houbolt and Max Faget; visionary, technically adept leaders like Robert Gilruth and Wernher von Braun; and an environment that was psychologically safe18 and allowed critical thinking and creative ideas and voices to flourish. The average age of the Apollo team was young, 26, and the average age of NASA at that time was 38. Everyone on the team knew and understood the risks of flying humans to space; in fact, many had come from the high-speed flight test world where the survival rate of test pilots was only 1 in 4. According to one Apollo program leader, “Anyone that gets on the end of a flaming rocket and ’doesn’t recognize the risks and dangers associated with it does not understand the problem. On the other hand, and I emphasize this very, very carefully —we would never fly a manned vehicle if we knew something was wrong with it until we fixed it7.” I want you to remember these words when we discuss the NASA
Figure 1. – Organizational makeup and historical evolution of the National Aeronautics and Space Administration (NASA) and its major field centers from its predecessor organization, the National Advisory Committee for Aeronautics (NACA)
culture pre- and post-Challenger and Columbia. The culture during Apollo was preoccupied with failure. Some may view this as a negative aspect; however, experts on high-reliability organizations (HROs) like Karl Weick19 contend a 29
Mission Out of Control
main principle of most HROs is this constant concern and sensitivity to the possibility of failure. This “normalization of risk,” the acceptance of failure, and the anticipation of trouble led to an atmosphere in which these things could be discussed openly.7” We will compare this NASA culture to one which would, over the years, devolve to a point where it would accept deviance as normal (the “normalization of deviance”) and which would develop terms like “acceptable risk” to provide the rationale to continue flying vehicles with known anomalies2. NASA – The Gradual Slide It took time for early NASA to lose some of the earmarks of its successful, high-performance culture, which allowed it to rise to the challenge of the space race of the ’60s to the bureaucratic behemoth it would become post-Apollo. Some theorize that bureaus experience a natural life-cycle progression over time. Economist Anthony Downs posits that bureaucracies go through an initial threshold phase where they must survive an early death; a period of rapid growth in size and social significance; and a contraction phase where recruitment slows, average age increases, opportunities for promotion decrease for all but the administrative population, and the ability to attract the best and the brightest begins to wane20. As bureaucracies age, they become less flexible, risk-averse, and conservative. They should die a natural death when they have outlived their usefulness; however, for government bureaus, that is seldom the case. There are cases where organizations can experience “qualitative” growth without expansion following the contraction phase as long as it can provide incentives to attract high-quality personnel to stay innovative and relevant and satisfy the needs of a society that is called upon to sustain it. While NASA was still able to attract top technical people in the decades following Apollo, the environment forced them to do less attractive functions like managing contracts, writing requirement documents and requests for proposals (RFPs), and other managerial and administrative tasks. NASA survived its initial threshold due to the early accomplishments of its Mercury and Gemini space programs. Money was never an issue early on during the Apollo program because of the gravity of the socio-geo-political environment created by the Cold War with Russia and the epic challenge proposed by a charismatic, popular President Kennedy. Once NASA was formed in 1958, it experienced the predicted growth spurt for a new bureaucracy. NACA grew from 11 employees in 1919 to a total of almost 8,000 employees when it became NASA in 1958, with a ratio of about 34% being scientists and engineers, 9% technical support, 44% craftsmen and 30
Charles Camarda
technicians, and 13% clerical. At its peak during the Apollo Program, NASA totaled 33,538 employees with a ratio of 40% scientists and engineers, 13% technical support, 17% craftsmen and technicians, 18% clerical, and 12% professional administrators. The growth spurt for NASA during the 8-year ramp-up of Apollo was just over 3,000 people per year as opposed to slightly over 200 people per year during the 39-year life of its predecessor, NACA. In only eight short years, NASA had quadrupled in size. In today’s dollars, its budget increased an order of magnitude from $4.5B in 1959 to $45B in 19677. It was a young, thriving organization with a “frontier culture” looking to achieve new firsts and venturing into the unknown. Top managers were raised/ mentored from the research cultures of NACA and ABMA maintained their research and technical prowess and were able to maintain a level playing field with the ever-growing and powerful professional administrators and program managers. The agency now had to also bow to the demands of political pressures (the “elites”2) and public support. This uniform balance of power between the technical and program management sides of the organization ensured a healthy tension and a psychologically safe environment where technical risks could be weighed against schedule, budget, and political pressures. More importantly, since the newly created space centers’ leadership came from the research centers, the collaboration with key researchers with the specific expertise and skills necessary to solve critical problems was immediate, much like a smallworld network where access to the right person with the right expertise could be guaranteed and trusted. Ideas for developing small-world networks of researchers, engineers, technicians, and technically qualified program and project managers will be discussed in the second half of this book as a way large bureaucratic organizations can be structured to maintain a high-performance culture. I was an invited panelist to an Economist event in Berkeley, California, in 2011 called “The Ideas Economy: Innovation – Entrepreneurship for a Disruptive World.” Our session was called “Move Fast and Break Stuff.” On stage with me was the vice-President for Product at Facebook. We discussed how entrepreneurial companies must rapidly innovate and mature new products. Elon Musk and his new company, Space Technologies (SpaceX), had recently achieved successful orbital flights with its Falcon 1 rocket and ISS re-supply missions with its Falcon 9 rocket. The first question I received from the audience as a senior advisor for innovation and engineering development at NASA was from a Gen-Xer, “Why isn’t NASA more like SpaceX?” My reply was “You young punk, NASA WAS SpaceX during Apollo.” My point being that by 2014, NASA had drastically 31
Mission Out of Control
changed from the agency I remembered when I was hired in 1974, let alone the high-performance organization it was during the Apollo program (40 years earlier). In fact, if you follow the progress of SpaceX during the buildup of its Falcon 1, Falcon 9, Falcon Heavy, and Starship rockets, you will notice it almost mirrors the full-up flight test strategy used by the von Braun German rocket team, at the insistence of its “Air Force” managers, at ABMA and early MSFC and KSC. SpaceX was able to fail big because it was a private company whose founder had deep pockets and was willing to bet everything he had. Public support for the space program began to decline in 1967 and continued to do so until 1975 when NASA entered its contraction phase. The ’60s and mid ’70s were a tough time for America; protests of the war in Vietnam, civil and racial unrest, and concerns for the environment weighed heavily on the minds of many Americans. Even at its height, the public never fully supported the amount of funding the U.S. spent on the Apollo program (Gallup polls indicated a 58% to 33% split against spending $40B for space); by the end of the program, the ratio was 8 to 1 in favor of cutting spending to NASA7. The Apollo Program ended on December 14 when Apollo 17 returned astronauts Eugene Cernan, Harrison Schmidt, and Roland Evans from the moon back to Earth. As NASA contracted, so did the entire aerospace industry in the United States. I remember this very vividly because I was a sophomore in aerospace engineering at the Polytechnic Institute of Brooklyn in 1972. I watched as engineering students dropped from aerospace departments to mechanical engineering, civil engineering, and even psychology departments in the hopes of staying in school, beating the draft, and possibly getting a job when they graduated. The few of my classmates who remained in aero for four years had a real passion that sustained us as we proudly crossed the stage to receive our diplomas in Brooklyn in 1974. As funding to NASA decreased, individual NASA centers that grew most during Apollo (JSC, MSFC, and KSC) began to compete with each other and even the research centers for the lion’s share of the budget to sustain the large infrastructure and workforce, which resulted during the expansion phase. More out-of-house work and contract management functions slowly crept into the everyday tasks required of its technical staff. The human spaceflight budgets for JSC, for example, were an order of magnitude larger than the budget of one of the research centers. JSC and MSFC fought tooth and nail for leadership of any future space programs. Collaboration and exchange of expertise from the research centers to the space flight centers decreased steadily over time. Funding for hardware testing for programs like the Hubble Space Telescope, the Space Shuttle, and the International Space Station (ISS) decreased so much 32
Charles Camarda
that engineers had to constantly fight with less technically capable program managers to include much-needed testing in the programs to qualify hardware for spaceflight. When I was a newly selected astronaut, I would watch in disbelief as engineers at JSC tried desperately to advocate for instrumentation to be included in the ISS so they could assess the structural health of the space station as it experienced the thermal and structural dynamic load cycles it would encounter throughout its expected lifetime—information which would be sorely needed by NASA to intelligently continue to extend the life of the station past its design life expectancy, which it did so anyway. Full-up flight testing of unmanned rockets was out of the question. Even the Space Shuttle, one of the most complicated and novel design ideas for a partially reusable space vehicle, flew its first flight with a crew onboard. Unheard of. In my mind, John Young and Bob Crippen have to be the bravest astronauts ever to step up and fly STS-1. Even though the Space Shuttle had the capability to fly and land totally automated, we chose not to. Even the first and only flight of the Russian-pirated copy of our U.S. Space Shuttle, the Buran, flew to space and landed unmanned, never to fly again due to serious thermal protection system problems. NASA was incurring unnecessary risks to meet the schedule and budget pressures being imposed by the political elites. I still remember Thanksgiving dinner over at the Youngs’ house when John Young and his wife, Susie, were looking at a picture of John stepping off STS-1 after its landing at KSC: Susie’s remark, “I didn’t think you would come back alive,” John’s reply, “Neither did I,” and the totally shocked look on Susie’s face! Apparently, she didn’t know how brave her husband was and how willing he was to accept the risk. The hefty budgets and large, hierarchical management offices became the normal way for conducting the “business” of space. We used to joke at the research centers with our meager budgets that we could survive on the scraps brushed off the tables at JSC and MSFC that were deemed wasteful spending. The lack of a true “dual career ladder” drove engineers to join the ranks of management earlier than desired, where promotions were much easier and often determined by the number of people you managed and the size of your budget, not success or technical competency. When I was hired at NASA Langley in 1974, work on the Space Shuttle program was in its early stages of development. Much of the early concept development and trade studies were conducted at Langley and in industry in the late ’60s. Thermal protection systems (TPS) like the lightweight ceramic tiles were being developed at Ames. My branch conducted analyses and hypersonic 33
Mission Out of Control
wind tunnel testing of concepts under simulated entry, aero-heating, and load conditions. Material scientists at Lewis and Langley were developing new materials like reinforced carbon-carbon (RCC), which was one of the concepts being considered for the most highly heated region of the vehicle the wing leading edges (curved front part of the wing) and its nosecap21. During this period in time, most of the NASA research centers had a separate source of funding that was not controlled by the large program offices like Shuttle. The money flowed directly to these centers and was controlled by local technical managers studying the latest vehicle concepts, their systems, and subsystems. We could evaluate the performance, test to failure, evaluate our ability to understand the physics of these multidisciplinary systems, and develop methods to analyze and predict performance, and even explore novel ideas of our own. It was the research and the quest for new knowledge that motivated the work of real research engineers. For example, one of the unique wing leading edge designs to reduce the local heating of the wing leading edge was an idea proposed by a man named Calvin Silverstein called a heat-pipe-cooled wing leading edge22. This novel idea would enable the Shuttle to use a much more durable metallic material as opposed to the relatively new RCC design, which was passive and could survive at much higher temperatures as long as its very thin and brittle oxidation protection coating of silicon carbide (SiC) was not breached, otherwise the carbon structure beneath would rapidly burn through, a hole could form, and the vehicle would be lost (the very conditions which caused the Columbia to breakup during Earth entry). I was able to prove the feasibility of the heat-pipe-cooled concept; however, the RCC design had already been selected for the Space Shuttle23. The beauty of having the freedom at the research centers was that you could continue to develop promising ideas and possibly provide solutions to problems as they arise as the vehicle, in this case, the Space Shuttle, is being developed, manufactured, and tested. A high-performance culture is crucial for an organization to rise to the level of a high-reliability organization (HRO) and demonstrate a high level of performance in complex and dynamic environments while managing to avoid catastrophic failures. Maintaining such a culture and environment so that disasters do not reoccur is crucial. The role of social and behavioral science in understanding an organization’s culture, when problems arise, and what organizations can do to prevent a slide down this “slippery slope”24 is often not fully appreciated by the left-brained not-so-technical managers and, hence, attempts to transform and fix the problems fail and similar tragedies recur. God
34
Charles Camarda
bless social scientists like Diane Vaughan and her valiant attempts to alert NASA executives to the importance of culture. The real question is why do storied organizations like NASA and Boeing continue to make the same mistakes over and over again, causing repeated disasters, even after the cultural root causes are known and so clearly laid out by sociologists like Vaughan and others following the Challenger and Columbia accidents? References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22.
Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D.C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press, 1996. Collins, Jim and Porras, Jerry I.: “Built to Last – Successful Habits of Visionary Companies.” Harper Collins Business Essentials, 1994. Bennis, Warren, and Biederman, Patricia Ward: “Organizing Genius – The Secrets of Creative Collaboration.” Basic Books, 1997. Presidential Commission on the Space Shuttle Challenger Accident (1986): Report to the President by the Presidential Commission on the Space Shuttle Challenger Accident. 5 vols. Washington, D.C.: Government Printing Office, June 6, 1986. Vaughan, Diane: “NASA Revisited: Ethnography, Theory, and Public Sociology.” American Journal of Sociology, Volume 112 Number 2, September 2006. McCurdy, Howard E.: “Inside NASA – High Technology and Organizational Change in the U. S. Space Program.” The Johns Hopkins University Press, 1994. Public Law 271, 63rd Cong., 3rd sess., Passed March 3, 1915 (38 Stat. 930). Hansen, James R.: “Engineer in Charge – A History of the Langley Aeronautical Laboratory, 1917–1958.” The NASA History Series, NASA SP-4305, 1986. Ferguson, Eugene S.: “Engineering and the Mind’s Eye.” The MIT Press, 1994. Meunger, Elizabeth A.: “Searching the Horizon – A History of Ames Research Center, 1940 – 1976”, NASA SP-4304, 1985. Schultz, James: “Crafting Flight: Aircraft pioneers and the contributions of the men and women of NASA Langley Research Center.” NASA History Series, NASA SP-2003-4316, 2003. Schefter, James: “The Race – The Uncensored Story of How America Beat Russia to the Moon.” Doubleday 1999. Edmondson, Amy: “Right Kind of Wrong: The Science of Failing Well.” ATRIA Books 2023. The National Aeronautics and Space Act. Public Law 85-568, July 29, 1958. Brooks, Courtney G.; Grimwood, James M.; and Swenson, Loyd S. Jr.: “Chariots for Apollo: A History of Manned Lunar Spacecraft. NASA SP-4205 1979. McChrystal, General Stanley: “Team of Teams – New Rules of Engagement for a Complex World.” Penguin Publishing Group, 2015. Edmundson, Amy C.: “The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley and Sons, 2019. Weick, Karl E. and Sutcliffe, Kathleen M.: “Managing the Unexpected – Sustained Performance in a Complex World.” Wiley, 2015. Downs, Anthony: “Inside Bureaucracy.” Little Brown and Company, 1967. Camarda, Charles J.: “Space Shuttle Design and Lessons Learned.” NATO Science and Technology Organization Lecture Series on “Hypersonic Flight Testing.” STO-AVT-234-VKI, March 24-27, 2014, von Karman Institute, Rhodes-St-Genese, Belgium. Silverstein, Calvin C.: “A Feasibility Study of Heat-Pipe-Cooled Leading Edges for Hypersonic Cruise Aircraft.” NASA CR-1857, 1971.
35
Mission Out of Control 23. Camarda, C. J. and Masek, Robert V.: Design Analysis and Tests of a Shuttle-Type Heat-PipeCooled Leading Edge. Ninth Intersociety Conference on Environmental Systems, San Francisco, CA, July 16-19, 1979. 24. Vaughan, Diane: “System Effects: On Slippery Slopes, Repeating Negative Patterns, and Learning from Mistake?” in William H. Starbuck and Moshe Farjoun, eds., “Organization at the Limit: Lessons From the Columbia Disaster.” Oxford UK: Blackwell, 2005: 41-59.
36
Chapter 2
Culture as Cause – Connecting the Dots, Challenger to Columbia The original research culture at NASA began to change following the Apollo program. NASA experienced a period of contraction where budgets dropped precipitously, civil service and support personnel at the centers were reduced, the agency struggled to sell ideas for human space exploration to Congress, and the public interest in space was waning. Regardless, images of routine access to orbiting space stations and colonies on the moon and Mars still lingered in the imaginations of the Apollo generation and one or two generations to follow. NASA proposed ideas for human-tended orbiting space stations that would be resupplied by a space “tug” or shuttle, which would routinely carry crews and supplies in a shirtsleeve environment much like a conventional airliner. NASA oversold the capabilities of its Space Shuttle program to members of Congress and had to struggle to find funding by partnering with the Department of Defense (DoD) and promising it could meet the satellite launch demands of both the military and private sectors. Many would blame the ever-changing design requirements as a major reason for the failures of the shuttle design selection; however, there were more serious reasons. To understand the technical causes of the Challenger and Columbia accidents as well as the organizational behaviors responsible, we need some background. Space Shuttle Design Between 1970 and 1980, at the end of the Apollo program and the advent of the shuttle program, the locus of power shifted from research centers to program managers at the spaceflight centers (Johnson, Marshall, and Kennedy), career bureaucrats who may or may not have had a degree in engineering but very little understanding of research and its processes. These program managers 37
Mission Out of Control
instituted what Diane Vaughan, a professor of sociology at Columbia University and contributor to the CAIB, calls a “production culture,” which she identified as one of the key contributing factors to both the Challenger and Columbia accidents.1,2 Worse yet, the production culture led to an “operations culture,” one step further removed from engineering development, at KSC, which managed launch operations; MSFC, which managed the propulsion and tankage elements of the Space Shuttle and payload operations; and at JSC, which led the shuttle program and managed all in-flight and on-orbit operations as well as astronaut training. Good engineers at all three centers were now busy looking over the shoulders of contractor engineers responsible for hardware development and testing instead of actually doing in-house analysis and testing. Instead, the role of most engineers at NASA was now just monitoring the hardware analysis and testing of the contractors with very little in-house, independent verification, only the “operation” of said hardware. Yet, they had the responsibility for safely operating such hardware in space’s extreme environment with astronaut crews. This was something Dr. von Braun would not stand for during Apollo. He fought to maintain 10% of the funding for Apollo to be spent on in-house engineering and testing. He knew this was the only way to maintain a technically competent workforce to oversee the contractors and to catch mistakes. This shift brought about dramatic changes in how NASA functioned. NASA stopped funding competing design approaches to technologies and vehicles and instead selected a single concept or “point design” as the winner early in the concept development phase of the contract proposal process. The contract would usually be awarded to the “lowest bidder” because NASA, a government organization, had to be ever vigilant and accountable for the taxpayer funding of a public venture. NASA selected a winged orbiter concept that would land on a runway and which was radically different than previous space capsule designs that plunged the crews into the ocean upon return from space. It had many new technologies that were not thoroughly tested prior to selection, and hence, there were lots of unknowns or “knowledge gaps” that had to be understood or closed prior to actual flight. In other words, NASA would have to go to remarkable lengths to ensure that the chosen concept was feasible during the shuttle development program. The program managers figured that by eliminating multiple teams conducting research across multiple centers, they could streamline production and cut costs. But this move had the opposite effect. Figure 1 illustrates schematically how the ever-changing design requirements and funding issues in phases A through C of the shuttle proposal and phasedgated product development cycle resulted in the selection of the partially reusable 38
Charles Camarda
shuttle configuration shown in Phase C. The shuttle was originally required to fly somewhere between 50 to 100 missions per year, have a turnaround from landing to launch to be less than two weeks, maintain launch costs on the order of $1,000 per pound of payload to low Earth Orbit (LEO), and provide airlinelike operations while on the ground and being readied for flight3. By far, the most significant objectives of the program were to demonstrate a significant reduction in the cost of travel to LEO and routine access to space. As shown in figure 1, the Shuttle program was a dismal failure on both counts: the system had an average flight rate of 4.5 missions per year with a recurring cost of approximately $4B per year and 18,000 ground personnel to operate, resulting in a launch cost of $32,000 per pound to LEO. In addition, the shuttle safety record was poor, with two tragic events that lost 14 crew members and two vehicles in a total of 135 missions4 (a probability of loss of vehicle and crew of 1:67.5).
Figure 1. – The effect of point-based design and making an early design decision in a program with ambiguous and changing requirements and large knowledge gaps in critical systems technologies4.
Early knowledge gaps in selected design concepts led to late failures during product development. The later you catch a failure, the more expensive it is to fix and the more time it takes. Several such examples arose during the development of the Space Shuttle. In fact, on the first trip of the Shuttle Orbiter across the country to KSC for launch preparations atop its 747-carrier aircraft, there was extensive damage to the fragile TPS ceramic tiles, many of which had popped 39
Mission Out of Control
off in flight. The first launch was delayed over one year, and a NASA Tiger Team, led by my boss, Paul Cooper at Langley, was tasked to find and fix the root cause of the TPS problem4.5. What the reader will learn many times over is that when NASA programs were faced with serious technical challenges, they relied on its research centers to save the day. However, these same Space Flight Centers would argue the cost of conducting applied research was too high and drive bureaucrats at NASA HQ to reduce funding. Early selection of a very fragile TPS tile system proved to be extremely expensive to refurbish after every flight, making the idea of a “space transportation system” or “space tug” impossible and a commercial failure3,4. The initial 4 requirements and performance of Figure 2. – Space Shuttle Final Configuration . the Space Shuttle, when proposed to Congress early on to gain support, were far from the actual results during the operation and retirement of the program3-7. The final Shuttle design configuration is shown in Figure 2 with primary components and individual elements such as the ground handling systems shown in Figure 3. A list of the primary Shuttle systems is shown in Table 1. The two Solid Rocket Boosters (SRBs) were fabricated by Thiokol in Brigham City, Utah, and were selected over liquid fuelled concepts because of their reliability, lower estimated developmental costs, and ease or refurbishment after recovery at sea. They are 149 feet (46 m) long and 12 feet (3.7 m) in diameter, and each generates 3.3 million lbs (14.7 million newtons) of thrust. The SRBs are used as a set of matched pairs (i.e., loaded from the same batches of propellant ingredients to minimize thrust imbalance) and are made up of four solid rocket motor segments. The four segments were mated at KSC in the Vehicle Assembly Building using an aluminum ring structure and bolts. The ring and bolts that joined motor segments together were known as the field joint (figure 4). (Problems with the nonlinear deformation of this joint under load and the change in material properties of the O-rings at cold temperatures were determined to be the technical cause of the Challenger tragedy on January 28, 1986)8,9. During launch, six seconds after the Space Shuttle Main Engines (SSMEs), manufactured by Rocketdyne in Canoga Park, California, were ignited 40
Charles Camarda
and checked out, the SRBs ignited and provided primary steering control for the duration of their 120 seconds of operation. After its fuel was expended, eight booster separation motors were ignited to separate the two boosters from the stack. The expended motors descended under parachutes and were recovered at sea. There, the nozzles were plugged, the water was removed from the casings, and the motor segments were towed back to the launch site and later shipped back to the manufacturer for refurbishment. The SRB nosecaps and nozzle extensions were not recovered. The External Tank (ET) was manufactured by Lockheed Martin in New Orleans, Louisiana, and was 154 feet (47 m) long and 28.6 feet (8.7 m) in diameter and contained 1.6 million pounds (725,747 kg) of propellant with the liquid oxygen (LOX) tank above the hydrogen tank separated by an intertank region. The liquid hydrogen fuel and LOX oxidizer were supplied under pressure to the three SSMEs during ascent. The ET was attached to the Orbiter at one forward attachment point and two aft attachment points. The liquid oxygen tank was an aluminum monocoque construction in the shape of an ogive which contains 143,351 gallons (543 m3) LOX. The liquid hydrogen was an aluminum semi-monocoque construction that contains 385,265 gallons (1,080 m3) of fuel. The ET’s TPS consisted of spray-on-foam insulation (SOFI) and pre molded
Figure 3. – Primary Space Shuttle Elements4. 41
Mission Out of Control
ablator materials, which maintained cryogenic temperatures of the propellants, eliminated air liquefaction, minimized boil off, and reduced ascent heating to the structure. Problems with this foam coming off the vehicle during launch would be a major problem that caused extensive damage to Shuttle TPS tiles, which had to be repaired prior to launch and the eventual technical cause of the Columbia accident. The Main Propulsion System (MPS) was composed of the SSMEs and its controllers, a propellant management system, a helium system, an integrated health management system, hydraulic system, thrust servo actuators, etc. Three new high-performance engine SSMEs had to be developed to meet the demands of the shuttle architecture. The engines were reusable and had variable thrust and used liquid hydrogen for fuel and cooling and LOX as the oxidizer. The three SSMEs were configured in a triangular arrangement, and each had to perform at a thrust level of 375,000 pounds-force (lbf.) at sea level and 470,000 lbf. in a vacuum (corresponding to 100%), and 417,300 lbf. at sea level and 513,250 lbf. in a vacuum (corresponding to 109%). The SSMEs burned 750 gallons (2.8 m3) of hydrogen and 280 gallons (1.1 m 3) of oxygen per second. The Space Shuttle Orbiter was manufactured by Rockwell International in Palmdale, California, and was constructed primarily of aluminum for loadbearing structures and was protected externally by reusable surface insulation (RSI) TPS. The highly heated sections, like the nosecap and wing leading edges (WLEs), were fabricated from reinforced carbon-carbon (RCC), which could withstand temperatures and maintain structural properties up to 3,000 ° F (1649 ° C). It was divided into nine major sections: 1) forward fuselage, 2) wings, 3) mid-fuselage, 4) payload bay doors, 5) aft fuselage, 6) forward reaction control system (RCS), 7) vertical tail, 8) Orbital Maneuvering System (OMS)/RCS pods, and 9) body flap. The Orbiter is 122 feet (37 m) long and 57 feet (17 m) high and has two delta wings with a span of 78 feet (24 m). The Orbiter body flap thermally shielded the three SSMSs during Earth entry and provided pitch control trim during atmospheric flight after entry. The vertical tail consisted of a structural fin surface, a rudder/speed brake surface, a tip, and a lower trailing edge. The rudder split into two halves to serve as a speed brake during landing. Prime contractors selected to manufacture the major shuttle components often sub-contracted to other aerospace companies for sub-components so the distribution of funding and jobs at the granular level was evenly distributed throughout the United States. For example, the Orbiter wings were manufactured by Grumman Aerospace in Bethpage, Long Island, New York; the RCC nosecap and wing leading edges were manufactured by LTV Aerospace Corporation 42
Charles Camarda
in Ft. Worth, Texas; the reusable surface insulation (RSI) thermal protection system ceramic tiles were manufactured by the Lockheed Missiles and Space Company in Sunnyvale, California; and the windows were manufactured by Corning Glass Company in Corning, New York. A complete list of primary space shuttle systems is shown in Table 1.
Table 1 – Space Shuttle Systems
Figure 4. – Space Shuttle Solid Rocket Booster (SRB) field joint and redesign post Challenger8. 43
Mission Out of Control
The engineers, scientists, and program managers who helped develop the thousands of components, subsystems, and systems that comprised the space shuttle used a reductionist approach to functionally decompose the problem and systems engineering principles to relate the elements and to predict the integrated behavior as if it was a “complicated,” deterministic problem as opposed to a “complex” problem10. Complicated systems are characterized by having many components or parts, but these parts are usually well-organized and follow clear cause-and-effect relationships. In complicated systems, the relationships between the components are deterministic, meaning the behavior or performance of the system can often be understood and predicted beforehand (a priori) by breaking it down into its constituent parts. Complex systems, on the other hand, involve a large number of interconnected components, and the interactions between these components are often nonlinear and may exhibit a higher order of emergent behavior that is greater than a mere sum of its parts. In complex systems, the relationships between components are not easily predictable, and the system, as a whole, can exhibit behaviors that are not immediately apparent from the characteristics of its individual parts. In layman’s terms, it is impossible to predict exactly how the system will behave a priori and when and if failures will occur. The construction of knowledge and subsequent reduction of knowledge gaps of each complex, multidisciplinary subsystem, such as its thermal protection system (TPS) or cryogenic tanks, followed a rigorous analysis-experimentdesign process and “building-block” approach to validate understanding, and to predict performance and failure (this will be explained more fully in Chapter 5). This building-block approach allows a system to be studied at a basic level and increasing levels of complexity as the system is built up to understand the interactions that occur among the constituent parts and to represent a realistic embodiment of the technology more closely. Failure, testing to failure, and learning from failure were a fundamental part of the “research engineering” culture at NASA and are critical elements in the construction-of-knowledge process. The skills required to identify critical failure mechanisms, construct a building-block experiment/analysis approach to understand fundamental, physics-based behaviors, and predict complex, system-level performance up to and including failure were carefully nurtured and mentored. As funding for actual research had dried up in the ’80s and ’90s, NASA’s flight centers stopped collaborating with the research centers and began to take a competitive stance with each other, pointing fingers after each accident (Challenger and Columbia). They hid data from one another and angled to prove 44
Charles Camarda
that their solution was the best, even when the data didn’t support it, in hopes of receiving more funding and prestige than the other centers. Another piece of the problem from an organizational standpoint was that NASA had transitioned from a highly centralized program management structure during Apollo to a “Lead Center” approach where, for example, JSC was in charge of the entire shuttle program; however, MSFC was in charge of most of the key elements of the entire system (ET, SSMEs, and SRBs). There were major technology integration and communication gaps, such as the liberation of ET (MSFC responsibility) foam during launch, which struck the Orbiter vehicle and critically damaged its RCC wing leading edge (JSC responsibility) and caused the Columbia disaster. JSC was quick to point fingers at MSFC when clearly the ownership/leadership of the system integration of the entire vehicle resided inhouse at JSC. Linking the Cultural, Organizational, and Social Causes of Challenger and Columbia Diane Vaughan and other social and behavioral scientists identified numerous causes for the Challenger and Columbia accidents1,2,11. In my research post-Columbia, I noted more than 40 terms and phrases used by these scientists to define the actions and behaviors they claim to have caused these and similar accidents in other high-risk/high-reliability organizations (see Table 2). I have added behaviors that I also experienced as a member of the research and design team responsible for shuttle technologies, as an astronaut who flew and operated the shuttle hardware and interacted with the safety and Shuttle Program managers and flight directors, and as a senior advisor and member of the newly formed NASA Safety and Engineering Center (NESC).
45
Mission Out of Control
Table 2 – Influences on Decision-Making Leading to the Columbia Disaster
1– Expressions used by Diane Vaughan in “The Challenger Launch Decision” 2 – Expressions used by Michael Roberto in “ Lessons from Everest – The Interaction of Cognitive Bias, Psychological Safety, and System Complexity” 3 – Expressions used by Henry Petroski in “Design Paradigms – Case Histories of Error and Judgment in Engineering”
External Causes of the Tragedies- History, Schedule, Political, and Budget Several external causes have been presented that are related to both the Challenger and Columbia tragedies: NASA’s history and the agency’s struggle to gain political support to maintain budgets necessary to complete the development of the shuttle vehicle; the tendency to oversell the technical and operational maturity of programs to Congress and low-ball cost estimates; and schedule and budget pressures. These external causes highlight the framework by which complex high-technology programs such as the space shuttle are constrained and limited in their methods of design and operation. Many of the historical accounts describing the shuttle’s development and the sociopolitical climate surrounding its genesis can be found in CAIB report and an excellent collection of papers addressing the causes of such accidents and ideas for their prevention11. The CAIB was one of the first groups that laid a proportionate share of the blame on the Executive and Legislative Branches of the U.S. government for schedule and budget pressure and for not allocating sufficient resources to NASA to ensure safe operation of the shuttle. However, in my mind, even if the funding were available, it would not have been properly spent to resolve the key technical and organizational issues—in Diane Vaughan’s words, “In a complete break from accident investigation tradition, responsibility for the 46
Charles Camarda
accident would extend beyond middle managers to Congressional leaders, the White House, and NASA Administrators”12. The CAIB recognized the role of the White House on January 5, 1972, when President Nixon announced that the shuttle would be: “designed to help transform the space frontier of the 1970s into familiar territory, easily accessible for human endeavor in the 1980s and 1990s. This system will center on a space vehicle that can shuttle repeatedly from Earth to orbit and back. It will revolutionize transportation into near space by routinizing it” [emphasis added]1,3. Pre-Challenger: The Space Shuttle was sold to Congress and the American public as a robust, operational vehicle that would fly somewhere between 50-100 launches per year and reduce the cost-per-pound of placing payloads to low earth orbit by orders of magnitude. NASA’s effective sales pitch of the Shuttle as an “operational” vehicle after only four flights (post Space Transportation System-4 (STS-4)) would continue to haunt the program and ingrain a set of expectations that could not be met. On July 4, 1982, Columbia landed at Edwards AFB where President Ronald Reagan declared that Columbia and her sister ships would be “fully operational, ready to provide economical and routine access to space for scientific exploration, commercial ventures, and tasks related to the national security” [emphasis added]1. This declaration of the shuttle as an operational vehicle led the agency down a dangerous path where it would not be easy to explain away the myriad of technical anomalies and costly delays that were soon to occur in the early life of the vehicle. To admit that these anomalies were the result of uncertainties in analysis, testing, and physical understanding would be to admit the experimental nature of the program and possibly lead to its premature demise. Hence, the engineering communities at NASA willing to raise concerns about anomalies were slowly being silenced by the much more powerful program offices. In addition, management professor Michael Roberto contends that cognitive biases13,14 play a very important role in clouding judgment and allowing a predisposition to “downplay the possibility of ambiguous threats” and “maintaining a stubborn attachment to existing beliefs”15. For example, the belief that the shuttle was operational after only four flights made it cognitively as well as politically very difficult to emphasize problems with critical systems like the thermal protection system (TPS) and/or solid rocket booster(s) (SRB) as they occurred. This also leads to a tendency of working groups and individuals to “normalize deviance” and develop meaningless terms like “in-family” and 47
Mission Out of Control
“accepted risk” to describe anomalies and rationalize continued flight in lieu of such deviations2. In addition, pressure to reduce costs and reduce turn-around time to achieve the predicted performance of the shuttle vehicle heightened the importance of schedule and budget and played an important role in determining how critical decisions were made. Diane Vaughan describes this “culture of production” as “analogous to the forces of the political/economic environment: the ideologies of professional engineering and historic shifts in policy decisions of Congress and the White House at the start of the Shuttle Program combined to produce in NASA the capitalistic conditions of competition and scarcity associated with corporate crime”12. It is a culture that emphasizes a businessoriented, compartmentalized sub-division of labor and a set of practices and rules to improve efficiency and meet cost and schedule demands. This subdivision of labor/tasks and highly bureaucratic organizational structure to solve complex problems stifles/impedes the flow of critical information and helps create what Diane Vaughan refers to as “structural secrecy.”2 In addition to cognitive biases, which tend to lead us to stay the course, Joseph Lorenzo Hall describes NASA as a “path dependent” organization that tends to make decisions based on, and have their present state defined by, their history.16 As such, NASA, established in 1958, had a clear mandate and a blank check to place a man on the moon before the Soviets in aligning with President John F. Kennedy’s vision as articulated in his famous “man on the moon” speech of 1961. However, at the end of the Apollo era, NASA experienced substantial budget cuts, which emphasized the culture of production and helped legitimize the normalization of deviance necessary to meet launch schedule pressures. In an effort to meet funding requirements, NASA had to join forces with the Air Force, which resulted in significant compromises to the configuration, design, payload requirements, performance requirements, etc. The CAIB report concluded that this: “technically ambitious design resulted in an inherently vulnerable vehicle, the safe operation of which exceeded NASA’s organizational capabilities as they existed at the time of the Columbia accident.”1 In addition, cost ceiling limitations imposed by the Office of Management and Budget (OMB) led NASA to make tradeoffs that resulted in near-term savings but resulted in much larger operational costs incurred throughout the vehicle’s life. For example, the selection of a fragile, reusable ceramic tile thermal protection system (TPS) resulted in over 30,000 uniquely sized insulation blocks, which had to be meticulously bonded to a strain isolator pad (SIP), which was bonded to the aluminum structure of the Orbiter to allow differential thermal growth without cracking the tiles during entry. The brittle and fragile nature of this 48
Charles Camarda
TPS resulted in an average of about 30,000 person-hours of refurbishment and repair in the Orbiter Processing Facility (OPF) at Kennedy Space Center (KSC) to repair damaged tiles and to re-waterproof tiles located in the highly heated regions on the belly of the vehicle after every mission. Pre-Columbia: Schedule and budget pressures translated into launch pressures and continued to play a very important part in the post-Challenger environment, leading to the Columbia tragedy. The linking of the space shuttle to the construction of the International Space Station (ISS) complicated schedules and the timing of specific payloads and manifests1. Because of this interdependence, a common occurrence in complex systems, the completion of the ISS played a very important role in increasing schedule pressures to meet formal agreements with NASA’s international partners (e.g., push toward “Core Complete” configuration with the addition of Node 2 which signaled the ability to add international partner segments such as the Japanese Experimental Module (JEM) and the Columbus Orbital Facility (COF)). As stated in the CAIB, by November 2002: “With the International Space Station assembly half complete, the Station and Shuttle programs had become irreversibly linked. If there were any problems with or perturbations to the planned schedule of one program reverberated through both programs. For the Shuttle Program, this meant that the conduct of all missions, even non-Station missions like STS-107, would have an impact of the Node 2 launch date.”1 Budget pressures increased prior to Columbia with new NASA Administrator Dan Goldin’s policies toward “faster, better, cheaper” (FBC) projects; the reduction of research and development (R&D) spending; and reduction of manpower on the SSP to free up people to complete the ISS and to work on his preferred project, human exploration of Mars. In 1995, the “Kraft Report” was published, and space shuttle was once again viewed as “a mature and reliable system…” by many top advocates and managers.17 This view helped lead to the privatization of the space shuttle as a costcutting measure and the resulting creation of a contractor-run program (Space Flight Operations Contract (SFOC)) by the United Space Alliance (USA). This resulted in NASA engineers adopting an “oversight” role instead of an “insight” role to the lead contractor to ensure the safe operation of the space shuttle. This privatization also cut layers of oversight and safety personnel to streamline resources and minimize cost. In November 2001, the White House chose Sean O’Keefe, a “bean counter” (Deputy Director of the White House Office of Management and Budget (OMB)) without real technical depth as 49
Mission Out of Control
NASA Administrator. “His appointment was an explicit acknowledgment by the new Bush Administration that NASA’s primary problems were managerial and financial.”1 This is very reminiscent of the claims made by Boeing leadership regarding both 737 MAX disasters. Internal Causes of the Tragedies - Cultural, Organizational, Social, and Behavioral The behavioral, cultural, and social similarities of two technical issues, both of which led to tragedies (Challenger and Columbia), are compared and shown to exhibit striking similarities: the Shuttle SRB joint/O-ring anomaly and the ET bi-pod foam debris anomaly. It will be shown how these same negative behaviors and culture that led to the first anomaly and accident were still in existence and responsible for the Columbia accident. Normalization of Deviance: Pre-Challenger: The Term “Normalization of Deviance” became one of several household terms post-Columbia. Dianne Vaughan, in her book “The Challenger Launch Decision,” describes this term to mean that behavior which an individual, “work group,” or organization first identifies as technically deviant and which subsequently reinterprets as within the norm for acceptable performance and then finally becomes officially labeled as “acceptable risk”2. This behavior may also be related, as mentioned above, to others listed in Table 2, such as arrogance and overconfidence, and help inculcate a belief system that skews the meaning of past successes and degrades the ability to accurately predict future probabilities. Using such a lens to view deviant behavior as “normative” could also affect the trending of results to confirm a predisposed position or stance. This trend spawns other like-related jargon such as “in-family” to imply an increased familiarity or knowledge about such behaviors and, thus, diminishing cause for concern and downplaying or discounting of risk14. Terminology like “in-family” and “acceptable risk” are terms that “working groups” invent as they create the norms, processes, and procedures that create the culture of their groups to help the program office develop “flight rationale.” From reference 2 (pgs. 62 & 63): “A five-step decision-making sequence in which the technical deviation of the SRB joint from performance predictions was redefined as an “acceptable risk” in official decisions. This sequence was repeated, becoming a pattern. Each time tests or flight experience produced anomalies that were signals of potential danger; the
50
Charles Camarda
risk of the SRBs was negotiated between working engineers at NASA and Thiokol. Each time, the outcome was to accept the risk of the SRBs.” The dominant belief, grounded in a three-factor technical rationale, was that the SRB joints were an acceptable risk; therefore, it was “safe to fly.” It was shown how this construction of risk was not based on peer review and how the testing was inadequate, which led to a reliance on the Thiokol design and the prior successful testing and reliability of the Titan III solid rocket motor design2. In addition, the Titan III design used only one O-ring, and the SRBs used two and were, thus, thought to be a redundant design. However, the designs were decidedly different (longer tang, reuse of the SRBs and subsequent out-ofrounding concerns of the casings, etc.), and nonlinear structural deformations of the Shuttle SRBs would cause an opening of the joint and increase in the gap which the O-rings had to seal against8. Since none of the tests accurately simulated the true structural flight response of the joint, the critical failure mechanism was not known, nor was it properly simulated during testing. The Challenger disaster was not caused by the lack of resilience of the O-ring material at cold temperatures, as theatrically demonstrated by renown physicist Richard Feynman, who dunked a piece of O-ring into a glass of ice water during the Rogers Commission, but rather a faulty field joint design and inadequate structural modeling. The Mark Salita model used to predict the actual behavior of the SRB joint was severely lacking, just as the Crater model to predict ET foam impact damage was during the Columbia accident18. Both these failures occurred because of a lack of a connection to real researchers within NASA—a lack of research culture. The three-factor rationale that the O-Ring Work Group had devised to justify launching in the face of clear deviation from nominal performance was: 1) a safety margin, 2) the “experience base,” and 3) the self-limiting nature of the O-ring problem. In many ways, this same “work group culture” produced a similar rationale for continuing to fly RCC leading edges with clear anomalies that will be shown in Chapter 4. This pattern indicated the existence of a “work group culture” (described more fully in a later section) in which the managers and engineers working closely on the SRB problems constructed beliefs and procedural responses that became routinized. The sealing capability of the SRB joint was in violation of three industry standards: 1) it sealed by extrusion (joint rotation caused a larger-than-normal gap, and the pressure extruded the O-ring into the gap) and not by compression of the O-ring; 2) the gap size was too large; and 3) the initial compression level was too low. The tests conducted to rationalize the suitability of the design did not adequately simulate the deformation of the 51
Mission Out of Control
joint under realistic operating conditions and, hence, the data generated to satisfy meeting the “safety margin” of the first component of the work group’s three-factor flight rationale was not verified. They did not follow a research culture approach to validate new knowledge or unexplained behavior. Much of the experience base argument of the second factor relied on Titan launches with a very different joint (a reliance on prior successes). In addition, the fact that they were incrementally accepting larger and larger anomalies and relying on limited past successes is not a sound safety rationale. It will be shown in a later section that a similar, careful examination of flight rationale for RCC leading edges was found to be flawed and lacking sound rationale. Yet, NASA continued to fly eight additional Shuttle missions prior to changing out discrepant wing panels. The O-Ring Work Group’s faulty three-factor rationale for accepting risk was conveyed up the hierarchy and institutionalized as a legitimate means for construction of risk (Ref. 2, pg. 120). In January 1985, STS-51C experienced the worst case of erosion where blow-by of hot motor gases past the primary O-rings occurred and blackened the grease between the O-rings. The team was very concerned because in addition to the impingement erosion, you now had blow-by erosion that removes O-ring material at a much faster rate. This launch had also experienced very cold temperatures pre-flight. These apparent “strong signals” were discounted as weak and later viewed as mixed signals after several later successful flights. This became a standard or routinized way to discount risk and strive to establish flight rationale. Pre-Columbia: The Space Shuttle Orbiter was never designed to withstand debris impacts from the ET foam. ET foam debris liberation, just as poor SRB joint sealing capability, were both violations of design specifications (deviant behavior). These anomalies were both signals of potential dangers that were accepted after initial analyses incorrectly concluded that these deviations could be tolerated. Based on a limited understanding of the potential failure mechanism and insufficient analysis and test correlation, the program accepted rather than eliminated the technical deviations and set in place, once again, a precedent for “accepting risk” and normalizing deviance. In the case of the ET bi-pod foam liberation, it was determined that there were 12 prior incidents of foam liberation from the left bi-pod (occurred on 10% of all flights where imagery of the ET bi-pod was available).1 Much later (~2007), it was discovered, after additional data mining, that there was a total of 28 incidents of bi-pod foam liberation prior 52
Charles Camarda
to Columbia. In addition, during the launch of STS-112 on October 22, 2002, the flight that immediately preceded Columbia (STS-107) by one other mission, a large piece of bi-pod ramp foam hit and damaged the ET Attachment Ring on the SRB skirt. This apparent “wake-up” call or “precursor” signaling of an impending tragedy (similar to STS-51C blow by erosion of the primary O-ring above) was not heeded, nor was the bi-pod ramp loss on STS-112. In both cases, there was not enough “data” to verify the correlation of cold temperatures to O-ring performance and/or poor joint behavior nor was there enough data to verify there was a problem with larger and larger pieces of foam loss and subsequent hits to the Orbiter. However, a program to investigate the danger of very large pieces of foam debris striking the orbiter was never initiated by the Space Shuttle Program Office (SSPO). In my humble opinion, both of these deficient responses to critical anomalies would have been avoided if JSC and MSFC maintained the strong research culture and technical leadership they had during the early Apollo years and their networked connection to their Research Centers! In fact, the collection of data by analysis and test was not initiated. The Crater model, which was used to correlate very small foam hits to Shuttle tiles (~ 3 cu. in.), were much smaller than the very large size pieces of foam which were liberated from the ET bi-pods (~600-1300 cu. in.) and not once was it suggested that such debris could cause catastrophic damage to the RCC leading edges. The Crater model was really nothing more than a “curve-fit” of only 50 small pieces of ET foam striking shuttle tiles at various angles of incidence and speed. You notice from Figure 5 that while the threshold curve provides a lower bound to the regions of damage/no-damage, there is much scatter in the data and no real significance of the equation in figure 5 to the “physics” of the problem!18-20 Not once did the key technical work groups investigate the effect of much larger foam debris strikes on TPS tiles or RCC leading edges. No one ever asked the “what-if ” questions to ascertain the safety of successive flights. Once again, the work group culture normalized the deviant behavior and rationalized the successive, incrementally more serious incidents as “acceptable.” Never once did they reach out to the research engineers at Langley or Glenn, who had the requisite knowledge. “With each successful landing, it appeared that NASA engineers and managers increasingly regarded the foam-shedding as inevitable, and as either unlikely to jeopardize safety or simply an acceptable risk.”1 What was truly appalling was the entry Flight Director Leroy Cain’s statements during an ABC interview where he appears to question Rodney Rocha’s failure to speak up and then asserted, “I fully believe that every single person in the organization 53
Mission Out of Control
was in agreement, that we did not have a safety of flight issue.”21 How arrogant, how misguided, how wrong! There were smart people who understood the severity of the problem, but they were never contacted. The distinction between foam loss and debris events also appeared to have become blurred.”1
Figure 5 – Theoretical damage/no-damage transition curve equation with results of computations (blue) and all impacts into Shuttle thermal tiles contained in (black)17,18 and the Columbia investigation tests (red)18.
Work Group Culture: A work group is described as the people in an organization who interact because they have a common central task2. The difference between a work group, a team, and a high-performing team (HPT) will be discussed in Chapter 6. In their interaction, work groups create norms, beliefs, and procedures that are unique to their particular tasks. In the case of the anomaly related to the SRB O-Rings, an O-Ring Work Group was formed to study this problem and report back to SSP management at appropriate board meetings at selected times leading up to and including the Flight Readiness Review (FRR). In the case of the anomaly related to ET foam debris, however, several working groups were necessary to address the mechanism for foam loss, aerodynamic transport of the foam to determine where it would strike the Orbiter, impact, and eventual damage to various sensitive portions of the Orbiter. In particular, the Shuttle tile work group and the Leading Edge Structural Subsystem-Problem Resolution Team (LESS-PRT) were the two leading groups that would have to eventually evaluate the danger of debris strikes from the ET foam to the fragile TPS systems and assess the risk to the crew and vehicle. A Damage Assessment Team (DAT) was formed in real-time following the bi-pod foam hit during the launch of Columbia while the crew was in orbit. This team did not have a clear charter and 54
Charles Camarda
mandate and very loose ties and reporting requirements to established program boards and the Mission Management Team (MMT). The DAT: “suffered in both structure and positioning. The DAT was an ad hoc group with poorly defined lines of authority and limited access to resources within the NASA hierarchy.”15 The DAT was composed of members of several of the other work groups such as the LESS-PRT and, hence, tended to follow established norms and procedures already adopted by these groups. The drops and misses in communication from such teams will be explained in the section on structural secrecy. Pre-Challenger: Diane Vaughan describes a process that develops a culture, “production of culture,” which begins at the lowest level of the organization within the technical work group2. In the case of the SRB work group, Marshall (MSFC) engineers worked side-by-side with Thiokol engineers in an adversarial /cooperative relationship – the civil servant or government engineers had to be the “bad news guys” on the one hand, finding fault with the contractor’s (Thiokol) analyses and tests and yet had to cooperate with the contractor toward a common goal, the launch of Shuttle. On the government side of the house, there was a natural adversarial relationship between the systems and engineering (S&E)/ engineering side of the house and the program management side. These internal and external relationships were also present pre-Columbia between JSC and its contractors and exist today post-Columbia. As explained above in the “Normalization of Deviance” Section, the O-Ring Work Group developed a routine system for accepting risk once a hazard was identified that could not be eliminated or controlled. This systematic process defined by Diane Vaughan as “construction of risk” followed six distinct phases in the case of the O-ring: 1) a redundant system – as opposed to the single O-ring used on the Titan, the SRB had two O-rings and was, hence, redundant and no problems were expected; 2) signals of potential danger – the actual performance of the SRB joints were found to deviate from design predictions creating uncertainty (the joint rotated and created a gap between the tang and clevis); 3) acknowledgement of escalated risk – was verified by the establishment of resources to understand the problem better; 4) review of the evidence – using the principles of “scientific positivism” where quantitative data is used to resolve a dispute; 5) official act indicating the normalization of deviance – the construction of risk which leads to the formal acceptance of risk and development of “flight rationale”; and 6) Shuttle launch – the completion of the next and successive launches further solidifies the “rationale” and established processes and 55
Mission Out of Control
establishes a “work group culture” which becomes ingrained within an already complex and insular institutional process. One thing we noticed when looking at the technical work groups for human spaceflight was that once the shuttle was deemed operational, there was less and less participation and involvement from outside organizations. Research and development (R&D) Centers like LaRC, ARC, and GRC played less and less of a role in the analysis and testing of hardware and support for the shuttle. The membership of the shuttle work groups took a more operations view of their respective technologies, more of an oversight versus an insight role into what the prime shuttle contractors were doing, and began to slide into what Michael Roberto calls a “confirmatory response” mode where the natural human tendency is to maintain the status quo and to seek out confirming cues or data which provides or adds to the rationale for flight14. Couple this confirmation bias with a biased social reward system (Table 2), which rewards the behavior of the work group and provides a logic that sustains the flight schedule, and you have a self-perpetuating system that incrementally accepts higher and higher risk as it inches toward an inevitable disaster. This example once again illustrates how the natural cognitive biases, behavioral responses, and evolved work group culture can combine and produce negative outcomes. There are, however, positive corrective actions that can be taken at select times in the decision-making process, which can help prevent being trapped in this circular reasoning. However, NASA failed miserably in adopting changes that could have helped prevent recurring tragedies and/or guide the agency back to its research roots.22 Pre-Columbia: A very similar “work group” culture developed within the TPS community at JSC, Boeing, USA, and Lockheed. Because of the distinct material, thermal, and structural differences between the types of TPS, there were two distinct work groups regarding the TPS: a Tile Work Group, which focused on the ceramic tiles, and the LESS-PRT, which was responsible for all RCC components (nose cap and wing leading edges). Prior to the privatization of the Space Shuttle in October 1997 and the award of the Space Flight Operations Contract (SFOC) to USA, the leadership of most of the technical work groups was by a NASA employee. After privatization, the leadership changed hands, and NASA maintained an “oversight” role as opposed to an “insight” role with technical leadership transferring to the contractor. JSC was responsible for the entire SSP, and Space Shuttle Program Managers held an inordinate amount of power and 56
Charles Camarda
even wielded some technical leadership. This significant imbalance of power between Shuttle Program Managers and the Engineering Directorate at JSC solidified the subordinate role of the engineers to the more powerful “elites” as described in reference 2 and in a later section entitled: “Sub-Cultures of Johnson Space Center (JSC).” The contractor-led work groups, such as the LESS-PRT, were under constant pressure to provide the program management with flight rationale. Similar to the O-Ring Work Group, the LESS-PRT developed its own set of norms, rules, and procedures for inspecting, refurbishing, and certifying the RCC components as ready for flight. The unique nature of the RCC material and its oxidation protection system resulted in a small select group of researchers and engineers across the country with experience in this material system and the earth-entry heating environment for which it was designed to withstand. Throughout the course of over 25 years of experience working with RCC on the SSP, it can become understandable why the LESS-PRT viewed itself as the only truly knowledgeable group capable of understanding the behavior of the RCC components even though much of the actual research on material performance and oxidation was done at NASA’s Research Centers and elsewhere. Several other social and behavioral influences contributed to the work-group culture of the LESS-PRT: 1) the subordinated position of the engineers to program managers/“elites” and their reliance on them for money, resources, facility support and technical direction muted their ability to speak up and/or take a stand on key technical issues; 2) the operations”/sustaining engineering orientation of the JSC engineers having only an “oversight” role instead of a leadership role tended to evolve more of a “confirmatory” response which would help provide flight rationale; 3) a system which rewarded engineers which aided the program in developing flight rationale and minimizing cost overruns and schedule delays (see later section entitled: “Biased Reward System”); and 4) a reliance on a perception of a “can do”/”failure is not an option” Apollo culture and history and an arrogance that only the human spaceflight Centers understood engineering related to the Shuttle Program all caused the LESS-PRT culture to become very isolated and insular, and very defensive of external peer review and criticism, etc. In addition, their strong ties to and acceptance from the program’s “elites” because of a past track record of working “with” the program to provide “flight rationale” gave the team an inordinate amount of power, which, in turn, led to arrogant behavior as will be explained later. When STS-45 returned from space in 1992 and had a significant gouge (1.9 in. x 1.6 in.) on the upper surface of an RCC leading edge panel 10-right 57
Mission Out of Control
(10R), the LESS-PRT went to LaRC to seek help in understanding what could have caused such damage: a hypervelocity MMOD hit, a ballistic velocity debris hit, etc. Research engineers at LaRC were asked at that time by JSC engineers to conduct several low-speed impact tests, and it was never really decided whether or not the damage was caused by a low-speed impact by a man-made object during ascent or an MMOD strike on orbit. The testing and analysis were minimal, and there was never a plan to develop a physics-based understanding of debris impacts to RCC. In fact, the makeup of the LESS-PRT was predominantly materials, thermal, and aerothermal expertise with minimal structures, thermal structures, or impact representation. Had the debris hit the lower surface of the leading edge on panel 10R it would probably have resulted in a catastrophe based on what we have learned post-Columbia. It is interesting to note that while the LESS-PRT recognized the intolerance of RCC to impacts by hard, dense objects, there was never any interest in pursuing impact testing of ET foam debris strikes to RCC. Even when the size of the foam debris grew to hundreds of times the size being evaluated for impact strikes to the “more fragile” TPS tiles, the LESS-PRT did not ask the “what if ” questions and seek out analysis and/or tests to understand the damage tolerance of RCC to such impacts. The LESS-PRT lacked that “thirst for knowledge” of a true research team. In 2000, STS-103 experienced a SiC coating chip missing post-flight on panel 8L; in 2001, STS-102 panel 10L experienced another SiC coating chip missing post-flight. In both cases, the coating chips were liberated from approximately the same region (the slip-side edge along the “joggle/step gap” region on the lower surface close to the stagnation region), and it was originally thought that debris impacts could have been the cause of these coating losses. However, the LESS-PRT did not pursue an analysis and/or test program to understand what caused the SiC coating chip losses from these critical panels nor did they try to understand the impact damage tolerance of the RCC. This incident will be explained in detail in Chapter 4. An e-mail from Mike Stoner (Ref. 1, 141 (fig. 6)) indicated that Mike Gordon (a Boeing employee, RCC Sub-System Manager (SSM) and leader of the LESS-PRT) concurred regarding the foam debris strike: “It didn’t look like a big enough piece to pose any serious threat to the system. At T+81 seconds the piece wouldn’t have had enough energy to create a large damage to the RCC WLE system. Plus, they have analysis that says they have a single mission safe re-entry in case of impact that penetrates the system.” There was NO data or analysis to support such an assertion, and furthermore, without knowledge of the severity of impact damage it would have been impossible to 58
Charles Camarda
claim that the vehicle could survive entry! As the then-leader of the LESS-PRT and RCC SSM, Mike Gordon had no structures ballistic impact expertise of foam and RCC. He did not question the consequences or conduct any analysis and/ or test to verify his reasoning. This shoot-from-the-hip, knee-jerk response to serious, complex technical problems by someone who is considered an “expert” in RCC was and is unconscionable! This know-it-all attitude from an individual and a technical work group evolves over time in an environment of NASA “can do”/”failure-is-not-an-option”/invincible culture, a culture of arrogance, a defensiveness to honest criticism and peer review, and a teamwork/consensus proclivity for confirming the status quo. After being selected as an astronaut in 1996 and after knowledge of these coating losses, I asked Don Curry (the then NASA leader of the LESS-PRT) if anyone had considered thermal stress as a cause of the RCC coating liberation, and he said no. To my knowledge, the LESS-PRT never conducted any thermal structural analyses, nor did they pursue finding the root cause of the significant RCC coating degradation post STS-102/STS-103. I asked him if he would like my Thermal Structures Branch at NASA Langley to take a look, and he declined the offer! It is also interesting to note that after the imagery enhancement and debris transport analysis indicated that the ET foam bi-pod could have potentially impacted the left RCC wing undersurface (around panels 8L to 10L) that the LESS-PRT: 1) never made a formal request to the MMT for imagery; 2) never conducted impact analysis or testing to understand the damage tolerance of the RCC wing to foam; 3) did not contact any external NASA Center, DOD, National Lab, academia, etc. to solicit help (they did not even contact NASA LaRC who they previously called on to understand what caused impact damage on sTS-45); 4) did not stand up and publicly support Rodney Rocha and his request for imagery or even approach the Engineering Director of JSC to solicit support, and 5) were silent when a person who was not a technical expert in RCC, Cal Schomburg, stood up and convinced program managers the strike would not be a problem for the RCC leading edges. In fact, the key leaders of the LESS-PRT knew full well what Cal Schomburg’s position was and were in agreement, as shown in the e-mail by Mike Stoner dated January 17, 2003, on the next page (ref. 1, pg.141 (fig. 6)).
59
Mission Out of Control
Figure 6 – Email from Leading Edge Structural Subsystem – Problem Resolution Team (LESSPRT) leadership agreeing with Calvin Schomburg’s position that the RCC is resilient, and the ET bi-pod foam debris strike would not be an issue.
It should also be noted that while we have concentrated on the LESS-PRT work group because of this group’s role in the post-Columbia RCC panel 8R anomaly (to be discussed in Chapter 4), the tile working group also performed inadequately pre-Columbia. Their failure to recognize the obvious inability to accurately predict foam impact damage to tile using Crater (had to extrapolate 400 times in volume of foam particle); their inability to forcefully convey the likely burn-through of impact damaged tile up the chain of command; their silence when people like Cal Schomburg indicated that even though the Crater predicted damage to the aluminum structure the 0.1 in. thick inner densified layer of tile would still be intact; etc. was not defensible either. While there will be shown to be several extenuating circumstances that may have impeded the ability of these work groups to successfully diagnose and prevent a tragedy, it is important to realize that in a healthy organization where dissenting opinions are not only tolerated but encouraged and communication up and down the hierarchy flows unrestricted, the two prior tragedies could have been avoided22. It will also be noted that if proper authority for diagnosing the nature of these anomalies had been given to the research centers and properly
60
Charles Camarda
integrated teams formed, these problems could have been properly diagnosed and fixed within the recovery window to prevent tragedy. Operations versus Research Culture: As explained above, the longer the shuttle continued to fly, the more the idea of the shuttle as an “operational” vehicle became accepted and permeated all organizations at all levels of management at the Johnson Space Center (JSC). The program management leadership became more and more distant from real hands-on engineering knowledge and leaned heavily toward the operations and management side of the organization. In fact, the leadership at NASA HQ during Administrator Dan Goldin’s tenure emphasized the management aspects as being the most important and less of a need for technical experience to lead a predominantly technical organization or program. Program managers tended to be selected from operations and process related jobs at KSC, JSC, and MSFC. External factors mentioned above placed more and more emphasis on schedule and budget and, hence, emboldened a “Production Culture” that was more focused on efficiency and near-term cost cutting (e.g., the selection of Sean O’Keefe as NASA Administrator) that the top-level program managers at JSC, pre-Columbia, had even less experience as engineers than those at MSFC prior to Challenger. This absence of rigorous engineering training in lieu of operations expertise for top Shuttle management positions and the selection of Flight Directors, such as Linda Ham, as leader of the Shuttle Mission Management Team and other top positions throughout JSC and the SSP resulted in a dangerous trend which once again would legitimize a work group culture which normalized deviance and provided rationale for the flight of STS-107. In addition, there was no appreciation for the complexity of the technical problems and the need to defer to the lowest member of the organization, if need be, to obtain the required technical expertise to adequately describe and understand the problem. This tendency to look for the “watered down” (simplified) version of the problem, low tolerance for lengthy technical explanations, a lack of critical thinking, and lack of deference to the technical expertise resulted in the upward transmission of key information without proper attention being paid to the key simplifying assumptions; errors inherent in the analysis and tests; and probability, risk and uncertainty analyses and sensitivity. Subjective terminology and jargon such as: “conservative,” “worst-on-worst” (used to denote the worst of several worst-case load conditions applied all together and supposedly resulting in a very conservative design or condition), etc., were often used incorrectly to create an illusion of safety, reliability and so forth. 61
Mission Out of Control
On one occasion, when I was Director of Engineering at JSC after my flight (post-Columbia), I had to have one of my engineers, Doug Drewry, step down from presenting to Shuttle Program lead Wayne Hale after Wayne became frustrated because he was unable to understand what Doug was saying and began to shout and berate him in public. Wayne called a time-out to the meeting and when we stepped outside the conference room, he began to shout at me and point his finger in my face. When Wayne saw he could not intimidate me and I called his bluff and told him I was going to go back into that meeting and explain to everyone on the teleconference what the real issues were, he backed down. It was obvious Wayne did not understand structural mechanics. What he said to me after the meeting was even more disturbing; he wanted me to give him a onehour presentation on structural mechanics. Imagine the arrogance: a program manager thinking he could understand the field of structural mechanics with such a very short briefing. Such was the nature of NASA in 2006. Oh, how far NASA technical management had slipped since Robert Gilruth, Max Faget, and Wehrner von Braun era of Apollo. Several factors in Table 2 can converge and potentially magnify negative behaviors or cultures. For example, the selection of program managers with less technical expertise, more power and authority than other organizations, and a natural background and grooming in a “failure is not an option” environment will be less tolerant of dissent and less likely to admit what they do not know and more likely to favor ideas and/or hypotheses which supports their own predetermined views and simplified understanding of a very complex problem. It may even lead them to believe they truly understand a problem (after a onehour primer on structural mechanics), which they do not, and make critical decisions without conferring with the appropriate technical community. Having a “Shared Cognitive Frame,” such as the Shuttle as an operational vehicle, represents a “Cognitive-level factor contributing to the organization’s muted response to the foam strike threat. Specifically, NASA framed the Shuttle program as a routine production environment, a subtle but powerful shift from the vigilant research and development mindset that permeated NASA in the Apollo years”14,23. In addition, program managers coming from a process or operations background have very little tolerance for problems/issues without solutions and tend to view research necessary to understand complex technical behavior as overhead and/or a “science project.” The reluctance to hand over any portion of a technical problem from JSC or MSFC to one of the Research Centers was partially due to the perception of the researcher’s natural tendency to “play in 62
Charles Camarda
the sandbox” and take much too long to fully understand the problem instead of obtaining the expedient “80%” solution. The SSP felt it could not expect to stay on schedule if it relied on an R&D approach to problem-solving all of its problems. This attitude persisted even though the SSP had had firsthand experience working with the R&D Centers on the return-to-flight (RTF) program following the Columbia accident and was able to experience the effectiveness of these centers in producing key technologies under budget and ahead of schedule. It was business as usual for the SSP to squander more money and resources to maintain loyal support for the local SSP contractors and vendors. A Mission Operations Directorate (MOD) “failure is not an option” philosophy is not compatible with an R&D view toward design, development, and problem solving in which most advances are made following failures. It was very difficult for Shuttle program managers who were incapable of admitting how much they did not know and who relied on their own “chief engineers” for guidance and to offer competing theories for solving problems to realize and admit their ignorance and mistakes. They had lost all connection with the real “experts” within their own agency who could really help. “Can-Do,” “Invincible,” History: Much has been said and written about the historic nature of NASA during the early Apollo years as being a “Can-Do” organization that has always been able to make good on any claims or goals. As reported in reference 1: “prior to both accidents, NASA appeared to be immersed in a culture of invincibility, in stark contrast to post-accident reality. The Rogers Commission24 found a NASA blinded by its “Can-Do” attitude, a cultural artifact of the Apollo era that was inappropriate in a Space Shuttle Program so strapped by schedule pressures and shortages that spare parts had to be cannibalized from one vehicle to another.” In addition, “engineers and program planners were also affected by “Can-Do,” which, when taken too far, can create a reluctance to say that something cannot be done”1. This same attitude is related to the “overconfidence bias”13 and results in a natural tendency to oversell a new project to accommodate unrealistic cost and schedule demands. In the initial months to a year following the Columbia tragedy, this “CanDo” culture, coupled with an operations-oriented leadership for RTF, created unrealistic estimates of 6-weeks to 3-months for the agency to return to flight, which caused premature down-selection of technologies and methodologies for inspection, TPS repair, damage tolerance, root cause and solution of ET foam liberation. Coupled with a lack of systems engineering expertise, this resulted 63
Mission Out of Control
in numerous dead ends and missed opportunities and eventually enabled a return to flight; however, it took a long time and was very costly. In Annex A of reference 25, Individual Observations by Mr. Joseph W. Cuzzupoli and Mr. Richard H. Kohrs: “The utilization of operational-type management and engineers made the return to flight of the Space Shuttle difficult. Nevertheless, the result was enormously positive for NASA. They got there!” In addition, the lack of engineering experience and leadership, together with this “can-do” attitude, translated into a TPS repair philosophy that focused on solutions to the worst-case problems for the worst-case conditions (e.g., a 16 in. x 16 in. hole in the hottest region of the hottest RCC panel). Many simpler and more readily available RCC repair solutions were discarded prematurely, which would have been applicable for temperatures up to 2800 °F and more moderate damage as opposed to striving to reach the maximum temperature condition of 2960 °F. In fact, the region on the leading edge where temperatures exceeded 2800 °F represented a region that totaled only 149 sq. in. (1.4% of the surface area of the leading edge). In essence, we were throwing away many simpler 99% solutions in search of the “Holy Grail” or 100% solution; was this “Can-Do” spirit or just plain arrogance? While this “can-do” mentality was positive and supported by the leadership in MOD at the time of the Apollo program, it should also be noted that, at that time, those same leaders recognized and valued the importance of dissent and actively sought such “out-of-the-box” approaches to problem-solving. What we have today, however, is a lack of tolerance for failure that discourages discovery and the natural evolution of failure and discovery, which is necessary in an R&D program. In fact, it is highly probable that JSC did not even recognize that RTF would be an R&D project. If they did, why was the leadership project placed in the hands of MOD? With a lack of tolerance for failure, you develop a culture that suppresses mistakes and, hence, fails to learn as an organization from mistakes, and at the very worst, tends to hide or cover up mistakes. Culture of “Experts” An R&D culture like that at NASA LaRC, developed some of the most highly qualified researchers in the fields of aerothermodynamics, aerodynamics, aeroelasticity, structural mechanics, composite materials, shell stability, structural optimization, and structural dynamics. Many of these researchers spent their entire lives developing an understanding of their particular technical discipline or area of expertise. They gained this knowledge and understanding 64
Charles Camarda
over 30-40 years of analysis, testing, and analysis/test correlation. Never once would you hear the term “expert” used by any one of these individuals, nor did you hear other colleagues refer to them as “experts.” At NASA JSC, however, the frequency and ease with which this distinction was bestowed upon even junior engineers (less than 10 years’ experience) was common. The JSC culture allowed such individuals to be anointed with such titles and recognition and, in many cases, since there was no real technical peer review of their work, enabled them to rise in stature within their own local organization to a level that allowed them unfettered access to senior program managers. What the CAIB team uncovered and could not understand was how such informal access by one-such “expert,” Cal Schomburg, was able to influence the decisions of senior managers without the slightest proof by data, analysis, test, or past experience. Many times, the answer to such questions was very simple. The lack of technical understanding by the senior program managers regarding Cal Schomburg’s area of expertise caused them to believe the words of an individual who had no prior expertise in impact and damage tolerance of RCC, aerothermal testing and/or analysis of RCC, RCC materials or structures knowledge. In fact, the LESS-PRT had no such expert on their team, nor was such expertise sought. The arrogance, ego, power, and authority of many of the supposed “engineers” who curry favor with and deign to cross over to the side of the “elites” program managers tended to grow as the Orbiter continued to fly successfully. No matter how close to the edge of the envelope we approached after each new anomaly, the credibility of such “experts” continued to rise, as did their stature and place amongst the program elites. They were called on time and time again, formally and informally, to provide “flight rationale” for program managers to maintain schedule and budget. The engineers who were the most adept at conducting themselves at senior-level boards, manipulating and presenting data, and developing new jargon and terminology to imply a greater understanding were the fastest to rise through the ranks of the program. The qualities that sustain mid-level leadership striving for senior leadership positions were compliance to the strict chain of command and rules which sustain bureaucratic accountability and structural secrecy (to be discussed shortly); consensus-driven, not prone to rock the boat; intolerant of dissent and insecure and apt to block any efforts which may prove them wrong or, in their minds lessen their luster and chances for advancement. As mentioned above, since the leadership of the senior-level program managers had become far removed from technical insight, their capability to calibrate the discipline expertise of their sources was severely lacking. Hence, 65
Mission Out of Control
their reliance on “any” expert to corroborate their views was often sufficient, and there was very little probing into the “credentials” of such internally bestowed authorities. What is unfortunate is that due to other social and behavioral influences (e.g., a can-do mindset, arrogance, and imbalance of power) many times when asked a question that is outside their area of specialization or expertise, these “experts” would often answer with the same authority as they would if the question was in their own specialty, as noted pre-Challenger and pre-Columbia and as will be shown later when discussing the RCC panel 8R anomaly. Without the ability to question, probe, and challenge such answers without being labeled a “non-team” player and dissident, oftentimes, the incorrect responses from the so-called “experts” become fact. This occurred more often than one would care to imagine. Sub-Cultures at Johnson Space Center (JSC) JSC is the lead center for human spaceflight; it commands the largest share of the NASA budget and wields an enormous amount of power and public prestige. In addition to its financial clout and sway over the success of its thousands of space contractors, JSC enjoys an enormous amount of political support. In addition, because of significant cuts in applied research, the NASA Research Centers (LaRC, GRC, and ARC) became more reliant on, and hence, influenced by, the JSC culture because they relied more and more on the program funding generated at JSC. In addition to having engineering and program subcultures somewhat similar to other centers, JSC also developed several other, very powerful sub-cultures, such as the astronaut and flight director or mission operations sub-cultures. How these various sub-cultures interact significantly affects how decisions are made regarding human spaceflight. JSC is viewed by other centers as uncooperative and arrogant. In fact, when asked how each center viewed each of the other centers at a Wallops Island NASA management training retreat (a typical icebreaker that senior management instructors use at the beginning of courses) that I attended, JSC was viewed by each of the other centers represented at the retreat as arrogant and having “attitude with an edge.” The JSC contingent’s response was also typical: “Yeah, we know we are the center of the universe, so what?” and proceeded to have T-shirts manufactured to proudly display “JSC – Attitude with an Edge.” Space and Operations centers like JSC viewed the research centers as “Ivory Towers” where no “real work” gets done. In fact, while considering applying for the Astronaut Corps, the Center Director of JSC, George Abbey, asked me: “When
66
Charles Camarda
are you going to leave that ‘library’ (NASA LaRC) and come to work for a ‘real’ engineering center?” Program Sub-Culture: After describing JSC as one of the most powerful of all the NASA centers, it is no wonder that the JSC program management culture would be unique and quite powerful. In fact, the imbalance of power between the program and engineering organizations at JSC (with respect to money, resources, and influence over outside contractors) had a serious effect on the willingness of dissenting views to be brought up the chain of command from the Engineering Directorate to the program side of the center and to be successfully defended. It was not uncommon for powerful program and project managers at JSC to shout down people and ideas and pound tables at key program board meetings such as the Shuttle Program Requirements Control Board (PRCB) without any retribution from above. The program managers at JSC truly knew their position in the pecking order and quite often were untouchable from even the NASA administrator for political reasons. It is understandable how, with such a great imbalance of power among the various organizations at JSC, the program subculture there could dominate and intimidate other critical organizations such as engineering. In fact, the Director of Engineering at JSC while the Columbia crew was in orbit, Frank Benz, never questioned if his engineering team analyzed the sophomoric impact analysis proposed to the program, nor did he defend his engineer, Rodney Rocha’s, request for a picture to be taken. Diane Vaughan2 describes this as the natural order of affairs in a “Production Culture”: “The social administrative arrangements of engineering also contain cultural scripts that are integral to the occupational worldview. Engineering is a bureaucratic profession….. Their ‘place’ in the hierarchical system is clear. The daily existence of engineers in production organizations exemplifies what Braverman identified as the ‘separation of conception from execution’ that began with Frederick Taylor’s introduction of ‘scientific management’ into the workplace in the early twentieth century. Workers’ control over their craft was altered when planning responsibilities were taken from the individual craft worker and shifted to the managers, leaving the workers to follow orders, implementing plans without access to the full picture.”2 In the days of the Apollo rogram, the power was more evenly distributed; the program managers had significant prior experience in research and engineering and could understand and make key technical decisions with sound judgment; the engineering leaders were respected because it was widely recognized that we had never embarked on such a unique mission and R&D was crucial; and 67
Mission Out of Control
there was plenty of money to test and fail and learn what we did not know prior to launch. The environment we were operating at NASA during Columbia was quite different: funding for human spaceflight was very tight (trying to juggle three programs on a flat budget); most of the engineering teams operated in an oversight or “sustaining engineering” role with little R&D and design expertise for over 20 years; and program managers and even senior engineering managers had very little hands-on engineering experience. The production culture described above of the SSP at JSC had become a bloated bureaucratic morass of rules, regulations, procedures, board meetings, and telecons, which prevented the effective transmission of data and the ability to rapidly solve key technical issues and anomalies. Evidence of how ineffective this operationsoriented production culture had become could be seen in the delays, false starts, and unintegrated approach of the return-to-flight effort (discussed in the next chapter). MOD, not engineering at JSC, led this uncoordinated effort , with very little experience in R&D and not a “systems-level” understanding of the complete problem. In effect, the key leadership did not understand the big picture of the R&D problem at hand and, hence, was not able to effectively direct and manage the effort25. In the recommendations section of the CAIB Report (ref. 1, pg. 208), the Board recognized several of the issues with the culture at JSC and noted: “The cultural impediments to safe and effective Shuttle operations are real and substantial, as documented extensively in this report. The Board’s view is that cultural problems are unlikely to be corrected without top-level leadership. Such leadership will have to rid the system of practices and patterns that have been validated simply because they have been around so long. Examples include the tendency to keep knowledge of problems contained at a Center or program; making technical decisions without in-depth, peer-reviewed technical analysis; and an unofficial hierarchy or caste system created by placing excessive power in one office. Such factors interfere with open communication, impede the sharing of lessons learned, cause duplication and unnecessary expenditure of resources, prompt resistance to external advice, and create a burden for managers, among other undesirable outcomes. Collectively, these undesirable characteristics threaten safety.” It is interesting to note that the description given in the above quote was exactly the patterns, behaviors, and work group culture which were exhibited by the LESS-PRT working group prior to and directly after the Columbia tragedy. In an e-mail, I stated: “The LESS-PRT is a very hierarchical, close-knit working group which is not prone to ask for outside help or review and often does not 68
Charles Camarda
recognize the need when appropriate.” While this statement may seem harsh, the facts remain that this team did not raise a request for on-orbit pictures during STS-107 after the foam impact; they did not have a physical understanding of ballistic foam impacts to RCC; they did not contest the judgment of tile TPS experts like Cal Schomburg who had little experience with RCC material; have dictatorial control over all data and hardware concerning the RCC leading edges and nose cap; and they never once sought outside help from researchers at NASA LaRC who had years of experience with RCC and whom they had called upon years earlier to conduct impact tests of RCC to understand what caused on-orbit damage during STS-45. In addition, the CAIB recognized how the formal and informal “pecking order” affected decision-making: “Structure and hierarchy represent power and status. For both Challenger and Columbia, employees’ positions in the organization determined the weight given to their information by their own judgment and in the eyes of others. As a result, many signals of danger were missed.”1 This was also alluded to in the section entitled “Culture of Experts” and will be discussed further in a section on “Bureaucratic Accountability” and “Structural Secrecy.” This characteristic of this culture also resulted in a lack of “deference to technical expertise” mentioned earlier, as was witnessed by Linda Ham’s (leader of the MMT for STS-107) reluctance to seek out the person who requested the imagery during the Columbia mission and/or the correct technical expertise to make an informed decision. The Mission Management Team (MMT) knew about the potential damage to Columbia’s wing while the crew was in orbit but decided, in the words of MMT chair Linda Ham, that “There isn’t much we can do about it.” The CAIB would later determine that there was a challenging but feasible rescue plan. We could have sent a second shuttle up to rendezvous with Columbia in space and ferry the stranded astronauts back home. While this plan would have required several new astronauts risking their lives, it still would have been better than doing nothing, and it most assuredly would have been the choice of Apollo Flight Controller Gene Kranz (remembering Apollo 13). And I can say with 100 percent certainty that I and any other astronaut would have leaped at the chance to save our colleagues. Instead, Ham told the Columbia crew they were safe to return home, effectively sealing their fate. This disaster should have served as a wake-up call, an opportunity for NASA to reevaluate its processes and return to its research roots. Unfortunately, the opposite happened, and the safety conditions at NASA only slid from bad to worse.
69
Mission Out of Control
Astronaut Office Sub-Culture: Another very powerful organization within JSC is the Astronaut Office. The entire center revolves around the ability to place the human members of the Astronaut Office into space and to return them safely back to Earth. In a sense, this could result in a pampered group of prima donnas who are the envy of many in the agency and at JSC and who are held in very high regard by the public and the government. In fact, many of the engineers, flight controllers, pilots, and flight directors at JSC had one time or another applied to the astronaut program and may have been rejected. As you might imagine, this undercurrent of feelings can affect the social dynamics of board meetings and the decision processes. This section will hopefully provide some insight into the astronaut sub-culture and how this office interacts with other organizations at JSC. This office has also evolved over the years from how it was during the Apollo Program. The Astronaut Office is a self-contained society with its own rules, regulations, policies, and procedures governed primarily by a “military-like” culture. The leadership of the office, when I was first selected in 1996, was a pilot astronaut and, as such, a graduate of the military test pilot school. There was a definite caste system within the office as listed in descending order from military test pilots/Shuttle pilots to non-Shuttle pilot to military Mission Specialist (MS) to military radar intercept officer (RIOs) (back seaters), JSC MS, to civilian MS, and make no mistake about it: military Test Pilots rule! Most of the “leadership” positions within the office were held by military members of the Astronaut Corps, and it was the perception that the military prepares leaders correctly, which perpetuated this tendency within the office. Unflown, civilian MSers (Mission Specialists) had very little voice within the office regardless of prior experience, expertise, or leadership positions. It was not uncommon to hear an experienced astronaut pilot say to an un-flown astronaut that the reason he/she is not being heard is that they haven’t flown yet and that once you fly, you will see a sea state change in how your voice is heard within the Astronaut Office and throughout JSC as if 14 or so days in space would automatically qualify even the most incompetent person into a position of authority and recognition. As a military-like organization, the Astronaut Office strictly enforced the corps to follow the “chain of command,” and not doing so was severely frowned upon and could affect an astronaut’s career and future flight assignment. Pre-Columbia, the Astronaut Office stressed the fact that they were operators and not engineers. The training focused on following set procedures, called flight data files (FDF), which were similar to checklists, and would cover almost every conceivable nominal and off-nominal condition. This focus of attention 70
Charles Camarda
on flight operations and lack of tolerance for detailed engineering reports was quite evident in the way typical Monday morning all-hands meetings were run within the office. It was no wonder that very few within the Astronaut Corps even knew that a large piece of foam hit the TPS of Columbia prior to Earthentry and its eventual disaster. The astronaut representative at the MMT for STS107 had no reason to notify anyone in the corps because engineering was not considered a primary function of the office. There was no penetration or probing by the Astronaut Office at that time, and the results of the MMT regarding the safety of the crew were accepted. It did not matter that many of the mission specialists had significant experience in select specialized fields that could have been of service; the policy was we were supposed to give up our previous life and follow the policies, procedures, and chain of command of the Astronaut Office. Even Capt. John Young’s (first Commander of a Shuttle (STS-1)) ,veteran of Gemini and Apollo missions, and one of the best astronaut engineers ever) engineering concerns and recommendations would go unheeded during most Monday morning meetings to the point of him being treated rudely by other less knowledgeable astronauts within the office. I was dumbfounded by this because on several occasions, I found John’s words to be right on, even down and in, in my technical area of expertise! Many times, after the accident, I asked the leaders of the Astronaut Office to brief the astronauts on what they needed to do to ensure they were safe to fly and how to reach the key researchers within the agency and around the country who had the expertise required for technical answers to anomalies they were experiencing. To this day, I have never been allowed to give this closed-door briefing to my astronaut colleagues. Teamwork was very important within the office, and more often than not, dissenting opinions were taken in the spirit of being a non-team player and frowned upon. Getting along and consensus were important within the Astronaut Office, as you can imagine, especially for training individuals for long-duration spaceflight. The “oddball”, lone wolf, researcher-type personalities might have a tough time fitting into such a program. In addition, because being an astronaut is such a high-profile job replete with all the advantages and disadvantages of all the media attention, the Astronaut Office was a very closed, insular organization where public disclosures were tightly controlled, key information (such as future flight assignments) remained in the hands of only a select few individuals (the “in” crowd), and rumors and perceptions could easily make or break a person’s career and chances for flight. Astronauts were told not to offer opinions, and any public statements should reflect the formal position of the Astronaut Office. As an office, we were to speak 71
Mission Out of Control
with one voice and were often supplied the “talking points” prior to a media event. Engineering Sub-Culture: The engineering profession is not as precise and “rule following” as the general public would believe, and when a failure occurs, “outside investigators consistently find an engineering world characterized by ambiguity, disagreement, deviation from design specifications and operating standards and ad hoc rulemaking.”2 Engineers are described as rule-following and in complex technical systems the rules are continually evolving. Technological systems, even complex ones, become operational after a significant period of time, and a sufficient number of tests and analyses have been conducted to verify a comprehensive understanding of the system and accurately predict its behavior. New insights are discovered during this operational period which are used to fine-tune or refine the design to improve its performance, its operability, and maintainability. During the early design and development of a new technology, such as a space vehicle, a different type of “engineer” is needed to innovate, design, and develop this new technology. Researchers, designers, or research engineers are critical during this design/development/discovery phase and are quite different than the types of engineers required to operate and maintain the health of the technology once it has already been developed and is operational. These researchers and designers are not strict rule-followers by nature because they are typically blazing a new trail, learning and discovering rapidly on the fly. There are no rules, procedures, or blueprints to follow, just the ideas and imagination of the R&D team. The necessity for failure and the toleration of failure during this phase of the design is critical for the success of the project, as described in the section entitled “Operations vs. R&D Culture.” The “Engineering Culture” described by Diane Vaughan refers to what could be called “sustaining engineering,” which defines engineers whose primary role is to maintain an operational system, identify critical failure mechanisms, and recommend possible incremental improvements. Such an engineer fits within a “production” environment where he/she knows his/her place within the hierarchical bureaucracy led by the managerial “elites.” When production, efficiency, and profit become the bottom line, the engineers understand their role within such a system, and “the institutional arrangements have led many scholars to conclude that engineers are ‘servants of power’: carriers of a belief system that caters to dominant industrial and government interests….Research has established the engineering profession’s acceptance and endorsement of 72
Charles Camarda
bureaucratic authority relations and capitalistic concerns about the costs and efficiencies of production systems.”2 The engineering culture at JSC had devolved significantly from the research culture, which enabled the success of Apollo. As stated in reference 2 with regard to the rapid growth of NASA as a function of the space race: “As a consequence of these developments, the space agency’s pure technical culture began to be compromised by changes in the premises of accountability at NASA. According to Romzek and Dubnick, during Apollo, the space agency relied on professional accountability: control over organizational activity rested with the employees with technical expertise. It was a system built on trust of and deference to the skills of those at the bottom of the organization….In the 1970s, however, professional accountability struggled to survive as the Agency adopted the trappings of bureaucratic accountability.2” Bureaucratic accountability is defined to be controlled from the top with strict allegiance to hierarchy, procedures, rules, and chain of command. This concept will be discussed further in a later section2. The years after Apollo saw a change in the engineering culture at JSC and its eventual slide to a position of oversight, sustainability, and maintainability of a very complex system, the Space Shuttle, which was deemed “operational” and whose engineers had to hand over responsibility to a private commercial company, United Space Alliance (USA) to manage and operate. This slide from a once-strong engineering and research culture to more of an oversight role eroded the technical expertise of much of the in-house engineering capability related to the creative/exploratory side of engineering to more emphasis on what Michael Roberto11 would call the “confirmatory” side. More and more, JSC Engineering had become an arm of the Shuttle Program, and in a matrix organization where one of their bosses was on a much more powerful “Program Manager” side, often tended to satisfy the priorities or needs of the SSP. Hence, the emphasis on developing flight rationale (to somehow develop a plan or strategy to continue to fly shuttles and maintain schedule and budget) was first and foremost on the minds of many engineers and engineering managers at JSC. There were no rewards for finding problems that did not have near-term solutions. There was no emphasis on asking the “what if ” types of questions common in an “exploratory response”15 mode or R&D type of culture, which would ask pointed technical questions in the face of new anomalies without fear of recrimination or accusations of not being a team player. There was no
73
Mission Out of Control
“reward” system for people who found and reported errors, as is the case for most HROs, as you will learn in Chapter 6. Flight Director/MOD Sub-Culture Another organization at JSC that is very influential is the Mission Operations Directorate, which is predominantly composed of the Flight Directors and Flight Controllers responsible for the planning and operation of human spaceflight missions (Shuttle and Station). Many of the original leadership in operations came from the military side of the house and, hence, made up the senior managers in MOD during the initial stages of the human space program. Much like the current Astronaut Office, MOD is also a hierarchical bureaucracy that enforces a strict chain of command and adherence to rules and procedures. The flight controllers who are trained to be “experts” in the operation of a specific system, for example, environmental control and life support, may grow and become proficient in the operation of several other systems (e.g., structures and mechanisms; guidance, navigation and control (GN&C); Propulsion; etc.); and eventually grow to become a flight director (FD). There are different types of flight directors. For example, ascent and entry FDs are responsible for the very dynamic phases of a mission where there is very much critical activity in only minutes of time (e.g., a total of only about 8 minutes from launch to orbit). Most of the anomalies that may occur during these dynamic phases of flight have been rehearsed prior to flight, and solutions and strategies to mitigate the outcome of various anomalies and/or combinations of anomalies have also been simulated and practiced many times prior to flight. Almost all possible failure scenarios have been studied, and the result of all this knowledge is a comprehensive set of flight rules (a description of the philosophy responsible for the flight procedures (called flight data file or FDF)). The FDs and astronauts are trained how to strictly follow these flight rules and procedures during the course of every mission. The controllers who can eventually rise to the top to become FDs are very capable of critically thinking and rapidly assessing critical issues during the dynamic phases of flight and acting very decisively in selecting the correct course of action, making split-second life and death decisions. This is a very demanding and difficult job that requires a tremendous amount of focus and confidence, which is developed over years of simulated flight training and actual mission support. As you can imagine, an ascent and entry FD is much like a commanding officer on the front lines, having to make split-second decisions with incomplete, ambiguous, and sometimes conflicting data. The style of leadership required 74
Charles Camarda
for this task is aligned more closely with a directive style as opposed to a contemplative style, as you might expect. There is very little time, hence very little appreciation, for subordinates who come to the FD with problems without solutions or potential options. Hence, it might be expected that there is very little tolerance for dissent and the utmost need for a united team effort or teamwork. On-orbit operation of a space vehicle or space station is a completely different animal in that there is usually much more time to think and contemplate when a problem arises. It may not be life-threatening, and there may be several solutions or paths that may or may not be viable. On-orbit operations for a space station mission, which spans approximately six months to a year, are also very different from launch and entry and on-orbit operations of the Space Shuttle. During ISS operations, many more of the routine daily operations such as attitude control, health monitoring, even the operation of the robotic manipulator system (robotic arm), can be commanded from the ground with little monitoring by the crew. There are many fewer emergencies that require immediate action by the crew and, hence, although the crew is trained to strictly follow procedures for most specific tasks, there is a general skill set of tasks and training because less of the day-to-day maintenance requires direct supervision from the ground. As you might imagine, there can be considerable tension between astronauts, the flight control team, and the program managers regarding how much control and responsibility is allocated to each organization. This struggle for control and authority plays out during say the on-orbit management of anomalies by the MMT on the ground and factors into the decisions made on the ground as each organization brings with it a separate set of needs, priorities and cultural norms and cognitive biases. Like the Astronaut Office, MOD enforces a strict adherence to the chain of command and rule-following. Reading between the lines of the CAIB report, it appears that Linda Ham (a former ascent and entry flight director) as head of the MMT was probably upset with Wayne Hale (also a former ascent and entry flight Director at JSC and now serving a rotational assignment at KSC as the Shuttle Program Manager for Launch Integration during the Columbia tragedy) for not going through the chain of command (her) before going out with a request to the Air Force for on-orbit imagery using secure assets. “Hale’s earlier call to the Defense Department representative at Kennedy Space Center was placed without authorization from Mission Management Team Chair Linda Ham. Also, the call was made to a Department of Defense representative who was not the designated liaison for handling such requests. In order to initiate the imagery request through official channels, Hale also called Phil Engelauf at 75
Mission Out of Control
the Mission Operations Directorate, told him he had started Defense Department action, and asked if Engelauf could have the Flight Dynamics Officer at Johnson Space Center make an official request to the Cheyenne Mountain Operations Center. Engelauf started to comply with Hale’s request.” (CAIB, ref. 1, pg. 152) “After the Department of Defense representatives were called, Lambert Austin telephoned Linda Ham to inform her about the imagery requests that he and Hale had initiated. Austin also told Wayne Hale that he had asked Lieutenant Colonel Lee at the Department of Defense Manned Space Flight Support Office about what actions were necessary to get on-orbit imagery.” NASA had made and rescinded an imagery request to the DOD in less than 90 minutes. “Linda Ham asked Lambert Austin if he knew who was requesting the imagery. After admitting his participation in helping to make the imagery request outside the official chain of command and without first gaining Ham’s permission, Austin referred to his conversation with United Space Alliance Shuttle Integration Manager Bob White on Flight Day Six, in which White had asked Austin, in response to White’s Debris Assessment Team employee concerns, what it would take to get Orbiter imagery.” (CAIB, ref. 4, pg. 153) Rodney Rocha’s request for imagery and concern regarding TPS damage was taken to ascent/entry FD Leroy Cain, and after checking with Phil Engelauf (a senior FD from MOD,) sent the following e-mail (figure 7):
Figure 7 – E-mail from ascent/entry FD Leroy Cain regarding requests for imagery.
This e-mail (fig. 7), sent by an FD to many of the FDs in MOD, makes the statement that the senior FD, Phil Engelauf, considers requests for imagery to be a “dead issue.” In essence, since MOD is very similar to the Astronaut Office with 76
Charles Camarda
respect to the organization speaking with only one voice, this was a statement to all FDs and members of MOD that the MOD position was that we did not need imagery! In addition, MOD tended to circle their wagons to protect their own during the accident investigation and RTF phase post-Columbia. Lambert Austin left NASA after the accident; all other parties associated with this imagery request remained at NASA. Cognitive Biases: In “Lessons from Everest,” Michael Roberto describes how pre-existing “cognitive biases” helped contribute to the tragedies of two very experienced climbers and their separate teams during their quest to climb Mount Everest on May 10, 199613. He describes how these biases, which caused the tragedy of two very complex and dangerous expeditions, are endemic to many high-risk, high-performance teams and are related to decisions that were made during the Challenger and Columbia tragedies. References 14 and 26 relate the role of cognitive biases in the shuttle tragedies and offer suggestions for countering some of these biases and dealing with ambiguous threats during what the authors term a “recovery window” of opportunity and the decision process that ensues. Where a recovery window is defined as a period of time between a threat and a major accident (or prevented accident) during which constructive collective action may be feasible. The recovery window for STS-107/Columbia began two days into the mission (when the large piece of bi-pod foam from the ET (approximately 1.6 lbs., and 1290 cu. in.) hit the undersurface of the left wing (either the RCC or the shuttle tiles) at approximately 800 ft/s (545 mph)) and ended 16 days later with the breakup of Columbia upon Earth entry. The differences in responses and decisions leading up to the Challenger and Columbia tragedies and the Apollo 13 incident can be partially explained by how the teams responded in cases of a clear threat to life (Apollo 13) and the Shuttle examples where the threats were ambiguous. In the case of the Columbia accident, there was either no clear threat or danger as perceived by the MMT or the threat was downplayed for reasons explained below. In actuality, however, the recovery window should have begun in 1981 when the first large pieces of foam came off the ET during launch (a recovery window of 22 years)! Cognition is the mental process of knowing or that which comes to be known through awareness, perception, reasoning, judgment, and intuition. Cognitive biases are the natural tendencies or biases that affect what we know or think we know and the mental models we create, a stubborn attachment to an existing belief. Research on human cognition indicates that people are 77
Mission Out of Control
predisposed to downplay the possibility of an ambiguous threat. When faced with an ambiguous threat, we tend to hold on to those stubborn beliefs and formulate our understanding of a situation automatically. In the sections below, we will try to show how cognitive biases at the individual and group or organizational levels play an important role in shaping the values, standards, and norms that help formulate the work group culture and define the processes used to construct risk and base decisions. Cognitive biases offer us another lens through which to view and understand why and how organizations tend to downplay ambiguous threats and how to help prevent this from clouding our judgment and subsequent responses during potential recovery windows. In the present section, we will describe these biases and their relation to tragedies. Confirmation Bias: Confirmation bias14 is a stubborn attachment to existing beliefs and making sense of situations automatically, and then favoring information that confirms rather than disconfirms these initial views. “This confirmation bias makes it difficult for us to spontaneously treat our interpretations as hypotheses to be tested”27. This natural human tendency to “favor confirmatory data and discount discordant information”14 was displayed when the MMT lead Linda Ham and SSP senior managers (Ron Dittemore and Ralph Roe) relied heavily on Cal Schomburg’s “opinions” without actively seeking out opposing views as mentioned earlier. Diane Vaughan attributes an “informal chain of command” as part of the problem which allowed Cal Schomburg easy access to the seniors of the program, yet the “automatic” acceptance of the answer (confirmation that the debris was not a threat or risk) should have been challenged by the leaders of the SSP and the MMT. This tendency to weigh or bias the answer we would like to hear is natural and must be actively challenged using critical thinking precepts. What happens is a routine tendency develops in which the opposing views are required to “prove there is a problem,” and a double standard is used such that the technical rigor to prove a problem is much greater than it is to “prove we are safe.” As described in reference 15 by a NASA engineer who was interviewed, Torarie Durden, notes: “The level of factual experimental data that you had to produce to articulate that you didn’t agree with something made it almost prohibitive in terms of time and cost/benefit analysis…in the sense that you’re going against the norm. To disprove a theory within NASA required an unbelievable amount of rigor and analytical data.” 78
Charles Camarda
There was a double standard applied to proving a reason not to fly compared to what was necessary to prove flight rationale! The exact opposite of what you would assume in a sound safety culture where the onus is on the technical community, to prove you are safe to fly! In addition, when the program holds the keys to all the necessary information (data, resources, funding, etc.), you may find yourself in the untenable position of trying to get the program to prove they are in error without access to the key data necessary to prove your position: a “catch-22”. This exact course of action was taken and will be described in detail concerning the handling of the RCC panel 8R anomaly. Couple a confirmation bias with a natural predisposed nature to downplay ambiguous threats and a reward system that actively recognizes solutions that provide “flight rationale” as opposed to the “Chicken Littles” (Rodney Rocha was referred to as a “Chicken Little” by the SSP Chief Engineer Paul Shack when Rodney raised the issue of the debris strike on STS-107 and said it could be a threat) who “cry wolf ” and only identify problems and you have a prescription for disaster1. As mentioned earlier, before Columbia, there was ample evidence that foam debris liberation during launch (a deviant and unanticipated consequence of a poor design) could cause critical damage to vulnerable systems (TPS (tile and possibly RCC), windows, control surfaces, etc.). Yet, the SSP and the LESSPRT failed to ask the probing questions to understand the risk of foam debris like the ET bi-pod foam ramp losses, which were several orders of magnitude larger than what was tested (3 cu. in. vs. 1290 cu. in.). In addition, the SSP did not maintain camera systems to better view this debris liberation, better correlate observations with models, and assess risk. A reliance on prior rules and decisions and so-called “expert” opinions on impacts of foam on tile which “confirmed” initial, automatic idiosyncratic explanations and assessments of risk provided an answer which minimized threats to program schedule and budget, or so it was believed. In addition, the eager acceptance of the unsuitable “analytical model” for impact damage, Crater; similar to the acceptance of the Salita model for O-ring erosion3,15, illustrates this natural tendency for confirmation bias. In the case of the RCC panel 8R anomaly, it was a dedicated initial focus of the NESC to provide the SSP “flight rationale” (confirming cues or data with which to validate flying or develop an understanding of the problem which could better comprehend the risks) with only two weeks to the launch of STS-117, that led to the incorrect recommendation to fly as will be explained in Chapter 4.
79
Mission Out of Control
Sunken Cost: Michael Roberto describes the “sunken cost” effect as: “…a tendency for people to escalate commitment to a course of action in which they have made substantial prior investments in time, money or other resources. If people behaved rationally, they would make choices based on the marginal costs and benefits of their actions. The amount of any previous unrecoverable investment is a sunk cost and should not affect the current decision. However, research demonstrates that people often do consider past investment decisions when choosing future courses of action. In particular, individuals tend to pursue activities in which they have made prior investments. Often, they become overly committed to certain activities despite consistently poor results. They ‘throw good money after bad’ and the situation continues to escalate.13” Sunken cost is one of the causes that hook addictive personalities into becoming problem gamblers, for example. In the climb to the summit of Everest in May of 1996, Rob Hall and Scott Fischer, two very experienced mountain climbers and successful leaders of expeditions, recognized that a dangerous escalation of commitment could occur as climbers approach the summit. In lieu of this, the sunken cost effect is shown to have played a significant part in the decision processes which caused these experienced climbers to break even their own pre-arranged turn-around times and continue toward the summit, resulting in the loss of Hall and Fischer, and another three members of their two separate expeditions. Pre-Columbia, the SSP decision to continue flying, which was viewed by Diane Vaughan as a “normalization of deviance,” can also be viewed as the set of processes that resulted from a combination of cognitive biases such as “confirmation bias” and “sunken costs.” In the latter case, the flight rationale to continue flying with foam debris was based on confirming data of foam impacts, which increased the probability of survivability, albeit from a much smaller size of foam debris, and the decision to rely on previous decisions which classified the risk as “acceptable.” In addition, past decisions also involve a commitment of ego of high-level managers who have invested reputations and careers. Sunken cost “also inhibits us from questioning initial sensemaking. Having invested heavily in a course of action, individuals become reluctant to abandon it and consider new directions.”14 The “summit” in the case of the Space Shuttle post-Columbia, and possibly pre-Columbia, is the reality that we were approaching the retirement of the shuttle fleet and the completion of the construction of the ISS. These shuttle and ISS “summits” could be shown to be major factors in the decisions made 80
Charles Camarda
concerning the continued flights post STS-114 in lieu of RCC anomalies and the suggestion to consider flying with 0 of 4 engine cut-off (ECO) sensors as opposed to the nominal 4 of 4 sensor requirement. The stakes are high, the summits are so close, and then was the time we needed to be more vigilant so that the decisions we made were not tainted by hard-to-recognize cognitive biases. Shared Cognitive Frame: A cognitive frame is defined as “mental structures – tacit beliefs and assumptions – that simplify and guide participants’ understanding of a complex reality.”15 When we refer to cognitive frames, we refer to meaning overlaid on a situation by human actors. Hence, a shared cognitive frame is the way in which an organization or group of individuals views or frames a problem, operation, or environment (e.g., the way the Shuttle Program viewed the Space Shuttle as an operational vehicle in a routine production environment) and the rules or meaning which that frame implies. Another shared cognitive frame is the belief of NASA as a “can-do” organization. “The operational frame and the schedule pressure manifested itself during the recovery window, when Linda Ham voiced concerns that the time spent maneuvering the shuttle to take additional photos would impact the schedule. In a research context, new data are valued; in a production context, schedule is king.”15 This operational frame and the belief in the ability to “pull off ” the near impossible in the face of insurmountable odds, constraints etc., lead good people and engineers to promise the world, underplay risk and threats, and sell Congress programs on a shoestring budget and an impossible schedule. It fosters the sense of invincibility mentioned previously and biases our decisions and judgment, for example, the acceptance of 3-6 weeks as a reasonable amount of time to be ready to return to flight following the Columbia tragedy. Even if upper management realized this was an impossible demand, many believed that this unrealistic constraint had to be imposed on the “engineers” or else they might slack off (“elite” vs. engineer mindset). This totally ridiculous idea of misleading the people you are trying to “lead” fosters a mistrust that has the opposite effect of what was originally intended and is detrimental to productivity. Overconfidence Bias: Michael Roberto attributes “overconfidence bias” as a contributing factor in impairing the judgment of Everest expedition leaders and climbers Hall and Fischer13. “Researchers have found that people are typically overconfident in 81
Mission Out of Control
their judgment. Scholars have confirmed this finding in studies of people from a wide range of fields, including academia, business, medicine, and the military”28. Both Hall and Fischer were extremely talented and experienced climbers. Hall had climbed to the top of Everest four times, guiding 29 clients. He exhibited signs of overconfidence and occasionally bragged “that he could get almost any reasonably fit person to the summit”29. Fischer was also a very experienced highaltitude climber who had reached the summit of Everest only once, however. He once told his team, “We’ve got the Big E figured out, we’ve got it totally wired”29. This same overconfidence bias was exhibited by many of the SSP managers during the Columbia tragedy, as has already been mentioned. Statements made by Cal Schomburg regarding his “opinion” that the damage would not be a safety of flight concern (even for the RCC leading edge) and LESS-PRT leads regarding the impact tolerance of RCC (from figure 6, “Basically the RCC is extremely resilient to impact type damage. The piece of debris (most likely foam/ice) looked like it most likely impacted the WLE and broke apart. It did not look like a big enough piece to pose any serious threat to the system.”) It will be shown in a later section how this same leader of the LESS-PRT made similar heuristic judgments without data to allay concerns regarding the potential of hail damage to the WLEs while the Orbiter was on the launch pad following a hailstorm at the Cape to the current RCC panel 8R anomaly. Both issues were later determined to be wrong (data showed that hail damage could have caused subsurface RCC damage, and the discrepant RCC panels were later replaced for successive flights because of significant sub-surface delaminations as predicted by the IR NDE methods previously developed prior to STS-117 and which could have been in place prior to STS-121)! In addition, this same “overconfidence bias” persists today ,even after the large PAL ramp ET foam loss on STS-114 and the ice frost ramp (IFR) losses on STS-121. In the former case, the probability of a PAL ramp loss on STS-114 was predicted to be 1/10,000. In the latter case, the probability of critical damage from IFRs was predicted to be ~1/300. Initial images of the IFR liberation looked ominous and had several senior managers in the SSP and at JSC scrambling at what looked like five large pieces of foam from different IFRs, which flew over and under the starboard WLE. The next day, when it was learned the pieces were smaller than originally thought and came from one large IFR piece which broke up in the airflow, jubilant MMT and SSP managers were “high-fiving” each other acting like what had occurred actually proved the “safety” of their initial “guesses.” Unfortunately, such occurrences of luck actually continue to reinforce the incorrect behaviors and previous decisions of completely oblivious 82
Charles Camarda
managers. When we come to the description of the 8R anomaly you will be completely amazed (maybe not by then) by the arrogant statements by these same senior managers regarding what they can predict/divine concerning the time of RCC coating chip loss based on two prior flights! Recency Bias: The “recency effect” or “recency bias” occurs when decision makers place too much emphasis on the information that is readily available and, in this particular case, recent13. Evidence of the Everest disasters suggests that recency bias may have impaired the judgment of the expedition leaders because the climbers had enjoyed remarkably good weather conditions on Everest in recent years and may have underestimated the likelihood of dangerous storms. In the case of Columbia, the data that was most available to the MMT was a very inadequate impact prediction tool called Crater, which was grossly misused (correlated debris sizes were orders of magnitude less than the bi-pod foam, and the impacts to RCC material were not even part of the database). It is totally unconscionable that not one engineer spoke up, knowing the tentative predictive capability of this tool for the intended application. A large piece of bi-pod foam that came off two flights prior to STS-107 caused serious damage to an SRB attach ring, yet it was not catastrophic and, hence, was viewed as a recent successful experience. During the FRR for STS-120, a senior project manager for Orbiter stood up and said he had faith that the SiC coating chips would not come off because they did not on the recent previous flight, STS-114. Psychological Safety: Probably one of the most damning cultural aspects of the human spaceflight community at NASA at the time of both accidents was its environment which was not psychologically safe. Psychological safety is the belief that the workplace is safe for interpersonal risk-taking, such as questioning others or sharing unpopular but relevant views. Team psychological safety is the “shared belief that the team is safe for interpersonal risk-taking30,31. It means that team members demonstrate a high level of trust and mutual respect for one another. Moreover, it means that team members do not believe that the group will rebuke, marginalize, or penalize individuals for speaking up and challenging prevailing opinions.”30 This was the primary reason engineers like Rodney Rocha felt threatened if they spoke out against the beliefs or opinions of the program office. Roberto indicates that the Mount Everest expedition teams “did not openly discuss issues and errors and that group members did not feel comfortable 83
Mission Out of Control
expressing dissenting views. The unwillingness to question team procedures and exchange ideas openly prevented the groups from revising and improving their plans as conditions changed.”13 Several elements could have undermined team psychological safety: 1) perceived status differences within the team, 2) the style of the expedition leader, and 3) a lack of familiarity among group members. It is interesting to compare how status, leader-guide-client protocol (similar to following a rigorous chain of command), and displays of arrogance and violent outbursts by the leadership are very similar to the actions reported of the behaviors before Challenger and Columbia. Comparisons of the different status of the “lowly engineers” and the program managers, as illustrated by statements such as that by Rodney Rocha concerning Columbia during an ABC news interview21: “I couldn’t do it (speak up more forcefully)… I’m too low down…. And she’s (Linda Ham (MMT Lead)) way up here.” The environment within NASA JSC and MSFC was not what it should be to prevent dissenting opinions from being heard1,2. The NASA leadership often challenged dissenting views and put contractors and lower-level engineers in vulnerable positions: “Top shuttle program managers held to previous Flight Readiness Review assessments. In the Challenger teleconference, where engineers were recommending that NASA delay the launch, the Marshall Solid Rocket Booster Project Manager, Lawrence Mulloy, repeatedly challenged the contractor’s risk assessment and restated Thiokol’s engineering rationale for previous flights.” In effect, Mulloy was challenging Thiokol’s credibility concerning previous analyses and decisions. As mentioned earlier, this is one of the reasons why it is very difficult to change previous decisions unless you encourage an environment where failure is an option and admitting prior “mistakes” is not career suicide! At one point, concerned about the engineering analysis and concerns for launching at temperatures below 53°F stated, “My God, Thiokol, when do you want me to launch, next April?” You will also hear similar challenges and cheap shots many times when program managers question the intent of the dissenters in trying to stop the Shuttle Program (e.g., “When do you want me to launch, never?”). There was a tremendous amount of pressure not to be the one to have to stand up and say his or her problem was critical enough to hold up a launch. This is a totally different mindset from Toyota’s policy where anyone on the production line can stand up and pull the “Andon Cord,” stopping the entire assembly line if something does not look right without any fear of retribution.30 Hence, many critical issues were not raised to a high enough level soon enough to not cause retribution by the program managers. According to Mulloy:
84
Charles Camarda
“You always went to those reviews hoping that somebody else would have a worse problem than you did so that it wouldn’t be your problem that held up a launch. We Project Managers joked among ourselves about it. We called it being the ‘long pole,’ the lightning rod, the one that absorbed all the attention and electricity, so to speak, by having the problem that delayed the launch.”2 Regarding the Columbia accident, MMT Chair Linda Ham made many statements in meetings where she reiterated her understanding that foam was only a maintenance problem and not a safety-of-flight condition. Once again, an authority was highlighting the prior flight rationale decisions and asking the community to defy or refute prior decisions and/or admit a mistake or faulty logic by the previous group who made the decision! Unless there is a climate where dissent is encouraged and dissenters are not subject to recrimination or ridicule as being a “Chicken Little” (a term which SSP Chief Engineer Paul Shack referred to regarding Rodney Rocha’s request for imagery during Columbia’s mission). Some of the conclusions of the CAIB concerning psychological safety: “Similarly, Mission Management Team participants felt pressured to remain quiet unless discussion turned to their particular area of technological expertise, and, even then, to be brief. The initial damage assessment briefing prepared for the Mission Evaluation Room was cut down considerably in order to make it ‘fit’ the schedule. Even so, it took 40 minutes. It was cut down further to a three-minute discussion topic at the Mission Management Team. Tapes of the STS-107 Mission Management Team sessions reveal a noticeable ‘rush’ by the meeting’s leader to the preconceived bottom line that there was ‘no safety-of-flight’ issue (see Chapter 6). Program managers created huge barriers against dissenting opinions by stating preconceived conclusions based on subjective knowledge and experience rather than on solid. Managers demonstrated little concern for mission safety.1 “By firmly stating her confidence in prevailing views about foam strikes at the outset, Ham inadvertently created pressure for people to engage in selfcensorship despite their reservations. Without meaning to, managers often reduce psychological safety when they actively seek endorsement of their own views. For example, Ham sought Schomburg’s opinion at several critical points in an MMT meeting to bolster the belief that foam strikes did not present a problem. Like most human beings, she did not engage in a comparable effort to seek dissenting views.”12 Regarding the environment concerning psychological safety, it is apparent, post-Columbia, that the environment is not much better at best and probably worse than what it was pre-Challenger and pre-Columbia. High- and mid-level managers who have expressed dissent have been re-assigned, have been excluded from e-mail distribution lists, calls have been made to their supervisors, etc. This 85
Mission Out of Control
will be expanded upon when we discuss the case study of the 8R RCC anomaly which will follow. Organizational: Hierarchy, Bureaucratic Accountability, and Structural Secrecy Bureaucratic accountability is defined as control from the top with a strict allegiance to hierarchy and chain of command, following rules and procedures, and relaying information up the hierarchy2. As opposed to “professional accountability,” which the space agency relied on during the Apollo era, where control over organizational activity rested with the employees with technical expertise and deference to the skills of those at the bottom of the organization: “In the 1970s, however, professional accountability struggled to survive as the agency adopted the trappings of bureaucratic accountability. Control at the top, superior-subordinate relationships, orders, close supervision, rules and regulations, and hierarchical reporting relations began to dominate NASA’s technical culture. Many engineers, instead of doing hands-on engineering work, were shifted to supervisory oversight of contractor’s work. Less time was spent in the labs, and more time was spent at the desk and/or traveling. Project managers and engineers alike were burdened with the procedural and paperwork demands of central clearance as they responded to the organization’s burgeoning non-technical administrative apparatus”2. The SSP had become a bureaucratic morass structurally and procedurally and severely constricted and slowed the transfer of key information throughout an organization that needed rapid dissemination of information and problem solutions as anomalies occurred that were out of the normal experience base and not well understood. A simple example of the complex flow of information through review boards necessary to process the information, analyze and technically review it, and make decisional judgments is shown schematically in figure 6 below. The rules, policies, and procedures necessary to navigate this process consume a tremendous amount of time to understand and master. Couple that with the time required for briefing compilation, scheduling, consulting intricate institutional chains of command, and reviewing necessary to even enter the system and the result is excessive delays and watering down of key technical details and assumptions, oftentimes resulting in a PowerPoint presentation that contains numerous mistakes and misinformation which gets presented to key program boards which are required to make technical decisions on what is really an R&D problem. Knowledge and familiarity of how to navigate the complex maze of boards, presentation protocol and format of the 86
Charles Camarda
PowerPoint presentations and informal avenues of acquiring buy-in from key managers in positions of authority around the table prior to the meetings were more important in most cases than the technical rigor and peer review of the problem/anomaly (form was more important than content). The very nature of this process resulted in the insular resolution of issues with very limited access to objective, non-advocate peer reviews.
Figure 6. – Board structure of the Space Shuttle Program.
Structural secrecy is described as “the way patterns of information, organizational structure, processes and transactions, and the structure of regulatory relations systematically undermine the attempt to know and interpret situations in all organizations”2. At NASA, structural secrecy concealed the seriousness of the O-ring problem, contributing to the persistence of the scientific paradigm on which the belief in acceptance of risk was based. Structural secrecy can also be defined as the way that organization structure and information dependence obscure the severity of problems from people responsible for problem oversight and decision-making. The Rogers Commission concluded after Challenger that there appeared to be “a propensity of management at Marshall to contain potentially serious problems and to attempt to resolve them internally rather than communicate 87
Mission Out of Control
them forward.”24 The House Committee on Science and Technology later found the opposite to be true, that communication was open; however, the way the information was analyzed and presented and the way the O-ring work group constructed risk was such that the problem was presented as understood and that the team had sound rationale for flight32. Dissenting opinions like Leon Ray’s, in the case of Challenger, were “accepted” into the O-Ring working group and allowed access to data the group had, which “confirmed” the rationale to fly. By winning over Leon Ray, the work group helped to legitimize their decision. In addition, all three safety organizations “depended on the engineers in the SRB work group (O-ring work group) for information. These information dependencies affected the regulators’ definition of the situation.” This led the House Committee to declare “that the information that reached the top decision makers was ‘filtered’: it was interpreted for Levels II and I by those beneath them in the hierarchy, whose position and expertise enabled them to interpret data and pass it on.” The key word in the above quote is “filtered.” Leon Ray’s management felt he was “crying wolf ” just like the chief engineer for the SSP during Columbia, Paul Shack, who viewed Rodney Rocha’s attempts to obtain imagery as “Chicken Little.” Both times, the engineers who dissented with the work group culture’s and the program culture’s decisions were “accepted” into the group and their positions changed. There are several cultural and behavioral reasons why this may have occurred, as presented in earlier sections. In addition, structural secrecy and bureaucratic accountability consider the control and transfer of information, how it is manipulated, filtered, and presented, and how the work groups then use that information to construct risk. Once the work group has already made a decision, it is very difficult to reverse: “For the work group to reverse itself at any point prior to the Challenger launch teleconference would have required a rejection of the scientific paradigm advanced in FRR for all previous launches. To do so was an occupational risk, jeopardizing the engineers’ professional integrity, causing them to lose face in all those forums. Either way, they were wrong then or wrong now. Larry Wear said: Once you’ve accepted an anomaly or something less than perfect, you know, you’ve given up your virginity. You can’t go back. …. It’s very difficult to go back now and get very hard-nosed and say I’m not going to accept that” (ref. 2). During an MMT meeting on January 24, 2003, the Debris Assessment Team’s (DAT’s) 40-minute presentation on the potential severity of damage based on the Crater analysis was reduced to a 1–2-minute discussion by Don McCormack (MER manager), which did not contain the DAT charts and was cut short by Linda Ham. The issue concerning the enormous amount of uncertainty 88
Charles Camarda
in the analytical results never registered with Linda Ham (the MMT Chair). McCormack was further removed from the technical details of the DAT and, hence, was not able to stress the importance of the uncertainty and capability of the Crater code. According to reference 1, “Structure and hierarchy represent power and status. For both Challenger and Columbia, employees’ positions in the organization determined the weight given to their information, by their own judgment and in the eyes of others. As a result, many signals of danger were missed. Relevant information that could have altered the course of events was available but was not presented.” … In the more decentralized decision process prior to Columbia’s re-entry, structure and hierarchy again were responsible for an absence of signals. The initial request for imagery came from the ‘low status’ Kennedy Space Center, bypassed the Mission Management Team, and went directly to the Department of Defense separate from the all-powerful Shuttle Program. By using the Engineering Directorate avenue to request imagery, the Debris Assessment Team was working at the margins of the hierarchy. But, some signals were missing even when engineers traversed the appropriate channels. The Mission Management Team Chair’s position in the hierarchy governed what information she would or would not receive. Information was lost as it traveled up the hierarchy. A demoralized Debris Assessment Team did not include a slide about the need for better imagery in their presentation to the Mission Evaluation Room (MER). Their presentation included the Crater analysis, which they reported as incomplete and uncertain. However, the Mission Evaluation Room manager perceived the Boeing analysis as rigorous and quantitative. The choice of headings, arrangement of information, and size of bullets on the key chart served to highlight what management already believed. The uncertainties and assumptions that signaled danger dropped out of the information chain when the Mission Evaluation Room manager condensed the Debris Assessment Team’s formal presentation to an informal verbal brief at the Mission Management Team meeting.” It will be shown in the discussion of the RCC panel 8R anomaly in Chapter 4 that not only was the person with the dissenting opinion not allowed access to relevant data, but I was actively banned from e-mail distribution lists I had access to prior to my dissent, was not allowed to participate on the NESC team to work the anomaly, and attempts were made to discredit and disparage my intentions and my character. The 8R anomaly was not brought to the attention of the SSP PRCB until almost two years after its discovery. The NESC’s initial position was to work together with the program team, the LESS-PRT, and to focus on helping to provide them with “flight rationale.” It was not until the dissenting 89
Mission Out of Control
NESC engineer provided a 117-page document, which used whatever technical information he was able to obtain by prior access, that the NESC realized that there was not sufficient rationale for flight and went against the LESS-PRT and OPO’s recommendations to fly as is with discrepant RCC panels! You decide which analysis of the culture of structural secrecy was correct, reference 24 or reference 31, the Rogers Commission or the House Committee on Science and Technology? References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14.
15. 16. 17. 18.
90
Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press, 1996. Jenkins, Dennis R.: “Space Shuttle – The History of the National Space Transportation System: The First 100 Missions.” 2010 Dennis R. Jenkins Publisher. Camarda, Charles J.: “Space Shuttle Design and Lessons Learned.” NATO Science and Technology Organization Lecture Series on “Hypersonic Flight Testing.” STO-AVT-234-VKI, March 24-27, 2014, von Karman Institute, Rhodes-St-Genese, Belgium. Cooper, Paul A. and Holloway, Paul F.: “The Shuttle Tile Story.” Aeronautics and Astronautics, Vol. 19, No. 1, Jan. 1981, pp. 24-34, 36. Camarda, Charles J.; Scotti, Stephen; Kunntu, Iivari; and Perttula, Antti: “Rapid product development methods in practice – case studies from industrial production and technology development.” Presented at the ISPIM Connects Ottawa, Ottawa, Canada, April 7-10, 2019. Camarda, Charles J.; Scotti, Stephen; Kunntu, Iivari; and Perttula, Antti: “Rapid Learning and Knowledge-Gap Closure During the Conceptual Design Phase – Rapid R&D.” Technology Innovation Management Review, March 2020 (Volume 10, Issue 3). Nemeth, Michael P. and Anderson, Melvin S.: “Axisymmetric Shell Analysis of the Space Shuttle Solid Rocket Booster Field Joint.” NASA TP 3033, January 1991. McDonald, Allan J. and Hansen, James R.: “Truth, Lies, and O-Rings: Inside the Space Shuttle Challenger Disaster.” University Press of Florida, 2009. Mitchel, M.: “Complexity – A Guided Tour.” Oxford University Press, 2009. Starbuck, William H. and Farjoun, Moshe eds.: “Organization at the Limit – Lessons from the Columbia Disaster.” Blackwell Publishing Ltd., 2005. Vaughan, Diane: “NASA Revisited: “Ethnography, Theory, and Public Sociology.” American Journal of Sociology, Volume 112 Number 2, September 2006. Roberto, Michael A.: “Lessons from Everest – The Interaction of Cognitive Bias, Psychological Safety, and System Complexity.” California Management Review Vol. 45. No, 1, Fall 2002. Edmondson, Amy C.; Roberto, Michael A.; Bohmer, Richard M. J.; Ferlins, Erika M.; and Feldman: “The Recovery Window: Organizational Learning Following Ambiguous Threats in High-Risk Organizations”. In: “Organization at the Limit: Lessons from the Columbia Disaster”. Blackwell Publishing, London, Starbuck, William H. and Farjoun, Moshe eds. 2005. Vaughan, Diane: “Changing NASA: The Challenges of Organizational System Failures”. In Critical Issues in the History of Spaceflight.” NASA SP-2006-4702. Stephen J. Dick and Roger D. Lanius Editors, 2006. Hall, Joseph Lorenzo: “Columbia and Challenger: Organizational Failure at NASA. Space Policy 19 (2003) pp 239-247. Elsevier Ltd. 2003 Kraft, Christopher; Borman, Frank; Jeffs, George; Lindstrom, Robert; Maultsby, Thomas; and Rigell, Isom: “Report of the Space Shuttle Management Independent Review Team.” (“The Kraft Report”). NASA TM-110579 February 1995. Report of the Columbia Accident Investigation Board Vol. II Appendix D.12 – Impact Modelling, Government Printing Office, Washington D. C., October 2003. http://www.nasa.gov/columbia/ caib/html/VOL2.html
Charles Camarda 19. 5. Grosch, D. J. and Riegel III, J. P.: “Ballistic Testing of Orbiter Tiles.” Southwest Research Institute Final Report #06-2720 prepared for Rockwell International, San Antonio, Texas, February 10, 1989. 20. 6. Goodlin, Drew L.: “Orbiter Tile Impact Testing.” Southwest Research Institute Final Report #18-7503-005 prepared for NASA JSC, San Antonio, Texas, March 5, 1999. 21. ABC News Primetime Live video: Final Mission. July 7, 2003. https://www.youtube.com/ watch?v=DPDufha00u8 22. Vaughan, Diane: “System Effects: On Slippery Slopes, Repeating Negative Patterns, and Learning from Mistake?” in William H. Starbuck and Moshe Farjoun, eds., “Organization at the Limit: Lessons From the Columbia Disaster.” Oxford UK: Blackwell, 2005: 41-59. 23. McCurdy, Howard E.: “Inside NASA: High Technology and Organizational Change in the U.S. Space Program.” The Johns Hopkins University Press 1993. 24. Presidential Commission on the Space Shuttle Challenger Accident (1986): Report to the President by the Presidential Commission of the Space Shuttle Challenger Accident. 5 Vols. Washington D.C.: Government Printing Office, June 6, 1986. 25. Final Report of the Return to Flight Task Group. July 2005. 26. Roberto, Michael A.; Bohmer, M. J.; and Edmondson, Amy C.: “Facing Ambiguous Threats.” Harvard Business Review, reprint R0611F, November 2006. 27. Einhorn, H., J. and Hogarth, R. M.: “Confidence in Judgment: Persistence in the Illusion of Validity.” Psychological Review 85, 395-416, 1978. 28. Lichtenstein, S.; Fischoff, B.; and Phillips, L.: “Calibration of probabilities: The State of the Art to 1980.”In “Judgment Under Uncertainty: Heuristics and Biases.” Kahneman, D.; Slovic, P.; and Tversky, A. eds. Cambridge University Press, New Your, N.Y., 1982. 29. Krakauer, J.: “Into Thin Air: A Personal Account of the Mount Everest Disaster.” Anchor Books, New York, N.Y., 1997. 30. Edmondson, A.: “Psychological Safety and Learning Behavior in Work Teams.” Administrative Science Quarterly, 44 (1999): 350-383, at p. 354. 31. Edmundson, Amy C.: “The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley and Sons 2019. 32. U.S. Congress, House, Committee on Science and Technology, “Investigation of the Challenger Accident: Report.” Washington, D.C. Government Printing Office, 1985. 33. President Bush Announces New Vision for Space Exploration. January 14, 2004, NASA Headquarters, Washington, D.C.
91
Chapter 3
Cosmetic Fixes and Near Misses One week after the Columbia accident, the U.S. Expedition 8 prime crewmembers, Mike Foale and Bill MacArthur, and its backup crew (me and Leroy Chiao) were allowed to stop training in Russia and return to the U.S. to be with our friends and family in Houston, and to grieve with our colleagues and classmates. I was anxious to get home to see how JSC was responding to the accident; were the engineers at JSC reaching out to the researchers at Langley and the other NASA Research Centers to understand the technical cause of the problem and how to fix it? Kent Rominger (“Rommel”) was the head of the Astronaut Office, and he held a meeting offsite at the local Holiday Inn one evening to brief the astronauts and their families on how the office would proceed and what our roles would be with respect to the accident investigation. He related everything he knew about the accident and wanted us to be responsive to the needs of the families who lost their loved ones. This was to be a time of healing. I foolishly thought I did not need to heal. I was angry and wanted to jump right in and start fixing the problems. I asked Rommel lots of probing technical questions, trying to understand what happened when a fellow classmate, good friend, and Marine Shuttle pilot, Charlie Hobaugh, came over to me, took me aside, and asked me very nicely to tone it down a bit. I guess a little bit of my anger and my NYer desire to chastise (kick ass and take names) the people who could have allowed this to happen was showing, and this was not the right time. I couldn’t help it; in my opinion, this was not just an organizational failure as Diane and others determined. This was a failure of engineers, program managers, and flight directors (FDs) who were not doing their job and a failure of a culture that prevented good-intentioned people from speaking up, admitting what they did not know, and actively seeking the really knowledgeable people we had 92
Charles Camarda
within the agency that could have helped. In my mind, Diane Vaughan was too kind in her book1 and refused to blame individuals; she did not understand that what was missing was the research culture I was so accustomed to at Langley. How could she be expected to understand? She was trained as a social scientist and could not be expected to know the subtle differences between engineers and research engineers. I knew that was the case because for the seven years I was training as an astronaut at JSC, I witnessed engineers misdiagnose technical anomalies, like the cause of the composite slat structural failures on the ISS treadmill. Bill Shepard, the first commander of the ISS, relied on this treadmill for critical exercise to prevent mineral losses in his bones during extended stays in space at zero gravity (0-g). The next morning, I walked into one of the first team meetings at JSC regarding the accident. It was the aerothermodynamics team, the team that was analyzing the heat loads caused by the slamming of the Orbiter vehicle into the Earth’s atmosphere at hypersonic speeds and trading kinetic energy for heat energy, which allows the Orbiter to slow down and glide to a horizontal landing on a runway. I was very pleased to see that the JSC engineers had reached out to the researchers at Langley, who had already started conducting hypersonic wind tunnel tests to try to understand the anomalous temperature readings that Leroy Cain and his flight control team had monitored as Columbia was crossing over Texas. The temperature readings had hinted something was wrong. I was very happy to see that this JSC team had finally made connections with the research centers and seemed to be collaborating. It was too bad that the impact damage assessment team and the LESS-PRT at JSC had not reached out to the research centers when the STS-107 crew was in orbit. Those researchers would have known immediately that the Crater analysis was inadequate by utilizing more rigorous analysis tools to highlight the criticality of the problem, pushed for pictures to be taken by the Air Force, and helped with the planning of a rescue and/or repair mission! One of the next meetings I attended was led by Ralph Roe, the Shuttle Program Manager responsible for the Orbiter vehicle, called the Orbiter Vehicle Engineering-Working Group (OVE-WG). Ralph had decided, immediately after the accident, that he would lead an internal accident investigation team at JSC and begin the process of collecting and analyzing all the data relevant to the accident. He did this for several weeks. I thought it was very strange that NASA would allow the Shuttle team to conduct their own accident investigation, but I did not say anything. Admiral Gehman, who was appointed to lead the Columbia Accident Investigation Board (CAIB) right after the accident, quickly 93
Mission Out of Control
put a stop to that and told Ralph and Linda Ham (the Head of the MMT during STS-107/Columbia) to stand down. He would not allow them to conduct their own investigation. NASA Administrator Sean O’Keefe tried to object; however, once the news media got hold of the story, O’Keefe realized the public would never allow NASA to investigate itself and backed off2. I was very suspicious right after the accident occurred when I heard the accident investigation would be run by members of the Department of Defense because I feared that they would try to protect NASA, another government agency, and not get to the real root cause of the problem. I quickly changed my opinion when I learned more about Admiral Gehman and how he understood the social and organizational science theories of Diane Vaughan and their impact on the investigation. I had researched all of Dr. Vaughan’s work as well as other social scientists and was sure there was something there. The Admiral was, in my mind, very prescient and understood the importance of the social and cultural causal links between Challenger and Columbia. It was Admiral Gehman who made sure the members of the CAIB read Diane’s book “The Challenger Launch Decision”1, invited her to be a formal member of CAIB, and listened and understood her sociological analysis and analogies to Challenger. His leadership produced the most extensive, open, and rigorous accident investigation, which went well beyond the proximate technical causes of most investigations and proved to the world the importance of the work of research scientists in the fields of sociology, psychology, cognition, and behavior in understanding and correcting the real root causes of such problems. It was during an OVE-WG meeting led by Ralph that I realized that many of the JSC Flight Directors and Shuttle managers were in denial about the real causes of the accident and were circling their wagons in defense mode, protecting their own and trying to prove their decisions were not flawed (cover your ass (CYA) was a term that came to mind). It was pathetic. In fact, only days after the accident, the head of the Shuttle Program, Ron Dittemore, had held up a large piece of ET foam during a media interview and stated that the foam did not cause the accident, “We are comfortable with the foam.” There was no contrition, no admission of any mistakes or misjudgments. According to Harwood2, Leroy Cain, the Entry Flight Director during STS-107, stated that everyone agreed that the foam could not have caused any damage. His first action, once it was obvious we had lost the vehicle, and most likely the crew, was to call out to his team to “Lock the doors” and to not make any outgoing calls or receive any incoming calls, with the intent of preserving the data and, I assume, prevent any leaks to the press. I watched Cal Schomburg, the TPS manager and 94
Charles Camarda
wrongly perceived “RCC expert,” who stated quite forcefully while the STS-107 crew was in orbit that the foam would not cause any damage to the RCC wing leading edge, stand up and be forceful and loud at the OVE-WG meeting. When I would push back on Cal’s understanding of the problem, he would get hotter, louder, and redder in the face. I could see Ron Dittemore give me the death stare from a corner of the room where he was standing. How could these people who screwed up so badly remain so arrogant, I thought. I was mad, and I showed it as I shouted down Cal and glared at Dittemore. Ralph tried desperately to keep things civil. My displeasure with him also showed. Why hadn’t he realized the severity of the foam strike to the Orbiter? That was my first brush with Ralph, and unfortunately, it was not good. I was making enemies fast, and I truly never thought that I would ever fly in space, but I really did not care. In spite of the fact that the CAIB had been initiated almost immediately following the accident, Roe decided to create the OVE-WG team (fig. 1), pulling together things like the timeline of events, failure analysis, data analysis, documentation (e.g., corrective action reports (CARs), hazard controls, certification of flight readiness (CoFR), debris assessment, etc.), and Paul Munafo at MSFC headed a team to understand why and how the ET foam was coming off the vehicle. There was a mix of senior shuttle industry managers from Rockwell (Bo Bejmuk), Boeing (John Mulholland), and United Space Alliance (Pam Madera); JSC Flight Directors (Phil Engelauf, Leroy Cain, and John Muratore) and JSC engineers (e.g., Julie Kramer and Phil Glynn). The team was busy trying to correlate the timeline of events and sensor readings to determine the technical cause of the Columbia disaster. Program managers’ first reactions to solve problems was to create a hierarchical organizational chart, according to their best guess as to key areas of needed work based on some work breakdown structure, and to identify leaders of key elements that they felt appropriate for the individual tasks. This hierarchical “command” structure3 was typical of how most organizations at NASA functioned and, as explained in Chapter 2, can often impede the flow of information up, down, and across organizations, what Diane Vaughan referred to as structural secrecy1. The methods used and the cast of characters did not change. Everything was to be worked internally primarily by JSC and its contractors, led by the same people who failed miserably at recognizing the criticality of the technical problem which downed Columbia. These were the same people who watched large foam losses since the very first shuttle flight (for over 22 years) without any real physics-based understanding of what was causing the foam to come off or what would happen if the debris impacted the 95
Mission Out of Control
vehicle. This was the same strategy they would use to create and lead the teams that developed the return-to-flight (RTF) strategy and planning.
Figure 1. – Orbiter Vehicle Engineering Working Group organization.
Lessons Not Learned The causal links between Challenger and Columbia were so strong there was no denying the social, organizational, and cultural issues had not been corrected post-Challenger, even though the presidential commission to study the Challenger accident, the Rogers Commission, seemed thorough and NASA appeared to address most of the recommendations. One of the conclusions of the commission was: “In view of the findings, the commission concluded that the cause of the Challenger accident was the failure of the pressure seal in the aft field joint of the right Solid Rocket Motor. The failure was due to a faulty design unacceptably sensitive to a number of factors. These factors were the effects of temperature, physical dimensions, the character of materials, the effects of reusability, processing, and the reaction of the joint to dynamic loading.”4 In her analysis of what went wrong and why NASA had succumbed to another related tragedy 17 years after Challenger, Diane Vaughan noted the
96
Charles Camarda
“Commission did not name organizational culture as a culprit.”5 NASA and the commission failed post-Challenger in several areas5: • The commission identified “human factors” as “contributing causes,” meaning they were of lesser, not equal, importance. NASA’s organizational system was not attributed the causal significance it was in the CAIB report6. • The commission attributed communication causes as a function of individuals and did not recognize the organizational structure and structural secrecy as playing a role. • NASA did not identify all layers of NASA’s organizational system as targets for change • NASA’s initial changes were implemented as the commission directed, but the changes to the safety structure were not • NASA remained powerless as a government agency to change the institutional environment because it remained dependent on the external political pressures and budgetary decisions made elsewhere (external to NASA, for example, Congress and the president) • NASA took the commission’s mandates to make changes as an opportunity to make additional changes within the agency, which resulted in unintended negative consequences (e.g., NASA’s inability to monitor reductions in Boeing personnel that moved back to California taking with them the Crater impact tool knowledge) NASA did heed the commission’s recommendation to centralize program management and safety. Control of the Shuttle Program post-Columbia was shifted from JSC to NASA Headquarters in an attempt to restore communication excellence similar to Apollo and it created the Headquarters Office of Safety, Reliability, and Quality Assurance (SR&QA). However, SR&QA did not have direct control over all safety operations, and NASA failed to recognize the importance of organizational structure in impeding communication. The last paragraph in Diane’s book “The Challenger Launch Decision”1 is prophetic in that the author recognized that the necessary social and cultural changes would not occur, and another accident would probably happen7. In her words, “History repeats, as economy and production are again priorities. The lingering uncertainty in the social control of risky technology is how to control the institutional forces that generate competition and scarcity and the powerful leaders who, in response, establish goals and allocate resources, using and abusing high-risk technical systems.”1 97
Mission Out of Control
In the wake of Challenger and Diane’s research, the CAIB recognized the importance of social science and an expanded causal model to include the social and cultural causes of the Columbia accident as equal to the proximate technical debris strike as the cause of the accident. Normalization of deviance, lack of effective decision-making, production pressure, hierarchical organizational structures, bureaucratic accountability, structural secrecy, and lack of psychological safety were all identified as primary causes. The CAIB made 29 recommendations for NASA’s return-to-flight (RTF), and an independent Return-to-Flight Task Group (RTF-TG) led by astronauts Tom Stafford and Richard Covey would review the CAIB’s recommendations and make its own evaluation to NASA8. Unfortunately, NASA openly ignored the cultural issues raised by Vaughan and haughtily brushed aside the insightful recommendations of the board only months after the accident. NASA Administrator Sean ’O’Keefe rapidly established a new engineering and safety center, the NASA Engineering and Safety Center (NESC), as a quick answer to one of the board’s recommendations for an independent safety and technical authority. However, to make NASA technically excellent would require an understanding of the research culture which made NASA’s predecessor organization, NACA, and the Apollo Program so successful. Unfortunately, this would be impossible because the applied research DNA which was successfully encoded during the birth of NASA in 1958, had not survived the 40 years of harsh budget cuts to fund expensive and sometimes wasteful human space endeavors led by managers with scant technical DNA left in their genes. A true “research culture” has an unquenchable thirst for knowledge that is unrelenting and persists until every anomaly is understood and can be predicted through rigorous validation by analysis and experiment. It demands a psychologically safe environment that ensures critical thinking, candid discussions, and dissent. It is never driven by schedule or budget pressures. A true research culture would have automatically satisfied all the CAIB’s recommendations and addressed all critical technical anomalies and gaps in knowledge. How could the NESC be expected to return NASA to technical excellence and prevent accidents? Its chosen leader was the Space Shuttle Orbiter Project Manager at the time of the Columbia accident, Ralph Roe. The short answer, it couldn’t, it did not even have the word research in its title. However, it did provide a soft landing for a valued member of the elite of the NASA human spaceflight community, Roe. Rumor was “he knew where the bodies were 98
Charles Camarda
buried,” a common response when powerful JSC managers were rewarded for failures and mistakes instead of being punished as would happen to most of the rank and file. Even the NESC’s bloated budget and lure of pay raises could not attract the best and brightest researchers and its close ties with the Shuttle Program made it more of a partner in providing “flight rationale” as opposed to the independent technical and safety oversight recommendations the CAIB demanded. NASA’s culture grew worse. Instead of proving we were safe to fly, we had to prove we weren’t safe to fly with a double standard for evidence/data required to do so and a powerful NESC many times impeding attempts. A mere eight months after Columbia, at a two-day NASA retreat (NASA’s Top 40 Leaders Conference on the Wye River, Maryland), Vaughan would recount that the NASA Administrator began the meeting by advising the attendees to “read the CAIB report carefully because ‘not everything in it was true’ and that NASA HQ was checking into what it would mean to ‘legally comply’ as opposed to ‘fully comply’ with the Board’s recommendations.”7 These statements made it clear to all senior leaders in attendance that NASA intended to only meet the bare minimum recommendations to get the space shuttles back up and flying so it could complete the construction of the International Space Station (ISS), a schedule and pressure driven symptom of the “production culture” noted by Vaughan. The recommendations Diane made to the Associate Administrator of Human Spaceflight to use embedded ethnographers, professionals trained in sociology, to monitor the progress of cultural change within the agency fell on deaf ears7. The NASA Administrator made snide comments to Dr. Vaughan at the Wye River Conference about her book about Challenger—“book sales must be up”—and a senior Flight Director’s e-mail comment to her regarding a media sound bite she made—“even my high school boyfriend called, but NASA never called”—stated her words as “a very cheap shot...You kind of had me interested until then...too bad.” In my opinion, many leaders at NASA treated Dr. Vaughan disrespectfully. Years later, when I returned from space following STS-114, I met with Dr. Vaughan at Columbia University in New York City and gave her a signed montage from our mission to thank her for her efforts in trying to help NASA post-Columbia! In addition, only two short months after taking the reins, NASA’s new administrator, Mike Griffin, would cancel the $10 million culture change contract with Behavioral Sciences Technology, which was put in place by his predecessor to satisfy another major CAIB recommendation. This wasn’t just 99
Mission Out of Control
a brush-off; this was a slap in the face to every member of the CAIB team that had worked so hard to help NASA recover, transform, and get back on track. It was a clear signal that NASA was not taking the primary social and cultural causes of the accident seriously! How could they? Most left-brained engineers and technical leaders considered psychology and sociology as “soft” sciences. Return-to-Flight STS-114 The period from the Columbia disaster on February 1, 2003, to the launch of the next shuttle mission, STS-114, on July 26, 2005, I refer to as return-to-flight (RTF). During these two-and-one-half years, NASA struggled to positively identify the technical cause of the Columbia accident; to create a strategy that would ensure that all remaining shuttle flights would be safe; develop and to mature the required technologies and flight procedures; correct the organizational, cultural, and social causes of the accident; and attempt to satisfy the recommendations of the CAIB6 and the RTF-WG8. When I returned home from Star City, Russia, one week following the accident, I walked into the chief of the Astronaut Office, Captain Kent Rominger (“Rommel”), and I told him I did not have to be an astronaut any longer. I had worked at NASA for over 25 years on some of the most challenging hypersonics problems on joint programs with DARPA, the Navy, and the Air Force and could be more useful in helping to put together the teams to understand the cause of the accident and to develop technology that will help us to fly safe. Rommel told me that was not necessary, I could remain in the Astronaut Office, and he would allow me to do what I thought was necessary. Several weeks later, I penned a 15page white paper and presented it to Rommel. In it, I discussed my thoughts on the cause of the accident (technical and cultural) and how we should proceed. The first step forward would have to be the determination of the root cause of the accident: “We may never know the exact cause of the STS-107 accident; however, several important facts are evident: during liftoff, debris hit the underside or (L.E.) leading edge of the left wing; the debris was most likely one or more pieces of foam from the E.T.; the debris could have also included other items such as Super Lightweight Ablator (SLA), SLA filled with liquid nitrogen or air, foam with ice, etc.; the debris probably struck either a RCC LE panel or T-seal, LE carrier panel, acreage TPS area, TPS area near MLGD (Main Landing Gear Door) seal, etc. TPS/substructure damage could have caused entry heating to burn through the skin and result in subsequent structural damage; hot gas could have entered through a compromised MLGD seal or through a hole in either the RCC LE or lost LE carrier panel. 100
Charles Camarda
Even if the exact source or cause of the accident is not known, all damage scenarios such as those listed above have to be addressed and understood and either their solution realized prior to RTF or are deemed as an acceptable risk to the program (loss of a shuttle, not the crew). We have to ensure we fully understand all the potential mechanisms of failure, we can model them analytically, and our analyses are correlated/benchmarked with experimental results.” My Proposed RTF Strategy: In a note to Rommel dated March 2003, two months after the Columbia accident, I proposed the following strategy and recommendations for a safe and expeditious return to flight of the Space Shuttle. It was my best guess, given the information I had at that time: 1. Determine the root cause(s) of the Columbia tragedy, make sure we understand the mechanisms of failure, and develop multiple solutions to fix the problem a. Identify the proximate technical cause i. The problem is a complex inter-disciplinary problem that must be analyzed as a coupled aero-thermal-material-structural system using appropriate analysis methods ii. Understand the various failure mechanisms of the ET foam, which is being liberated at different locations on the cryotank by analysis and test iii. Create a team of experts to develop a “physics-based” ballistic impact analysis capability to predict foam impact damage to the Shuttle Orbiter b Identify the organizational behaviors, cognitive biases, and cultures that contributed to the accident and ensure a “psychologically safe” environment i. Ensure a safe path, without recrimination, to top managers for problems to be addressed ii. Ensure MMT includes membership of objective subject matter experts (SMEs) without any ties to the SSP iii. NASA and Contractor working group teams need sufficient representation of experienced engineers in all domains relevant to the focus of the problem (e.g., the LESS-PRT should have had experts knowledgeable in thermal structures, structural mechanics, RCC materials, aerothermodynamics, heat transfer, impact, and damage tolerance, etc.) 101
Mission Out of Control
2. Ensure a rigorous, methodical scientific approach for problemsolving, which is based on careful validation of analysis methods with representative experiments a. Analytical models and methods have to be correlated by experiment and should be able to predict failure. b. Have to develop a “physics-based” analytical method to predict TPS and structural damage to the Orbiter c. Re-assess all analytical modelling tools and failure prediction methods (e.g., it was found that the 90-degree RCC damage tests in the JSC 10 MW Arcjet Facility were found to be non-conservative). These tests had been used to certify ablation predictions and damage growth of RCC material for the Orbiter. d. Analysis methods have to be developed that take account of the multidisciplinary, coupled nature of most problems. “Zooming out” to capture the “systems engineering” perspective and yet be able to “zoom in” to identify even the most microscopic details if necessary (see Shuttle Tile Story9,10). 3. Key capabilities and technologies need to be developed prior to RTF: a. Non-Destructive Evaluation (NDE) Methods – to determine the integrity of: i. RCC structural components (wing leading edges and nose cap) and coating: 1. In-situ capability to detect voids, coating flaws, cracks, etc., while panels are on the vehicle in the Orbiter Processing Facility (OPF) and/or in orbit. 2. Investigate IR-thermography for in-situ inspection of WLEs ii. Create an NDE tiger team to investigate NDE techniques to inspect and characterize ET foam (Contact key personnel at NASA LaRC’s Instrument Research Division who are working on tera-hertz methods, shearography, IR-Thermography, etc.) b. Full-field and local displacement and stress measurement capabilities for static and high-speed ballistic testing to obtain critical material properties of foam and other materials which will be necessary to predict failure mechanisms. 4. Inspection and repair: a. Develop a strategy to enable inspection and on-orbit repair of the most probable TPS (Tiles and RCC) damage and locations.
102
Charles Camarda
b. Understand at a systems level all implications of the repair strategy, for example: sensitivity for tripping the boundary layer at certain locations; shear flow and loads; aerothermal heating and loads (aero, thermal, vibration, on-orbit, etc.); restraint and dexterity requirements for extra-vehicular activity (EVA (spacewalking)) astronaut performing the repair; potential to cause unintended damage to nearby structure/TPS; etc. 5. Investigate ways to make the Orbiter more robust with respect to foreign object/debris impact damage. For example, using more durable RSI tiles in high-probability impact areas, toughening the RCC WLE and nosecap. 6. Investigate the use of integrated health monitoring (IHM) to determine the extent and location of damage and debris strikes. 7. Aging aircraft/spacecraft: a. The shuttle was designed for a 10-year service life and 100 flights. The average age of the current fleet is over 20 years. We should utilize the structural and materials experts at NASA and in industry that understand how to inspect, analyze and re-certify “aging aircraft” to develop a program to certify the operation of future shuttles and develop a program to ensure safety. 8. Trajectory analysis: a. Develop analytical tools that could be coupled with thermal/ structural damage analyses and growth predictions to optimize potential trajectories to minimize damage growth and ensure safety. 9. Develop an unmanned Orbiter de-orbit and land capability: a. The Shuttle Orbiter has an automated landing capability except for some manual actions performed by the crew (e.g., throwing switches to drop the landing gear and extend the air data probes prior to landing). Having the ability to auto undock and land would decouple the decision to possibly risk landing a damaged vehicle to save the Orbiter (and very possibly the US human space program) with a partial crew while the remaining crew remains on the International Space Station (ISS) and awaits rescue. 10. Take a “Systems Engineering” approach to attacking the entire damage assessment and repair strategy. Figure 2 illustrates the interrelated and interdisciplinary aspects of a repair strategy. a. You need an inspection capability to detect damage and, most importantly critical damage. 103
Mission Out of Control
b. You need an impact damage analysis and test team to be able to predict damage and to assess what is shown by inspection. c. NDE team on-orbit to scan the damage visually (cameras and lasers) to measure damage site d. The Damage Assessment Team (DAT) conducts analysis and experiments to determine what critical damage is. This team has to communicate with the inspection team to ensure with the inspection team to ensure the size of damage can be measured accurately enough.
Figure 2. – Representation of some of the interrelated systems engineering aspects of a damage assessment strategy for return-to-flight of Space Shuttle post-STS-107, Columbia tragedy.
On Columbus Day, October 12, 2003, Rommel called me to tell me I was assigned as a crewmember on the Return-to-Flight (RTF) Mission STS-114. I was sitting at my desk working on ideas for on-orbit repair of the RCC wing leading edge and was totally shocked (I had never expected to ever fly in space because I was making so many enemies in high places upon my return to Houston). I thanked Kent and immediately called my wife and family to tell them the news. Some of my recommendations to Rommel were met with resistance; however, the SSP, quite to my surprise, developed a strategy that was very similar 104
Charles Camarda
to what I had proposed months earlier. Even recommendation 9, to develop an unmanned undock and Autoland capability for the Orbiter, which was met with much resistance by the astronaut pilot community, as one might expect, was in place prior to the RTF of STS-114. NASA’s Strategy for Return-to-Flight (RTF): The RTF solution strategy identified key technologies and capabilities that needed to be developed to help ensure vehicle safety throughout the life of the Shuttle Program. A summary of the RTF strategy and technology requirements is shown in Figure 3 and listed here: 1. Determine the root cause of the Columbia tragedy. 2. Understand the cause of the ET bipod ramp foam loss during launch. 3. Understand the mechanism of debris transport in the airstream to predict probable impact locations, velocities, and orientations. 4. Develop a capability to accurately predict debris impact damage to the Orbiter. 5. Develop a monitoring and detection capability to pinpoint debris strikes to the Orbiter WLE. 6. Develop enhanced ground imaging capability to identify launch debris threats. 7. Develop launch-to-orbit imaging capability to monitor ET foam loss during ascent. 8. Develop accurate on-orbit inspection capability for the crew to inspect the vehicle on flight day 1 (FD1) and throughout a mission as needed. 9. Eliminate all potentially critical debris sources/threats (e.g., foam, ablator, ice, etc.) 10. Investigate toughening the vehicle to resist potential debris threats. 11. Develop and on-orbit TPS repair capability 12. Develop procedures to ensure the International Space Station (ISS) could serve as a “Safe Haven” (officially termed the Contingency Shuttle Crew Support (CSCS)) for the Space Shuttle crew in the event of catastrophic debris damage to the Orbiter.
105
Mission Out of Control
Figure 3. – Space Shuttle Program (SSP) multi-pronged return-to-flight (RTF) strategy.
During the RTF period, the NASA Research Centers were instrumental in developing modelling methods to predict debris impact damage to the Orbiter wing leading edge, predicting the technical cause of the Columbia accident, and in developing on-orbit inspection, detection, and repair capabilities. However, the Space Shuttle Program (SSP) placed leadership of most of the elements for RTF in the hands of FDs and SSP mid-level managers who had very little understanding of how to run technology development projects. There were false starts, questionable decisions, and the process took longer than necessary and was ugly; the next two flights following Columbia experienced several near tragedies resulting from similar causes, ET foam shedding8. The SSP was never able to determine the cause of ET foam shedding or the probabilistic risk of large sources of foam debris shedding and causing critical damage. Toughening the RCC wing leading edge or other TPS was decided to be too costly and would take too long to implement. Because the threat of debris strikes causing critical damage could not be eliminated, the SSP included a rapid launch of a rescue shuttle to save the crew, which could shelter and use the ISS as a safe haven for a period of 30 days awaiting the arrival of the rescue vehicle. While this plan might sound reassuring, it did little to increase crew safety. For one, conducting repairs while in orbit is problematic, especially if there is significant damage. Second, at that time, the ISS was designed to host only three astronauts. Each space shuttle flew seven people. So, if the ISS were to house all ten crewmembers, then three times as many people as intended 106
Charles Camarda
would be stuck on the ISS, breathing the oxygen, eating the food, drinking the water, and producing CO2 and waste. In the best circumstances, the ISS could have supported ten astronauts for thirty days. That would give NASA only thirty days to plan and execute a safe launch to rescue the crew, a near-impossible feat, even under ideal circumstances, but NASA would try its best to ensure such a capability prior to every launch. However, if the astronauts had been stranded because of a systemic failure that needed fixing (which is what happened on STS-114), a rescue mission would be rife with a similar risk of losing the crew and one of only three remaining shuttle vehicles. In other words, NASA’s plan didn’t solve the problem or even reduce much of the risk. They just found a superficial solution that had the potential to further exacerbate an already dangerous situation. The RTF-TG concluded, “To date, the tile and RCC repair techniques developed by the Agency are not considered sufficiently mature to be a practicable repair capability for STS-114.”8 In addition, the SSP never felt on-orbit repair was necessary for return to flight. However, I felt very strongly that it was, and that was the reason I initiated an independent RCC repair team. The SSP tried desperately to develop a strategy for assigning risk for foam debris threats, but this was found to be inadequate. NASA directed the use of probabilistic risk assessment (PRA) as a means to determine the debris threat. PRA analysis proved to be inadequate11, was used to justify flight rationale for STS-114 and the next mission STS-121, and resulted in two near misses that could have caused accidents like Columbia. The problem with trying to develop a PRA to predict ET foam impact threat was that it required knowledge of what caused ET foam to fall off the ET, how that foam would be transported through a very complex airstream between two vehicles, where and what it would strike and if it would cause critical damage if it struck the vehicle. Unfortunately, we knew very little about any of these phenomena, so determining a probability would be pure guesswork. Garbage in/garbage out, as my professors used to say. However, that did not stop the PRA team from touting estimates for risk assessments, which fed the still-entrenched production culture quite nicely. Most of the senior-level managers with signature authority to sign the Certificate of Flight Readiness (COFR) at the FRR were unable to comprehend the technical arguments of the PRA experts and accepted the results with very little technical critique and/or discussion.
107
Mission Out of Control
Technical Cause of the Columbia Accident: Post-launch photographic analysis of STS-107 on January 16, 2003, indicated that a large piece of insulating foam, used to protect the left bipod joint that attaches the ET to the Orbiter, had broken off 81.7 seconds after launch (fig. 4). The foam impacted the underside of the left wing near reinforced carboncarbon (RCC) wing leading edge panels 5 through 9 (fig. 5). The foam was approximately 21 to 27 in. (53 to 69 cm) long and 12 to 18 in. (30 to 46 cm) wide and traveling at a relative velocity of 416 to 573 mph (186 to 256 m/s) at impact.
Figure 4. – Foam debris cloud after ET bi-pod foam impacts Shuttle Orbiter Columbia (ref. 6).
108
Charles Camarda
Figure 5. – ET bi-pod foam and approximate impact point on STS-107, under wing leading edge.
One of the first tasks for the RTF team was to determine if the ET bi-pod foam strike to the Orbiter wing leading edge was the technical cause of the accident. Toward that end, the OVE-WG appointed a Boeing engineer, Dan Bell, to lead the development of a full-scale impact test at Southwest Research Institute (SwRI), which recreated foam impact strikes that caused the Columbia accident (fig. 6). I was concerned that the SSP was marching off and conducting very large, expensive tests without even talking to the experts at Langley and Glenn who had years of experience in analyzing and conducting such ballistic impact tests. On Feb. 25, 2003, I asked Dan about his test plans, what analyses he was planning to use to design the tests and predict results, and what instrumentation he would use to collect the data needed to verify the analysis. To my amazement, he told me he did not need any help that he knew what to expect, could predict damage and would know if it would survive. He also said that they were planning on using only six strain gages to monitor the structural behavior of the impacted RCC panel. I told him that was inadequate, and I could call some folks at the research centers to help create the analysis models. Dan told me he didn’t need my assistance. The very next day Glenn Miller of JSC engineering sent out an e-mail stating that the NASA Technical Integration Team was not 109
Mission Out of Control
interested in additional impact analysis, and since JSC engineering was strained to support the accident investigation, he could not support an activity that had no customer! He was told he was not allowed to pass on data to a non-OVE analysis team without the approval of the MRT. Basically, I was told by Ralph Roe’s OVE team that they did not need any help, even though they had no valid impact analysis tools. I would not let this response stop me from doing what was needed.
Figure 6. – Full-scale impact test of an ET foam debris strike to a Space Shuttle Orbiter wing leading edge at Southwest Research Institute.
I could not believe that NASA would allow such an inadequate and expensive test, which would not provide any meaningful data for the program other than a great photo op for the media and which would allow NASA Ames Center Director and CAIB member Scott Hubbard to jump up and down with glee after the test to announce “we (NASA) found the smoking gun” when it was clear the 14 in. x 14 in. pizza-box size hole in the wing would have definitely caused the accident (fig. 7). This is not how a research organization would have conducted business.
110
Charles Camarda
Figure 7. – Space Shuttle ET foam impact test result at Southwest Research Institute to prove the technical cause of the Columbia accident.
I was appalled at the responses from the OVE-WG and the SSP. I happened to meet John Neer, Chief Engineer for Lockheed Martin Missiles and Space, who was an advisor on the CAIB, and I discussed my displeasure at how the SSP was running the RTF effort. I told John I could create a team of researchers from NASA that could put together a plan to develop a physics-based modeling capability that could hopefully predict the impact response of ET foam striking the RCC wing leading edge. John convinced the CAIB that this was the correct way to proceed. They halted the rush for the full-scale test, and I quickly called my friends at LaRC and GRC. While most of the research centers were reluctant to offer JSC help unless JSC requested it, I was close friends with the center director at Langley, Delma Freeman, and the chief engineer at Glenn, Dr. Woodrow Whitlow, and, without permission, I reached out for assistance, and they immediately gave me access to the individuals who were the subject matter experts in ballistic impact analysis and testing. The team and I developed a presentation and got on the OVE-WG agenda to present our plans to create an accurate physics-based modeling tool that could be ready to predict the results of the first full-scale impact test. When Ralph Roe and others heard I would be presenting, he asked for a private briefing for him and John Mulholland of Boeing (the company that had created the Crater impact tool) prior to my presentation to the OVE team. After they digested what I was going to present, they decided I should put Glenn Miller’s name on the presentation, and oh, by the way, he would lead the project12. That is correct: Glenn Miller, the person 111
Mission Out of Control
who did not think the analysis should be done unless it was requested by the SSP, would now lead the effort. This was classic Ralph Roe and JSC program managers; they would rather let someone internal to the Space Shuttle program at JSC lead a research project regardless of whether that person had any expertise in the research being conducted! The R&D Impact Analysis team which we initiated was able to develop an accurate modeling and simulation tool that accurately predicted the very large damage to the RCC wing leading edge shown in figure 7, and it took only 3 to 4 months. NASA and the Space Shuttle program could have had an accurate tool in place when the crew was in orbit if they had started just four months earlier, when there were warning signs as early as STS-7 (June 1983) and one as recent as STS-112 (October 2002). In Chapter 7, we will use this as one of several examples of how to develop research-based networks to address anomalies within a recovery window13 before they become tragedies. First Near Miss – STS-114 Protuberance Aerodynamic Load (PAL) Ramp The large piece of foam that dislodged and caused the Columbia accident came from the left bi-pod region of the ET (see figs. 5 and 8). The bipod ramp is a wedge-shaped foam structure 10 in. (25.4 cm) long, 14 in. (35.6 cm) wide, and 12 in. (30.5 cm) tall and has a total mass of about 2.2 lbs. (1 kg). The ramps are applied by hand spraying BX250/265 foam over the bipod fittings, which are covered with Super Lightweight Ablator (SLA) during processing (fig. 9). During the STS-107 accident investigation, it was found through dissection of existing bipod ramps that hand spraying over such a complex geometry was prone to produce internal voids and defects in the foam10. Since it was impossible to predict exactly how the bi-pod ramps were jettisoning from the vehicle during launch, it was decided to conduct analysis and wind tunnel tests to determine if the bipod ramp was necessary to prevent overheating of the fittings and adverse flow over the vehicle. Once it was decided the ramps were not necessary, they were removed from the ET for all remaining shuttle flights. Our crew would now be assured we would not have a bipod debris threat during our launch. Now, if the PRA community had correctly predicted the probability of a bipod ramp hitting our vehicle to be 0% after being removed from the vehicle, I would have believed them. That would have been the only calculation I would have agreed with by that team at that time.
112
Charles Camarda
Figure 8. – Shuttle External Tank (ET) bipod foam ramp.
Figure 9. – Process steps for building the shuttle ET bipod ramp.
During crew training, we would typically visit many of the shuttle manufacturing and processing facilities around the country to meet the teams working so hard to ensure our flight would be safe. It was during a trip to the Michoud facility in Louisiana, where they manufactured the ET, that we were 113
Mission Out of Control
able to get an up-close look at the ET and watch the teams as they processed and inspected the tanks that we noticed the protuberance aerodynamic load (PAL) ramp. The PAL ramp was another large foam section that was originally designed to protect the ET liquid oxygen (LO2) tank and liquid hydrogen (LH2) tank cable trays from high-velocity crossflows. I remember our crew requesting the SSP to remove the PAL ramp, similar to how they removed the bipod foam ramp prior to our launch. We were told this was not possible because it would take too long to certify by analysis and wind tunnel testing. During the flight readiness review (FRR) for our mission, the SSP produced the results of the PRA team that the calculated probability of PAL foam release and catastrophic damage to the vehicle of such foam being less than 1/10,000; it was determined that it was thus, not necessary to remove the PAL ramp. This decision was made even though the team investigating the causes of foam release had stated that the primary cause of ET foam release was due to thermal stresses in the very thick foam sections such as the PAL ramp (which it had been shown by dissecting sections of an ET to have serious bond-line delaminations exactly where the largest thermal stresses were being predicted). On July 26, 2005, approximately 127 seconds after the launch of STS-114 and shortly after Solid Rocket Booster (SRB) separation, the highly improbable happened; a very large piece of the ET PAL ramp foam broke away from the tank, entered the airstream, and flew under our starboard wing (see fig. 10). If the PAL debris had struck our starboard wing, it would have caused more severe damage than Columbia had encountered. We would not have been able to repair the wing leading edge on orbit with enough certainty that a returning crew would be safe. Hence, we would have had to perform the very first automated uncrewed, undock from ISS and landing of a Shuttle Orbiter or its destruction during entry in the Atlantic Ocean. Our crew would then have had to use the ISS as a safe haven and hope that a rescue vehicle would save us. This probably would not have been an acceptable option, however, since the new head of the Shuttle Program, Bill Parsons, had grounded the shuttle fleet right after launch when it was learned the PAL ramp had dislodged. My wife found out there would be no rescue mission when she heard the news on the television that evening. She was never notified by NASA. Luckily, the PAL ramp missed hitting our vehicle, and our survey of our vehicle using the shuttle robotic arm (the space shuttle remote manipulator system (SSRMS) and the newly developed orbiter boom sensor system (OBSS) when we woke up on our first day on orbit indicated we experienced no debris damage to the vehicle.
114
Charles Camarda
Figure 10. – Large piece of ET protuberance aerodynamic load (PAL) ramp, which was liberated 1.27 seconds after liftoff of STS-114.
It became apparent that two years of effort to understand the causes of ET foam losses from critical areas of the ET, such as the large acreage areas, ice/frost ramps, and intertank-to-tank flanges, were not sufficient to quantify the root cause of the foam liberation and prevent large foam losses from the ET as shown in figures 11 and 12 and as listed in Table 1.
Figure 11. – Large pieces of the ET protuberance aerodynamic load (PAL) ramp and additional releases post-launch of STS-114. 115
Mission Out of Control
Figure 12. – ET foam debris loss areas following the launch of STS-114.
Table 1. – ET Foam Losses During STS-114
Calculated probabilities of actual ET foam debris that broke loose during launch and that could have exceeded the allowable are shown in Table 2. It is of interest to note that the a priori calculation of the probability of PAL ramp debris causing critical damage was less than 1/10,000! This calculation, coupled with other programmatic concerns at the FRR for STS-114, resulted in a decision to launch without removing the PAL ramp foam. It should also be noted that 116
Charles Camarda
following the return of STS-114, the calculated probability of a critical impact of PAL ramp foam given the known time of release and mass was determined to be 1:26. That is correct; we had a 1 in 26 chance we would have incurred a strike and probably critical damage. Needless to say, it was decided to remove the PAL ramp prior to the next shuttle launch, STS-121. To this author’s knowledge, no one has ever gone back to recalculate and/or explain how such an improbable event occurred or why the PRA analysis was so inaccurate. Major changes to the way probabilities were being calculated did not occur, and PRA continued to be used to assess risk for the next shuttle launch, STS-121, and successive flights thereafter.
Table 2 – Calculated Conditional Probabilities of Impacts Exceeding TPS Damage Thresholds Based on Time of Release, Mass, etc.
Second Near Miss – STS-114 Gap Fillers Days before the launch of STS-114, our crew was on the launch pad checking out our ride with new NASA Administrator Mike Griffin. I knew Mike because I had worked on a team he led in 1993, which investigated Single-Stage-to-Orbit (SSTO) options for safe, inexpensive, and reliable access to space. I really liked Mike; I thought he was a brilliant, no-nonsense leader who could read people and would not stand for any BS. He took me aside from the other members of the crew and told me that when I returned from space, he wanted me to be the Director of Engineering at JSC. I was caught by surprise and speechless. I told him I really wanted to return to pursue research at Langley and that I could easily travel to DC any time he needed me, but he pressed me further. He said he really needed someone like me to make JSC technically excellent and to fix the culture. I knew how badly JSC was broken and how powerful the program 117
Mission Out of Control
offices were, and I knew the fights I would be having , so I told him, “Mike, I don’t mind going into the ring with one hand tied behind my back, but if you tie both hands, it’s a no.” He told me he would support me, and I naively accepted the challenge. Our crew knew how bad the culture was at JSC post-Columbia; several of us had run-ins with program managers and working side-by-side with them during the RTF period. At one point, we had a crew meeting to discuss whether we should request a changeout for our lead flight director, Paul Hill. Flight Directors and Shuttle Program Managers were also tired of the confrontations they were having with me and, on more than one occasion, asked upper management to have me taken off the flight. If it weren’t for the courage of my commander, Eileen Collins, I would have been removed. Many years later, on a podcast with Eileen called “Leading Edge Discovery (produced by ITSP Magazine), I asked her why she stood up for me like that. Eileen said, “Charlie, my crew can say whatever they want whenever they want. We just killed seven crewmembers on Columbia and if we are going to tell people they cannot speak up, we still have a problem.” Wow, Eileen may very well have been one of the very few people at NASA at that time who truly understood the meaning of psychological safety! The environment at JSC was so toxic that our crew had an agreement prior to launch; at the end of every day, we would talk privately with the head of the Astronaut Office, Rommel, and no one else. This was unprecedented; Flight Directors were gods in Mission Control and were privy to all communication with the crew except for private family and medical communications. I carried a contact list of key researchers in the U.S. (what I called my friends-of-Charlie (FoC) network) in my crew notebook, which I could use to bypass Mission Control and speak directly with researchers I trusted if there were any issues during our mission. It was during our approach to ISS, when we were 600 feet below it, that Eileen commanded a backflip of the Orbiter (called the R-bar Pitch Maneuver (RPM) in NASA speak), so the crew onboard the ISS could photograph and video our vehicle for TPS damage, that we saw two small pieces of felt called a gap filler sticking up between the protective tiles near the Orbiter’s nose (fig. 13). We were totally caught off guard because we had never seen this on prior shuttle missions. This protuberance could potentially disrupt airflow during our entry back to Earth and lead to catastrophe. People on the ground debated doing a spacewalk to fix it. My aerothermal research friends at Langley, Tom Horvath and Scott Berry, were analyzing whether this could be a problem and had sent up a short white paper of their analysis with their concerns this protuberance 118
Charles Camarda
could trip the boundary layer (thin layer of flow close to the solid surface of the vehicle) and cause excessive heating. From space, I called Tom to determine the criticality of the gap-filler problem and to hear his recommendations directly from him. I called Tom from our computers on ISS which got patched directly to Tom’s cell phone on Earth, bypassing MCC. His team’s expertise and perseverance forced the informed decision for two of our crew to perform an unplanned emergency spacewalk. This critical decision was later proven to have saved the lives of our crew. Thanks to our connection to the researchers with the correct knowledge and a commander who respected and listened to the expertise of her crew, JSC made the right decision in this instance. The cause of this “near miss”, ET foam loss during launch, was still unresolved, yet NASA would continue to misdiagnose the criticality of the problem and eventually lead to yet another near miss!
Figure 13. – Most forward of two protruding gap fillers on STS-114 as photographed from ISS during R-Bar Pitch Maneuver (RPM).
Third Near Miss – STS-121 Ice Frost Ramps (IFRs) and My Firing as Director of Engineering at JSC When we returned from space and completed all the obligatory media events, I began the process of applying for the Director of Engineering job at JSC. It was pretty easy since the administrator wanted it and I received plenty of help from the folks at JSC Human Resources (HR). I would be replacing Frank Benz, the Director of Engineering at JSC during the Columbia accident, who was also slated to receive a soft landing after the Columbia disaster as Director of JSC’s White Sands Lab in El Paso, Texas. Griffin selected me to lead engineering 119
Mission Out of Control
at JSC even before he selected a new center director for JSC, Mike Coates. In fact, he told me the going-in position for the new center director would be his/her acceptance of me as the Director of Engineering. This was also unconventional and would put me in a very uncomfortable position and working relationship with Coates. I had several confrontations with the newly selected center director, ex-astronaut and Shuttle Commander Mike Coates. I felt he did not respect my engineering leadership and direction and was listening to others for their insights. I really did not have any time to try to win his respect. Mike Griffin gave me a very tough job, and I spent my time building an environment in the Engineering Directorate at JSC that was psychologically safe and that understood the need to reach out for technical expertise elsewhere within NASA when my engineers needed help. My first responsibility was to ensure every successive shuttle mission was safe and we would not have another tragedy. Unfortunately, it appeared to me that Mike Griffin was laser-focused on his new Constellation Program, which would build the next rocket system to replace the soon-to-be-retired Space Shuttle, and which would propel space exploration back to the Moon, Mars and deep space. This program was the agency’s response to President Bush’s “Vision for Space Exploration,” which he announced on January 14, 2004, following the Columbia tragedy as a way to regain public enthusiasm for space. Exactly as Diane Vaughan predicted, the external forces affecting NASA would continue to apply the production pressure, which had led to prior disasters. In fact, on several occasions, I was told by Mike I needed to spend more time on the Constellation Program and not spend so much time working shuttle. One time, I had to tell him point blank, “I have people flying on a real space vehicle in weeks and that his Ares vehicle was still on paper.” I knew Mike did not like to hear that because the Ares I rocket was the concept his team selected after only a 60-day study in DC called the Exploration Systems Architecture Study. The Constellation Program would suffer many technical design problems and wanting leadership, which led to cost overruns and schedule delays and would be cancelled in 2011. Mike believed that the Flight Director he selected to lead the Constellation Program, Jeff Hanley, had the expertise to lead what should have been considered a technology development program. The time I tried to offer assistance to Jeff by offering one of my highly talented FoCs from LaRC, Dr. Stephen Scotti, as his chief engineer, I was rebuffed by Jeff, who went directly to Mike to complain and further got me in trouble with my boss in HQ. It was pitiful. Months later, Mike would assign a chief engineer from JPL, Brian Muirhead, to help Jeff and ensure the success of the Constellation Program. 120
Charles Camarda
Our JSC engineering team conducted a total review of the next shuttle vehicle to launch, STS-121, Space Shuttle Discovery, the same vehicle I flew into space. It was to launch almost one year after STS-114 because they needed time to ensure the removal of the PAL ramp would not cause any problems. During our review with Center Director Michael Coates we identified all the technical issues and concerns. The meeting lasted only one and a half hours, and you can imagine we did not have enough time to dive into the technical details, and it was obvious, in my mind, that Mike was not responsive to hearing such details. During that briefing, Mike Coates was informed that JSC Engineering felt we should remove all IFRs prior to flight because we rated the seriousness of an IFR event as probable and catastrophic. This designation meant the likelihood of an IFR becoming dislodged during launch was probable and would cause catastrophic damage! The statement in the JSC Engineering “Position on the Debris Risk from the Existing Configuration of the External Tank (ET) Ice Frost Ramps Section 6” read, “JSC Engineering, as the Independent Technical Authority (ITA) for the Integration Safety Engineering Review Panel (ISERP) is insisting that the debris risk due to IFR be classified as Probable/Catastrophic14.” We were advising Coates that the JSC Engineering position was that we should not fly as-is! The SSP was slowly realizing that the only sure way to reduce the ET foam debris threat was to eliminate as many of the large sections of thick foam as possible. Hence, with the removal of the bipod and PAL ramp foam sections, a risk assessment of remaining foam for the FRR of STS-121 predicted probabilities of debris impact threats made by two independent assessments for the next critical sections of ET foam, the ice frost ramps (IFRs) to be between 1:75 and 1:110. Yes, that is correct; the SSP still relied on the PRA analysis for flight rationale even though it was proven to be totally inadequate for the PAL ramp predictions for STS-114. I tried to meet with Mike Coates prior to the FRR for STS-121; however, he did not have time to see me that evening. It was the JSC Engineering Directorate’s decision that we should not fly STS-121, and Mike Coates was unavailable to hear the details. It was apparent from these conditional probability calculations, which were now based on past instances/histories of IFR ramp foam losses, that the likelihood of ET foam losses from these areas was “probable” and the “severity” would be catastrophic (as shown in the “hazard severity and likelihood chart shown in Table 3). There were two hazards documented in the “probable-catastrophic” box (upper right-hand corner in red (Table 3)) at the FRR for STS-121.
121
Mission Out of Control
Table 3 – Hazard Likelihood and Severity Matrix Presented to Flight Readiness Review Prior to Launch of STS-121
When it was time to sign the Certificate of Flight Readiness (CoFR), the leader of the FRR went around the table and asked each signee if they thought we were ready to fly. When it was Mike Coates’ turn, he responded, “JSC is go for launch.” I could not believe it. I was sitting in the audience, and my engineers behind me were squirming in their seats. Our center director refused to listen to his director of engineering. I immediately rose to the podium after Coates’ reply, and said “I’m Charlie Camarda, the Director of Engineering at JSC, and I believe we should not fly. The ice frost ramps should be removed.” And I proceeded to explain why. This was a massive breach of protocol, as I spoke out of turn and publicly undermined my supervisor. You could have heard a pin drop in the room. If you could see the look on Mike Coates’ face as I stood at the podium, you could have easily guessed what was going to happen to me (fig. 14).
122
Charles Camarda
Figure 14. – Flight Readiness Review for STS-121, June 6, 2006, at Kennedy Space Center.
Right after I stood up, the head of Safety and Mission Assurance (S&MA), Bryan O’Connor, and then-NASA Chief Engineer, Chris Scolese, both stood up and supported me and recommended not to fly as-is but to remove the IFRs prior to launch of STS-121. In addition, even SSP Program Manager Wayne Hale insisted that the IFR hazards remain as “probable-catastrophic” in the risk matrix amid many recommendations to change the designation! The decision to launch and accept the risk of another potential accident was made by the administrator, Mike Griffin, against the recommendations of all three technical and safety leaders in the shuttle decision tree. Mike stood up and stated he would “accept the risk,” and we would fly STS-121 as is. To this author’s knowledge, this was the first time in the history of the Shuttle Program that we had ever flown the vehicle with a “hazard” in the “red” (upper right-hand corner) of the hazard probability-severity matrix! So much for the independent technical authority of the newly formed NASA Engineering and Safety Center (NESC). Both Bryan and Chris relented; they decided not to appeal Griffin’s decision and signed the CoFR, but not without first crossing out the words “I concur with proceeding with the mission” and adding the caveats noted in figure 15. Chris’ decision to not appeal was based on the ISS providing a “safe haven” (CSCS (Contingency Shuttle Crew Support)) in the event of a critical debris strike, even though this option for redundancy was proven ineffective when the Shuttle 123
Mission Out of Control
Program grounded the fleet after the loss of the PAL ramp on my flight. Bryan’s reason was that he had accepted his administrator’s risk acceptance. It was very apparent that there were no teeth in NASA’s two safety organizations, S&MA and the NESC, and there was no respect for the opinions of its technical leaders.
Figure 15. – Signatures of the Head of Safety and Mission Assurance Bryan O’Connor and Chief Engineer Chris Scolese on the CoFR for STS-121 with caveats.
Three days after the FRR, I was fired from my position as the director of engineering at JSC by Mike Coates. Mike Griffin had a golden opportunity to support me and make a strong statement to the NASA team that psychological safety was critical to ensure safety, but he chose not to. Instead, Griffin accepted Coates’ reasoning for firing me. These actions sent a clear message to all engineers at JSC and throughout NASA that if you speak up and dissent, your career is at risk. The message my team heard after all my efforts to lead by example and develop a safe environment was destroyed. Instead, it is my belief that Mike Griffin and Mike Coates solidified that a “whack-a-mole” environment would persist at NASA. The opposite of what renowned author and researcher of psychological safety, Amy Edmondson, would call a “Fearless Organization.”15 I could not believe Coates would fire me a week before the launch of STS121. I pleaded with him to wait until after the launch because of the anxiety this would cause for the crew and their families, but his hubris wouldn’t let him. A shuttle commander who knew well the emotions the crew and their spouses would be feeling would not relent. The wife of one crewmember and a close friend of another called me at home for some explanation. They were very worried for the safety of their loved ones. 124
Charles Camarda
The day prior to launch, Chris Scolese called me at home at Mike Griffin’s request and asked me if I would watch the launch from Mission Control to view the video downlink of the IFRs to see if there was a problem. I could not believe Mike could not ask me himself after he allowed me to be fired. Being a loyal NASA employee and doing everything I could to ensure the astronauts would be safe, I said I would. I watched the video downlink from ET cameras realtime with John Muratore, the SSP Chief engineer, and witnessed a large section of acreage area ET foam adjacent to an I/R ramp (IFR) break away during launch and fly over the wing leading edge. Photographs of the ET taken by the crew on orbit during ET separation are shown in figure 15. Such a hazard was one of the two events which were determined to be “probable-catastrophic.” Downlinked video footage was viewed post-launch, and it appeared that several IFRs had been liberated into the airstream and were transported directly past the starboard wing leading edge. The debris seemed to pass both under and over the wing leading edge, and it remained a considerable concern for the Mission Management Team throughout the evening and until the flight day 2 inspections of the leading edge were completed by the crew the following day. I remember Bob Cabana approached me at the MMT meeting with a frightened look on his face and asked me what I thought. “Did the foam hit the wing leading edges? Could it have caused damage?” I thought to myself, is he worried about the crew, or is he more concerned about his reputation if, God forbid, we have another tragedy? Either way, I was pissed; we could have avoided this. I looked him straight in the eye and told him, “I DO NOT KNOW; it was impossible to tell from the video footage.” We would have to wait for the STS-121 crew to complete their full day of robotic inspection to determine their fate! Once the ET photos were downlinked, it was apparent that only one large piece of acreage area foam next to an IFR was liberated and that it broke up in the flow into several pieces. Later that day, when the crew completed their inspections, the downlinked photo and laser data from the Orbiter Boom Sensor System (OBSS) indicated that the leading edge received no visible damage.
125
Mission Out of Control
Figure 15 – Ice Frost Ramp (IFR) failure during launch of STS-121 as photographed by a crewmember post-ET separation.
When the “all clear” came down by radio from the ISS crew, I watched shuttle program managers, flight directors, and senior management literally high-fiving one another—a sight that has remained seared into my brain. My colleagues had barely escaped yet another near miss with disaster, and the program and Mission Control teams were celebrating when, in my view, what was called for was a somber reckoning of what could have happened. As you will learn in Chapter 6, a major attribute of an HRO is to treat near misses as if they are failures. After STS-121, we, again, put shuttle missions on hold until we could determine whether it was safe to fly without the most recent piece of foam that came off, the IFRs. Even then, NASA still clung to its production culture and failed to lead with research and analysis about why the foam was shedding. They removed IFRs, just as I had recommended at the FRR, after the recent near miss and kept flying, knowing there were many more large pieces of ET foam remaining that could cause a problem. It was during this nearly dangerous repetition of the same mistake that I realized something was deeply rotten at NASA, and the culture would not change without a massive effort. NASA refused to address the social and cultural issues that caused the Challenger and Columbia accidents and continued to use the same bad habits for constructing risk, creating flight rationale, and discounting dissenting voices, narrowly missing a tragedy on the next two missions. Just how badly NASA had fallen will be highlighted by the behaviors and actions it took next to dismiss a critical anomaly I found on the RCC wing leading edge of my vehicle,
126
Charles Camarda
Space Shuttle Discovery, which I discovered months after our return from space following our mission, STS-114. References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13.
14. 15.
Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press, 1996. Cabbage, Michael and Harwood, William: “Comm Check: The Final Flight of the Shuttle Columbia.” Free Press 2004. McChrystal, General Stanley: “Team 0f Teams – New Rules of Engagement for a Complex World.” Penguin Publishing Group 2015. Presidential Commission on the Space Shuttle Challenger Accident (1986): Report to the President by the Presidential Commission of the Space Shuttle Challenger Accident. 5 Vols. Washington DC: Government Printing Office, June 6, 1986. Vaughan, Diane: “System Effects: On Slippery Slopes, Repeating Negative Patterns, and Learning from Mistake?” in William H. Starbuck and Moshe Farjoun, eds., “Organization at the Limit: Lessons from the Columbia Disaster.” Oxford U.K.: Blackwell, 2005: 41-59. Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Vaughan, Diane: “NASA Revisited: Theory, Analogy, and Public Sociology.” American Journal of Sociology, Vol. 112 No. 2, September 2006. Final Report of the Return to Flight Task Group. July 2005. Cooper, Paul A. and Holloway, Paul F.: The Shuttle Tile Story.” Aeronautics and Astronautics, Vol. 19, No. 1, Jan. 1981, pp.24-34, 36. Camarda, Charles J.: “Space Shuttle Design and Lessons Learned.” NATO Science and Technology Lecture Series on “Hypersonic Flight Testing.” STO-AVT-234-VKI, March 24-27, 2014, von Karman Institute, Rhodes-St-Genese, Belgium. Camarda, Charles J.: “Failure is Not an Option... It’s a Requirement.” Presented at the 50th AIAA/ ASME/AHS/ASC Structures, Structural Dynamics, and Materials Conference.” AIAA Paper Number 2009-2255. Camarda, Charles J., Throckmorton, David A., and Miller, Glenn: “Impact Modeling and Test.” Presented to Orbiter Vehicle Engineering (OVE), March 11, 2003. Edmondson, Amy C.; Roberto, Michael A.; Bohmer, Richard M. J.; Ferlins, Erika M.; and Feldman, Laura R.: The Recovery Window: Organizational Learning Following Ambiguous Threats.” in William H. Starbuck and Moshe Farjoun, eds., “Organization at the Limit: Lessons from the Columbia Disaster.” Oxford U.K.: Blackwell, 2005: 220-245. Evans, Carol; Muratore, John; Gomez, Ray; Jarrell, George; and Shack, Paul: “JSC Engineering Position on the Debris Risk from the Existing Configuration of the External Tank (E.T.) Ice Frost Ramps. June 14, 2006. TEO-5 Edmondson, Amy C.: “A Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley & Sons 2019.
127
Chapter 4
When All Else Fails “The ring ain’t the thing; it’s the play. So gimme a stage where this bull here can rage. And though I can fight, I’d much rather recite. That’s entertainment!” “Raging Bull,” film by Martin Scorsese
After being fired as Director of Engineering at JSC by my center director, Mike Coates, days before the launch of STS-121 after standing up at the FRR and stating I disagreed with his decision that JSC was “Go for launch,” I penned the following message for my 1,200 employees: Subject: My Parting Words to the Organization I am Honored to Have Had the Opportunity to Lead Team, I want you all to know how proud I am of the efforts you have made to ensure a safe return to flight for STS-121. I am most proud of the way I can count on you to do and say the right thing and stand up and be counted. I have witnessed it daily as your director, and I know this firsthand after serving as a crewmember on STS-114. My wife and family were confident knowing my safety was in your hands (she had a speed dial list with all the key engineers’ phone numbers). I was most proud at all the PRCBs and at the recent FRR when you stood up and presented your dissenting opinions and your exceptions/constraints for flight. I believe we have come a long
128
Charles Camarda
way in a very short time, and I truly believe you will become the jewel in the exploration crown for this agency. I cannot accept the methods I believe are being used by this center to select future leaders. I have always based my decisions on facts, data and good solid analysis. I cannot be a party to rumor, innuendo, gossip and/ or manipulation to make or break someone’s career and/or good name. I refused to abandon my position on the MMT and asked that if I would not be allowed to work this mission that I would have to be fired from my position, and I was. I am truly sorry I will not be there with my team after all our hard work. I will be there in spirit, and I am only a phone call away if you need me. We have much to do to prepare to support this mission and I am sure you will be the ultimate professionals that I know you are and exemplify the spirit of “teamwork” which will be needed to get the job done. Please do not let this affect your focus at this crucial time! I have been offered a position and will continue to support this agency, which I love, and be a good team member. Thank you, and God bless you all, Charlie Please forward this note to all EA employees. Below is an excerpt that was posted by Keith Cowing, founder of NASA Watch, a NASA/government watchdog website which focuses on NASA and spaceflight, dated June 26, 2006. Readers note from Keith Cowing of NASA Watch: “What kind of leadership gets rid of the Director of Engineering days before a launch? In particular a person that has vast proven practical knowledge of the technical issues that the next shuttle flight will have to deal with? Couldn’t such leadership wait until STS-121 wheels stop?? Typical JSC strategy, either you are with them or against them. People get railroaded, become a pariah, and everyone is afraid of being next. At least Charlie refused to be part of such culture, and now he pays the price. This a dark day at NASA.” After clearing out my office on the 9th floor of the JSC HQ building, building 1, I moved into my new “office” on the first floor, a small utility room with no windows, a refrigerator, a copy machine, and a FAX machine. I was reassigned as Ralph Roe’s deputy in the newly created NASA Engineering and Safety Center (NESC), the organization I helped Ralph create when he first started to 129
Mission Out of Control
select employees. I asked to help Ralph select recruits for the NESC to ensure he would select the right people as opposed to some of his buddies in the program world who had little technical and very little research expertise. I was only partially successful because many of the top researchers were not attracted to an organization that was not led by someone they respected as a researcher and one who could lead and support the research necessary to prevent future failures. My initial, albeit naïve, reaction was to view this reassignment as a positive. I thought it would give me an opportunity to bring about real change to the NASA safety program. I quickly discovered that it was in reality, a punishment and a way to silence me and cancel my voice while at the same time appearing to value my commitment to safety to the public through the news media. In the years that followed, I would find that most of my recommendations to Ralph would fall on deaf ears. In my opinion, Ralph’s job, was to keep me in line and make sure I was silenced and remained a pariah at JSC and NASA. I spent the next two years fighting to investigate a potentially life-threatening anomaly that was found on one of the critical RCC wing leading edge panels after my mission, STS-114. I was foiled at every turn. I sent hundreds of emails to dozens of subject-matter experts and leaders in the NESC and to Space Shuttle Program (SSP) Managers. I also emailed Chris Scolese (Chief Engineer), Bryan O’Connor (Head of S&MA), and Mike Griffin (NASA Administrator). Many of the emails had enclosures and references of my technical and cultural analyses and differences with NESC, SSP, and LESS-PRT logic and decisions. Many went unanswered, and those who did respond usually refused to provide the information I needed. I was prevented from speaking with the technician who discovered the anomaly, and I was removed from the email list for the team in charge of the wing leading edges. Take a second for this to sink in: the deputy for NASA’s newly formed engineering and safety organization, the NESC, who just happened to be the astronaut who flew on the return to flight mission, and also, by the way, spent twenty-two years as a researcher of this exact technical area, was not granted access to information that directly pertained to the space shuttle’s safety. On two separate occasions, my colleagues received direct orders not to assist me directly. It got so bad that, in an act of desperation, I filed a Freedom of Information Act (FOIA) request on April 1st, 2008, to access the records I needed to prove the anomaly was critical and systemic to RCC panels on all vehicles. Of course, this proved pointless due to delays by the HQ machinery, and when I did receive a response, it left out all the relevant technical information I requested and sent me in an infinite do-loop back to HQ and the people who were blocking me 130
Charles Camarda
from getting the data to start with. I will use the anomaly of the starboard RCC wing panel 8R, which was found on our vehicle, STS-114 Space Shuttle Discovery after landing, as a case study of how a dysfunctional culture is hard to correct and persists even after being identified as the cause of a recent tragedy. I saved hundreds of emails, audio recordings, technical papers, program presentations, and personal logs of events to validate the voracity of the reporting and to develop timelines to add context to the urgency of specific events. I will highlight a persistent and dysfunctional work group culture and its flawed construction of risk policies; the production culture and all its inherent pressures to launch; a weak safety culture, and the inability of the newly created NESC to understand the technical issues and its tendency to align itself with shuttle program decisions as opposed to being an objective and independent technical authority. I will also use this case study to demonstrate simple changes which I believe could have been made, which I will elaborate on in the second half of the book to show how to correct a failing organization and prevent such accidents. Reinforced Carbon-Carbon (RCC) Material and Wing Leading Edges To understand criticality of the RCC wing panel 8R anomaly we are about to discuss and the enormity of some of the decisions that were made regarding the safety of the crews that continued to fly the space shuttle after knowledge of the anomaly was known, the reader needs to have some background of the RCC material system, the structural components which comprise the wing leading edges, the location of the anomaly on the wings, and the working group responsible for the RCC wing leading edges, the Leading-Edge Structural Subsystem-Problem Resolution Team (LESS-PRT). Reinforced Carbon-Carbon (RCC) Material: Reinforced Carbon-Carbon (RCC) is not a material, per se, as much as it is a “material system” or “system of materials.” RCC was developed for use in ballistic missile nosecones in the ’60s and ’70s; however, unless it has sufficient oxidation protection, it will ablate, or erode by oxidation and fluid flow, during atmospheric entry. Hence, developing and certifying this material for extended reuse was always a development challenge for the Shuttle Program. RCC is a composite made by curing graphite fabric that has been pre-impregnated with phenolic resin and laid up in molds for a desired shape1. The polymeric resin is converted to carbon by pyrolysis—a chemical change brought about by the addition of heat. The part is then densified to increase its mechanical properties 131
Mission Out of Control
by repeated cycles of infiltrations with furfuryl alcohol and heating. This process could take months and is very costly. A thin, silicon carbide (SiC) oxidation protection outer layer for the carbon-carbon (C-C) is formed by a diffusion coating process that converts the outer three plys on the C-C part to form SiC (fig. 1). The very thin (0.04 in (0.1 cm)), less than four times the thickness of an eggshell, protective SiC layer has a different coefficient-of-thermal-expansion CTE than the underlying C-C substrate and, hence, thin cracks form through the SiC layer during processing cooldown as shown on the right in figure 2. It is this thin, brittle oxidation protection layer that protects the carbon substrate from rapidly burning up like a piece of charcoal during Earth entry! The thin cracks could serve as pathways for oxygen to penetrate the subsurface material to oxidize it and cause voids beneath the fragile coating2,3 (left side of fig. 2). These voids beneath the surface of the SiC coating are critical because they are not visible on the surface and they could jeopardize the integrity of the RCC coating, cause chips of coating to break loose and expose the carbon-carbon structure, cause oxidation, burn-through of the RCC panels, and hot gas ingress during Earth entry, which could result in a catastrophe (similar to that of STS107). To help prevent this problem, tetraethyl orthosilicate (TEOS) is applied via vacuum impregnation to fill remaining porosity. A Type-A sodium silicate glass sealant is then applied to the outer surface to fill the craze cracks in the SiC coating to impede oxygen penetration to the C-C substrate. The Type-A sealant fills the craze cracks and, after curing, forms a glass outer coating. The craze cracks closed during entry heating, and since the glass is viscous at those temperatures, some of the glass will flow onto the outer surface and be driven away by shear flow forces. Hence, because of the unknown reusability of this material, a rigorous inspection test program was instituted to ensure subsurface mass losses and potential coating integrity issues would not cause a catastrophic failure. However, the devil was in the details, and selection of a relatively new technology without sufficient analysis and testing to verify its performance led to a standing army of people on the ground and caused shuttle operations costs to skyrocket similar to its ceramic tile TPS. Also, this material was not understood well enough as a system (i.e., its coupled aerodynamic, thermal, structural, and material behavior) to identify and predict key critical failure mechanisms prior to design selection and use. For example, impact testing to simulate the severity of a debris strike and understand coating breach and/or loss and possible burnthrough during Earth entry. Hence, the working group responsible for the RCC wing leading edge, the Leading-Edge Structural Subsystem-Problem Resolution Team (LESS-PRT), led by Mike Gordon (Boeing) and Don Curry (NASA JSC), 132
Charles Camarda
would not have had the knowledge necessary to predict the integrity of their system if impacted by debris prior to Earth entry as was the case for STS-107.
Figure 1. – Reinforced Carbon-Carbon (RCC) material system.1
Figure 2. - Reinforced Carbon-Carbon (RCC) material craze cracks and voids.1,2
Reinforced Carbon-Carbon (RCC) Leading Edges: Each wing leading edge has 22 individual RCC panels and RCC T-seals to allow overall wing mechanical growth during loading and thermal expansion differences between panels and attachment structures (fig. 3). Because the leading edges operate as a hot structure and could reach surface temperatures as high as 3000 °F (1649 C) during entry, the leading edge could radiate a portion of that heat to space to cool the hotter regions such as the stagnation region. In addition, internal radiation from the lower (hotter) wing surface could radiate 133
Mission Out of Control
to the upper (cooler) surface to help cool and equalize temperatures. Each RCC panel and T-seal had a unique shape, and, because of strict fit tolerances to prevent hot gas ingress and allow differential thermal expansion, each panel component is custom fit to each vehicle and wing segment (port and starboard). The forward spar of the wing was covered with insulation and a thin super alloy shield to protect it from high temperatures within the wing cavity and potential sneak flow of hot-gas ingress during entry. One of the wing leading edge panels recovered after the Columbia accident had its inside surface coated silver, probably due to the vaporization of the thin internal super alloy heat shield covering the forward spar of the wing (as shown in fig. 4). That gives you some idea of the laser-like heating of the hot gas jet that probably flowed through a hole in the wing and vaporized the metallic shield. Installation of a shuttle RCC wing panel at KSC to the forward spar is shown in the lower left of figure 3.
Figure 3. Reinforced Carbon-Carbon (RCC) wing leading edge system.
134
Charles Camarda
Figure 4. Reinforced Carbon-Carbon (RCC) wing leading edge system recovered from STS-107, Columbia accident.
Leading Edge Structural Subsystem-Problem Resolution Team (LESS-PRT): Throughout the 30-plus years of the Shuttle Program, there were many issues concerning leading edges that had to be addressed. It was the job of the LESSPRT to evaluate the performance of this new technology before and after every mission. Although the LESS-PRT has “Team” in its title, it operated more like a working group. According to Jon Katzenbach and Douglas Smith6, a “working group” is defined by strong leadership, individual accountability, alignment with the broader organizational mission, and discusses, decides, delegates, and recommends as opposed to a “team,” which exhibits shared leadership roles, has mutual accountability, has a specific team purpose, and encourages open-ended discussion. We will be discussing in detail the difference between working groups, teams, and especially what it takes to be a high-performing team in Chapter 6. During the development phase of the shuttle and its early flights (late ’70s to early ’80s), the members of the LESS-PRT worked closely with the material scientists and research engineers at Langley, Ames, and Glenn to understand the aero-thermal-structural performance of RCC during various loading conditions which would simulate the cyclic launch, orbit, and Earth-entry environments the hardware would be subjected to. Thousands of high-speed wind tunnel and arc-jet tests were conducted, which mimicked the aerothermal heating and flow conditions through the atmosphere and thermal-mechanical load tests to understand material properties and failure limits of the material 135
Mission Out of Control
and its SiC oxidation protection coating. Over the years, this close collaboration and technical exchange between the LESS-PRT and the NASA Research Centers evaporated, and the LESS-PRT had less and less contact with their colleagues at the Research Centers. The working group became more insular in how they developed processes and procedures regarding the RCC leading edges, such as documentation, inspection techniques, repair procedures, and decisions on flight worthiness/readiness of the RCC components. The LESS-PRT was a valued working group by the SSP managers because their decisions directly affected their decision to fly or not fly and, thus, were directly related to the production pressures of schedule and budget. The “working group culture”5 of the LESSPRT developed many of the same characteristics and behaviors reminiscent of the O-Ring working group during the Challenger accident. The LESS-PRT, at the time of the Columbia accident, was headed by Michael Gordon, a Boeing materials scientist who had been following the development and processing of the RCC leading edges manufactured by Lockheed-Martin’s Missile and Fire Control (LMMFC) Facilities in Dallas, Texas, throughout their use during most of the shuttle program. I met Mike during my astronaut years at JSC, and we became good friends. He was a hard worker who had a passion for his job. Mike worked closely with the JSC sub-system manager for the leading edges, Dr. Donald M. Curry. I knew Don for many years when I was a researcher at Langley. When I was an astronaut at JSC, I worked with Don on an RCC heatpipe-cooled wing leading edge concept for the shuttle and honored him with a coveted NASA Silver Snoopy Award—an award given by astronauts to people who went above and beyond to ensure their safety. Don was a well-respected Apollo-era researcher at JSC whose expertise was understanding the ablation and recession/degradation of the Apollo capsule heatshield. He earned his PhD studying the multi-dimensional mass flux and flow in porous media which he used to develop damage models for predicting RCC material ablation during atmospheric entry. Don was valued as a true expert in all things related to RCC ablation, and together with Mike, they pretty much dominated the leadership and decision-making within the LESS-PRT. Organizations represented in the LESSPRT working group included members from other NASA centers, LockheedMartin (LM), and the United Space Alliance (USA). USA was a spaceflight operations company; initially, it was a joint venture equally owned by Rockwell International and Lockheed Martin, and later, Boeing and LM. There would be some tension between members of Boeing, LM, and USA, which I believe happens when large complex organizations merge. The technical areas covered
136
Charles Camarda
by the LESS-PRT included aerothermal analysis, materials science, structures, heat transfer, non-destructive evaluation (NDE), and safety. Until the loss of Columbia during the STS-107 mission on February 1, 2003, many of the LESS-PRT engineers believed the RCC wing leading edges to be durable and robust, even though small areas of coating loss could cause oxidation, burn-through, and large holes during the 15-minute heating phase of entry. After the Columbia tragedy, the LESS-PRT and the nation learned how much they did not know about the reliability, impact damage tolerance4, and Earth-entry survivability of the Orbiter. Furthermore, in addition to the limited understanding of impact resistance, the tragedy revealed large knowledge gaps remained regarding the causes of systemic cyclic aerothermal exposure damage to RCC leading edges. RCC Inspection Techniques and Safety Controls: Rigorous procedures were developed by the LESS-PRT to inspect the leading edges using visual means, tactile methods, non-destructive evaluation (NDE) (e.g., ultrasound (C-scans), X-ray, eddy current, and infrared camera), and destructively by using plugs of RCC pulled after flight to monitor the rate of subsurface mass loss and void growth. The gold standard for inspection following the Columbia tragedy became IR Thermography, a technique which was developed by a researcher at NASA Langley, Dr. Bill Winfree. It uses flash thermography (basically a radiant heat pulse to the external surface of the component and analysis of the transient thermal gradients) to identify surface and even sub-surface defects (voids) that are not visible to the human eye. This relatively new technique was adopted for our mission, STS-114, and so, prior to our flight, every RCC component was thoroughly inspected using all NDE means available. Typical methods included visual, tactile, tap tests, X-ray, ultrasonics (UT), IR thermography, and eddy current, which were complemented with destructive tests of process control coupons to ensure material compliance and coating integrity. Post Columbia, there was a concerted effort to develop other NDE techniques which could detect small amounts of damage, including subsurface damage in situ. The reason for this desire to develop better inspection methods was that, following the Columbia accident, we were rapidly learning how fragile the RCC system was to small amounts of coating loss as well as subsurface damage (delaminations, cracks, and backside coating loss, etc.) due to impacts by foam. This drastically changed the critical damage criteria to much smaller coating losses (from 0.25 in. dia. to 0.08 in.) and, in addition, we also learned that foam 137
Mission Out of Control
impacts can cause backside coating loss and subsurface delaminations in the substrate without any visible surface coating damage. This meant that prior inspection techniques (visual and tactile coating checks) were no longer valid to ensure RCC panel integrity (prior to this finding, the RCC panels would not have ultrasonic or X-Ray inspections for subsurface or backside defects for 8-10 flights, until scheduled maintenance was performed). IR thermography was developed as a viable in-situ tool and was adopted as part of the RTF plan; however, most of the validation testing was conducted on “simulated” damage to RCC specimens with “flat-bottomed” holes of various sizes to calibrate the technique to known damage. Very little testing on impact damaged specimens was conducted initially until “impact” specimens became available. Very little testing was conducted on real damaged specimens that exhibited subsurface mass loss at the SiC/carbon interface or near highly curved regions such as the anomaly on panel 8R. Most of these “safety” controls developed by the group to ensure the integrity of the RCC components were shown to be inadequate to prevent an accident, as we later learned during the 8R anomaly case study9. Visual Inspection: Post-flight, all RCC outer surfaces undergo a 100% visual micro-inspection to identify discrepancies that do not conform to the ML0601-0002 criteria (visible surface cracks filled with glassy sealant and no visible voids (<0.003 in. wide); craze cracks > 0.001 in.; SiC chips > 0.015 in. deep; and pinholes > 0.04 in. in diameter). Visual inspection of panel 8R pre- and post-flight did not detect any cracks/flaws. However, after post-flight IR NDE inspection detected a flaw7 (CAR LRU-TES-3-32-099), the panel was shipped to LMMFC on 12/6/05. During in-process inspection at LMMFC as per MRD 583201, out-of-family craze cracks were discovered, which measured 0.013 in.8 (greater than an order of magnitude above what was considered to be discrepant (>0.0001 in.) and over twice as wide as out-of-family (>0.005 in.)) cracks. It is not certain whether the initial discrepant craze cracks were found before or after the removal of the glass sealant; however, it is certain that the remainder of the large craze cracks were not found until after the sealant was removed. In addition, there were many other such instances reported where out-of-family craze cracks (>0.005 in. wide) were not discovered until the glass sealant was removed and the part was heat treated as part of the refurbishment process. Hence, the reliability of visual inspections for finding discrepant (>0.001 in. wide) cracks or even outof-family (>0.005 in. wide) cracks was very low (panel 8R craze cracks were not 138
Charles Camarda
found by post-flight visual inspections). Therefore, the LESS-PRT should have questioned such safety controls and, at the very least, not touted the efficacy of such controls during subsequent FRRs (which they routinely did). Tactile Inspection: The existing tactile inspection criterion, V09AJ0.075, was designed to detect damage at the SiC/carbon interface. Rationale for this methodology was based on a belief that the “coating adherence integrity is largely based on the integrity of the underlying carbon substrate” and that “the “hands-on” pressing evaluation on regions with larger craze cracks will identify interfacial oxidation manifestation that will cause the SiC coating to loosen and flake from the RCC, thereby exposing the carbon substrate”1. There are several reasons why this “logic” was shown to be false and very misleading and the reliance of such an inspection as a means of detecting the onset of critical defects was shown to be seriously flawed9. In fact, it was well known and admitted by the LESS-PRT in several briefings that they did not have a coating integrity standard or even a tensile evaluation to quantitatively determine the integrity of the SiC coating with subsurface damage (even though we had been flying shuttles for more than 30 years)! IR Thermographic Non-Destructive Evaluation (NDE): IR thermography was selected by the SSP as the best in-situ NDE technique to use after every mission to determine subsurface defects, coating loss, and cracks. As mentioned earlier, it was maturing in its capability to detect “real” impact damage and subsurface mass loss; however, it was by no means “certified” even in its capability to reliably detect subsurface flaws in the acreage areas prior to STS-114. The IR NDE research team at Langley, led by Bill Winfree, was rapidly learning and trying to transfer that knowledge to the inspection teams at JSC and KSC. A critical setback to the ability to use this technique to detect critical voids along the slip side edges of RCC panels (also called the “Joggle/Step Gap” region) surfaced post STS-114 with the failure of the method to predict damage to panel 8R which has subsequently been proven to exist pre-flight. It will be presented below that pre-fight IR thermography images of STS-114 show subsurface IR indications when using an improved “line-scan” technique developed later by researchers at NASA LaRC.
139
Mission Out of Control
STS-114 RCC Wing Leading Edge Panel 8R Anomaly – Case Study I was told once by the head of the Astronaut Office that the only negative response he ever received from one of my supervisors was that “I had a low tolerance for stupidity.” While I thought this observation was laughable, I also felt it was unjustified. I actually do have a very low tolerance for two types of people: arrogant people who think they know more than they really do and bullies. But when this confluence of such traits resides in a handful of people who hold the power of the life and death of my friends in their hands, I have zero tolerance. As I think back, I cannot believe any high-risk organization (HRO) like the NASA Astronaut Corps should consider a “low tolerance for stupidity” as a negative behavior. The 8R anomaly story is technically involved. It describes not only the complex technical phenomena regarding the RCC WLE but also the social, behavioral, and cultural complexities of several interconnected organizations throughout the agency and its contractor base. To help the reader, I have created visual timelines and logs of events, personal communications and email references, and presentations regarding the disposition of RCC panel 8R and related events concerning programmatic decisions regarding safety of flight. The early part of the story actually begins in January 2000, over five and a half years before the 8R anomaly was discovered post-flight of STS-114 in September 2005. Missed Opportunities 1 and 2 – RCC Panels 8L (FY 2000) and 10L (FY 2001): A visual timeline of the sequence of events describing the nondestructive and destructive investigations of RCC panel 8R taken from Space Shuttle Discovery (OV-103) post-landing on August 9, 2005, is shown in figure 5. You notice two instances of indications of a very similar anomaly to that discovered post-flight of STS-114 along what is called the slip side Joggle-Step-Gap region (JSGR) of the wing leading edges (the region between the RCC panel and its adjacent T-Seal (see figures 6 and 7)). The hottest RCC panels along the wing leading edges are panels 8 to 10 because this is the location on the wing where the bow shock of the nose of the vehicle interferes with the bow shock of the wings. In addition, the JSG region experiences additional, enhanced heating because the flow impinges on the forward-facing step of the adjacent panel, as shown by the inset in the upper right of figure 7. It is believed by this author that, over time, the cyclic heating in this region caused an accelerated loss of substrate carboncarbon material causing the associated voids beneath the SiC-coated surface to
140
Charles Camarda
grow faster. This was, in my mind, a systemic problem associated with aging and degradation of the RCC material.
Figure 5. – Visual timeline of the sequence of events detailing the inspections (visual, tactile, NDE, and destructive evaluations) concerning the RCC wing leading edge panel 8R, which was discovered post-flight of STS-114 (August 2005)13.
Figure 6 – Prior damage experienced on panels 8L (STS-103) (12/7/2000) and 10L (STS-102) (3/27/2001) of OV-103 post flight9,11. 141
Mission Out of Control
Figure 7. – Reinforced Carbon-Carbon (RCC) Wing Leading Edge System (WLES) and “Joggle”/Step Gap Region9.
RCC Panel 8L of OV-103 experienced a SiC coating chip missing postflight of STS-103, which landed on 12/27/1999 and was documented in in a Corrective Action Report on 12/29/199910 and shown in figure 6. The damage was large enough to exceed the repair criteria, and the panel was removed and shipped to Lockheed Martin Vought Systems (LMVS) for inspection and later scrapped. During the inspection, it was noted that the damage had “trademark impact signatures,” and there was “no evidence of generic coating adherence or coating loss problem existed.” It also was noted that “carbon substrate was exposed and oxidized.” However, upon further inspection of panel 8L during an NDE evaluation using IR thermographic techniques presented on 10/1/0111, large sub-surface indications of mass loss were present, which were very similar to those found post-flight of STS-114 on panel 8R. Reference 11 is one of the first indications by the SSP that it might have been helpful to investigate other NDE techniques for detecting subsurface defects in RCC material. Inspectors also noticed the large areas of SiC coating loss along the slip edge of 8L (similar to that of RCC panel 8R of OV-103 following STS-114). It is also interesting to note that documentation stated that “evidence suggests that the damage evolved over many flights – NOT IMPACT DAMAGE” (original emphasis from ref. 11). In addition, there is no mention of out-of-family craze cracks in the CAR for 8L10 (This is very interesting because in several presentations made by the LESS-PRT to the SSP an attempt was made to draw a connection between the magnitude of subsurface coating loss experienced on 8R to very 142
Charles Camarda
large craze cracks). Upon a closer inspection of the local damage of 8L, it does not appear that there were large out-of-family craze cracks as noted on panel 8R post STS-1149. The fact that there was evidence of oxidation of the substrate may have indicated that this SiC chip came off prior to or during entry heating; however, it is also very possible that this oxidation was present, and the chip became dislodged after touchdown. On April 3, 2001, two weeks after the landing of STS-102, and a little over one year after the missing chip on RCC panel 8L was discovered, the same Space Shuttle vehicle (OV-103, Discovery) experienced a similar coating loss, only this time on RCC panel 10L12. The damage was recorded as 2.0 in. long, 0.3 in. wide, and 0.018 in. deep with exposed and oxidized substrate. This was a large enough area of unprotected carbon substrate, which could have caused a catastrophe during entry if it had been dislodged early in the heating phase. My good friend Don Curry briefed us in the Astronaut Office on the status of the RCC TPS on the Orbiter in 2001. When I saw the RCC coating chip that popped off on STS-103, I approached Don after the presentation and offered the assistance of my colleagues in my old branch at LaRC, the Thermal Structures Branch (TSB). I told him I believed this was an indication of a thermal stress problem that the experts in my branch could help his working group diagnose and analyze. Don turned me down and said that he did not think that was the problem. I was totally surprised by Don’s reply to an offer of free assistance from some of the best structures and materials scientists who specialized in RCC material and who helped him, and his working group mature this system for reusable flight. The fact that two anomalies where SiC coating had come off RCC panels in critical locations of the vehicle and IR imagery indicated linear subsurface degradation, and the LESS-PRT would not immediately contact the research experts at Langley and Glenn and conduct an exploratory study to understand its root cause dumbfounded me. In my mind, these were two missed opportunities to fix the problem within a “recovery window”13 (over five years long for the 8R anomaly) and avoid a potential disaster. What I would come to learn was that this resistance to seek outside help from the research centers had become the normal operating procedure for the LESS-PRT and many of the engineering staff at JSC.
143
Mission Out of Control
Missed Opportunity 3 - An Anomaly is Found Post Flight of STS-114 in RCC Panel 8R The LESS-PRT had missed not one but two prior anomalies spaced a little over a year apart, which should have led them down an exploratory research approach to understand the root cause and serious nature of a large SiC coating chip loss in this critical area. The vertical bars in figure 4 identify the periods of shuttle launches and landings. I use the shuttle launch dates to highlight the urgency in diagnosing the severity of the anomaly, the pressure I was feeling to convince my safety office, the NESC, the criticality of the problem, and how launch pressures affected key decisions to continue flying safety-critical, questionable hardware. On 9/13/05, The LESS-PRT had data that there was an IR indication of subsurface damage to panel 8R on OV-103 (Discovery) in a similar location as two prior occurrences (figure 8).
Figure 8. – Indication of subsurface anomaly on panel 8R (OV-103) post STS-114, which was not evident during pre-flight IR inspections and was observed using in-situ IR thermography with the panel still on the vehicle14,15.
This was the first use of IR NDE of RCC panels in situ (in place on the vehicle) and hence, there could have been issues with the inspection method. They continued with further NDE inspections on 11/21/05, which indicated a 30-in. linear indication of a sub-surface anomaly along the entire slip side of the panel, which was also disturbingly similar to panels 8L and 10L (see figure 9)16 over 5 years earlier.
144
Charles Camarda
Figure 9. – Linear indication noticed upon further inspection at USA Hanger-N nondestructive evaluation (NDE) facility using IR thermography (sometime prior to USA Engineering report dated 11/22/06)16.
In July of 2006, further inspections using a tape pull (one of the LESSPRT “certified” techniques for tactile inspection) and dental pick probing to understand coating integrity indicated that large sections of coating flaked off readily, as noted by Lockheed technician Ali Nasseri in a presentation dated 4/4/0716. When I alerted the NESC as to the seriousness of the problem, the LESSPRT removed all mention of how easily the coating flaked off in all subsequent presentations, and I was forbidden to speak with the technician directly to try to ascertain how tenuous the coating integrity was. In addition, they failed to mention that this was a recurring problem and had occurred on two prior missions. Instead, in the charts they presented to the SSP about the status of 8R, they stated: “Except for Panel 6R, no other RCC part has documented out-offamily craze cracks to the extent seen on Panel 8R” and “The Panel 8R condition is believed to be confined this part only and is not a safety of flight concern for the rest of the LESS hardware.” The LESS-PRT also mentioned that they needed to “Establish a Coating Integrity Standard for Flown RCC.” When I mentioned to the team my surprise that there was no SiC coating integrity standard for such a critical shuttle component that had been flying for over 26 years, they changed their chart one day later to read “Establish a tensile evaluation for Flown RCC.” Developing a “standard” would have required a much more rigorous scientific approach to understanding the structural integrity of the degraded RCC specimen and its qualification as safe for flight. It would also require much more time, which would cause major program schedule issues! When they finally relented years later and the NESC attempted to determine coating integrity as related to the magnitude of subsurface mass loss experimentally, the results would prove to be alarmingly low. 145
Mission Out of Control
Radiographic images, taken tangent to the surface of the panel edge, indicated that the voids beneath the surface of the RCC were as deep as 0.04 to 0.08 in17 (see fig. 10). Hence if a SiC chip was lost prior to entry, the flow would experience additional heating to exposed substrate as shown in figure 518,19. This is what alerted me to the seriousness of this problem, yet why was the LESS-PRT unfazed?
Figure 10. – Tangential X-rays of panel 8R post STS-114 pre- and postglass sealant removal. Image b was filtered to enhance and sharpen the void area17.
You can tell by the level of subsurface damage that the seriousness of the anomaly was known to the LESS-PRT prior to the very next flight following mine, STS-121, yet nothing was mentioned at the Flight Readiness Review (FRR) of STS-121 to alert the Space Shuttle Program (SSP) that the integrity of the RCC panels and/or the newly developed IR NDE inspection technique was in question and a potential safety-of-flight concern. If I had known this when I stood up at that FRR and was concerned about the IFRs, I would have added it as one other reason we should not have flown STS-121 as-is, in addition to the ice frost ramps (IFRs)! When technicians began probing the fragile coating, large flakes came off readily, exposing an oxidized RCC substrate approximately 6.25 in long and 0.394 in. wide, as shown in figure 1116. When I saw the extent of this damage,
146
Charles Camarda
my jaw dropped. I was shocked the LESS-PRT was not raising major alarms and calling for a shuttle standdown from flying!
Figure 11. – First region of SiC coating removed from RCC panel 8R (OV-103), approximately 6.25 in. long and 0.394 in width)16.
I first learned of the 8R anomaly on my flight almost two years after it was first noted on 9/13/05. I was sitting at my desk at Johnson Spaceflight Center (JSC) on May 1, 2007, when I received an email from the LESS-PRT20, which had an attached enclosure of a team briefing concerning an anomaly they were working on located on the right wing leading edge (WLE) of my vehicle, Space Shuttle Discovery (OV-103), which they discovered just weeks after we landed on August 9, 20058. Instead of using this as an opportunity to gain knowledge of this anomaly by conducting a rigorous forensic inspection, the LESS-PRT proceeded to follow established “working group”5 procedures, treated this as a standard anomaly, and continued to process panel 8R, which had incurred unprecedented SiC coating damage, for refurbishment and reuse (by excavating out the damaged area as part of the processing for repair). In doing so, the LESS-PRT destroyed critical evidence/data and missed an opportunity to collect valuable data to help them understand the seriousness of the problem, showing a very serious lack of critical thinking. 147
Mission Out of Control
The SSP, guided by their technical working group responsible for RCC components, the LESS-PRT, continued not to contest the launch but instead relied on them to help provide flight rationale for three successive launches (STS-121, STS-115, and STS-116) before I had even learned of the anomaly. It was then that I began my odyssey to school the NESC, the LESS-PRT, and the SSP as to the seriousness of this problem and to try to halt further launches until this anomaly was well understood. In my mind, there was absolutely no good reason why the SSP could not replace the discrepant RCC panels with good RCC spares and continue working on the problem. This would have had a minimal impact on the schedule and budget. What the LESS-PRT should have learned after the Columbia accident was that the tests they were relying on to understand the effect of entry heating to the wing leading edges (WLEs) and the analyses they developed to predict the survivability of the RCC WLEs were not conservative as they had claimed at numerous FRRs but were instead nonconservative and the possibility of burn-through and structural failure during Earth entry was very possible. Post Columbia, my good friend and aerothermal researcher at LaRC, Dr. Peter Gnoffo19, had conducted analyses for the CAIB, which indicated that flow over a hole or damaged section of the RCC leading edge can cause more severe heating at an angle as opposed to 90 degrees to the surface which was used to qualify RCC burn-through by the LESS-PRT. This would have meant that all of the predictions for burn-through using the RCC Damage Growth (RDGT) developed by members of the LESSPRT would be inaccurate and nonconservative. The LESS-PRT knew this in August of 200511, yet they continued to tout the conservative predictions using this tool incorrectly at several FRRs. Their original critical damage criteria had to be changed several times from an original criterion of a 0.25-inch hole through the WLE to a 0.08-inch hole through the WLE and SiC coating loss less 0.08-in. for lower surface RCC WLE panels 8 to 10!12 The LESS-PRT had known that the RCC panel 8R on our WLE had experienced severe coating degradation, which readily flaked off over a region as large as over 0.4-in. wide and 6.5-in long as shown in figure 11. Detailed timelines of events concerning the processing of OV-104 (Orbiter Vehicle-104) WLE panel 8R both pre-and post-STS-114 given in reference 15 (a 100-plus page white paper I had written to explain the seriousness of the problem to the shuttle program). The failure of the LESS-PRT to understand the seriousness of the 8R anomaly and their reluctance to seek the expertise of researchers at LaRC and GRC is a testament to how working group cultures can degrade and become dysfunctional. What was even more
148
Charles Camarda
concerning, was that the newly formed NESC which was instituted to correct such failures was also woefully inadequate in doing so. Missed Opportunity 4 – Failure of the NESC to Recognize the Criticality of the 8R Anomaly When I opened the attachment in the email dated 5/1/0720 with attachments of a presentation by Clifford Pigford8 , I was shocked to see how serious the damage shown above was; I was even more shocked that my good friends and colleagues did not call me immediately, knowing my expertise in this area and the fact that I actually flew on this vehicle. It had been almost two years since STS-114 landed on the dry lakebed at Edwards Air Force Base. This email set off a string of events, which I detailed in my log of events, which set me on a collision course with my friends in the LESS-PRT, upper management within my own organization, the NESC, upper management in the SSP and NASA Headquarters, including the new Chief Engineer, Mike Ryschkewitsch, and NASA Administrator Mike Griffin. My first email to the leaders of the LESS-PRT was that I was shocked, I agreed that they needed to develop a coating integrity standard, and that I disagreed with their assessment: “The damage we are seeing (figure 7) with your statement on page 4 of reference 21, that this is ‘not a safety of flight concern’ (seems I have heard this statement one too many times). I would recommend standing down until you can predict such discrepancies prior to flight or at least have verified your coating standard. Let’s not rely on past successes.”21 I immediately contacted the upper management in the NESC to alert them we had a serious safety problem. My response from Ralph’s deputy, Tim Wilson, was that he would meet with Bob Piascik (a materials researcher at LaRC with little RCC knowledge) and develop a list of questions and that I should refrain from “anything that may be seen as ‘inflammatory rhetoric’ and let our processes work24. Now, I had not used any inflammatory rhetoric; however, I was very concerned because we were approaching the launch of STS-117, which was scheduled in about one month. My feeling was that the LESS-PRT reacted because I disagreed with their “knee-jerk” confirmation of this anomaly not being a problem and were beginning the process of silencing my concerns. I also had started to notice behavioral issues with how the LESS-PRT was dismissing critical information regarding impact damage to the orbiter, as noted below. Post STS-114, there were two IR indications in total noted post STS-114 that were not noted pre-flight: one on panel 8R and the other on T-Seal 6L (6L-T).
149
Mission Out of Control
There was one other IR indication that was noted post flight on STS-114 on T-Seal 6R (6R-T). This indication on RCC T-Seal 6L also coincided with a probable impact as recorded by the wing leading edge impact detection system (WLEIDS) during ascent22. The WLEIDS incorporated dozens of vibration sensors on each wing on STS-114 and all subsequent flights with the hopes it could alert the team of impact strikes to the WLE and help us in our visual scans for damage while on orbit. The WLEIDS team was led by an engineer at JSC, George Studer, who fought vigorously to have the technology ready for our RTF mission. This issue was raised to the Shuttle Problem Resolution Control Board (PRCB) and the LESS-PRT with the recommendation that this should be recorded as an in-flight anomaly and an impact on orbit which may have caused probable subsurface damage because there were two confirming indications (the IR indication post-flight and the probable impact during ascent as noted by the WLEIDS). However, the LESS-PRT was quick to dismiss this logic and, together with the SSPO, classified this as a case of “hangar rash,” another idiosyncratic term used by the LESS-PRT to account for rough handling by the ground crew during installation. We will talk about a similar rush to wrong conclusions in the LESS-PRT’s decision after a hailstorm at the cape with the orbiter on the launchpad later. Both of these are examples of what Dianne Vaughn6 would classify as structural secrecy and a working group’s construction of risk, which normalizes deviant behavior, and could lead to the dismissal of potential early signs of problems. This is also reminiscent of the O-Ring Working Group’s assessment of “blow-by” of the Shuttle Solid Rocket Motor’s O-rings. Whether the result of an on-orbit debris hit or mishandling on the ground, this anomaly was never cataloged by the safety and reporting system. It is also troubling to note that this was another instance in which the LESS-PRT and the SSPO missed an opportunity to think critically and question whether or not the NDE IR inspection capability, which was new and first used on this mission, may be experiencing some problems with respect to its ability to identify subsurface defects (e.g., issues related to IR accuracy related to setup sensitivity, hood orientation, procedures, and/or instrumentation errors). Instead, it was much easier to use a simple, heuristic idiosyncratic explanation, which was never verified by any paperwork, and it would normally have been required to document such alleged rough handling on the ground during installation. This begins to surface a behavior I noted earlier, which is referred to as “confirmation bias,” where a double standard is applied that requires more rigor 150
Charles Camarda
to prove there is an anomaly that opposes a rationale for flight as compared to the lower level of rigor required to dismiss a potential anomaly and confirm flight rationale or, to put it another way, it was easier to accept and/or normalize “deviant” or unexplained anomalies/events. I began to highlight the issues I was noticing with this working group to the NESC in an email in which I stated, “The LESS-PRT is a very hierarchical, closeknit working group which is not prone to ask for outside help or review and often times does not recognize the need when appropriate.” Unfortunately, this email was leaked to the LESS-PRT by someone within the NESC, a supposedly objective and professional organization, and it totally destroyed the friendship and collegial relationship I had with that working group and set off a firestorm of less-than-friendly emails between myself and Mike Gordon which destroyed further relationships with members of the group. In that same email to the NESC, I make the following recommendation, “I would not be so prone to accept idiosyncratic explanations for such complex behavior. The Program needs real data to prove why we are safe to fly. I view these as strong and not weak signals with documented past evidence that this is a recurring problem waiting to happen again. Let’s not rely on past successes.” And “I recommend an outside peer review (outside of the current LESSS-PRT) to get to the bottom of this issue. I think Corky Clinton (Raymond Clinton at MSFC) and folks from his team (the RCC Aging Team) can help provide answers or ways to get them.”23 This was my first attempt at suggesting to the NESC that we needed an outside team of experts to work on this problem! Corky Clinton was a good friend at MSFC, and I highly regarded him because of his materials expertise and his objective handling of the RCC aging study. I felt that what we were seeing was related to aging since the RCC panels that experienced this degradation had over 20-30 flights. The rush to prevent the launch of STS-117 with 5 suspect panels: I continued to inundate the NESC with relevant technical information and even used some of their own anomaly review criteria to show them the fallacy of continuing to allow Shuttle launches in view of several critical issues relating to this specific RCC anomaly in the “joggle”/step gap region of the wing leading edge. For example, criteria 2 – repeated anomalous indications of evidence of precursors to failure without actually failing, criteria 6 – is the subject of strong alternative viewpoints, criteria 7 – repetitive material review dispositions (MRDs) for the same condition, criteria 8 – root cause of failure not identified, and criteria 12 – flight rationale based on engineering judgment unsubstantiated 151
Mission Out of Control
by test or analysis.24 I also composed a slide for the NESC team showing what I believe to be some of the key warning signs when there are problems with a team’s culture and behavior (see figure 12) which I enclosed in an email to the team23. These warning signs were based on several years of prior research by me into the behaviors responsible for accidents like Challenger and Columbia. It was quite obvious to me that the members of the NESC team were not well versed in prior behavioral and cultural research studies, and I believed this short checklist would be helpful.
Figure 12. – List of warning signs for the NESC to look out for when dealing with teams that exhibit dysfunctional behaviors that could lead to suppression of critical knowledge.
Throughout May, I worked frantically trying to collect as much data as I could to analyze the causes of the anomalies we were seeing. This was crucial because, in my mind, we had allowed three Space Shuttles to launch post-STS-114, (STS122, 115, and 116 (see figure 1)) after we had significant evidence this was a systemic problem that could cause an accident, and we were now getting ready to launch STS-117 in early June. I had to convince the NESC of the urgency and seriousness of this problem. I had an extremely short amount of time to educate an entire organization about 152
Charles Camarda
organizational behavior, psychological safety, and the systemic technical problem causing the 8R anomaly. It was amazing to me that few people in the NESC had even read most of the CAIB reports and other relevant accident investigation papers and research. I recommended references 4, 5, and 13 (entire book) as must-reads for all safety people I talked to at that time. On 5/3/07, I alerted Ralph that I was hearing very disturbing news from the working engineers that they were intimidated to work with JSC and the LESSPRT. Engineers who knew me would come to me because they knew I could be trusted, would maintain their anonymity, and would carry their concerns forward. I also alerted the NESC that someone on our previous email chain had leaked my thoughts to members of the LESS-PRT, which helped to destroy my relationship with that team. I began my briefings to the NESC RCC team (formed by Ralph and led by Bob Piascik, against my recommendation) anomaly team on May 9, and at the same time, I continued to try to get as much data as I could to fully understand the problem. I highlighted the importance of this specific RCC panel location, the “joggle”/step-gap region (JSGR) of panels 8-10, and the inadequacy of all current inspection methods currently in use to validate the integrity of these RCC panels. I also started to identify the key Friendsof-Charlie (FoCs) who had the correct expertise to help analyze this coupled, interdisciplinary problem and tried to encourage the NESC to add them to the team that was forming. I was desperately trying to collect as much information as possible, and, not to my surprise, it appeared the LESS-PRT and the NESC were throwing up every roadblock they could. For example, I had asked a member of the NESC, Curtis Larson, to try to get as much information about the numerical tools the LESS-PRT was using (called the over temperature tool (OTT) and RCC damage growth tool (RDGT)) to determine the survivability of damaged RCC to survive Earth entry. My reason for this was that I had discovered that the RDGT was not as conservative as the LESS-PRT had touted at many FRRs. In fact, unbeknownst to me, a senior and well-respected materials scientist, Dr. Charlie Harris, had written a scathing NESC report that had highlighted the shortcomings of the tools the LESS-PRT were using to qualify the integrity of RCC components like the RDGT as early as April 200528. Curt was prevented from obtaining the requested information and told all requests had to go through Justin Kerr (a JSC engineer and member of the LESS-PRT)26! The response to Curt by Justin was the following: “We are preparing for a flight—we are less than two weeks out, and I cannot have the team distracted with requests that would deviate us from the proper focus…on the mission. 153
Mission Out of Control
I ask from now on that ALL requests for review of the DAT (damage assessment tools) and operations be communicated to me first and not directly to the Boeing and engineering team. We will make the data and expertise to review these requests with you available at times that are appropriate and safe to support. These latest requests are clearly outside the scope of the RCC panel 9R/NDE review. They were not part of our forward work plan and do not contribute to 117 flight rationale.” Notice the wording, “these latest requests are outside the scope...do not contribute to STS-117 flight rationale”! It appeared the only concern for the LESS-PRT working group, the SSP, and, to my surprise, the NESC was work that would contribute to developing flight rationale. In other words, the NESC and LESS-PRT were focused on identifying reasoning or “confirming cues” which the SSPO could use to keep flying shuttles with known critical anomalies. How could collecting information that could discredit the accuracy of the analytical tools the LESS-PRT was using to plead their flight rationale case at the FRR not be considered important and critical for the managers at the FRR would need to make informed, intelligent decisions? This was exactly the reverse safety culture thinking as mentioned in the CAIB and by the Rogers Commission for Challenger that: “Far from meticulously guarding against potential problems, the Commission found that NASA had required “a contractor to prove that it was not safe to launch, rather than proving it was safe.”4,27 What was most disturbing is that I, a Deputy for Advanced Projects within the NESC, would be considered by my own organization as a dissenting opinion, be kept on the outside, and would eventually be denied access to critical information that I needed to prove we were not safe to fly! On May 22, I sent my dissenting opinion29 to Ralph Roe and Bob Piascik that we were not safe to fly STS-117; I was joined by one other member of the NESC who agreed with me that we did not have sufficient flight rationale to launch. On May 25, I posted it to the NESC website. I often wonder if the LESSPRT would have even addressed the 8R anomaly and RCC panel concerns at the FRR for STS-117 if I had not raised it as an issue. The NESC convened on May 22 to review the RCC sub-surface anomaly30, and their conclusion was that there were several issues of concern and uncertainty which placed several x’s in the likelihood of an occurrence to be probable and the outcome would be catastrophic (red and yellow portions of the safety of flight risk matrix). Hence, the NESC determination at the FRR for STS-117, two weeks before launch, was that “Further risk mitigation is needed to develop acceptable flight rationale for STS-117.” According to the NESC’s position at the 154
Charles Camarda
time of the FRR, the 8R anomaly was severe, it lacked an understanding of root cause, the rate of degradation growth is unknown and not well understood, and mission life models do not address this failure mode30. With less than two weeks to go till the launch of STS-117, I felt confident that the NESC’s position at the FRR not to fly and the fact that it was the premier, newly created technical authority and safety organization, that the Shuttle Program would stand down and correct the anomalies of five RCC panels (10R, 14L, 16L, 17L, and 17R) on OV-104 prior to launch of STS-117. Each of these RCC panels was recently inspected using IR thermography, and they all exhibited similar anomalies to 8R and had IR readings greater than 0.2 Winfrees, a magnitude which was deemed to indicate a significant loss of SiC coating integrity (similar to that exhibited by the 8R anomaly). Months later, when the NESC finally conducted rigorous research into this anomaly and researchers at LaRC conducted scientific coating integrity studies (which should have been done much earlier by the LESS-PRT), we were able to show that the coating integrity precipitously decreases when the IR readings approach a reading of 0.2 Winfrees to a level well below allowable limits! On June 5, three days prior to the launch of STS-117, Ralph Roe held a special NESC Review Board (NRB) telecon during which he directed his team that they were to try to help the LESS-PRT and the SSPO “develop flight rationale for STS-117”! In my opinion, this was highly inappropriate for the head of an objective safety organization to do. In other words, try to find any confirming cues/evidence that would help the program make a “Go for Launch” decision! The NESC was supposed to have separate lines of governance and be an objective technical authority, according to the Columbia Accident Investigation Board4 and Stafford Covey Commission32. It came as no surprise, after Ralph’s instructions to his team on June 5, that a mere one day later, the NESC had reversed its original decision it had made less than one week earlier and had aligned themselves with the message from the LESS-PRT, that there was indeed sufficient rationale to fly STS 117 as is. At a meeting just one week after the FRR for 117 and only two days before launch (L-2), Mike Stoner of Boeing and a member of the LESS-PRT presented the following summary to the Shuttle Program33, as shown in figure 13.
155
Mission Out of Control
Figure 13. – Mission Management Team (MMT) presentation two days before the launch (L-2) of STS-117 to support their rationale to fly Orbiter Vehicle 104 (OV-104) as is33.
You notice in figure 13a that the LESS-PRT focused on trying to understand the “joggle” region and conducted multiple IR line scans of the five panels that exhibited an anomaly. They mainly focused on any differences between the readings pre- and post- STS-115, which flew in October 2006, and which was the same vehicle, OV-104 Space Shuttle Atlantis. Their unverified hypothesis was that if the anomalies did not show an appreciable difference before and after flight that that would mean the damage was not growing and, hence, we should be good to fly as is. An assumption that had no basis in fact and had not been corroborated by any testing up to that time. Not to mention that an IR reading ≥ 0.02 Winfrees was an indication of a severe loss of coating integrity in a critical, highly heated region that had never been adequately analyzed. This faulty construction of risk is something reminiscent of another dysfunctional working group, the Challenger O-Ring working group. The NESC and the LESS-PRT apparently either had not read my white paper9 in which I state my hypothesis that if the RCC panels had not been cleaned and the glass removed, this glass could fill voids beneath the SiC surface, and you would not be able to identify the true extent of the subsurface damage or disagreed with my assumptions without discussing it with me. I also explained how each of the supposed control methods the program had in place to identify discrepant panels was inappropriate: tactile inspection (for coating integrity in suspect regions), visual inspection (for abnormally large craze cracks), and IR Thermography (for identifying subsurface voids) as discussed above9. I also spent considerable effort explaining the reasons why the IR readings grew so much for my flight was they there were very large craze cracks that grew when heated during Earth entry, and much of the glass filling the voids was entrained in the flow and swept out of the cavity below the SiC coating. Hence, the logic
156
Charles Camarda
and evidence the LESS-PRT used to determine we would be safe to fly (figure 13c) was totally faulty. Their rationale for flight in figure 13c, “no measurable differences in NDE signatures pre and post-STS-115,” which they assert by showing only the preand post-flight images of panel 10R, which I had already disproved in my white paper as shown in figure 14 for three panels flown on STS-115. The LESS-PRT had cherry-picked the data that would support their claim/hypothesis, what is called a disqualification heuristic and/or confirmation bias!
Figure 14. – Comparison of IR line scans of discrepant panels pre-STS-115 and pre-STS-117 disconfirming hypotheses of LESSPRT of no changes in IR line scans pre- and post-STS-1159.
The LESS-PRT conveniently omitted the IR line scans of panels 14L and 17R in their L-2 briefing for obvious reasons. My dissenting opinion34 highlights the unknowns in the ability to infer the quantitative level of coating integrity but what we did learn was that the level of IR readings above 0.2 shown in figure 15 were indeed consistent where the coating integrity of panel 8R was seriously degraded and posed a flight risk. As shown in figure 15, in large areas where the IR readings were above 0.2, the coating flaked off “readily.”33 This information was known by the LESS-PRT prior to the FRR but was purposefully not presented. Once again, the LESS-PRT dismissed any information which disproved their confirmatory response that we were “go for launch.” It was included in their backup slides, which were never shown at the FRR.
157
Mission Out of Control
Figure 15. – LESS-PRT presentation to the MMT two days before launch (L-2) of STS-117 showing IR line scan data for RCC 8R post STS-114 and associated areas of damage33
Later work that I requested from the LESS-PRT included the investigation of the structural integrity of the SiC coating, which should have been conducted by the LESS-PRT years earlier, was conducted by an RCC materials scientist, Wally Vaughan at Langley, and the results quite dramatically show how coating tensile strength drops off exponentially at IR readings of 0.2! Unfortunately, the figure could not be presented because the material properties are claimed to be ITARrestricted. Suffice it to say that panels with IR readings of 0.2 or less had little to no tensile strength prior to popping off, hence the technician’s observation “Out of family cracks liberated coating readily16.” It was appalling to me that the NESC allowed the SSPO and the FRR team to allow another flight to launch given the severity of the anomalies on five critical RCC panels, the lack of sufficient safety controls (tactile, visual, and NDE), and the unsupported technical evidence of sufficient coating integrity. They did not even question misleading data presented by the LESS-PRT. The LESS-PRT was prepared also to show predictions made using their “RCC Damage Growth Tool (RDGT)”, which I would later have to convince my boss Ralph and the NESC was flawed and not conservative as had been touted in numerous FRRs, and yet still indicated that even if coating were lost during Earth entry that the panels would survive. 158
Charles Camarda
Knowing this would be another defensive tactic used by the LESS-PRT, I had to spend considerable time collecting data to prove that, indeed, the RDGT was not conservative as they had boasted but instead was nonconservative, and the panels they said would survive would instead fail5,8. I was unsuccessful in delaying the launch of STS-117, and the program decided to fly STS-117 as-is, without replacing the discrepant panels with good spares. I was extremely disappointed and angry with my own safety organization for flipping their decision on the readiness for flight only two days prior to flight (L-2) without what I considered to be credible logic. Regroup and try to collect data to convince the NESC of the severity of the problem: I continue sending emails to Ralph and other members of the NESC requesting more data, analysis, and testing and make a formal request for a safety standdown before we fly another shuttle mission. I created a visual timeline of the events (figure 16), most of which are corroborated with emails, to highlight key decisions, impediments I encountered to collect data, and to create context for the culture and environment that persisted and even grew worse post-Columbia.
Figure 16. – Visual timeline of events of STS-114 RCC panel 8R anomaly story.
159
Mission Out of Control
I am a man on a mission trying frantically to educate the key leaders in the NESC on the severity and urgency of the problem. I feel I am fighting my own safety organization, which is unheard of, in addition to the LESS-PRT, the SSPO, MOD, and yes, even my own Astronaut Office at times. Believe it or not, a crew getting ready to launch wants to do just that: launch. They are ready to go, they think they understand all the risks, and as long as the risk of catastrophe is less than 1:68 (assumed odds for launching on space shuttle), they are usually dead set against any delays! On June 20, I sent an email to Ralph telling him the LESS-PRT had taken me off their email distribution list. That is correct, the LESS-PRT was impeding the flow of information to someone in the NESC who was collecting data to prove their flight rationale was not sufficient and we were flying vehicles with critical anomalies. What is even more distressing is that it appeared to me this did not faze Ralph or the NESC one iota! You will note that throughout the course of events I requested several times to be reinstated on the email distribution list for the LESS-PRT and finally relented after it was never resolved. I also pleaded several times to Bob Piascik, Clint Cragg (newly appointed lead for the NESC anomaly team), Ralph, and others to read my 117-page white paper9 that I was hurrying to complete, which analyzed the 8R anomaly and the proposed evidence against flight rationale9. In an email to Clint and others on the “RCC leading Edge Team” on August 26, 2007, I made the following request: “I sent you all a copy of my paper on the 8R anomaly. Much of the information in that paper was conducted by Peter Gnoffo, Sandy Walker, and Bill Winfree. I tried to piece together a credible scenario for what is a potential root cause of the coating degradation (subsurface mass loss). Pleas (sic) send me your comments regarding my interpretation of the data and whenever applicable please add your own data/results to refute any statements I have made which may be incorrect. I would like to finalize all comments by next week so I can go to tech editing and distribute to a wider audience. I will have to assume if I do not receive your comments that you are content with the analysis. I am most concerned with the CFD aerothermal, the thermal structures, and the NDE. Thank you, Charlie”
160
Charles Camarda
In my email above, I am requesting the team to technically review my analyses, to refute any items they disagree with, and to provide their data and analysis so we can have a technical discourse. This is what research scientists do. True researchers discuss data, ask questions concerning the type and rigor of analytical models and assumptions, and critically assess if there are any issues or potential discrepancies. However, not one member of the NESC read the details and analyses I presented, and in his email response to the above, Clint Cragg even admonished me at one point, telling me it was not his job to “quality check” my paper. “CharlieI have been thinking of your request to review your report for some time. Your report remains essentially a dissenting opinion to the work our team has done. We have looked at the same facts as you have and have come to different conclusions. Additionally, we are now beginning to delve into the root cause determination. Because of this, I believe that our job is not to proof your technical arguments. In any case, Charlie, it’s not our job to do a quality check of your paper. That responsibility, I believe, is yours. It would be a mistake to assume that no comments from us constitutes contentment. Clint I added the highlights above to dissect some of the points in Clint’s response: 1) I am a dissenting opinion, so the NESC has to detach themselves from me as they respond to my position in reference 9 (hence the “our team” of which I would clearly be told several times I was NOT a part of). 2) They “looked at the facts and came to different conclusions” which was exactly my point; if they disagreed, I wanted them to please contact me and let’s have a scientific discussion. As it stood, I did not have any information or data as to why they disagreed with my analyses. Since I was also a senior member of the NESC, I was hoping I could be on the NESC team; however, it became very clear that my own organization was going to become one of my biggest impediments to getting to the truth about this anomaly. 3)Without a scientific discussion of competing ideas/hypotheses, the NESC was now on a mission to discover the “root cause.” In reality, NESC’s mission should have been to determine if the vehicles were safe prior to flight. A real root cause determination would take much longer (if at all possible) to prove, and we could possibly lose another shuttle crew. What I quickly realized during this process was that the NESC team, while it did have many so-called SMEs and designated “Technical Fellows” in several 161
Mission Out of Control
key domains on their team, their team was not performing at what I would call a “High-Performing Team (HPT)” level because they had never worked many interdisciplinary, highly coupled technical problems and did not possess the necessary critical thinking and collaboration skills. I was trying desperately to help train this team, but the leadership was preventing me at every step. The people on the team who did have these skills, like Sandy Walker from my Thermal Structures Branch and Peter Gnoffo from the Vehicle Analysis Branch (both at LaRC), who read my paper, understood that it was a critical issue and worked collaboratively with me. My frustration with the NESC grew steadily with every email exchange, and in desperation, I started going around my chain of command and started to communicate directly with NASA Headquarters and, specifically, with the newly appointed NASA Chief Engineer, Mike Ryschkewitsch, and Chris Scolese who was now appointed associate administrator, and at one point the administrator himself, Mike Griffin (up to this point, I had not spoken with Griffin after he refused to come to my defense after being fired as Director of Engineering by Coates). At this point, instead of stressing the technical issues, which many did not really understand, I stressed the sociological and behavioral issues, which were similar to “Echoes of Challenger,”4 and how they were preventing the flow of knowledge necessary to make intelligent decisions. Needless to say, I was making plenty of enemies; however, I really did not care. In my mind, this was very much a matter of life and death for, in particular, my friends, colleagues, classmates, and crewmates, the astronauts! In an email to Mike Ryschkewitsch, I told him about my displeasure with the NESC Team and implored him to think about getting a truly outside, objective team to investigate this37. It was clear from the above that the NESC team was closely aligned with the LESS-PRT and SSPO and was even working with them to help create flight rationale. The only person who really read my white paper9 at that time was my good friend and world-renowned aerothermodynamics expert, Dr. Peter Gnoffo. He responded38: “The first theme is defining the technical problems with ‘aging’ in the joggle region. You make a strong case for defining the problem, suggesting the root causes, and explaining why any current inspection techniques miss the degradation… The second theme develops your concerns that the analyses to clear these regions for flight readiness are not sufficiently conservative. Essentially, if one accepts that the degradation can be missed (I do), then the analyses to clear for flight do not adequately encompass the severity of the off-nominal environment.” 162
Charles Camarda
Clearly, what you will learn is that more and more of the research community who are knowledgeable of this technical area and are “Friends of Charlie” (FoCs) begin to weigh in on this problem because they realize how serious this issue is. Unfortunately, the NESC community had not begun to integrate these FoCs into their team and when they did, they chose to selectively dismiss members who opposed their ideas. I spent the next few weeks trying to collect data, recommend key team members, and push my case forward. This data was absolutely critical. For a very short time, I was connected to Kevin McClam (JSC Safety), who was instrumental in obtaining the RCC processing and disposition histories of the RCC panels. With these histories of the local “joggle” region, the evolution of problems from prior flights, the number of flights of each RCC panel, and the inspection/refurbishment/repair schedules, I was able to connect this as a systemic problem which was related to the age of the component, the location on the vehicle and past processing. When the LESS-PRT learned of Kevin’s assistance to me, he was immediately scolded as a contractor, prevented from talking directly with me, and ordered to go through the NESC39. This was pretty interesting since I was a deputy director in the NESC. Obviously, they meant through Ralph Roe and whomever Ralph designated. It is quite obvious from Kevin’s email that the OPO (Orbiter Program Office) required him to go through them for any information requests and that “it was concurred by the NESC that any new requests for PRT (Problem Resolution Team) involvement/information (other than work they are nominally performing in regard to this issue) needed to go through them.” Hence, the NESC was compliant with this request to impede information to a dissenting individual within their own organization—me!
163
Mission Out of Control
However, thanks to Kevin’s help, I was able to dive into the histories of all past RCC anomalies and present the panel histories of all panels on OV-103 for STS-117 (see Table 1). Table 1 RCC Panel Histories Prior to Damage Incident or STS-117 Launch
Historical analysis of results9 – Looking into the history of discrepant RCC panels which had similar instances of coating degradation as panel 8R which caused the premature retirement of the panels from the active fleet9, you notice that all of the panels were from OV-104, except panel 16R (OV-104), had to be scrapped after 27-31 flights and most had to be refurbished at least once after about 20 flights (panel 6R on OV-105 had a total of 19 flights and a defect which was 29 in. long x 0.50 in. wide and crazing of 0.005 in.). It was discovered that panel 16R had an out-of-family craze crack (.55 in. long and 0.006 in. wide) after only seven flights was repaired and flew three flights and was refurbished and 164
Charles Camarda
then flew seven more times before being scrapped (after 17 flights). In addition, the panels that were discrepant and scrapped or which we have found subsurface linear indications near the JSG region correspond to regions on the leading edges where we have increased heating due to local flow conditions (panels 6-10 and 15-18). As shown in Table 1, four of the five panels with linear indications on OV104 have 18 flights (panels 14L, 17L, 17L, and 17R), and one has 16 flights (panel 10R). This was one of the reasons I believed that this was a systemic problem related to the aging and degradation of RCC panels and, hence, the urgency in educating the key people at the FRR for STS-117 of the seriousness of the problem as well as the lack of its understanding by the LESS-PRT and the NESC. My hypothetical reasons for the irregular IR readings and incorrect classification of integrity of all RCC panels in the “joggle”/step-gap region of the slip side of the WLE panels – The conventional wisdom for the subsurface oxidation and degradation of carbon-carbon substrate is shown in figure 17.
Figure 17. – Illustration of the SiC coating loss due to cyclic subsurface oxidation and degradation9.
165
Mission Out of Control
Oxygen can reach the carbon substrate due to microcracking of the SiC oxidation protection surface layer due to thermal stresses during the cooldown process in forming RCC by two phenomena: the difference in coefficientof-thermal-expansion (CTE) of the SiC coating and the RCC substrate and nonlinear thermal gradients (see figure 17a). During entry heating, the carbon substrate is oxidized and escapes as a gas through the narrow microcracks, as shown in figure 1b, and forms subsurface voids. The Type-A glass fill is used to block the oxygen from contacting the carbon during entry heating and causing it to oxidize. The glass is not much protection when exposed directly to the level of heating during Earth entry and will quickly either be swept away in the flow or evaporate/ablate. However, many of the arguments the LESS-PRT used at FRRs touted the effectiveness of the Type-A glass sealant as some backup or redundant protection system during entry. Remember how the Type-A glass was completely removed during the RCC cleaning procedure of bakeout at 2400 °F? Well, the temperatures on the wing leading edge can approach over 3000 °F for up to 15 minutes! Multiple flights can cause successive mass loss and failure, as shown in figure 17c. According to the LESS-PRT, the SiC “craze squares” may become dislodged by external forces such as air loads, vibration, impact, acoustic loads, and flexure/ bending11,40. One of the external loads the LESS-PRT had not considered was the effect of thermal loads and resulting thermal stresses. Now, looking at the increase in thermal contact resistance at the SiC/ carbon interface, it is easy to understand that during the very high heating portion of the entry trajectory where heat fluxes can be greater than 60 BTU/ft2-sec, peak temperatures and temperature gradients and resultant thermal forces and moments on the fragile SiC chips Figure 17. – Effect of subsurface mass loss can be very large (see figure5, 10, 1nd on increasing peak RCC temperatures and 11). This increase in local temperatures temperature gradients for entry heating and also results in the rapid evaporation of resulting thermal forces and moments. Type-A sealant and subsequent increase in mass loss rates. This could potentially increase the local thermal stresses and deformations and, once again, increase the probability of failure of the coating during Earth entry.
166
Charles Camarda
When the NESC finally listened to my advice and allowed Dr. Sandra Walker of my old Thermal Structures Branch (TSB) to conduct a detailed thermal stress analysis of this region as I had instructed, the results validated this hypothesis. Dr. Sandra Walker’s finite element stress analysis of a typical RCC panel segment with a JSG region for various thermal loadings and assumptions is shown in figure 1840. There is a 42% increase in stress in the corner of the SiC chip.
Figure 18 – Deformations and stresses for a non-uniform delta T and an assumed 0.4 in. long delamination at the SiC/Carbon interface (simulating significant local subsurface mass loss).
Imagine if Don Curry had listened to me in 2001 when, after his presentation to the Astronaut Office on RCC WLEs, I told him I could have members of the TSB conduct a thermal stress analysis of this anomaly instead of dismissing my offer. We could have solved this problem and found a safe solution years earlier. The second half of this book will describe how creating research networks of high-performing teams with the correct leadership and key researchers can solve critical problems quickly, within reasonable recovery windows. In my white paper, which followed my dissenting opinion,9 I also alerted the NESC that they need to look at transient thermal gradients and stresses for cases with and without subsurface mass loss/porosity and cyclic nonlinear material behavior to fully understand the potential failure mechanisms of the SiC coating.
167
Mission Out of Control
In addition, the local, in-plane temperature gradients due to the very local aerothermal heating of the knee of the “joggle/Step Gap” region, as shown in figure 19, also need to be included in the analysis8,38.
Figure 19. – Locations of increased heating on the shuttle wing leading edge during Earth entry and further increases due to joggle/step-gap region heating8,42.
This local increase in heating would only increase the severity of the thermal stress problem. Some members of the NESC RCC Team did not agree with Sandy’s analyses and attempted to intimidate her. When I heard of this, I immediately notified Ralph and another senior materials scientist at LaRC, Dr. Charlie Harris. The results in figure 19 were presented by Dave Schuster of the NESC on November 1, 2007, and are a compilation of work by K. Rajagopal of Boeing and Peter Gnoffo of LaRC. You notice in figure 19 that there are two regions on the WLE where heating is more intense (between panels 8-10 and 16-19). In these regions, the heating on the forward-facing step to the flow in the joggle region can increase further and possibly even be doubled. It should now be very obvious why this region is critical and was not well understood by the LESS-PRT. In addition, if you look at the history of Shuttle RCC WLE panel refurbishment and replacement histories (Table 1), you notice that all five panels that had anomalies prior to STS-117 were all within these regions! How did the LESS-PRT and the NESC miss this obvious pattern of damage and the relationship to prior anomalies? Quite simply, the NESC was not a research organization and had little experience creating high-performing teams to solve complex interdisciplinary problems! It should be noted that if a SiC chip was dislodged during Earth entry (bottom of figure 7) this would greatly increase the forward-facing step size and additional heating to this damaged region. This phenomenon was all presented in my dissenting opinion and white paper months ahead of the NESC validation.
168
Charles Camarda
Much of this work was also used to prove to the LESS-PRT that the RCC Damage Growth Tool (RDGT) they had been using to predict burn-through of damaged RCC panels was very nonconservative and the predictions they had been using to validate flight readiness were incorrect and misleading! In my communication with the NESC, I recommended a series of aerothermal tests in the NASA arc jet facilities and aerothermal and thermal stress analyses to understand the mechanism of material degradation and failure in this region. My requests were ignored. Stepping up the heat prior to launch of STS-118 After my disappointment in the NESC not being able to convince the SSP to stand down from flying shuttles after the launch of STS-117, I doubled my resolve in trying to collect data to make my case and to educate members of the NESC RCC Team (now being led by Clint Cragg). My pleas to be reinstated on the LESS-PRT email list, for the NESC team to read my white paper9, and calls for an external independent review were all rebuffed (see Appendix A (6/25/07 to 8/8/07). On 7/31/07, I requested a private audience with Ralph to discuss my censorship by the LESS-PRT and discuss my concerns with the performance of his RCC anomaly team. I traveled from Houston to meet Ralph in person in his office at LaRC, Hampton, Virginia. When I entered Ralph’s office, I pleaded my case, at which point he told me, “Our team has the best structures, materials, and aerothermal people available. How can they be wrong?” That was when I told Ralph, “Your team sucks.” Upon which, he totally blew up, threw over a table in his office, and shouted at me to get out. Over and over again, he screamed, “God damn it, I said get out.” I did not budge. In fact, I sat down across the office from him and asked him if I could start breaking things (one of his Space Shuttle models had fallen off the desk in his rage and had broken). I told Ralph I would not leave until he calmed down and listened carefully to what I was telling him. When Ralph finally calmed down and listened, his eyes widened. There was a glimmer of understanding the veracity of my claims and the severity of the problem. Perhaps his team did miss something? What Ralph could not comprehend was how his team, which consisted of some very good engineers and researchers in multiple disciplines, had missed the criticality of this problem. As explained earlier, these problems are complex and interdisciplinary, and they require researchers who are used to working collaboratively with multiple disciplines in a converged way to understand the problem in a way that can lead to the real root cause of the problem. Ralph 169
Mission Out of Control
ordered me to meet with his team and to make my case. I had prepared about two hours of presentation slides, which I went over with his team in two phases. That afternoon, 8/1/07, I gave a two-hour presentation to the NESC RCC Team. Bob Piascik could not make the meeting, so I would have to come back the following week to brief Bob and the remainder of the team the following week43,44. I meticulously stepped through each claim of flight rationale by the LESS-PRT and proved that their presentation to the FRR for STS-117 falsely claimed their RDGT to be nonconservative and would lead to the burn-through of several panels with anomalies if the coating was lost! I was very tough on the NESC RCC Anomaly Team. I questioned each of their prior assumptions and their reasons for aligning with the LESS-PRT that we had flight rationale for STS-117. I challenged them to think critically, question everything, and dig deep to find the truth. I cautioned them not to accept everything they were hearing from the LESS-PRT but to do their own analyses and investigations. Apparently, someone at that meeting felt I was too aggressive and even accused me of being abusive to a friend and fellow colleague of mine from the materials division at LaRC, Wally Vaughn. Wally was an excellent materials scientist who specialized in carbon-carbon and RCC and someone whom I highly regarded. I was totally in the dark about the above accusation, and when I met with Bob Piascik the following morning, he cornered me in an empty office and proceeded to accuse me of abusing Wally Vaughn and threatened to take me outside if I ever treated Wally that way in the future. Well, I was totally caught off guard and denied I was abusive and proceeded to tell Bob he should never accuse someone when he had no proof, and if he ever spoke to me again like that, we would indeed step outside. I was blown away; I did not feel I was abusive to Wally or anyone else at that meeting. I approached other members of the team that were at the meeting and asked if anyone thought I was abusive, and they looked at me as if I had two heads. “No, Charlie, you weren’t abusive.” I sent an email to the team, and to protect myself, I spoke to a NASA attorney and requested an investigation. I have witnessed Shuttle Program managers in the past accusing people of being abusive when they wanted to silence their voice because they opposed the “party line.” I was not going to let that happen to me because astronaut lives were at stake. At that second meeting, the NESC hired an independent observer, Mr. Johnny Garcia, to monitor my behavior, mediate the proceedings, and probably identify my supposed “abusive” behavior. I later, in an email to Associate Administrator
170
Charles Camarda
Chris Scolese requested an IG investigation of the matter to definitively clear my name45. I was taking no chances at this point. I was unable to convince the NESC, the SSPO, and Safety and NASA Mission Assurance (S&MA) that we were unsafe to fly STS-118. Apparently the LESSPRT had convinced the FRR team that their predictions for damage growth were conservative and we were safe to fly. I continued my fight and collected more data and wrote a 50-page white paper where I show the inadequacy and nonconservative analytical predictions of the LESS-PRT RDGT9,46. I distributed this paper widely and finally reached a senior material scientist at LaRC, Dr. Charlie Harris, who was well respected by Ralph and the NESC. When Dr. Harris read my paper, he was convinced the LESS-PRT analysis of damage growth was flawed and made this fact known to Ralph. In fact, Charlie Harris was one of the first people to question this tool in a much earlier report he published for the NESC, as noted two years earlier28. The NESC’s reasoning for flight rationale for STS-117, STS-118, and STS120 noted earlier is presented in reference 31. Basically, the NESC just parroted back the rationale proposed by the LESS-PRT with minor additions based on additional testing. They totally ignored my proof that the three methods currently used to inspect the vehicle for anomalies were ineffective as safety controls, and the RDGT tool the LESS-PRT was using to determine burnthrough was inaccurate and nonconservative. I do not stop. I continue making pleas to the head of S&MA (astronaut Bryan O’Connor), the associate administrator (Chris Scolese), NASA’s chief engineer (Mike Ryschkewitsch), and even the NASA Administrator (Mike Griffin). I demand an independent team, and at that point, the NESC decides to create a supposedly “independent” team to address the anomaly. The name of the new team was the RCC SiC Liberation Tiger Team. The only problem is that the new team was led by a member of the LESS-PRT (Justin Kerr), was populated overwhelmingly by either members of the LESS-PRT (37%) and NESC team members (~ 43%), which basically worked together with the LESS-PRT, and several of the recommendations I made for membership, like Dr. Fran Hurwitz of GRC, were rejected because they opposed some of the LESS-PRT’s decisions in the past. Only about 20% of the team had any research training and expertise and were external and independent of the SSPO/LESS-PRT/NESC. Every lead position on this team was held by a member of the LESS-PRT and/or one of the key companies working closely with the LESS-PRT and SSPO. The makeup of the team was so lopsided that even the selection for Astronaut Crew Office representative, Jose Hernandez, was an engineer at JSC and worked on the 171
Mission Out of Control
LESS-PRT prior to selection as an astronaut. This is a typical technique used to bias the majority of a supposedly “independent technical team” to reach a conclusion that advocates a desired outcome, fly as-is. Although I am listed as one of the “Independent Assessment” group, I immediately ask Ralph to have my name removed. I realized early on that they only wanted to include my name as a means to show impartiality (they had no intention of actually listening to my advice as evidenced by prior history); however, I could definitely see what the true purpose of this team was: to create the impression that the LESS-PRT and the NESC had acted responsibly and that their decisions were defensible at some level (basically what we would call CYA (cover your ass)). No way would I put my name on such a report. In fact, the few members who were truly objective (Dr. Sandra Walker, Dr. Wally Vaughn, Dr. John Koenig, and Dr. Kamran Daryabeigi) found sufficient fault with this process that they incorporated a dissenting opinion to that proposed by the majority of the team in the final report. In my frustration with the lack of objective technical oversight of how the 8R anomaly was being handled, I wrote yet another white paper that not only addressed my technical concerns but also included a detailed behavioral science, cultural, and sociological analysis of the dysfunctional behaviors which have continued to persist post-Columbia in spite of and in some cases because of the NESC15. I also redoubled my efforts to prove the LESS-PRT RDGT was not conservative and the lack of integrity of SiC coating with subsurface anomalies registering a reading greater than 0.2 Winfrees. Success at last: The newly formed RCC SiC Liberation Tiger Team was rapidly trying to come up to speed on the anomaly, and they relied heavily on the ideas, hypotheses, research, and analyses I had already conducted9,15,44. One of the key defenses of the LESS-PRT at previous FRRs was that there was sufficient structural integrity of the SiC coating chips that they would not liberate prior to or during Earth entry even though they had indications of significant mass loss as detected by IR NDE (> 0.2 Winfrees). Dr. Wally Vaughn conducted numerous tests of SiC coating tensile strength as a function of IR NDE readings as mentioned earlier, and proved the LESS-PRT and SSP claims false once and for all. Tensile strength dropped precipitously at IR readings of 0.2 and higher. I was finally able to convince the NESC that we did not have flight rationale to continue flying STS-120 without removing and replacing the discrepant 172
Charles Camarda
panels with good RCC spares. At the FRR for STS-120, the NESC’s position to replace discrepant panels with high IR readings is shown in figure 2047.
Figure 43. – NESC recommendations at the Flight Readiness Review for STS-12048.
You can see that the NESC was now recommending the replacement of the three discrepant panels with IR NDE indications > 0.2 Winfrees (RCC panels 9R, 12L, and 13R) as the only way to mitigate risk from the unacceptable risk category down to the acceptable risk category. They also showed calculations, based on my revised assessment of the LESS-PRT RDGT tool, that all three panels would burn through if the coating was liberated during or prior to Earth entry heating. On October 20, three days before launch of STS-120, NASA’s Associate Administrator Chris Scolese sent me and the NASA Chief Engineer Mike Ryschkewitsch, an email. In it, he agrees with me that we do not have sufficient rationale to fly STS-120 without replacing the discrepant panels. I was asked by Ralph to attend the FRR for STS-120. John McManamen, the SSP chief engineer, had been asked to brief SSP Program Manager Steve Poulos prior to the meeting on the details of the 8R anomaly and our issues with the flight rationale. Unfortunately, John was not up to speed on all the details of this anomaly and most of the information he fed Steve was incorrect. I had an opportunity to discuss my concerns with Mike Gordon in a heated exchange on the balcony where the FRR was being held during a break. I pleaded with him to “do the right thing.” Unfortunately, I was unable to persuade Mike, the LESSPRT, or Steve Poulos and the SSP. Steve stood up and presented the SSP position on why they had sufficient rationale to fly. He bought into the entire story the LESS-PRT was selling and even made some amazing proclamations as to why we were ready based on “past successes” (reminiscent of the “success bias” noted by Challenger and Columbia 173
Mission Out of Control
accident investigations)4,5. This would make six flights in a row we had flown space shuttles with serious anomalies in critical hardware locations. NASA is now realizing that I am beginning to give public presentations regarding my concerns for the state of the culture I was seeing at NASA to organizations outside of NASA. One was to the U.S. Chamber of Commerce Space Enterprise Council Meeting in D.C. on October 17, 2006, entitled “NASA Culture.” I made no attempts to hide my displeasure and told all my supervisors at that time about my intentions to speak openly. I also begin sharing my analysis of the RCC damage growth tool and its inaccurate representation of damage growth. Finally, thanks to my analyses and the support of other respected researchers close to Ralph, the NESC agreed that there is no definitive technical basis to fly as is with panels exhibiting WLE joggle region damage. On April 2, 2008, I receive word from astronaut Bill McArthur that the SSP had made the decision to change out panel 10R on OV-104. It had been a grueling TWO-YEAR battle, but I finally convinced at least a few people that the story they had been hearing from the LESS-PRT was not exactly true. The members of the LESS-PRT were angry. They continue to believe to this day that they were correct and would have continued to fly without replacing panels with serious anomalies. Flight Director Leroy Cain (the same flight director who was on console during the Columbia tragedy) called a meeting to console the members of the LESS-PRT. He did not believe the decision to changeout the panels was the correct one in his mind. At one point, Justin Kerr spoke directly to me and said he felt certain the panels were OK as-is, and the only reason the program decided to change out the panels was because of me. Imagine that. Little Charlie Camarda was able to make the SSP change their decision to launch as is and instead spend millions of dollars to do the correct thing; change out bad panels with good ones. Even if that was the only reason and the program really had not learned any lesson, I was happy to do my part to fix a problem that could have potentially cost the lives of my friends and colleagues. I would suffer it all over again to ensure that would not happen. One last outrage On May 9, 2008, one day after my 56th birthday, I get a call from Ralph Roe. He told me he had some bad news. He just heard the IG (Inspector General) was investigating a report of workplace violence. Supposedly, someone heard that I had thrown a punch at Mike Gordon (Boeing Lead for the LESS-PRT) and 174
Charles Camarda
missed during the FRR for STS-120. I said, “What?” That is it, that is the last straw. Either this gets cleared up by close of business today, or I am going to the New York Times” and hung up. Moments later, Ralph’s boss, called from headquarters, telling me the same story. He said it came from someone at the Cape (KSC), and Bill Parsons (Center Director at KSC) and Mike Coates (Center Director at JSC) knew about it. I tell him this is BS, and I am tired of my reputation being trashed by hearsay. I repeat to him the same thing I told Ralph, “If you and the IG do not resolve this by close of business today, I am going to the New York Times.” He then proceeds to tell me I did not want to do that; he says they also have records of my personal files with the Houston Police Department, which I took as an implied threat. At that point, I totally lost it; I am now being threatened, and attempts are being made to destroy my good name. All I can think about are the sacrifices made by my immigrant Italian grandparents and my parents, the values they instilled in me and my brother, and all the hard work it took to build my reputation and my career. I was determined to fight. I was furious now, so I proceeded to ask Ralph’s boss, “Did you ever have to deal with someone that doesn’t give a fuck? Well, right now, you are talking to someone who doesn’t give a fuck! You are talking about my good name and my reputation. Now let me repeat, if you do not get back to me by COB today, I am giving my story to the New York Times, and we will see whom the public believes.” (Click, I hung up the phone.) Yes, that is correct, I called NASA’s bluff by senior executives and cursed at my boss’s boss in HQ! I was fuming. How could this accusation rise to the level of both NASA center directors based on hearsay without even confirming it with the source/ victim, Mike Gordon? At that point, I called the local IG office at NASA JSC and asked to speak with an agent regarding this case. John Corbett met me in my office, and I proceeded to tell him this story and I asked him what he thought. I said, “Did you ever hear of someone throwing a punch at someone at a crowded meeting (the FRR had over 300 participants) and missing? And then what? What would the other person do after I missed the punch? Would the fight end, or would you expect a brawl to break out in front of all 300 people at the FRR?” He recommends I have a whistleblower case and should move forward. I do not. Instead, I put my head down and continue to work for the agency I love for another 11 years, doing everything I possibly can to help save an organization I am watching die a slow death in front of my eyes. So why the quote from the movie “Raging Bull?” Well, if you saw the movie, it is a grueling graphic story of a middleweight champion, Jake LaMotta, who takes savage beatings in the ring, usually ending in his success but not without 175
Mission Out of Control
him incurring major physical injuries. And when Jake finally suffers terrible punishment against the ropes at the hands of a great fighter he highly respected, Sugar Ray Robinson, bloodied, his eyes half closed, he looks up at Ray and says, “You never got me down, Ray…you never got me down.” Jake fought the best fighters of his era but never went down, never even took a knee. Am I proud of the language I used? Absolutely not. What I am proud of is the fact that I never took a knee. I never succumbed to the pressure, intimidation, isolation, false accusations, and attempts at character assassination. I continued trying to fix the problems within NASA until my retirement in 2019. As Director of Engineering, I tried very hard to instill the precepts of research culture. I told every employee they had unfettered access to me if ever they had questions or dissenting opinions, even if it meant going around their chain of command. I would not disclose their requests, I would maintain their confidence and trust, and I would carry the fight forward and take the heat—something their prior director, Frank Benz, should have done. The second half of this book is devoted to the researchers who solved the tough problems following the Columbia tragedy and its return-to-flight and safe retirement of the space shuttle. It will step you through what it takes to transform a dysfunctional organizational culture and create a high-performance team. I feel that if I had not explained in detail my experiences, most readers would not believe how far a great organization like NASA could sink, how insidious its slide into cultural dysfunction could be, and how difficult it is to correct. I can sleep well at night knowing I did my part to shine a spotlight on a very dysfunctional organizational culture and spent my remaining career at NASA trying to fix it. I have no regrets. References 1. 2. 3. 4. 5. 6. 7.
176
Gordon, Michael P: “Leading Edge Structural Subsystem and Reinforced Carbon-Carbon Reference Manual.” Boeing Report KL0-98-008, October 19, 1998. Jacobson, Nathan S.; Roth, Don J.; Rausser, Richard W.; and Curry, Donald M.: “Oxidation Through Coating Cracks of SiC-Protected Carbon/Carbon, NASA TM 214834, June 2007. Presentation of the status of OV-105 nose cap damage by Mike Gordon (subsystem manager (SSM) of the LESS) on 4/22/04 entitled: “OV-105 Nose Cap Damage”. Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press, 1996. Katzenbach, Jon R, and Smith, Douglas K. “The Discipline of Teams.” In Harvard Business Review’s HBR’s 10 Must Reads on Teams. Harvard Business School Publishing Corporation 2013. Corrective Action Report (CAR TES-3-32-0991): Post Flight Infrared Thermography NDE imHas Identified an Internal Void not in Pre-Flight NDE. 12/5/05.
Charles Camarda 8. 9.
Initial presentation charts to LESS-PRT dated 5/1/07 by Clifford Pigford entitled: “Panel 8R Test Plan.” Camarda, Charles J.: “Volume I – Technical – A Question of Flight Rationale for All Space Shuttle Flights Post STS-114 and A Proposed Root Cause of RCC Panel 8R Anomaly on OV-103.” White paper, August 2007.
10. 11. Corrective Action Report (CAR) TES-3-28-0622 entitled: “LH RCC Panel #8 onOV-103 Was Damaged on Leading Edge, Substrate is Visible.” December 29, 1999. 12. Presentation to the Shuttle Program Engineering Review Board (PERB) dated Oct. 1, 2001, entitled: “Develop a Nondestructive Evaluation (NDE) Method for RCC Components. https:// drive.google.com/file/d/1ruCPtVSz3Uf6n9p-S8Jj7_e3_AJIZ74c/view?usp=sharing (TR-21) 13. Corrective Action Report (CAR) (JSC PR# AE2175-0) entitled “Damage Was Discovered on the Left Hand Leading Edge RCC Panel Number 10.” May 30, 2001. 14. Edmondson, Amy C.; Roberto, Michael A.; Bohmer, Richard M. J.; Ferlins, Erika M.; and Feldman: “The Recovery Window: Organizational Learning Following Ambiguous Threats in High-Risk Organizations”. In: “Organization at the Limit: Lessons from the Columbia Disaster”. Blackwell Publishing, London, Starbuck, William H. and Farjoun, Moshe eds. 2005. 15. Data presented to the NESC by Mike Gordon entitled “Investigation of OV-103 Panel 8R, 5/22/07. (file title “Supporting_NESC_AuditK”). (TR-5 Mike Gordon Timelines). https://drive. google.com/file/d/1SxK0oZkt9Vdzs-pqVeAqpxA0Hjuqq-Vs/view?usp=sharing 16. Camarda, Charles J.: “Shuttle Orbiter Reinforced Carbon-Carbon (RCC) Wing Leading Edge Subsurface Anomaly Volume II Culture as Cause – The Cultural, Social, Behavioral, Political, and Organizational Issues Related to the Disposition of Recent RCC Anomalies and Subsequent Construction of “Risk” and “Flight Rationale.” Whitepaper August 2007. 17. Presentation to LESS PRT dated 4/5/07 entitled: “OV-103 Panel 8R Following STS-114 – Update to the LESS-PRT: Presenting the :P 8R Status Following the Second Coating Integrity Evaluation on April 4, 2007, by Phil Gendreau, Mike Stoner, Ali Nasseri, Ken Spinks, and David Wright. https://drive.google.com/file/d/1kwJlca269iflq2IokJO8AV3cQEqsdvuS/view?usp=sharing 18. NASA Engineering and Safety Center (NESC) Paper entitled: “Orbiter Wing Leading Edge Structural Subsystem (LESS) Reinforced Carbon-Carbon (RCC) Panel Subsurface Anomaly”, June, 2007. https://drive.google.com/file/d/1WxQ0iJR8lrQw4K1jzg3GvMLvvdu8Algc/ view?usp=sharing 19. Briefing to the Aerothermal Team Supporting the CAIB entitled: “LaRC Computational Aerodynamics – Supporting STS-107 Columbia Investigation.” April 23, 2003. 20. Gnoffo, Peter A. and Alter, Stephen J. “Simulation of Flow Through a Breach in Leading Edges at Mach 24.” AIAA Paper 2004-2283. 21. Email from LESS-PRT with enclosure entitled: “Panel 8R Test Plan dated 5/1/07” for a special presentation to the LESS-PRT (TE2A). 22. Email from Charles Camarda to leaders of the LESS-PRT dated 5/1/07 at 9:33 AM. (TE-2) 23. Presentation by Tammy Gafka of the Wing Leading Edge Impact Detection System (WLEIDS) post STS-114. (TR-19). 24. Email from Charles Camarda to the NESC team dated 5/2/07 at 4:23 PM (TE-4B) 25. Email from Charles Camarda to Bob Piascik dated 5/2/07 at 12:17 PM. (TE-8) 26. SSP-ISS Recurring Anomalies Review: Appendix C – Review Criteria. (TR-23) https://drive. google.com/file/d/1AfAGEcun05AZSN4HkF-ikROve0R8Khgz/view?usp=sharing 27. Email from Justin Kerr to Curtis Larsen dated 5/23/07 at 2:49 PM. (TE-14) 28. Presidential Commission on the Space Shuttle Challenger Accident (1986): Report to the President by the Presidential Commission of the Space Shuttle Challenger Accident. 5 Vols. Washington D.C.: Government Printing Office, June 6, 1986. 29. Harris, Charles: “Preliminary Findings: Peer Review – RCC Damage Growth Tool.” April 22, 2005. (TR-1) 30. Camarda, Charles J.: “Concerns with the Current Rationale for Flight of STS-117.” May 22, 2007. (TR-7) 31. NESC Review Board Presentation 5/22/07. https://drive.google.com/file/d/1OWqfaohNvRAEF811bujbGjGW9S0c5-F/view?usp=sharing (TR-24)
177
Mission Out of Control 32. NESC flight rationale for STS-117, STS-118, and STS-120: “NESC’s Data Supporting Rationale for Flight.” 9/21/07. https://drive.google.com/file/d/1ElOcUBiv-sHMsd95XV0Qn9QZTBKIxIx5/ view?usp=sharing. (TR-25) 33. Stafford, Thomas P.; covey, Richard O.; et. al: “Return-to-Flight Task Group Final Report.” July, 2005 https://drive.google.com/file/d/1oJvrbvtDc1bqyDXyx3uAx19UAezNWJUP/ view?usp=sharing 34. STS-117 Prelaunch MMT briefing by Mike Stoner. 6-05-07. https://drive.google.com/file/ d/1bcgkX2bSp_Q2D1c-egdLAwwqC4ry2QtW/view?usp=sharing (TR-26) 35. Camarda, Charles J.: “Dissenting Opinions for the Launch of STS-117.” 5/22/07: https://drive. google.com/file/d/1OaGAgvBUiL7WSRkyySLg18uqpBu0b4tg/view?usp=sharing 36. Email from Charles Camarda to the NESC requesting the team to read 117-page white paper concerning flight rationale for all Shuttle missions post STS-114. August 26, 2007. (TE-51) 37. Email from Clint Craig to Charles Camarda dated August 26, 2007, at 3:24 PM. (TE-50) 38. Email from Charles Camarda to Mike Ryschkewitsch dated August 26, at 4:43 PM. (TE-52) 39. Email from Pete Gnoffo to Charles Camarda dated August 31, 2007, at 10:38 AM. (TE-53) 40. Email from Kevin Mcclam to Charles Camarda dated September 12. 2007, at 1:49 PM (TE-65) 41. Presentation to the LESSPRT concerning STS-100/STS-105 Flight Readiness Reviews entitled: “OV-103 Left Hand WLE RCC Panel 10L Damage.” 4/6/01. https://drive.google.com/file/d/1Utr_--ek39WK0BwkIaK4_fa5GaSEUTb/view?usp=sharing (TR-27) 42. NESC Final Report: “Orbiter Wing Leading Edge Structural Subsystem (LESS) Reinforced Carbon-Carbon (RCC) Panel Subsurface Anomaly. August 2007. https://drive.google.com/file/ d/1-02HS2Vmt2saR5vXCBYEmCOWLByoPJ2o/view?usp=sharing (TR-28) 43. Presentation of Aerothermal CFD Analysis of the Shuttle Wing Leading Edge by Dave Schuster (NESC): “Shuttle WLE RCC “Joggle” Region Heating CFD Evaluation.” 11/1/07. https://drive. google.com/file/d/1hqZOSJMoSaOw70-G_otNHb18pKSaRj_u/view?usp=sharing (TR-?) 44. Presentation to the NESC by Charles Camarda: “RCC Panel 8R Anomaly Investigation – Implications to Flight Rationale” (Part I). Aug. 8-9, 2007. https://drive.google.com/file/d/1JwYR iE6vehWDrqgqsi9g8EkoNTGOjQZz/view?usp=sharing (TR-29) 45. Presentation to the NESC by Charles Camarda: “RCC Panel 8R Anomaly Investigation – Implications to Flight Rationale” (Part II). Aug. 8-9, 2007. https://drive.google.com/file/d/1mU OwZMVB6uX_2YCsxw2aiq1rprChPH4f/view?usp=sharing (TR-30) 46. Email to Chris Scolese, Ralph Roe, etc. dated August 9, 2007, at 8:53 PM (TE-48) 47. Camarda, Charles J.: “Evaluation of RCC Damage Growth Tool with Respect to Claims of Conservatism.” November 2007. https://drive.google.com/file/d/1RVHlXAyAgFI1vkzYJQ4G4E fH5r1tehLE/view?usp=sharing (TR-31) 48. Camarda, Charles J.: “Shuttle Orbiter Reinforced Carbon-Carbon (RCC) Wing Leading Edge Subsurface Anomaly Volume II Culture as Cause – The Cultural, Social, Behavioral, Political, and Organizational Issues Related to the Disposition of Recent RCC Anomalies and Subsequent Construction of “Risk” and “Flight Rationale.” August 2007. https://drive.google.com/file/d/1rxi tEwScgNla2HEC2nmHQbUaseMAG4K8/view?usp=sharing (TR-32) 49. NESC Presentation at the Flight Readiness Review for STS-120. 10/15/07. https://drive.google. com/file/d/1kOfI5jnqssBvHXLpsP0v_EQxakp8XQoJ/view?usp=sharing (TR-33) 50. Charles Camarda and Mike Ryschkewitsch dated October 20, 2007, at 10:42 AM. (TE-102) 51. Email from Charlie Harris to Charles Camarda dated November 17, 2007, at 9:06 AM (Te-119)
178
Part II Creating a High-Performance Culture – How to Build a Research Culture and Network of High-Performing Teams The first half of this book spent a lot of time discussing the technical, social, cultural, and behavioral problems of NASA, a once-premier technical powerhouse with creative solutions to some of the most daunting challenges possible in aeronautics and space. Many of the technical and cultural causes of the Challenger and Columbia accidents have also been found to be similar to the causes of other accidents in various industries (e.g., the Boeing 737 Max plane crashes). I described what made NASA great and how it slowly evolved into a bureaucratic behemoth that, like similar high-risk/high-hazard enterprises, struggles to prevent recurring disasters. I went into excruciating detail in some cases to show how even amazing organizations like NASA can become corrupt and how insidious the real root causes of failure, such as dysfunctional cultures, can be. The second half of this book will evaluate several philosophies on accident theory, which may seem contradictory at first but which this author views as both right and wrong. I will try to explain what is missing from both and what it will take to transform failing organizations into high-performance organizations.
179
Chapter 5
Research as a Distinct Culture Normal Accident Theory (NAT), High Reliability Organizations (HRO), and NASA Normal Accident Theory (NAT): In 1984, Yale sociologist Charles Perrow wrote a book based on his studies of accidents that occurred in high-hazard systems in which he posits that technical systems that are complex and which exhibit “tight coupling” will induce and propagate failure in ways that are unfathomable by operators in real-time. Such technologies (e.g., nuclear power plants, petrochemical plants, aircraft and airways, marine systems, etc.) are accidents waiting to happen, and their occurrence should be considered “normal accidents” with huge adverse potential.1,2 You can understand why what came to be known as “normal accident theory” (NAT) would be viewed as a “pessimistic” model among the organizational researchers studying accidents. Perrow, although a sociologist, initially focused on the technical aspects of the problem. He recognized that it was the nature of these technical systems (their complexity and tight coupling) that made it impossible to safeguard against failures from happening in multiple subsystems and causing totally unexpected consequences, accidents, and catastrophic failures. There is a difference between complicated and complex problems. “Complicated problems originate from causes that can be individually distinguished; they can be addressed piece by piece; for each input to the system, there is a proportionate output; the relevant systems can be controlled and the problems they present admit permanent solutions. On the other hand, complex problems and systems result from networks of multiple interacting causes that cannot be individually distinguished and must be addressed as entire systems. That is, they cannot be addressed in a piecemeal way; they are such that small 180
Charles Camarda
inputs may result in disproportionate, nonlinear effects; the problems they present cannot be solved once and for ever but require to be systematically managed, and typically any intervention merges into new problems as a result of the interventions dealing with them; and the relevant systems cannot be controlled – the best one can do is to influence them, learn to “dance with them”3.” Presumably, for simple and even complicated problems, we have the requisite knowledge of how each of the individual parts and components will behave, how and with which components they will interact and what the performance of the system will be, and yes, even how and it will fail (of course all within a reasonable level of accuracy). In other words, there are no knowledge gaps in the technology used in the system, it is deterministic, and performance can be predicted a priori. For example, engineers who study the behavior of solid structures under various loading conditions can easily predict how that structure will deform and how it will eventually fail assuming we know the loading conditions (forces and moments, environment (pressure, temperature, etc.), how those loads are applied (directly at one point or distributed over a length, in-plane tension/ compression, bending, and/or combined, etc.), and the material properties of the solid (e.g., whether it is homogeneous (properties similar throughout) or heterogeneous (properties change by special location within the body (e.g., chopped fibers or inclusions to increase strength and stiffness)), whether it is isotropic (properties the same in all directions) or anisotropic (properties change according to direction)), whether the material is linear elastic (as it is stressed it will deform proportionally to the load) and once unloaded it will return to its original shape following the exact same path)), and whether or not the material property is related to the rate at which the load is applied. Engineers are taught how to simplify some basic natural laws, such as the conservation of mass, momentum, and energy, and some fundamental laws of thermodynamics to make assumptions and create theories of how a structure behaves and then go into the laboratory to test their assumptions to validate their theoretical models (more on how this is done in the next section). And so, for a very “simple” structure such as a beam made of a linear elastic material with a clearly defined load (e.g., a tip load) with well-defined boundary conditions (e.g., it is embedded in a wall where translation and rotation are constrained) we can accurately predict how that beam is going to deflect under a static load, say a tip load of magnitude P. That is the job of a structural engineer who was trained in the most current methods of analyses and assumptions for various types of structural configurations (all of which can behave differently based on configuration (e.g., 181
Mission Out of Control
plates, beams, trusses, thin-walled shells, stringer-stiffened panels, etc.)) applied loading and environmental conditions. It is the job of the materials scientist/ engineer to understand the material the structure is made of at the microscopic, molecular level to understand how, for example, the crystalline structure of a particular alloy will initiate failure, how those microscopic failures can grow and thus provide insight to the structural designer as to when and if a component will deform into the inelastic or plastic range and how it will eventually fail. Structural designers can then use this information and the “assumed” operating conditions for the structural components they are designing and thus limit the operating range (e.g., maxim value and types of loading conditions, safety margins) so that the component will not fail throughout its expected life. You can easily make even a simple problem complex by making the material heterogeneous and anisotropic such that its behavior under loading conditions can cause coupled deformation modes, which can interact with the environment and produce totally different failure mechanisms and unexpected failures if not properly understood. For example, the structural engineers on the Space Shuttle Program were never able to understand and predict how and why pieces of foam were popping off the external tank because the foam was heterogeneous and anisotropic, the loading conditions and environment were not accurately understood, and the application method was variable depending on the position on the ET. This lack of knowledge or understanding of the technology of a system is called a knowledge gap. Unfortunately, when NASA selected the design for the space shuttle, they knew they did not fully understand each of the technologies and how they would behave and interact, and they hoped they would gain that knowledge during the development phase of the program. In doing so, they incurred and accepted additional “risk.” As the number of parts, components, subsystems, and systems increases and the interconnectedness, interactions, and interrelatedness between them increases, the system becomes complex. Components in complex systems are often driven to be multifunctional. Thus, a failure in one component can indirectly affect the operation of a completely different function; for example, fuel cell operation produces electricity for the crew, and a byproduct of the reactions results in a valuable water supply for the astronauts in space. Imagine hundreds of such connections, and it becomes impossible to assess the consequences of other unrelated systems, thus causing signals that can daunt even well-trained operators such as flight controllers and astronauts. Managing and understanding such systems often involves dealing with nonlinear dynamics, uncertainty, and challenges in predicting outcomes. Imagine 182
Charles Camarda
now thousands of parts, hundreds of components, and dozens of subsystems on a vehicle like the space shuttle, hundreds of systems and subsystems that are interacting and varying and the types and magnitude of loads to various components and you can begin to understand why it becomes impossible to predict behavior and failure of such a system. Add to this already complex problem a system feature called coupling. Coupling refers to the degree of connectedness or interdependence between different components or subsystems within a system. Take, for example, the gaps between the wing leading edges on the space shuttle. Tight coupling means that very small, unexpected changes in the interaction of one of the 22 RCC wing leading edge panels and T-seals could cause increased aerothermal heating, burn-through, and catastrophic failure. Thus, the mechanical deformation of the panels is directly related to another unrelated function, the aerothermal heating the wings will experience. In addition, when highly technical systems are designed, engineers often follow a process called optimization in which one or two performance conditions are likely to be maximized or minimized to increase performance. For most aerospace systems, weight is usually one of those properties which is usually minimized. A lightweight rocket structure will enable you to carry more payload to orbit for a given propulsion system. Most often, the process of optimization results in systems that have higher performance but result in systems that are also very sensitive to small changes in system parameters. Operating such sensitive, complex systems on the edges of the envelope for maximum efficiency or to meet performance and/or capacity requirements can lead to disaster. For example, shuttle wing leading edges operate very close to the maximum temperature limits of its wing leading edges (within 200 degrees of its 3200 degrees F maximum SiC oxidation protection coating limit), or nuclear power plants operating at peak capacity to satisfy power demands of a city stressed by severe weather conditions. When the coupling between components is tight and direct, very small disturbances can cause rapid unexpected responses which makes recovery so much more difficult if not impossible. When you add humans into any system, complexity usually increases. Imagine trying to predict the behavior of humans/operators with their inherent limits and capabilities in sensing, cognition, perception, and judgment that are making decisions that control the operation of a complex system. They have limited capability of understanding the exact state at any given time. Now add hierarchical layers of management that can impede the flow of information and reduce solution times, having less technical understanding of the system 183
Mission Out of Control
but more status/weight on decisions, and layers of political policy and public scrutiny to constrain acceptable actions, and you have a recipe for disaster. You would have to ask yourself, as did Perrow, if accidents in such systems are normal and expected, why have we not seen more Three Mile Islands (TMIs), Chornobyls, or Fukushimas? Perrow describes how failures can occur at various levels in the design, equipment, procedures, operators, supplies and material, and environment (DEPOSE)1. He defines four levels of failure: 1) part failure, 2) unit or component failure, 3) subsystem, and 4) system. Levels 1 and 2, he states, could be classified as incidents and levels 3 and 4 as accidents. There are several ways designers and manufacturers of such systems can build reliability by affecting any of the DEPOSE parameters, which can minimize the different levels of failure and prevent them from growing from incidents or anomalies into subsystem and entire system failures. All of these safety features or procedures usually result in a cost (decrease in performance or capacity, increase in monetary costs, etc.) and can sometimes lead to additional, unexpected interactions and/or failures and thus reduce safety. For example, increasing the number of engines on an airplane will enable a redundant mode to save the vehicle if a single engine fails; however, the probability of an engine failing increases, and because of the complexity and coupling of the system, a single engine component failure could cause the degradation of another subsystem (e.g., a turboprop blade ruptures a critical hydraulic line) and cause an entire system failure. This cascading of small failures into a full system failure is characteristic of complex, tightly coupled systems. High Reliability Organizations (HRO): Around the same time that NAT was being discussed in the literature, a group of researchers at the University of Berkeley in California, Todd LaPorte, Gene Rochlin, and Karlene Roberts, began a study widely known as the “High Reliability Organizations” (HRO) project4,5. The quest of these HRO researchers was to study how high-hazard/high-risk organizations that operate complex technical systems can do so flawlessly and prevent drastic catastrophic consequences. Instead of studying failures and catastrophes, they were driven to understand why some organizations seemed to experience successful, safe operation of such high- hazard/high-risk systems for long periods of time. The organizations they studied varied from Elements of U.S. Navy Carrier Group Three (which included two aircraft carriers, Enterprise (CVN 65) and Carl
184
Charles Camarda
Vinson (CVN 70)), two American nuclear power plants, and elements of the Federal Aviation Administration’s air traffic control system7. Their intention was not to be prescriptive and thus present solutions to problems organizations like NASA, the U.S. Navy, and Boeing might face, but rather to identify the common characteristics of HROs that might help explain the successes they were experiencing. In some ways, High Reliability Theory (HRT) might be considered a more “optimistic” approach to accident theory. It is no wonder why the CAIB would view the principles and characteristics of HROs as a benchmark that NASA could use to help transform itself after the Columbia tragedy6. However, NASA’s questionable safety performance has led several social scientists to question whether NASA, as a large public organization, can ever achieve the status of a high-reliability organization or HRO2,7. Instead, as one author suggests, at most, NASA should become a “reliability-seeking” organization since “more generally, it is important to develop standards that can be applied to organizations such as NASA, which have to juggle production or time pressures, substantial technical uncertainties, safety concerns, efficiency concerns, and media scrutiny2.” I contend that all HROs studied are very close to tragic disaster, and if you consider near misses as a sign of failure (as is suggested by some HRO enthusiasts), the classification and distinction of HRO may seem dubious. However, the study of HROs and high reliability theory (HRT) is a very worthy endeavor to prescribe (something most academics prefer not to do) ways to develop large technical organizations responsible for very large, highhazard, complex, tightly coupled systems to operate flawlessly or near flawlessly without suffering catastrophic failures. While complexity and tight coupling are used to define the technical problem (e.g., space shuttle, nuclear reactor, or aircraft carrier battle group), these terms can also describe elements of the total environment that encompasses the problem, including social aspects of the system, the government and political system and its policies, organizational structure, and the regulatory and safety organizations. Hence, the entire socio-political-technical system has to be studied and modeled. However, as you are beginning to learn, it is impossible to predict the behavior of complex systems. However, there are ways we can reduce the complexity of these systems, loosen the coupling of such systems, and vary the organizational structure, operational procedures, sensing, control, feedback, and regulation of such systems to help make the management and operation of such systems more tractable and safer.
185
Mission Out of Control
The HROs studied were so hazardous to public safety that they had to be operated flawlessly; otherwise, a catastrophic failure could result in the loss of many lives or disrupt critical services for a large population. Attributes that describe what an HRO is and what characteristics or traits it embodies are debated but include terms such as hyper complexity, tight coupling, extreme hierarchical differentiation, redundancy in control and information systems, exceptional degree of accountability, and rapid feedback. Researchers studying what it takes to operate such risky technologies could be viewed as taking an “optimistic” outlook, intent on developing strategies to ensure the risk can be managed effectively and can ensure the safe operation of such organizations. However, what actually qualifies an organization to be considered as an HRO? Almost every one of the high-risk industries mentioned above has experienced a catastrophic accident, in some cases with significant loss of life. Some academics argue that to qualify for HRO status would mean the organization should not be too large (apparently, an aircraft carrier with 6,000 people qualifies but NASA with approximately18,000 civil servants would not), failures have to be sufficiently catastrophic (e.g., a chemical plant accident and thousands of deaths compared to a loss of seven crewmembers on a space shuttle might not), and/or a single failure would cause the demise of a particular industry or endeavor (e.g., the loss of a single space shuttle crew may not cause the end of human space exploration)7. Author James Casler notes NASA could never be considered to be or become an HRO, based on his set of dimensions by which public organizations can be evaluated with regard to HRO characteristics, which include: 1) openness to external environment, 2) simplicity of objectives, 3)social performance demands, 4) degree of risk, 5) process complexity, 6) organizational redundancy, 7) operational control, 8) professional diversity, 9) organizational learning, and 10) organizational mindfulness7. I can find fault and a counter-argument for NASA as an HRO for almost every one of the dimensions noted above and give counterexamples based on the social and technical causes of past accidents at NASA and the operating principles which I believe can transform large, public or private organizations into high-performance, learning organizations which can mitigate risk and prevent catastrophic failure. For example, the idea that an HRO should be isolated or closed to their external environments to limit “distracting intrusions and interferences from external the environment,” like the “skunk works” was isolated from its parent company Lockheed, might sound reasonable7. However, the networking of high-performing teams of experts external to the Shuttle Program was proven 186
Charles Camarda
to be necessary to understand the criticality of foam impacts and preventing the Columbia tragedy and developing an impact modeling capability that helped ensure the future safety of flights. Isolating “operators” from distractions during dynamic phases of HRO operation might be appropriate to some degree, for example, during launch and entry of the shuttle. However, when anomalies happen during other phases, e.g., when the crew is on orbit, there is usually sufficient time, for example, the external analysis and review of the ET foam impact, to make an intelligent/informed decision. With regard to the maximum size of an HRO, I agree that as organizations grow and develop multiple layers of bureaucracy and a hierarchical organizational command and control structure. Such an organizational structure can make it more difficult to communicate effectively between the differentiated “silos” (up and down and across this command structure) and important information can be filtered, distorted, slowed down, and even neglected. However, during Apollo, NASA, although a confederation of multiple centers and cultures with varying goals, was able to operate as a closely coupled network of managers, scientists, research engineers, and technicians because of a mutually shared vision and strong research culture which maintained a powerful hold over managerial and political forces which could tend to affect safety decisions based on competing values. Many proponents of HRT contend that most HROs have a singular goal and that safety has to be of the utmost concern because the enormity of catastrophe is so great that it cannot be allowed to happen. That large public organizations that have multiple, broad goals, like the exploration of space, and rely on taxpayer support for funding, would never be able to allow the resources necessary to prevent a catastrophic accident. Some argue that large public organizations like NASA cannot afford to prioritize safety over all other values; they must serve multiple contradicting values, and most HROs exist in closely regulated environments that force them to take reliability seriously2. And yet, all proposed HROs have to meet efficiency and capacity demands in addition to reliability and safety, and we see corruption due to revolving doors of regulatory agencies and industry as causal factors in accidents like the Boeing 747 Max failures and nuclear reactor failures like TMI. Some of the other criticisms as to why NASA can never achieve HRO status include: 1) NASA does not have the requisite “social isolation” necessary to restrict external pressures (e.g., political, responsive to public scrutiny to be responsive federally and publicly), 2) large public organizations are beholding to taxpayers and hence have to be accountable for the efficient expenditure of resources which would make it unrealistic to ensure sufficient funding necessary 187
Mission Out of Control
to maintain safety as a primary solitary goal, 3) the heterogeneity of its culture works against having a consistently strong safety culture throughout, 4) the consequences of a space accident are not sufficiently catastrophic (NASA can lose a crew of seven astronauts but we would not lose the entire space program), 5) Nasa’s organizational structure offers little redundancy, etc. Yet the smaller private enterprises, like nuclear reactor plants which have been studied as HROs, are still subject to external pressures for efficiency, capacity, and public support; they fail and are still assumed to remain within the category of HRO status. Explain to me how air traffic control can be viewed through the lens of a high reliability organization if it allowed two separate jets, commandeered by terrorists, to fly into the Twin Towers in NYC and end over 3,000 lives. As we draw the “system” boundary around a larger or higher level (engineers are taught to define the system they are analyzing using a dashed line), for example, from the subsystem (aircraft engine) to the system level 1 (aircraft), to the system level 2 (air traffic control) to the system level 3 (geopolitical ideologies around the world) you can see that one of the causes of 9/11 was the impediments to information flow across several government bureaucracies. I would like to mention at this time another strategy for addressing such complex problems, which General Stanley McChrystal used to fight the spread of Al Qaeda in Iraq, which he called “Team-of-Teams8.” We will talk about a research network of teams in the following sections, which demonstrates how a network approach to rapid knowledge acquisition and problem-solving, how a network approach can exist within a hierarchical organization to reduce such failures of communication and information flow. I believe it depends on where you draw your dashed lines around the “system” or “organization” as to what you try to identify as a high reliability organization (HRO). Clearly some small subunits of the system may appear to not exhibit any failures over extended periods of time; however, when viewed within the context of the much larger systems, the HRO label fails, for example, the ATC system studied by the Berkeley team, Oakland Enroute Air Traffic Control Center, was studied as an HRO, yet if we extend our dashed line to include not only that local unit but the much larger system we see failures can and do occur. The same is true for all organizations in general. Organizations are divided into sub-units or sub-organizations all the way down to the multitude of individual working groups, branches, divisions, and/or teams. It only takes one individual team to underperform (e.g., the O-Ring working group or the LESS-PRT) in any capacity, and a catastrophic tragedy can occur.
188
Charles Camarda
How these individual teams interact socially, learn, share information, and aggregate to solve problems is critical to the safe, successful, and efficient operation of the entire organization. Hence, to discount NASA as an HRO because it has broad goals and mission statements compared to a highly focused HRO such as a nuclear power plant whose singular purpose is stated to be safety, I would have to disagree. If you widen your dashed line a bit around a “nuclear power plant” to include the parent power company, the public, and the government you will see that the same production pressures that drive NASA, Boeing, Exxon, and Union Carbide also drive the local plant. It is often these external pressures imposed by politics, governments, and/or regulatory agencies that have the power to slowly degrade the culture of an organization and, over time, result in safety issues which cause tragedies. Safety is always important and a driving concern (probably more so the closer you get to the operators on the deck of the aircraft carrier, the astronauts launching in a space capsule, or operators within the nuclear plant control room), but make no mistake, profit, money, shareholders, politics, and taxpayers/citizens factor into every decision at some level. So, to transform an entire organization and be effective, one has to consider the entire ecosystem. Instead of picking apart every one of the ten dimensions listed above, what I would like to focus on are the common characteristics of many high reliability organizations and show how and why NASA was one in a prior life (i.e., NACA) and how it and other organizations can do what many say is impossible, transform broken organizational cultures and become high-performance high reliability organizations.
189
Mission Out of Control
Qualities of a High Reliability Organization (HRO): A list of some of the outstanding qualities or characteristics of HROs is presented in Table 12,4,5,7,9-12. •
• •
• • • • • • • • • •
High technical competence throughout the organization • Aggressively search to know what is unknown • Constant training • Team based learning strategies such as Crew Resource Management Strong culture of safety and reliability • Incentives for error discovery A constant, widespread search for improvement across any dimension of reliability • Analysis of precursor events and a clear demarcation of conditions that lie outside of prior analyses • Reward and incentive systems that recognize the cost of failure and the benefits of reliability A formal structure of roles, responsibilities, and supporting relationships that can be transformed under conditions of emergency or stress into a decentralized, team-based approach to problem solving Diverse in terms of culture, tasks, technology, and education and training but is guided by a singular purpose Redundancy in control and information systems Exceptional degree of accountability Rapid feedback Flexible delegation of authority Respect for frontline operators Communication of the “big picture” to everyone in the organization Systems thinking Mindfulness - viewing HROs not as closed systems but within a broader environmental context
Table 1 – Characteristics of a High Reliability Organization
As you can see from Table 1, there are numerous qualities or attributes which differentiate HROs from other organizations, and while it is difficult for all HROs studied to exhibit all to a high degree, we will identify some of the overarching ones that are the foundation which can ensure a high-performance culture and organization which can sustain exceptional performance at peak capacity and extreme environmental conditions and still do so safely. For example, authors Roberts, Bea, and Bartles suggested three of the attributes as keys to enhancing reliability in complex organizations: 1) aggressively seek to know what you don’t know, 2) design a reward and incentive system to recognize costs of failures as well as benefits of reliability, and 3) consistently communicate the big picture of what the organization seeks to do, and try to get everyone to communicate with each other about how they fit into the big picture9. The attributes listed apply at all levels, from the individual to the team and the organization. Take, for example, one attribute: communication. 190
Charles Camarda
Communication must be open and transparent and flow rapidly up and down and across a hierarchical organizational structure. The overarching environment of the organization has to be one which is psychologically safe otherwise a fear of bringing issues up the chain of command to the appropriate system leads will result in tragedies similar to Columbia. Leaders who do not possess an aggressive search for knowledge, or what I will later categorize as an attribute of a “research culture,” will not understand the importance of what is not known and will default to a mindless procedural response to such requests as did Linda Ham, who dismissed requests for imagery because she could not identify who was making the requests. She did not bother to ask the Damage Assessment Team (DAT) or NASA’s Intercenter Photo working Group (IPWG), both of which had been working on the issue, if there was a need. Rodney Rocha, the engineer who voiced the need through the IPWG, stood down and did not push for an answer as to why his request for imagery while the crew was in orbit was rejected for fear of recrimination. I believe some of the most prescient work in understanding the important characteristics of an HRO was developed by Karl Weick and Kathleen Sutcliffe: “Unexpected events can be disorganizing. It takes both anticipation and resilience to manage unexpected disruptions, a combination of what we call mindful organizing.13” From their studies emerged an implicit pattern of five principles of HROs, which became more explicit as a more varied set of organizations were studied. The five principles are listed in Table 2. I believe when I describe what I define as a research culture, you will understand why I believe it was NASA’s loss of its research culture DNA that led to two avoidable accidents, Challenger and Columbia. A true research culture embodies most, if not all, of the key principles of an HRO. Once a culture is developed and engrained within an organization or society, it persists even with the changeout of key leaders and managers. For example, the safe, efficient, and successful operation of an aircraft carrier battle group has to be sustained even though “the Navy demonstrably performs very well with a young and largely inexperienced crew, with a ‘management’ staff of officers that turns over half its complement each year, and in a working environment that must rebuild itself every 18 months.” And, in addition, “every 40 months there is almost 100% turnover of the crew, and all officers will have to be rotated through and gone on to other duty11.” I also believe, as Weick suggests, that the name high reliability organization can be misleading and imply a static state as opposed to what he prefers to use, high reliability organizing. In that sense, all “HROs” should be viewed as high reliability seeking organizations as suggested by Boin and Shulman2. 191
Mission Out of Control
Table 2 – Mindful Organizing and Weick’s Five Principles of an HRO
What Karl Weick and Kathleen Sutcliffe found was that most HROs were “preoccupied with failure”; they understood the systems they were operating were imperfect, and they had to be ever vigilant and anticipate potential failures. They realized that the natural tendency for groups trying to understand complex systems was to build simple mental models and that this tendency must be resisted at all costs. Just look at the simple model (Crater) the Damage Assessment Team (DAT) at JSC clung to in a desperate effort to try to predict whether a piece of foam 400 times the size ever tested would cause critical damage to the shuttle’s fragile thermal protection system. Instead, NASA should have developed and nurtured skeptics like Rodney Rocha and turned to true subject-matter-experts (SMEs) who knew the Crater tool was inadequate and realized the team needed more data, an image of the wing on orbit, and/or an improved and validated physics-based analysis model to make an intelligent decision. The decision Linda Ham and Ralph Roe made while the Columbia crew was in orbit was an unfounded mental model that foam shedding was “in-family” and merely a concern for turnaround and maintenance! Successful HROs maintain a sensitivity to day-to-day operations and even small anomalies 192
Charles Camarda
or indications of potential issues. The SiC chips that came off the wing leading edges of the space shuttle in the most highly heated regions over five years before they were discovered following my mission, STS-114, were discounted by leaders of the LESS-PRT as clear signs of aging, which could have caused a disaster, and which were never looked into even after being identified as a warning sign. Instead Weick recommends three processes: 1) detection of small failures (HRO principle 1), 2) differentiation of categories (HRO Principle 2), and 3) watchfulness for moment-to-moment changes (HRO Principle 3). A Research Culture – What Made NACA and Early NASA Great Culture has been cited as the cause of several accidents like NASA’s Challenger and Columbia disasters; however, most analysts fail to recognize and understand the one critical culture which, when it is missing, is the root cause of most accidents related to complex technical systems like space shuttles or commercial airplanes like the Boeing 737 MAX. In Chapters 1 and 2, NASA was described as a confederation of diverse, heterogeneous cultures and subcultures made up of 10 geographically dispersed centers throughout the country as well as NASA Headquarters in Washington, D.C. The critical feature missing in this understanding of the cultural causes of accidents to date has been the oversight of a distinct culture within NASA and many other technical organizations, which was the critical element that made NASA and its predecessor, the National Advisory Committee for Aeronautics (NACA) successful; it is what I call the “research” culture. If you can understand the subtle but very important distinction between a “research engineer”14 and an engineer, you will be able to understand the key differences between an “engineering” culture and a “research” culture and the real reasons why large technical organizations like NASA and Boeing fail big and are incapable of changing their culture to prevent future disasters. I believe this lack of understanding of the importance of a research culture and why it is necessary for researchers to lead the collaboration with the engineering and program cultures within an organization to rapidly identify and fix critical anomalies before they lead to major catastrophes and disasters, is the primary reason for disasters like Columbia. Scientists, engineers, and research engineers: Scientists study the world around them; they develop hypotheses that lead to theories that try to explain what they observe in the natural world. It could be the study of matter, energy, and the forces which interact to cause motion and/or 193
Mission Out of Control
deformation, airflow over a moving object, or the heat generated by high-speed flow of an object moving through the atmosphere. Their theories are validated by carefully planned and executed experiments, which they use to corroborate their hypotheses and associated assumptions. The laws that scientists develop are used by engineers to design and build machines to accomplish specific tasks. The difference is stated quite nicely by Theodore von Kármán, one of the great theorists of aerodynamics, “Scientists study the world as it is, engineers create the world that has never been.” This is what inspired me to be an “engineer”: to use the right/creative side of my brain to create something totally new, something I could only imagine in my mind’s eye! The act of conducting research is the process by which scientists/researchers study a phenomenon they see, develop an analytical or mathematical representation based on their hypotheses and validate their theories in the laboratory by experiment. True researchers or scientists focus on the unknown and are constantly trying to improve their theories to predict the behaviors they observe and are driven to understand the minutest anomalies or sources of disagreement or error between their theories and what they observe in laboratory experiments and in the real world. There is what I am calling another class of engineer, what I am calling a “research engineer.”14 A research engineer is an engineer who conducts “applied” or “useinspired basic research”15 to understand and design new concepts or machines. The research engineering method developed at Langley was “the product of fruitful engineering science: a solid combination of physical understanding, intuition, systematic experimentation, and applied mathematics.”16 While pure, basic, or fundamental research is the study of the unknown for the sole purpose of extending the body of knowledge about a subject, applied or “use-inspired basic research”15 is that research which seeks to understand the unknown as it applies to a specific purpose or use. Such was the case for the state of aeronautics at the turn of the century and the reason for the establishment of NACA, the National Advisory Committee for Aeronautics. To understand the core ideology and culture of NASA is to understand the DNA of its predecessor organization, NACA and its transition from NACA to NASA, as was explained in Chapter 1. The Research Culture at NASA Langley When I was Hired in 1974: At the end of my junior year studying aerospace engineering at the Polytechnic Institute of Brooklyn, Professor Pasquale Sforza asked our class if anyone was interested in a summer internship at NASA. Three hands shot up immediately; mine was one of them. Of those three, two of us received internships at NASA’s Langley Research Center: myself and my good friend and classmate, Peter Gnoffo. We didn’t know it at the time, but our internships 194
Charles Camarda
would set us both up for long careers at NASA. Peter would eventually become a world-renowned leader in the field of aerothermodynamics (the study of the heating caused by high-speed flight through an atmosphere), and I would lead teams of researchers studying and designing structural systems for hypersonic vehicles, and eventually get to fly on one as an astronaut and test some of our ideas in space. When Peter and I reported to the Langley Research Center in Hampton, we were two curious students who had no idea what our internship experience would be like. Talk about a culture change: We were two New York/New Jersey boys dropped into the heart of the South. I can still remember packing my car and saying goodbye to my little Italian mother, who was crying her eyes out as I left home. And I can still remember my first stop for gas in Delaware along the Eastern Shore. The gas station attendant checked my oil, washed my windows, filled my tank, and when I reached into my pocket to give him some money for a tip, he waved his hands up and said, “No, no, we don’t take tips.” Oh my God, where was I, and where was I going? A land where people refused tips. This was unheard of. If I had not tipped an attendant in NYC after filling up my tank, washing my windows, and checking my oil I could hear the expletives, the derision, and the insults I would have received! This was only the very beginning of my education in cultural differences. I thought that by living in New York, I had experienced every kind of ethnic, religious, and socioeconomic culture possible; however, I had really never spent much time outside New York and definitely not in the South. After my first week of work at Langley, I knew I was spoiled forever and that I would never be happy in any job other than research. I learned several things that summer, but none more important than this: Research is not just a process; it’s a culture. True, genuine research requires a unique working culture to support it. I encountered this research culture for the first time at Langley and had the pleasure of working at Langley for twenty-two years as a research engineer and eventually a technical manager there before I was selected to be an astronaut and move to Texas to work at the Johnson Space Center (JSC). The people who mentored me at Langley were the tops in their field: Figure 2. – Dr. James H. Starnes, people like Jim Starnes in structural mechanics, Head of the NASA Langley Research Center, Structural Dr. Carson Yates in aeroelasticity, Dr. Ahmed Mechanics Branch (1981 to 1999) 195
Mission Out of Control
Noor in finite element analysis, and Dr. Raphael Haftka and Dr. Jarek Sobieski in multidisciplinary optimization (MDO). I spent that summer working with Dr. James H. Starnes (figure 2). Jim, as he liked to be called, was a hulking bear of a man with a modest, cave-like office under a stairwell in an old WW II era building on the East Side of the NASA Langley adjacent to Langley Air Force Base. Every morning, I would find Jim hard at work in his office, which was in total disarray, with Jim poring over stacks of computer listings piled on the floor and deep in thought. And almost every morning he would explain to me what it meant to be a researcher. On my first day, he walked to his tiny chalkboard and drew three ellipses in an oval pattern with double-ended arrows connecting each. Then, in one ellipse, he wrote “analysis.” He wrote “experiment” in the next and design in the third (figure 3). He explained that this was the scientific research method in its simplest, barestbones form. He taught me how we begin with a physical observation, a structural concept (a typical example for structures researchers), or a problem; we attempt to model that observation as best we know how, analytically or numerically. We evaluate our representation of that observation by a test/experiment. More often than not, we “fail,” or our model of the physics of the problem is found to be lacking. It could be our experimental representation of the “real” observation (initial conditions, boundary conditions, physical properties, etc.), or the errors could lie in our simplified model (simplifying assumptions, numerical model, etc.). We iterate in these two worlds of experiment and analysis (double-ended arrows) until we understand the discrepancies and can correlate our analytical representation of behavior with what we observe in the laboratory to within some desired level of accuracy.
Figure 3. – World view of the construction of knowledge for research and design according to Dr. James H. Starnes. 196
Charles Camarda
Dr. Starnes always stressed the importance of testing to failure. The true test of our understanding of the problem is when we are able to anticipate every conceivable “failure mechanism” a priori and be able to accurately predict when failure will occur. It is failure that keeps researchers humble; it is the true test of what we know and the stark reminder of how much we do not know. One of Jim’s notable quotes was “Good judgment comes from experience, but experience comes from bad judgment.” Jim kept a “scrap heap” of his structural mechanics failures lining the walls of the James H. Starnes Structural Mechanics Laboratory at Langley (so named in honor of Jim’s contributions to the world of structural mechanics). He would proudly walk mentees and visitors through his lab and point to failures; then he would tell a story about each and describe in detail all that was learned from those failures, which helped to accelerate his branch’s understanding of how aircraft and spacecraft structures behave and how they could fail! Once we believe we truly understand the physics of a problem and can predict failure, we proceed to step into the world of design. We now use automated methods like nonlinear programming and MDO to vary design parameters/ variables and wander rapidly and systematically through design space, avoiding constraints until we stumble upon a design that satisfies all constraints and produces an “optimum” solution to whatever objective function (problem) we are trying to minimize, like mass or cost, or maximize others like speed or range. We then go back to the laboratory to determine if we can reliably predict the behavior and failure of our “optimum” design. Once again, we fail and, in the process, discover or learn that we have exceeded our understanding of the problem by somehow moving beyond the bounds of our prior assumptions into the design space, where we encounter an unanticipated “failure mechanism.” This process is repeated many times, and each time we fail, we learn, and we develop a much better understanding of the problem and, more importantly, an understanding of our limitations in predicting the behavior of an actual, imperfect artifact and/ or an idealized, and imperfect, model. Without the humbling experience that failure brings, there can be no real appreciation of the limitations of our mere mortal capabilities and the increased vigilance it demands. This was the job of a researcher, and this was the career I would choose. You have to think of figure 3 in three dimensions, as a flat spiral. This same process is employed at every level of what Jim would call a “building block approach” from the coupon level to the full-scale concept test and evaluation. At each trip around the spiral as problem size and complexity increases, you improve your understanding of the whole system by validated analysis correlated by experiment. 197
Mission Out of Control
It was during that summer internship at Langley that Jim would introduce me to one of the most brilliant aerospace scientists I would ever know, Dr. Raphael Haftka, a post-doctoral student from Israel on grant to NASA, working on the latest advances in optimization which incorporated the aeroelastic failure mode called flutter as constraints in the automated design process. Flutter was a unique phenomenon. It was a dynamic instability, usually catastrophic, which was caused by the coupling of the aerodynamic forces or loads together with the structural deflections. Such problems could not be solved separately but had to be solved as a coupled set of nonlinear dynamic equations. Years later, “Rafi” would become my dissertation advisor at Virginia Polytechnic Institute and State University. Another very important element in the construction of knowledge process and a legacy of Jim’s mentorship is the importance of determining all “failure mechanisms,” and to do this, research engineers must adhere to a “buildingblock approach” to knowledge construction. They must look at their own concepts and theories very objectively and aggressively seek out peer review and critique from others.
Figure 4. – Example of a stepwise building-block approach in the design and development of an evacuated, honeycomb-sandwich panel reusable cryogenic hydrogen tank for the singlestage-to-orbit (SSTO) X-33 vehicle.
An example of the stepwise, building-block approach was the development of a reusable cryogenic tank concept for the single-stage to-orbit (SSTO) X-33 vehicle17 (A schematic of the approach used in the development of this concept 198
Charles Camarda
for the SSTO X-33 vehicle is shown in Figure 4). My initial idea or concept was to develop a more efficient and reusable cryogenic hydrogen tank for the X-33 vehicle by utilizing a very efficient honeycomb sandwich structural design and then evacuating the honeycomb (pulling and maintaining a vacuum) so that it would also be a good insulator (similar to a thermos bottle) and would eliminate the need of fragile external foam insulation which could degrade and shed, as was the case for the ET for the space shuttle. This is the essence of a “multifunctional” design feature of concept, which added a new, complex failure mode. Following the process described in figures 3 and 4, we conducted thermal and structural preliminary analyses to verify system performance and identified all the critical failure mechanisms that would have to be addressed prior to detailed analysis and design. A series of thermal, structural, and operational test specimens and tests are then strategically planned to address a particular failure mechanism and/or operational procedure early in the development process early in the design cycle, before performing a full-scale cryogenic tank test. Small-scale material and structural coupons and components were fabricated and tested to verify material properties and structural limit loads, like in-plane tension and compression, flatwise tension tests, and bending tests, shown in figure 4, and combined thermal and structural load conditions of larger components and their nonlinear interactions. These tests may include interfaces between multiple components, attachments, manufacturing details, etc. The closer the test article approaches the true embodiment and operational use of the concept, the more rigorous the analysis must be in order to predict behavior, performance, and failure. One laboratory test was devised to address a particular critical failure mode, the loss of vacuum within the honeycomb core and the pumping and condensation of outside air into the honeycomb, forming a highly combustible pool of liquid oxygen (cryopumping) or liquid hydrogen through the internal sandwich face sheet. Any liquid trapped within the honeycomb could cause a rupture of the tank as the vehicle climbed through the atmosphere or the liquid level in the tank decreased. When this happened, the liquid would heat, turn to gas which would expand and build up internal pressure, blowing the structure apart. To better understand this failure mechanism, we fabricated small 6 in. x 6 in. “ravioli”-shaped graphite-epoxy specimens with a composite honeycomb core and sealed the flattened edges. The specimens were evacuated and cold soaked in liquid nitrogen to simulate thermal loading during cryogenic propellant tanking. Numerous attempts were made to maintain a vacuum within 199
Mission Out of Control
the honeycomb with no success. The microcracks within the composite face sheets and closeouts and joining features of the panels were sufficient to cause leaks. These same leaks in a full-scale hydrogen tank could cause a disaster, and it was apparent that a honeycomb sandwich cryogenic tank could experience cryopumping problems and cause a disaster. This cryopumping phenomenon was also one of the explanations for why the foam was popping off during the launch of the space shuttle. Voids in the foam could trap air, which could liquefy on the pad and turn to gas and expand as the vehicle launched through the atmosphere. These very small and inexpensive laboratory tests could quickly address a critical failure mechanism and give insight into methods to obviate such failures18,19. After I was selected to become an astronaut and was training at JSC, the NASA X-33 Program Lead from MSFC, Gene Austin, called me and invited me to review the X-33 progress. Astronaut pilot Scott Horowitz flew me in his backseat in a NASA T-38 to the Lockheed Skunk Works facility at Palmdale, California, for the review. During the meeting, it became quite clear that while the Lockheed team had acknowledged they were using “my” honeycomb sandwich design for the tank. However, they overlooked one critical detail: They decided that evacuating the complex, multilobed sandwich structure was not necessary. I told them this would be a mistake based on our laboratory tests which indicated a potential failure mechanism/mode but was brushed off by their lead engineer, Paul Landry, a Brooklyn Poly graduate whom I had met during the X-33 selection process. Unfortunately, two years later, as I was sitting in my office at JSC, I read in Aviation Week and Space Technology that Figure 5. – Damaged X-33 composite hydrogen cryotank.19 the X-33 cryotank had a catastrophic failure during a test in November 1999 (figure 5), and, soon thereafter, the program was canceled. I left a few phone messages on Paul’s work phone; funny, I never received a return phone call from him.
200
Charles Camarda
The X-33 Investigation Team, led by Jim Starnes, would conclude that the cause(s) of the accident were: “1) Microcracking of the inner facesheet with subsequent gaseous hydrogen (GH2) infiltration, 2) Cryopumping of the exterior nitrogen (N2) purge gas, 3) Reduced bond line strength and toughness, 4) Manufacturing flaws and defects, 5) Infiltration of GH2 into the core, which produced higher than expected core pressures. (The investigation revealed this phenomenon, which was an unexpected contributor to the failure mechanism.).”19 Another noteworthy observation by the investigation team was that “the design of this system should have been pursued using a building-block design approach, accompanied by in-depth technical penetration to quantify risk. The building-block design approach used did not adequately address the complexity of the system.”19 Jim passed away in 2003 tragically by a brain aneurysm caused by a slip and fall. I do not remember if I had ever told him of my attempt to reach out to correct the problem I had found with the X-33 cryotank in 1997. I am sure he is up there looking down and smiling, knowing I had been a good student and listened to his words of wisdom! This research process to construct and validate knowledge as described above (figures 3 and 4), while seemingly simple, took me a long time to fully learn and comprehend. And what I discovered was that it is not simply this process that makes someone a researcher. It is, rather, a commitment to the culture. Part of the research culture is a willingness to submit to the process almost entirely. But fundamentally, it is a thirst. Jim would stare into my eyes intensely and say, “Researchers have a thirst for knowledge. You HAVE to have that THIRST for knowledge!” Researchers and true scientists have this “thirst,” this burning curiosity for the truth and a passion and determination to understand every anomalous behavior observed in the laboratory and be driven to be able to predict how, when, and why it occurred and if it would occur in the real world. The world where theories are applied to understand and predict the behaviors of very complex systems such as aircraft and spacecraft. High reliability/high risk systems that could cause loss of life. In Table 1, this “thirst” is described as “aggressively seeking to know what is unknown,” and in Table 2, it can be understood by a combination of principles 1, 2, and 3. I feel certain that adherence to the values of a healthy research culture, like the one that existed at Langley at NASA’s inception in 1958, and mentorship by true visionary leaders like Jim would have prevented the development of the bad sociological, cultural, organizational, and psychological behaviors which were evident during the Challenger and Columbia disasters and could have prevented 201
Mission Out of Control
those and other similar tragedies like the Deepwater Horizon oil spill and the Boeing 737 Max airline tragedies which will be discussed in Chapter 8. True leaders like Jim had character and integrity. The values of a healthy research culture within an organization include: 1) highly competent employees who are humble and have a strong curiosity and thirst for knowledge, a desire to continuously learn, and a resistance to simplify or satisfy, 2) deference to the person with knowledge regardless of status or position, 3) a psychologically safe environment and culture which embraces aggressive dissent, 4) a strict appreciation of failure and an understanding of how to fail intelligently to ensure rapid, collective learning, 5) open and transparent access to all relevant data and communication, and 6) the encouragement of the maturation of competing theories and analyses to validate and advance the body of knowledge. True “experts,” or people I would infrequently refer to as “experts,” were researchers who were highly competent yet extremely humble. These are the people I would later refer to as the Friends-of-Charlie (FoC) network. Humility is a necessary attribute because once highly competent people cross the line and become arrogant, they cease to listen and learn everything they refuse to admit they do not know. They will never be able to grow as individuals that can admit their failings and work collaboratively as high performing team (HPT) members. Langley had some of the greatest mentors, and it was such a unique place to work. First-level managers like Jim, branch heads, were “experts” in their fields; however, never once did I hear them or anyone else at Langley refer to themselves or others with that term. These were extremely competent individuals who were competent in what they knew, yet supremely humble and quick to admit what they did not know. It is only when we cross the line from highly competent to arrogant that we stop learning. Dr. James Starnes was the one person most responsible for the safe acceptance of using polymeric composite materials in the primary structures (wings and fuselage) of modern aircraft. Yet, he would correct anyone who would call him an “expert.” There was no “culture of experts” at Langley, unlike at JSC, which I discovered years later. JSC was awash with arrogant program managers, engineers, and flight directors who are still reluctant to admit to the part they played in allowing the Challenger and Columbia tragedies.
202
Charles Camarda
Knowledge-Based Hierarchy A critical component of a healthy research culture is an incessant thirst for knowledge and the natural deference to the person(s) with the requisite knowledge, regardless of their seniority or status. Knowledge at Langley was not assumed to reside with or be dependent upon organizational hierarchy. At Langley, it was common for directors, division chiefs, and technical branch heads to defer to the person with “the knowledge,” no matter what their position in the bureaucratic strata. Researchers were trained in the scientific method to validate new knowledge through analysis and experimentation. The construction of knowledge of complex systems was developed using a building block approach and testing to failure at multiple levels, similar to the very early subsystem, and component flight tests of the MSFC/ABMA rocket scientists under Wernher von Braun. Knowledge was vetted by internal and external reviewers and technical editors in peer-reviewed academic and industry journals and specialty conferences where experts from around the world would meet to share ideas and advancements in various fields. According to Karl Weick, some of the questions an organization should ask when probing to see if your organization is prone to simplify include13: • “To what extent do people around here take things for granted? • Is questioning encouraged at all levels? • Are people encouraged to express different views of what is happening? When they do, do we label them troublemakers? • Are people shot down when they report information that could interrupt operations? • Do people listen carefully to each other? Is it rare that people’s views are dismissed? • Do we appreciate skeptics around here? Do we strive to challenge the status quo? • Do people show a great deal of respect for each other, no matter what their status or authority? • When something unexpected happens, do people spend more time conducting an analysis of the situation rather than merely advocating for their familiar view of what happened? • Do we develop people’s interpersonal skills, regardless of their rank or position?”
203
Mission Out of Control
After reviewing the above questions and then recalling the behaviors that groups like the LESS-PRT, the SSP, and the NESC exhibited in Chapters 2-4, we can see the organizational problems that NASA exhibited quite clearly! We also saw how the culture of “experts” at Johnson Space Center (JSC) weighted/favored the opinions of senior leaders in the working groups and Space Shuttle Program Office (SSPO) like Cal Schomburg over legitimate questions and concerns of capable engineers like Rodney Rocha. The culture during Columbia was such that if a person asked tough technical questions of individuals leading these working groups, he/she was considered a non-team player and viewed very negatively. The mantra was teamwork and a drive towards consensus—the very antithesis of a healthy research culture. These so-called “experts” would rise in power within the organization and would serve to quiet or silence dissenting opinions, leading to a groupthink mentality and resulting in poor decisionmaking. It was a self-fulfilling prophecy whereby getting ahead meant getting along and hence, there was very little tolerance for dissent the higher you rose in the organization. The dysfunction grew at NASA, and the system naturally weeded out the good people who would stand up and question the status quo. Psychological Safety The key ingredient or underpinning of all successful teams, according to Julia Rozovsky and Google’s Aristotle Project20,21, and a quality of all great research organizations is a psychologically safe environment. It was only after Julia was able to view the Google data regarding the teams they were studying through the lens of psychological safety, a concept proposed by researcher Amy Edmondson22, that she was able to make the connection. Julia Rozovsky realized she was able to study and learn so much more effectively when she was in a group where she felt supported and nurtured and was not criticized when she asked questions that openly exposed her lack of knowledge and/or misunderstanding to a group that would use that knowledge to destroy her selfconfidence. Psychological safety is an emergent property of a group or team. It is an environment in which all members feel safe to take interpersonal risks, question authority, openly dissent, and candidly criticize others without fear of recrimination or negative actions22. Think about it: We watch the news today and witness countless researchers and scientists being vilified and even silenced if their views are different than the majority consensus opinions. This is not a healthy research environment. Such an environment will impede curious thought and critique of ideas, critical thinking, and creativity. The COVID-19 pandemic was one such stark example. Who were the real “experts” in the 204
Charles Camarda
medical community? Why were citizens vilified for even questioning the studies of the so-called “experts”? It was a non-psychologically safe atmosphere that caused the vilification of good engineers like Rodney Rocha in the Columbia disaster and Alan McDonald of Thiokol during the Challenger disaster, and which caused them to stand down and allow decisions to be made that cost lives. The careers of both men suffered, as did mine, as we were outcast from the club, social pariahs who were no longer invited to events and shut off from crucial information. It takes more than team surveys and classroom lessons on psychological safety to ensure your organization ensures a safe environment and all employees and team members safe. Leaders have to walk the talk, and what better way than to reward people who find problems and speak up. I can remember running at lunchtime at JSC with the head of the Shuttle Program, Bill Parsons, a sixfoot-plus ex-Marine who was shouting down at me as we ran, “Charlie, I tell my team over and over that if they see something wrong, they should SPEAK UP!” I looked up at Bill and replied, “Maybe next time, don’t scream it at them.” Actions speak louder than words: Lead with the behaviors you want to see in your team, and reward those behaviors publicly (Table 1-incentives for error discoveries and Table 2-preoccupation with failure). When I was Director of Engineering at JSC, I suggested to my boss, Mike Coates that I wanted to establish two awards, The Max Faget Award for technical excellence and the John Young Award for those people who showed the courage to speak to authority and were willing to put their badge on the table to do the right thing. Capt. John Young was my favorite of all the astronauts. He was an outstanding test pilot and engineer who worked night and day in shuttle simulators flying abort scenarios to ensure the astronauts were safe. John was never afraid to speak his mind, and everyone at KSC working in the Orbiter Processing Facility (OPF) felt comfortable going to him with any problems they saw regarding safety of flight. They knew they could trust him to keep their names anonymous and carry their concerns to the right people who would act on them. My requests for both awards were denied! Intelligent Fast Failure15 The next attribute of a healthy research culture is its tolerance and, yes, even reliance on failure. True researchers know that failure is not only an option; it’s a requirement—one that is necessary to understand the true limits of all assumptions made in the modeling of phenomenon of interest. Researchers are trained to design and conduct experiments to accurately collect the data 205
Mission Out of Control
necessary to validate their understanding of the problem. Hence, the proposed attempt to conduct a very expensive, full-scale impact test post-Columbia with very minimal instrumentation would never have even been considered by any of the three NASA research centers. In addition, researchers seem almost preoccupied with failure and look at all possibilities that could lead to failure prior to design and definitely throughout the experimental flight-testing phases. The culture at NASA Langley Research Center was one in which failure was tolerated; according to NASA historian James Schultz, engineers there had “permission to try and try again.” “Learning by repeated attempts may appear cumbersome, but failures indicated areas where further research was needed to improve the understanding of flight phenomena. At Langley, the mistakes were just as important as the successes, for they sowed the seeds of future accomplishment”6. This was the essence of a true research culture, one in which open debate, disagreements, intelligent discourse and critical thinking was mandatory and protected, even encouraged; hell, it was necessary if we were to determine the REAL root causes of critical anomalies that could cost lives! When the program office was struggling for over a year to develop a wing leading edge repair technique with no success, I took Don Pettit, and, together with some FoC’s from Langley, we developed a repair technique in only three months, which I was able to fly on my mission, STS-114, the first flight after the Columbia tragedy. Contrast this with the “failure is not an option” quote JSC Flight Director Gene Kranz in the movie “Apollo 13,” and you can understand the failure-tolerant world of research at Langley as compared to the risk-averse one at JSC where engineers were steeped in a production culture, driven by schedule, budget, and political influence where dissenting voices were quickly silenced. While the words above were never really spoken by Gene Kranz, this philosophy was, however, deeply engrained in the culture of the Mission Operations Directorate (MOD) at the JSC that Mr. Kranz was instrumental in developing. And while Gene strove for technical competence of his MOD team, that competence turned to a culture of arrogance as the Space Shuttle Program was successful and even persisted and grew after two tragic failures. Just think back to responses made by some of the flight directors’ post-Columbia (e.g., doubting the culture prevented people from speaking up or questioning the intentions of CAIB sociologist Diane Vaughan).
206
Charles Camarda
Transparency and Collaboration A research community operates in an open and collaborative environment that allows the transparent sharing of analytical and test results in enough detail so that experiments can be recreated and corroborated or discredited by others in refereed journal publications. This peer review by a trusted community ensures accuracy of knowledge and helps to highlight any inconsistencies. The research community in question must be one which is psychologically safe and allows for rigorous and aggressive dissent among its constituency. Alternate and even competing methods of analysis and testing are evaluated and compared, with the purpose being a more comprehensive understanding of the unknowns and an increasingly more accurate methodology for analysis of behavior and prediction of failure. Both the Challenger and Columbia accidents occurred because the working groups responsible for understanding critical anomalies like the cause of hot gas blow-by past one and then both Solid Rocket Booster (SRB) O-Rings or the effect of External Tank (ET) foam strikes to the Orbiter vehicle did not have the right team with the correct experience in solving such problems. The “knowledge” was not transparent and vetted by a community of objective researchers; its flow was impeded, and yes, sometimes even filtered and restricted. Hence, the “mathematical” models of the physical phenomena they were trying to understand were woefully inadequate. In fact, in the case of the SRB O-rings, the real root cause of the problem was not really the material properties of the rubber at low temperatures, as played up by even famous physicist Freeman Dyson, but the nonlinear structural deformation of the SRB field joints during loading. It was really an inadequate structural design and an overly simplified analysis of joint deformation which was only successfully corrected when teams of research engineers correctly modeled the structural behavior and redesigned the joint that connected the cylindrical sections of solid propellant. You guessed it: It was the shell experts of the Structural Mechanics Branch led by my mentor Dr. James H. Starnes (and Friends of Charlie (FoCs) Dr. Michael Nemeth, Dr. Bill Greene, and Dr. Norm Knight) that correctly modeled the nonlinear structural behavior of the SRB joint which enabled a successful redesign. I can list dozens of examples when programs were faced with tragedies and daunting technical challenges and were forced to rely on the “researchers” to find and fix their engineering design mistakes. Unfortunately, once problems were solved, programs returned to business as usual, and ties to real researchers evaporated, just as it had post-Columbia.
207
Mission Out of Control
Encouragement of Competing Theories The true essence of a scientific and research culture is the encouragement of competing theories and even seemingly “redundant” teams to work on competing concepts. The current NASA, in an effort to minimize costs due to a multitude of reasons, one being a bloated human space program, ineffective business/program management practices, etc., has scaled back and closed many unique laboratory facilities at multiple centers which were viewed as redundant and sought to have specific capabilities reside at only one center. The reason NACA and NASA were successful in the past is that we had competing teams of researchers advancing the state-of-the-art of multiple technologies simultaneously. These teams sought to optimize their specific concepts and served as excellent objective sources of critique for other competing work. The metallic and reinforced composite thermal protection systems (TPS) developed at Langley, for example, were always competing with the more fragile, lightweight ceramic foam tiles developed at Ames Research Center. I like to use a term borrowed from a good friend and very creative leader, Dr. Bart Barthelemy, what he calls “competitive collaboration”25. You have the best of the best to develop and optimize their ideas and collaborate with competing teams and ideas to create new ideas by incorporating elements from each. This cross-pollination is crucial for innovation and also for organizational learning. The reason why NASA believed supporting competing technologies was redundant and expensive is that the senior managers at NASA use an outdated process for product development and technology maturation. Using the intelligent fast failure and rapid concept development approaches we used at Langley to develop new technologies which we flew on my return-to-flight mission, we accomplished what most program managers thought impossible in an order of magnitude less time. We failed smart, fast, small, cheap, early, and often, and we learned fast and furiously. Similar to what companies like SpaceX appear to be doing today. The question remains, however, whether or not SpaceX and other commercial space companies truly embrace a research culture, and if so, how long they can sustain it. The process for creating high-performing teams (HPTs) using intelligent fast failure to solve “impossible” challenges will be discussed in Chapters 6 and 7. References 1. 2.
208
Perrow, Charles: “Normal Accidents: Living with High-Risk Technologies.” Princeton University Press 1991. LaPorte, Todd R. and Consolini, Paula M.: “Working in Practice But Not in Theory: Theoretical Challenges of “High Reliability Organizations”. Journal of Public Administration Research and Theory, Volume 1, Number 1, 1991.
Charles Camarda 3. 4. 5. 6. 7. 8. 9.
10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21. 22. 23. 24. 25.
Poli, Roberto: “A Note on the Difference Between Complicated and Complex Social Systems.” CADMUS Promoting Leadership in Thought that Leads to Action Volume 2, Issue 1, October 2013. Rochlin, Gene I.: “Reliable Organizations: Present Research and Future Directions.” Special Issue of the Journal of Contingencies and Crisis Management, Volume 4, Number 2, 1996. Boin, Arjen and Schulman, Paul: “Assessing NASA’s Safety Culture: The Limits and Possibilities of High-Reliability Theory.” Public Administration Review November/December 2008. Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Casler, James G.: “Revisiting NASA as a High Reliability Organization.” Public Organizational Review, Volume 14, pages 229-244, 2014. McChrystal, Stanley: “Team of Teams: New Rules of Engagement for a Complex World.” Penguin Publishing Group 2015. Roberts, Karlene H.; Bea, Robert; and Bartles, Dean L.: “Must Accidents Happen? Lessons from High-Reliability Organizations [and Executive Management Commentary]” Academy of Management Executive (1993-2005), Aug., 2001, Vol. 15, No. 3, Themes from Sports, Disasters, and Innovation (Aug., 2001). La Porte, Todd R.: “High Reliability Organizations: Unlikely, Demanding and at Risk.” Journal of Contingencies and Crisis Management, Vol. 4, No. 2, June 1996. Rochlin, Gene. I.; La Porte, Todd. R.; and Roberts, Karlene. H.: “The Self-Designing HighReliability Organization: aircraft carrier flight operations at sea. Naval War College Review, Summer 1998, Volume 51, No. 3. Rochlin, Gene. I.: “Safe operation as a social contract. Ergonomics.” Vol. 42, No. 11, 1999. Weick, Karl E. and Sutcliffe, Kathleen M. “Managing the Unexpected: Sustained Performance in a Complex World.” Wiley 2015. Hansen, James R.: “Engineer in Charge – A History of the Langley Aeronautical Laboratory, 1917-1958. The NASA History Series, NASA SP-4305, 1986. Stokes, Donald E.: “Pasteur’s Quadrant – Basic Science and Technological Innovation.” The Brookings Institution, 1997. Schultz, James: “Crafting Flight: Aircraft pioneers and the contributions of the men and women of NASA Langley Research Center. NASA History Series, NASA SP-200304316, 2003. Johnson, Theodore, F.; Natividad, Roderick; Rivers, H. Kevin; and Smith, Russell W.: “Thermal Structures Technology Development for Reusable Cryogenic Propellant Tanks.” NASA TM2005-213913, September 2005. Petroski, Henry: “To Engineer is Human: The Role of Failure in Successful Design.” Vintage Books, April 1992. Anon: “Final Report of the X-33 Liquid Hydrogen Tank Test Investigation Team.” NASA MSFC, May 2000. Duhigg, Charles: “What Google Learned from Its Quest to Build the Perfect Team.” The New York Times Magazine. Feb. 25, 2016. https://www.nytimes.com/2016/02/28/magazine/whatgoogle-learned-from-its-quest-to-build-the-perfect-team.html Rozovsky, J.: “The Five Keys to a Successful Google Team.” Rework Blog. November 17, 2015. https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/ Edmondson, Amy: “the fearless organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley and Sons, 2019. Matson, Jack V.: “Innovate or Die – A Personal Perspective on the Art of Innovation.” Paradigm Press Ltd. 1996. Edmondson, Amy C.: “The Right Kind of Wrong: the Science of Failing Well.” Atria Books 2023. Barthelemy, Bart: “The Sky Is Not the Limit – Breakthrough Leadership.” St. Lucie Press 1997.
209
Chapter 6
The idea of a “Group,” a “Team,” and a “High-Performing Team” (HPT) “NASA WAS SpaceX 50 years ago!”
Dr. Charles Camarda
Teams are how work gets done today. In organizations around the world, work is no longer individual. It is networked and connected. The more complex the work to be done, the more important the team. However, why did some NASA teams fail miserably during the Columbia tragedy and the return-toflight period following the tragedy, and yet others, which will be presented, were so successful? All teams are not equal; it is a science to create, nurture, lead, and sustain what I call high-performing teams (HPT). First, it is important to define what a team is, what it’s not, and how it differs from a typical group of individuals working cooperatively to solve a complex problem. While all teams can be considered groups, not all groups are teams! This chapter defines these terms and focuses on “High-Performing Teams” or HPTs, detailing attributes of a team and how these attributes function to define the dynamics and interrelationships that are necessary to monitor team health and predict team performance and outcomes. What is a Team? Jon Kazenbach and Douglas Smith1 define a team as “a small number of people with complementary skills who are committed to a common purpose, set of performance goals, and approach for which they hold themselves mutually accountable.” In contrast, a group is a collection of individuals who are usually independent of each other and has a different set of tasks that are carried out 210
Charles Camarda
by individuals. Working groups are formed in organizations when information needs to be shared and decisions need to be made. Teams, on the other hand, have a strong sense of purpose and commitment and a shared sense of accountability. They are typically laser-focused on a specific purpose, which may or may not be the same as the broad organizational mission but may be critical to achieving that mission. The output of a high-performing team is a collective product that is usually greater than the sum of its parts. There is a synergy that emerges in an HPT when individual members are focused on a meaningful purpose or mission. This will become clearer when we discuss team mission, the concept of an Epic Challenge, and the attribute of team cohesion. In this chapter, we will discuss teams and using this term also to describe not only a single unit with a specific objective of a larger mission but also an interconnected network of teams throughout an organization, which is often necessary to accomplish complex projects and missions. Breakdowns locally, in any part of this network, as was the case of the Leading Edge Structural Subsystem-Problem Resolution Team (LESS-PRT), could have very serious consequences, such as the loss of the crew and space shuttle in the Columbia accident. Hence, it takes only one small team out of hundreds needed to operate complex systems not to function properly to cause serious repercussions that can result in disaster for high-risk/high-hazard organizations. Teams are usually necessary when the problems are too complex to be solved by an individual and usually require a set of diverse skills and disciplines working together in a coordinated or integrated fashion. The need for interdisciplinary and transdisciplinary collaboration and problem-solving becomes more critical as the complexity and coupled nature of the problem grows. Hence, the difficulty or epic nature of problems we will use as case studies, such as the Carbon-Carbon (RCC) wing leading edge repair problems, will serve to differentiate HPTs from ordinary teams whose function may be to sustain day-to-day operations of a complicated but well-understood system with predictable outcomes for example. Communication as an Indicator of Team Strength MIT researcher Sandy Pentland2 suggests that it is possible to identify successful teams by measuring patterns of communication among team members and people external to the team. Collecting large sets of sociometric data of teams wearing electronic badges developed at MIT’s Human Dynamics Laboratory and measuring things like tone of voice, body language, and who talked to whom and how much, he was able to predict, with reasonable consistency, patterns of communication associated with successful teams. This 211
Mission Out of Control
work has previously been explored by Granovetter in the strength of weak ties3. The “It Factor,” as Pentland calls it, which identifies a cohesive, high-performing team, can be determined by three dimensions of communication: energy, engagement, and exploration. Energy is a measure of the number and nature of exchanges of team members weighted by value in descending order, such as faceto-face, phone or videoconference, and email or texting. Engagement reflects the distribution of energy among individual pairs of team members (e.g., members A & B, A & C, and B & C, etc.). Engagement is considered to be extremely strong if all team members have relatively equal and reasonably high energy with all other members. Exploration is communication that team members engage in outside the team. High energy between a team and other external individuals outside the team is considered here to be more indicative of a high-performing team. Team communication patterns over time could also be used to map and analyze behavior and help identify problems or “weak signals,” which could be used to engage coaching to intervene and provide helpful insights. That elusive “It” factor, or, as it is sometimes referred to as team “cohesion”, was also studied by author Warren Bennis. Indeed, what Bennis4 calls “Great Groups” are defined by much more than only their patterns of communication. He studied seven such Great Groups that were highly creative and effective and identified many attributes that enabled the magic to happen. They spanned fields such as technology (The Manhattan Project, Palo Alto Research Center (PARC), Apple, and The Lockheed Skunk Works), entertainment (The Disney Troupe), education (The Black Mountain College), and politics (the Clinton Campaign)4. From these studies and many others, I have compiled a list of attributes and behaviors that affect team performance which are shown in figure 1. The creation, motivation, and sustainment of an HPT are dependent upon many variables shown below and the dynamic interrelationship of those variables over time. In addition, the attributes themselves can be interrelated and interdependent, as can be seen. The complexity of these interactions makes it extremely difficult to monitor and predict team behaviors. The three dimensions of communication researched by Pentland are a beginning and offer some hope for further research in this area. In the following chapters, we will address new work in this area and how we can use artificial intelligence (AI), natural language processing (NLP), and machine learning (ML) to augment our understanding of this complex ecosystem and possibly offer some recommendations identifying weak signals of behaviors that could lead to problems.
212
Charles Camarda
Figure 1. – Attributes, characteristics, and behaviors of teams that affect performance.
Even though a significant range of variables exist that impact team success, we can identify clusters and attributes that they share. Let’s get started by looking at some of the factors that make for a successful team. Team Attributes, Characteristics, and Behaviors Psychological safety: As mentioned earlier, perhaps no attribute is more important than psychological safety5-7 to ensuring group success. Psychological safety is a shared belief amongst individuals as to whether it is safe to engage in interpersonal risktaking in the workplace. It encourages an environment where employees feel safe to voice ideas, willingly seek feedback, provide honest feedback, collaborate, take risks, and experiment and are able to engage in constructive conflict without fear of recrimination. In 2012, Google embarked on an initiative code-named Project Aristotle to study hundreds of Google’s teams and determine why some teams stumbled while others excelled8. They struggled to find patterns, and it wasn’t until their researchers discovered the concept of psychological safety in academic papers by authors like Amy Edmondson that things fell into place. Google’s data indicated that psychological safety, more than anything else, was critical to making a team work. As Julia Rozovsky of Google notes: “Psychological safety was by far the most important of the five key dynamics we found. It’s the ‘underpinning’ of the other four9.”
213
Mission Out of Control
Other case studies like this demonstrate the role psychological safety played in attempts to silence dissenting opinions concerning serious anomalies with shuttle hardware only months after the Columbia accident. It is such an important element for high-performing teams that we will spend some time analyzing its effects and discuss ways to unobtrusively monitor the level of psychological safety as teams evolve over time during the course of their mission and how to implement corrective actions. Another consideration is that if there is even one team in a large organization that does not ensure psychological safety, the system can experience a catastrophe. When you have a very large hierarchical organizational structure (as described below), it is very easy for mid-level managers and their teams to stray from the core ideology of the organization. Leaders at the highest levels have to ensure every individual has a voice and a path to reach upper management safely without fear of recrimination. Because this is such an important topic for overall team performance, we will be recommending ideas for preventing such breakdowns in organizational psychology safety, which we have used in several examples. During my tenure as Director of Engineering, I made it very clear I wanted to hear all voices and dissenting opinions. I told each employee I did not care if we were in a large meeting with people outside our directorate and even external to NASA; if they ever heard me say something that was not technically correct, I would expect them to stand up and correct me. We can all make mistakes, but what I would not want is someone external to our team to get misinformation. I would reward people who caught mistakes. One of the first things I did as director was call in Rodney Rocha, that brave engineer who stood up and requested imagery of the Orbiter when the shuttle crew was in orbit. The first thing I did was to privately thank Rodney for having the courage to speak up during Columbia. I then told him, just as I told everyone in my organization, if they see something that does not look right and they feel uncomfortable speaking up, they can go around the chain of command within the organization if they had to, and I would listen to their concerns and we would resolve the issue by having a detailed discussion with the technically competent engineers within our directorate or external if needed. I would maintain his/her anonymity. After our center director fired me as Director after speaking up at the FRR, Rodney came into my office and thanked me, and then he told me something that was very touching. He told me I was the only person at JSC who ever thanked him for coming forward and speaking up.
214
Charles Camarda
Leadership style: Leadership style is important in shaping group culture and ensuring goals are achieved. Leadership styles can vary along a spectrum from a directive style, which elicits commands and expects immediate compliance, to a more contemplative one, where decisions are made in an open forum where information and data are shared after much discussion and thought. For example, the role of a launch and entry flight director at NASA’s Mission Control Center would tend to be a more directive style simply because launch and entry are very dynamic phases of flight for a space shuttle, and there is very little time to discuss ideas (it takes only 8 minutes after launch for the shuttle to reach orbit). Anyone who has watched a space launch on television or the movie “Apollo 13” can recall visions of a leader, like Flight Director Gene Kranz, listening attentively to multiple communication loops simultaneously, asking questions, and barking orders to be relayed to his team and the crew in space. His time is precious and any possible anomaly or even a weak signal of potential danger had to be addressed as expeditiously as possible. In addition, the style of communication had to be precise, to the point, and as briefly worded as possible. Almost every possible failure mode had to have been thought of, trained for, and analyzed by the crews and mission control teams prior to launch to ensure success. Once that potential failure mode has been determined, the team on the ground is aligned with the one in the spacecraft, and both follow and conduct very specific and predetermined procedures carefully laid out on cue cards or in books the crew carries called Flight Data File or FDF. Several of the examples I will present deal with high-performing teams that try to do things that are next to or impossible and require breakthroughs in technology and/or major scientific advances to be successful. I coined the term “epic challenge” to describe these problems. One such challenge was the development of a hypersonic vehicle that would take off like an airplane from a conventional runway, accelerate to 25 times the speed of sound (Mach 25) through the Earth’s atmosphere directly into orbit, and return to Earth unfueled, land and be able to turn around and fly again in less than a week. The vehicle was called the National Aero-Space Plane (NASP) or X-30, and the program was classified as a secret, limited-access defense program labeled Copper Canyon. I had the pleasure of being one of the elite researchers selected to work on the NASP Program and learn from the very able and innovative leadership of its director, Dr. Bart Barhtelemy10. I also had the opportunity to lead the Structures and Materials Technology Maturation Program, which tried to solve some of the most daunting technical challenges since the NASP would experience heating 215
Mission Out of Control
rates over 1,000 times higher than the space shuttle and require active cooling by liquid hydrogen over close to 50% of its structure to maintain temperatures within operating ranges for even the high-temperature materials we had yet to develop. Yes, you heard correctly: The materials that were light enough, strong enough, and tough enough to survive the flight conditions were not even available in thumb-nail-size quantities and, in some cases, had not been invented yet at the start of the program! To be a leader of a team on a quest to solve an epic challenge requires special qualities. If you want to build an HPT, you have to select a leader that has a preponderance of the following qualities listed in Table 1: Table 1 – Leadership Qualities of a High-Performing Team • Humble, not arrogant – willing to share the glory • Inspirational who can transform mundane projects into mission from God • Flexible leadership style (from directive to contemplative) • Purveyors of hope/optimistic • Obsessive attention to detail • Keen eye for talent • Provide for and protect team members (act as a buffer) • Has a comprehensive understanding of the problem/challenge • Curators of talent, like symphony conductors • Has the ability to let every member of the team see the “Big Picture” • Must maintain a psychologically safe team environment and culture Great leaders are humble and fully aware of the skills of every person on the team and the importance of those skills for the success of the mission/project. They are very willing to give and share credit for all the successes and equally willing to take responsibility for any failures along the way. I believe, depending on the circumstances and needs of the team, certain members will have to rise and lead based on their unique skills and the needs of the program/project at any given time. A good leader will recognize those times and, when necessary, either rise up to meet the challenge or delegate leadership authority accordingly. One of the qualities of HROs was their ability to be flexible and adapt to changing scenarios, such as emergencies on the flight deck of an aircraft carrier. As explained earlier by the flight director example, there are different styles of leadership, and depending on the nature of the problem or task, leaders may have to change their style for the success of the mission. A “consultative” or 216
Charles Camarda
“participative” style of leadership may be required at times to allow time for team members to carefully analyze, conduct experiments, and even fail. Yes, fail, because a great leader recognizes the importance of failure for learning and preventing failures as long as the experiments and failures are carried out in an intelligent way for the purpose of rapidly closing knowledge gaps critical for the success of the mission. Team leaders also have to be aware of individual and team behavior and recognize that as teams form, they will experience various phases of development11 and must support the needs of the individual team members to advance to a high-performing phase where the team can excel. They have to be able to comprehend the “Big Picture” or see the problem or challenge from the “hundred-thousand-foot level” and understand the individual complexities and disciplines that have to interact from a “systems” standpoint to make the project a success. Hence, they have to have a keen eye for talent and even personalities to understand how the individuals they select for the team will fit in. They are, in effect, like symphony conductors. They may not be the most talented musicians, but they can organize and lead a team of some of the most capable musicians to exceed all expectations and create an outstanding musical experience! In addition, they must ensure that every member of the team is able to see and comprehend the grand vision or purpose of the team and understand the critical importance of each of their individual contributions to achieve success. They will bring out the best in everyone for the greater good of achieving that shared vision. It is also a good idea for leaders of high-performing teams to exhibit a sense of humor and be able to identify when teams need rest, relaxation, and fun. I have used humor in many situations as a leader and often to defuse aggressive team behaviors and confrontation. I have also used humor in self-deprecating ways to show the team I do not know everything and am also fallible and fail and that this is OK! This also helps promote psychological safety. It is a signal that we are all in this together, and their expertise and judgment matter. A good leader knows when to insert himself/herself into the middle of arguments and to ensure that all members of the team play fair and exhibit the utmost respect for one another. Great leaders also have an obsession with what might seem like the minutest details and the desire for perfection or an ideal outcome. They are also able to assess when better can be the enemy of good enough and prevent unnecessary delays without compromising safety or missing program milestones. As you will see in the Return-To-Flight (RTF) On-Orbit Repair Case Study, what was 217
Mission Out of Control
needed in the early stages of the concept development was a working prototype that could demonstrate the concept and attain buy-in from the shuttle program managers12,13. The prototype we demonstrated was not an exact copy of the actual system and actual materials; however, it was good enough to demonstrate the idea and convince the program managers it was a viable solution! My first mentor, Jim Starnes, was unassuming and very humble, yet his vision was so grand that he had probably more influence on the development and adoption of composite materials for commercial and military aircraft than any entire aerospace organization in the United States. He formed amazing partnerships in industry and academia across the globe and when he spoke, people who mattered listened. Jim did not need to use fear, intimidation, or reward to motivate his team; he only had to ask, and people would drop what they were doing and do their very best to support him. Jim Starnes is an ideal model for leaders of all teams whose mission is to advance research and knowledge. Many of the ideas toward research, knowledge construction, and transdisciplinary team performance I discuss have their roots based in my mentorship and relationship with Jim Starnes. Personality qualities that correlate well with a transformational leader will be discussed later. What I learned while training as an astronaut was that leadership of teams in extreme environments and where there is a high level of risk, such as in space or on the battlefield, are very special cases.14 As levels of stress and stressors increase, it affects many dimensions of individual and team behaviors that leaders have to recognize in real-time, fully understand, and make rapid decisions during time-critical moments, say during emergencies. We train as astronauts for hundreds of hours as a crew in numerous simulated environments and situations for sometimes years to ensure we can function effectively as a team in extreme conditions. In fact, before we are selected as part of a crew/ team, we often experience training in multiple varied groups during exercises, such as water and land survival, cold-weather survival, flight training in the T-38 aircraft, emergency simulations in all relevant spacecraft (e.g., space shuttle, Soyuz, International Space Station (ISS), Extra-Vehicular Activity (EVA or spacewalks), etc.). Another great leader I had the pleasure of working for in the Astronaut Office and with during my mission, STS-114, was my commander, Eileen Collins. Eileen, like Jim or astronaut John Young, would never have to “push” you to do something. Great leaders “pull” you. They only have to ask or suggest, and you are driven to work your hardest and do your best to satisfy their highest standards.
218
Charles Camarda
Cohesion (the “It” Factor): Cohesion is viewed as one of the most fundamental aspects of teams15. One definition of team cohesion is the tendency of a group to remain united while working towards a goal or to satisfy the emotional needs of its members. It is that certain intangible “something” or “It Factor” that creates strong bonds and esprit de corps of team members, allowing the magic to happen. You cannot put your finger on it, but you know it is there when watching a sports team, listening to a symphony, or riveted in your seat, transfixed as characters interact on a stage. The team is “in the flow,” everything clicks, and synergy and creativity emerge16. Team cohesion is also a function of psychological safety, trust, team culture, leadership, creativity, and the ability to explore ideas. The interrelationships between these many factors and the dynamic relationships that vary over time all have a complex effect on team cohesion which must be monitored and maintained throughout the course of the task or project to ensure a successful mission outcome. The nature of the team challenge or mission is also a critical factor affecting dedication. I have witnessed numerous instances where meaningful, purposeful work can galvanize a team to succeed against all odds to overcome adversity and successfully accomplish the most daunting challenges. Highly cohesive teams display trust and mutual respect toward other members, assume mutual accountability, are committed toward shared goals, display loyalty, communicate openly, equally, and candidly toward one another, acknowledge and share contributions of other members, and make decisions as a group. They achieve high levels of collaboration because their members have a strong sense of identity built on mutual trust and confidence in each other’s abilities. Such teams possess high levels of team emotional intelligence (EI), which is an awareness and understanding of emotions and empathy among members. Communication: The level, intensity, manner, and type of communication among team members are very important, as mentioned above and by researchers like Sandy Pentland2. Patterns of communication among members of a team and between teams and external subject-matter-experts (SME’s), such as energy, engagement, and exploration, are very important signs that teams are functioning effectively. Teams can also be taught how to monitor communication for signs of trouble and make appropriate corrections in real-time. Advances are being made in the analysis of “Big Data” to understand not only individual but also the collective 219
Mission Out of Control
intelligence/creativity/behavior of groups and teams, and expressions like “Social Physics” are coined, and algorithms are being developed that hope to one day actually develop predictive models of behavior the same way physicists create models to observations in the natural world2,17. Using social network analysis (SNA) we can potentially ascertain status of individuals within a team by noting the quantity and tone of communications, with the positive tone and higher quantity usually being directed at the person of higher status. These same higher-status individuals are then usually perceived to be the most influential. SNA identifiers, such as influencers, bridges, extroverts, and hermits, are given to identify perceived roles, connectivity within the SNA, and patterns of communication. In addition, it is possible to build trusted small-world networks of key SME’s, such as the Friends-of-Charlie (FoC) network discussed earlier, which were used to rapidly build and transfer information to creatively solve some of the epic challenges described in the examples which will follow. Communication is critical to optimum team performance and learning and is highly related to and dependent upon other factors such as psychological safety, trust, cohesion, and personality; it is also highly dependent upon organizational structures, hierarchies, and cultures. Noted sociologist and author Diane Vaughan defines a term, structural secrecy: “the way patterns of information, organizational structure, processes and transactions, and the structure of regulatory relations systematically undermine the attempt to know and interpret situations in all organizations” as one of the causes which concealed the seriousness of the O-Ring problem during the Challenger accident18. Interestingly, “The Challenger Launch Decision”, was published in 1996, the year I was selected to be an astronaut, and was the first book I read after the Columbia accident. I was the Director of Engineering at Johnson Space Center immediately following the Columbia accident and after my mission on STS-114, and I can tell you even I struggled to understand the maze of boards, the rules, protocols, procedures, and objectives of each board; the organizational complexity; the language used (a compilation of complex technical terms, acronyms, specialized terms and definitions like accepted risk, waiver, C1, etc.); and the governance structure and organizational hierarchy not to mention the blizzard of paperwork that had to be read to even try to keep up with the bombardment of super-efficient and brief technical presentations19! Team members must maintain free and open lines of communication, and all team members should be encouraged to learn as much about other related disciplines as necessary. In addition, all information has to be conveyed using simple terminology that is domain- and solution-neutral so that members in 220
Charles Camarda
other disciplines can understand the ideas/concepts; relate them to similar or analogous ones in their own domain of expertise; and then utilize all their experience and techniques for solving the analogous problems to the challenge being addressed. Failure, tolerance to failure, and aversion to risk: While a “failure is not an option” philosophy/mantra might suit an “operational” vehicle that supports human life, it can have disastrous consequences when applied to the operation and/or development of experimental or research vehicles. In fact, the two space shuttle tragedies resulted, in part, from organizational behaviors that relied on a “can-do” spirit and a professed belief that the space shuttle was an operational vehicle. My philosophy toward failure is slightly different; I believe that “Failure is not an option…It’s a requirement”! High-performing teams have to understand failure and risk and, most importantly, how to learn from failure. As a NASA researcher and technical branch head we understood how important failure is for success. Failure is a natural part of learning, and I conducted several courses and lessons teaching professional engineers how to fail, what my friend and colleague Jack Matson (Professor Emeritus from Penn State) would call “Intelligent Fast Failure (IFF)20.” We taught students how to fail smart, fast, small, cheap, early, and often20-23. One of our case studies, the RTF Impact Dynamic Team story, highlights how highperforming research teams can rapidly learn from failure using an IFF building block approach to construct knowledge and close knowledge gaps regarding ballistic impacts of ET foam on a shuttle RCC wing leading edge. Failure is a necessity for research and development. Discovery and innovation go hand-in-hand with failure, and failure to permit failure impedes exploration, discovery, and innovation. Failure is also critical in the early conceptual phases of engineering design. Professor Henry Petroski24 describes a good design as one that “obviates failure” by adequately addressing potential failure mechanisms early in the design process. If one understands and can visualize potential failure mechanisms early enough in the design process, it will open up avenues of new ideas to simply and elegantly address and eliminate these modes of failure from ever occurring. Studying the histories of past failures of similar or analogous systems provides a starting point for this critical analysis. Not all failures are created equal. As mentioned earlier, Amy Edmondson describes what she calls a “Spectrum of Failure”25, which ranges from “praiseworthy” to “blameworthy” (Table 2). Every research scientist and/or research engineer is taught that the way to understand a new phenomenon is 221
Mission Out of Control
to iterate through a rigorous analysis/experiment cycle to correlate analytical or mathematical models of behavior and performance with carefully designed and instrumented exploratory experiments. Certain failures are not only better than others, as mentioned above, but they are necessary in order to succeed. We have to be as creative in designing our experiments so that we can learn from each test as inexpensively and as quickly as possible! Hence, a team’s tolerance to failure is related to the level of creativity that emerges from the team and is also dependent upon a leader who understands this and encourages exploration, discovery, and even failure.
Resilience In general, resilience is a measure of the capacity to withstand and recover from challenges, pressures, and/or stressors26. It is important to consider resilience “of what,” “to what,” “for whom,” and “over what timeframe,” as what Helfgott terms a “Resilience Framing Cycle”26,27. Are we talking about the resilience of an individual, a team, a biological ecosystem, an economy, an engineered system, etc.28 to a perceived threat (individual), hazardous mission (team), environmental stressor (climate change), natural or technological disasters, technical anomaly for psychological health, team success, maintenance of biodiversity, financial stability, mission success over a specified timeframe? In addition, the definition of resilience can and perhaps should be expanded to include not only the robustness/resistance phase, which includes absorbing the disturbance and maintaining acceptable performance, and the stability/recovery phase, which includes recovery from the disturbance and the return to original 222
Charles Camarda
features or performance, but it should also include an adapting/benefiting phase which includes moving forward to a situation which could be more beneficial and advantageous. Studies of the resilience of complex systems of various domains such as ecological, organizational, engineering, economic, and psychological surmise that much can be gained by cross-disciplinary research and application of interdisciplinary system models29. Warren Black believes we can learn much from how highly resilient natural systems bounce forward to an improved state that emerges after a disruption because of intelligent responsiveness and adaptation30. Teams of individuals constitute such a complex system in which outcomes are not deterministic but result from numerous nonlinear, coupled interactions, behaviors, and decisions with the surrounding environment and circumstances. Individual resilience and a person’s ability to bounce back and recover from a severe stressor can be a function of multiple factors, for example, personal psychological characteristics (e.g., positive attitude, internal sense of control, cognitive flexibility, and emotional stability); level of physical durability and fitness, and social and emotional support. However, a group of highly resilient individuals does not necessarily mean a team formed by this group will be highly resilient. The team’s resilience will depend upon and is a function of many other defining attributes such as the team mission (levels of risk, danger, extremeness), appropriate skill mix, collective and individual: personalities, intelligence, creativity; trust; in effect, almost every other attribute listed in figure 1. Team makeup and training can affect resilience, for example, if members are cross-trained or there are multiple individuals with similar skills in the event the team loses one member due to illness. Teams that routinely have to experience extreme or hazardous conditions in extreme environments, such as firefighters, astronauts, emergency medical technicians, and elite military teams, usually require rigorous training in simulated conditions to develop the requisite skills and reflex responses necessary to survive and sustain performance to successfully accomplish their mission.
Stories of experience
I had the honor to be selected as a crew member of the STS-114, the space shuttle mission immediately following the Columbia disaster. When we are first selected as astronaut candidates (ASCANs), we undergo extensive foundational training in all the systems and subsystems of the vehicles we will be flying on, living in, and operating, such as the space shuttle, the International Space Station (ISS), the Soyuz, and T-38 jet 223
Mission Out of Control
trainers. We train as individuals and as teams for thousands of hours of simulated nominal and off-nominal conditions during all phases of operation: launch, on-orbit, abort, and entry. We have specialized training in extreme environments for extended periods of time to help select crews for long-duration missions on ISS and/or for specialized missions such as ours, STS-114, whose primary focus was the evaluation and testing of technologies and procedures that would ensure the survival of future crews in the event of debris impacts to the vehicle which could cause a similar tragedy as what happened to Columbia. The “systems” view of the “team” must include the larger view of the “team-of-teams”31 or “network-of-teams” which is necessary to make such a complex mission succeed. Hence, the “crew” trains and conducts simulations with the Mission Control team at JSC, the Launch Control team at Kennedy Space Center (KSC), the training teams, which include individual skills such as robotic arm operation to team skills during vehicle docking, and the hundreds of individual specialized technical teams which support vehicle health and maintenance suggests an organizing framework for examining teams in extreme environments18. The input factors to evaluate team processes and emergent states include both the individual-level and team-level attributes and include environmental-level factors that can cause stress/stressors. An example of such team challenges and stressors that require resilience is described in reference 26. Key team processes or emergent states that may be affected by extreme settings and stress are connected with the five primary psychological mechanisms that impact team performance through these processes. The emergent states here include status within the team and team formal and informal hierarchy; team roles and responsibilities and the nature of the work (e.g., whether or not it is viewed as meaningful); Team cognition and communication (e.g., how adept is the team at learning new concepts and sharing that knowledge); and interpersonal relations (affected by individual and collective personalities and levels of emotional intelligence). The outcomes can be monitored and measured and used as an adaptive feedback loop to effect positive intervention and success. Team leaders, for example, can identify when team focus narrows during times of stress and take appropriate action prior to potential catastrophes that can occur; for example, during airline accidents where the pilot and co-pilot lose situational awareness while laser-focused on 224
Charles Camarda
a local instrument reading and miss the mountain immediately looming in front of them. I was surprised to see firsthand how effective team training simulations in extreme environments were in identifying interpersonal and personality issues among crewmembers. During our cold-weather survival ISS expedition training in Cold Lake, Canada, our team of five crewmembers (ranging from senior astronauts to astronaut candidates (Ascans) like me) were airlifted to a frozen lakebed with provisions for seven days and nights, and physically, psychologically, and cognitively stressed and monitored as we performed simulated ISS tasks. Leadership roles were rotated during the mission, we were monitored by a group of Canadian military personnel (unbeknownst to us at the time) as well as by a simulated Mission Control team onsite via walkie-talkie, and feedback was provided at the end of the weeklong session. It was an eye-opening experience for me to see fellow classmates vie for the attention of the senior shuttle commander on the team with the hopes of gaining an early flight assignment at the expense of other crewmates, breakdowns in leadership under stress, and total chaos one evening at their failed attempt to light the stove while I was on an early-morning emergency scouting mission with a fellow teammate, astronaut Dave Wolf. In only one short week under very realistic hardship conditions, it was obvious which crewmates I would want to spend six months to a year on ISS with, which ones I would trust with my life, and others that I probably would not like to fly with. I would fly in a heartbeat with Dave Wolf, Rich Linnehan, and many other of my astronaut colleagues after my experiences in extreme survival training experiences. Diversity – Team diversity is an essential ingredient for many reasons and for many various team missions. Diversity in thought, cognitive modes and processing, expertise, culture, ethnicity, race, religion, personality, etc. is critical for a high-performing team to solve complex problems and overcome disruptions creatively. How a team learns and how deeply they dive into a problem and immerse themselves in it is critical. Looking at the problem objectively through many lenses and experiences can expand the design space to new ideas and solutions. To avoid failure and to address critical failure modes early in the conceptual design process, teams must constantly assess their core capabilities and understand when it is necessary to add an external subject-matter-expert 225
Mission Out of Control
(SME) or complimentary skill to the team to accomplish the mission or desired outcome. In several of the case studies, it became obvious that the existing teams that were formally in charge of solving a challenge were struggling and, in the case of the RCC on-orbit repair of the wing leading edge, it was necessary to initiate a small covert team of key researchers and technicians to unobtrusively explore design options not being considered and to also address the reasons why their solutions were failing! Team members have to be tolerant of diversity, share mutual respect, and recognize the importance of “listening” to each member’s ideas and points of view. Imagine a homogeneous team composed of members with a very low tolerance to risk or failure. Chances are that this team will most likely struggle if their mission is to develop a creative, breakthrough solution for a very challenging problem! Diversity of individual team members and diversity of skills within an individual (e.g., multidisciplinary) is critical, especially for missions that are constrained to just a handful of team members. For example, future space missions that go deeper and deeper into space will have time lags in communication due to the sheer distance the electromagnetic signals have to travel and the limitations of the speed of light of those transmissions. Astronauts on the surface of Mars may have communication delays of up to 40 minutes. For these missions, teams may be limited to as few as four members and have to operate autonomously and will need to have all the requisite skills available to solve problems without the assistance of Mission Control. Hence, having a member of the crew that is a specialist in only one portion of one field could be very limiting. Personally, if I were on a Mars mission, I would love to have as my crewmate a medical doctor like my classmate, Dr. Lee Morin, who could not only perform surgery but who is a computer genius, has a PhD in engineering, and can build and operate complex equipment—truly a “multifunctional” human if ever there was one! Team size and makeup – Selecting the members for a team is critical for success and is highly dependent upon the type of team, the mission of that team, its size, and the outcome you would like to achieve. The size of the team should be large enough to accomplish the mission but not too large to impede agility, speed, and effectiveness. Depending on the mission or phase of the mission, the team size can be flexible, having a “flexible core”32, which grows and shrinks to accommodate the needs of the “team” as a function of time and phase. The development of a complex product, such as a space vehicle, requires networks of teams which varied skills and performance objectives throughout the product lifecycle. The team responsible for the conceptual design phase has to be very 226
Charles Camarda
creative, open, collaborative, take chances, and be willing to rapidly explore numerous innovative ideas quickly and simultaneously. Creativity and the ability to accept/tolerate risk and explore the design space are some of the most important attributes that will be necessary in this phase of the design. Once a final design is selected, other teams of engineers are required to conduct rigorous analysis and tests to validate performance and ensure successful fabrication and operation. The same was true for the two teams we created to solve the on-orbit RCC wing leading edge repair problem. Members of a high-performing team (HPT) or great group are driven to excel and to achieve the unimaginable at all costs without any fear of failure. They are motivated by the impossible and accept pressure as a means to stoke their creative genius. They can juggle numerous tasks simultaneously and always keep the end goal in sight. They must be selfless and open to sharing ideas and credit with teammates, and they must be curious and have an insatiable thirst for knowledge. High-performing team leaders have to be very selective in the team members they choose so as not to poison the environment and culture of the team. They must also be willing to excise non-high-performing team members from the team if necessary. Team members must be open to diversity and recognize the importance of “listening” to each person’s point of view and perspective. They are DOERS and Tinkerers who love to try new ideas. Above all, all team members must be able to work together collaboratively and cooperatively and get along! Environment – The environment within which the team is immersed and must operate is important in determining team configuration. We must consider all aspects of the environment: the physical, emotional, psychological, and virtual aspects to ensure we understand all possible interactions and potential problems. We touched on “extreme” environments, which can impart high levels of danger and risk to individuals that can cause stress and negatively (and sometimes positively) affect how the individuals on the team and the team, as a collective entity, handle that risk. The environments you create for training and simulation often depend heavily upon the specific mission and objective of the team. Creating workspaces that foster collaboration and effective communication in the physical world, as well as the virtual world have to accommodate three “affordances” according to Anne-Laure Fayard: proximity, privacy, and permission33. If these affordances are not carefully moderated and measured, they could potentially have the opposite effect you are trying to accomplish. For example, if creating an open workspace and creating opportunities for random interactions is what you are trying to achieve for a creative solution to a difficult 227
Mission Out of Control
problem, the lack of privacy may limit key interactions and discussions. Creativity seems to flourish in the aesthetically pleasing environments of Google and innovative product developers at IDEO34, where, in addition to the physical environment, the affordances of permission to experiment and fail are crucial to conceive creative solutions to design problems. Affordances are as simple as handles on a door; they allow people to enter. Permission gives them the right to open. The environment is coupled with other attributes to varying degrees depending on the mission, culture, personality, leadership style, and psychological safety. Brilliant and creative solutions to very complex problems were also conceived in the most austere environments at Los Alamos for the Manhattan Project, garages for PC startups like Apple4, and for the on-orbit wing leading edge repair team12,13. My own theory for these last three examples is that the “epic” nature of the problem and the extreme importance to the nation, the company, and the agency respectively, trumped the environment and forced the needed collaboration to happen. Creativity –Imagination is the “capacity to conceive something that does not exist… the ability to conjure new realities and possibilities: in John Dewey’s words, “to look at things as if they could be otherwise.”35 The premise is that: “The general assumption is that a will to act must precede imagination – which you decide to do something before you imagine what it is. “The reality is that imagination comes first. It must. Until and unless we have the emotional and intellectual capacity to conceive of what does not yet exist, there is nothing toward which we are to direct our will and our resources.”35 Einstein regards one’s imagination as the ultimate sign of intelligence and is quoted as saying, “Logic will get you from A to B. Imagination will take you everywhere.” Creativity can be defined as the mental process involving the generation of new ideas or concepts or a new association of the creative mind between existing ideas or concepts. Hence, if imagination is the capacity to conceive that which is not and, according to reference 35, creativity is imagination applied, then I contend that innovation is the process by which imagination and creativity become relevant! What differentiates an invention from an innovation is that while an invention is the transformation of knowledge into new products, processes and/or services, an invention is only considered to be an innovation if it is considered noteworthy, leads to widespread use, and is a benefit to society! Imagination provides the spark, creativity is the fuel, invention is the engine, and innovation is the vehicle that provides value. Depending on the objective, say transportation, and constrained by a set of requirements, the vehicle (literally and figuratively) could be a bicycle, a car, a boat, a plane, etc.! 228
Charles Camarda
Depending on the specific mission, the most important attribute in determining a team’s success can be its ability to think creatively and to develop breakthrough ideas to solve seemingly impossible challenges. Creativity is directly related to environment, leadership, the makeup of the team, diversity, psychological safety, culture, tolerance to failure, communication, and personality. The leader of a creative team must create an environment that allows the creative individuals on the team to flourish. According to Warren Bennis: “There are two ways of being creative. One can sing and dance. Or one can create an environment in which singers and dancers flourish.”4 She must also be able to select the right mix of team members to make the “collective magic” happen, that team cohesion, esprit de corps, which allows the synergy to enable the genius to emerge effortlessly. Leaders of creative teams also have to know how to be “technology brokers” and “develop a strategy for exploiting the networked nature of the innovation process.”36 This idea of a network or network-of-networks to connect the right people and teams and create the right physical and virtual environment will be discussed later by using several case studies, such as the on-orbit repair of the space shuttle wing leading edge. In 2008, together with faculty from MIT, Georgia Tech, Penn State, and several NASA SMEs, I developed a methodology called innovative conceptual engineering design (ICED) to infuse innovation into the engineering design process and to teach young NASA engineers how to work together in teams and develop innovative, breakthrough solutions to difficult engineering challenges12,22. We would teach students how to work together and communicate and also instruct them on several methods for enhancing their individual and collective creativity. Some of the techniques we taught included functional problem decomposition, brainstorming37, the theory of inventive problem solving (TRIZ)38, and biologically inspired design39. Many of the ideas we developed for teams whose mission is to develop breakthrough solutions to “epic” challenges are an integral part of the ICED methodology. I used this same philosophy when I developed teams to work on several of the Columbia accident and return-to-flight challenges (Chapter 7). Organizational structure and governance – Teams, even very small teams, can have a formal or informal organizational structure and rules of governance. More often than not, numerous teams in a large organization are interconnected in a network to carry out the complex business of the company/organization. Even large hierarchical organizations like an aircraft carrier group, NASA, and/ or Boeing, matrix their workers across the organization to work on specific projects and programs. The parent organization, for example, the structural 229
Mission Out of Control
mechanics division or the carrier flight team, depends on mentorship and training within their parent group and is rated on its performance on a specific project or program. The organizational structure can have a major effect on the efficacy and response of a team in performing its critical mission. In many large, established organizations, a hierarchical structure develops, which is called a “Command” structure in General Stanley McChrystal’s book “Team-or-Teams”31. In such organizations, the flow of information up, down, and across teams and suborganizations is drastically impeded, slowing response time and sharing of critical information. Depending on the type of organization or the organizational culture, there can be very strict rules and procedures for how that information is passed through the layers of the hierarchy. In military organizations, for example, there can be very serious consequences for “breaking the chain of command” and going around your in-line supervisor and taking your ideas or complaints to someone higher up in the organization or to one with “more status.” How information is vetted, and decisions are made may require very strict protocols and rules of governance. Remember the concept of structural secrecy and the torturous path of the Space Shuttle Control Boards shown in figure 6 of Chapter 2. While these rules, processes, and procedures are often necessary to ensure safety in the operation of very complex hardware in extreme environments, the strict adherence to these very rules without the use of critical thinking and sound judgment to question “weak signals” of anomalous behavior can, often, be the very cause of accidents and tragedies you are trying to prevent, as in the cases of Space Shuttles Challenger and Columbia18,19. Many organizations are looking at other product development strategies to be commercially competitive in the global marketplace. Even government organizations, whose bottom line is not for profit, such as the Department of Defense (DoD), are looking at innovative strategies such as set-based design to rapidly explore the design space, evaluate numerous design concepts, and to learn and close knowledge gaps as quickly as possible before down-selecting to a final concept to manufacture and produce40. Unfortunately, NASA has devolved into an organization that cannot seem to get out of its own way to innovate and produce breakthroughs in science and technology at a reasonable cost and/or within an acceptable timeframe like it used to during Apollo! NASA’s culture during Apollo allowed brash young engineers to circumvent the chain of command to propose disruptive ideas like John Houbolt’s “Lunar Orbit Rendezvous” (LOR) to be selected. If not for great leaders like Wernher von Braun, who recognized LOR as a better idea than 230
Charles Camarda
their own, it would be doubtful NASA would have been successful in its moon mission. In addition, the average age of employees at NASA today is more than twice what it was during Apollo, and over 50% of those employees are eligible to retire. Interestingly, the average age of SpaceX employees is approximately the same as that of the NASA Apollo team: 26 years old! In addition, NASA is also sorely lacking critical technical expertise, a result of drastic decreases in applied research funding over the past 30-plus years. This is not an uncommon occurrence for large corporations. As they grow, they tend to become more bureaucratic and hierarchical and somehow lose their “core ideology.41” Companies find different strategies for maintaining their startup DNA that was responsible for their initial success and growth. Other large companies, like Lockheed, created totally separate sub-organizations and subcultures like the Lockheed Skunk Works to develop “breakthrough” solutions to impossible problems4,42. The Lockheed Skunk Works was very unique in that it worked on Top Secret/Classified programs that forced them to stay very isolated from the parent company, yet also enabled them to create and maintain a unique culture and innovative product development strategy. Later in the book, we will present ideas that companies can use to develop innovative subcultures within their organization that can spread and scale as needed. Team personality – Individual personalities should be considered when forming a team as well as how the individual personalities will work together as a unit/team, the collective team personality. Individual personality has been studied in its relationship to leadership and, in particular, which personality attributes are important for transformational leaders. Great or transformational leaders obtain support by inspiring followers to identify with a vision that reaches beyond their immediate self-interests. The Five-Factor43 Model or “Big Five” has revolutionized personality psychology, replacing instruments such as the Myers Briggs Type Indicator (MBTI)44. The Big Five traits are broad personality constructs that are manifested in more specific traits and are represented as such: 1) Extraversion – represents the tendency to be outgoing, assertive, active, and excitement-seeking; 2) Agreeableness – consists of tendencies to be kind, gentle, trusting/trustworthy, and warm; 3) Conscientiousness – is indicated by achievement and dependability and best correlates with job performance; 4) Emotional Adjustment – is often labeled by its opposite, Neuroticism – which is a tendency to be anxious, fearful, depressed, and moody; and 5) Openness to Experience – represents a tendency to be creative, imaginative, perceptive, and thoughtful. Openness to experience is only one of the five factors which correlate well with intelligence. Three of the Big Five personality traits that correlate with 231
Mission Out of Control
transformational leaders are, in order of importance, agreeableness, extraversion, and openness to experience45. Studies have also been conducted on a collective personality measure for teams or groups of individuals. While individual personality is an intrapersonal phenomenon with a foundation in biological and cognitive processes, collective personality traits are inherently an interpersonal phenomenon based on shared routines based on interactions46. According to reference 46, “as individuals in a collective work together, they begin to develop shared expectations and norms that, in turn, lead to the emergence of observable behavioral regularities.” In some ways, the collective personality of a group or team can be related to its culture. A collective personality profile was developed using the adjective-based measure of the Big Five and was used to evaluate attributes like team leadership and performance. Some initial findings indicate that transformational leadership is positively related to collective team personality attributes like agreeableness, openness, extraversion, and conscientiousness. Transformational leadership, in turn, inspires, motivates, and helps create and enhance a healthy team culture that cascades throughout the organization. Conscientiousness and agreeableness were found to be related to an increased consistency in team performance. There is much more to be studied and learned about the effect of personality (individual and collective) and its effect on team behavior and performance. One thing to note is that when considering the collective personality of the team, one must consider the changes in that collective personality over time. While most individual personalities do not experience appreciable changes after adulthood, teams gain and lose members over time, and possibly the rules and norms that govern collective behavior can also change. Remember my first experience with the idea of a “collective” personality during my senior management training at NASA’s Wallops Flight Facility on the Eastern Shore of Virginia when all the other NASA Center’s described the personality of JSC as being “arrogant”. Two weeks after the Wallops training, I submitted my application to be an astronaut, was selected, began training, and instantly, I was able to recognize the arrogant culture my Wallops classmates had foretold. Years later, I would distill the cause of the Columbia tragedy from over 40 sociological/psychological/behavioral terms noted in Chapter 2 to just one: ARROGANCE. Cal Schomburg, LESS-PRT, and the shuttle tile TPS community did not understand the physics of the foam debris strike. Yet, they over-confidently responded with a resounding answer that it would definitely not cause any critical damage. How accurately the rest of the agency was in capturing the core culture of JSC almost nine years before the Columbia tragedy! 232
Charles Camarda
In addition to my experience at NASA creating and working on teams of professional engineers, I have also worked for the past ten years creating and analyzing hundreds of teams of students in the United States, Finland, and Australia, working to solve challenging engineering problems. We are monitoring these teams of students as they mimic actual product design and development processes to understand the effects of these attributes, like team personality and culture, on team creativity, critical thinking, performance, etc. I initially used personality tests like Myers-Briggs44, cognitive modes surveys48, and other instruments to measure individual academic skills, cognitive modes, creative abilities, 3-D visual thinking, computer-aided design (CAD)/computer-aided manufacturing (CAM) abilities and building/fabrication skills to pre-select students before placing them on teams. Working with Boeing’s AerosPACE (Aerospace Partners for Advanced Collaborative Engineering) students, we currently use the Big Five personality test and 3-D visual thinking tests to help pre-select individual teams. Cognition – is “the mental action or process of acquiring knowledge and understanding through thought, experience, and the senses.49” There are many theories of how individuals think, learn, and sense the physical world around them. Psychologists develop various methods for measuring what is often called “general intelligence” or “g” after experiencing a wide variety of cognitive tasks. This is sometimes referred to as the intelligence quotient or IQ. Researchers have also studied how groups of individuals or teams make sense of the world and how they create knowledge to better understand unknown phenomena and predict observable outcomes50. This collective intelligence or “c” is a measure of the ability of a “group/team” to perform a wide variety of tasks50. What was discovered was those results “support the hypothesis that a general collective intelligence factor (c) exists in groups.50” It was also observed that the average and maximum intelligence scores of individual group members are not significantly correlated with c! In other words, groups comprised of members with the highest IQs did not score the highest collective IQ or c! Also, collective intelligence did not seem to be positively correlated with team attributes like cohesion and motivation but did seem to correlate well with 1) the average social sensitivity of group members (positively), 2) the variance of the number of speaking terms of individual group members (negatively), and 3) with the proportion of females in the group (positively). There is still much more research to be done in the area of collective intelligence and sense-making of teams as we discuss the knowledge construction of different teams and compare and contrast results in Chapter 7.
233
Mission Out of Control
Culture – As discussed in Chapters 2 to 4, “culture is a set of solutions produced by a group of people to meet specific problems posed by the situations that they face in common.51” According to noted sociologist and author of the Challenger Launch Decision Diane Vaughan: “these solutions become institutionalized, remembered, and passed on as rules, rituals, and values of the group18.” While culture may sometimes be attributed to small organizations, in most large organizations, there can be as many cultures as potential subunits. This idea of numerous subcultures is definitely true of NASA, for example. As you read the histories of each NASA center, you understand the difference in cultures between entire centers and directorates, divisions, and branches within those centers. The ways in which NASA Research Centers were trained to acquire and create knowledge and how to communicate, make sense, and solve complex, transdisciplinary problems like the root causes of the Columbia accident and how to accurately predict foam impact damage at ballistic speeds was quite different (see the section in Chapter 7 entitled: “RTF Impact Dynamics Team”). It was evident from the CAIB that the primary cause of the Columbia accident was the dysfunctional culture and the inability of the LESS-PRT to adequately assess the criticality of the foam impact19. We discussed in detail how poorly groups like the LESS-PRT constructed knowledge/risk and arrived at the poor decisions, how that dysfunctional culture persists even today within NASA, and why it is so difficult for organizations and teams to correct. It was also evident that it takes only one team in hundreds or thousands within a large organization to show even “weak signals” of technical or behavioral issues for tragedies to happen. The culture of a team and/or team-of-teams is highly dependent on and related to several other hey team attributes such as its leadership, level of psychological safety, cognition and ability to think critically, communication, trust, and identity. The next chapter recommends several ways for organizations and teams to monitor their culture and take corrective actions to help transform the culture back to a healthy, high-performing state. Team identity – the qualities, beliefs, personality, and/or expressions that make a group or team unique or distinctive. It is that distinctive quality of a particular team that sets it apart from others and makes team members proud to be a part of. What do people typically think of when they think of NASA, what quality or identity best comes to mind: teamwork, can-do attitude, brilliant problem solvers? Hila Lifshitz-Assaf studied knowledge boundaries at NASA and the critical role professional identity of teams at JSC played in either accepting or rejecting the idea of “open innovation” as a way to solve 234
Charles Camarda
some challenging life science problems there52. What she learned was that even though some teams responded to surveys that they accepted open innovation from outside sources to help solve their problems, actual observations indicated otherwise! The offending team viewed their team identity as “Problem Solvers” that were rewarded by solving challenges in the life science community at JSC and resisted external help. However, the groups that accepted the idea of “open innovation” and accepting help from others had a team identity more aligned with “Solution Seekers” whose real value was finding the solution from wherever and whomever they could. They valued external exploration and help from outside their core team to find successful solutions to problems. Where Problem Solvers might tend to impede the progress of other teams trying to find solutions and thus gain recognition, Solution Seekers were not concerned with individual recognition, and their goal was to solve the problem by any means possible! My view of a high-performing team and/or “team-of-teams” is one that is very proficient at finding and maturing solutions to complex problems. In later chapters and throughout the case studies, you will learn how to develop team learning skills and ways to find key SMEs to bring into your “solution-seeking” network at just the right time to accelerate the solution process. As unimaginable as it might sound, one such SSPO team at JSC in charge of the formal on-orbit repair of the wing leading edge problem even tried to sabotage the success of the parallel team that we had created covertly (in astronaut Don Pettit’s garage) to solve the very same problem (see RTF OnOrbit Repair)21. What could possibly be the reason a NASA SSPO team would want to prevent another team from helping to save astronauts’ lives on orbit? This highlights very dysfunctional behaviors of a team and very possibly provides some clues as to how team identity and culture could harm a healthy team or organization and possibly put lives at risk. When our HPT (research team) developed several viable solutions to the RTF On-Orbit Repair Problem, we were totally transparent and open and shared our work with the formal SSPO team even though they had actively tried to impede our progress. The result was a solution that we eventually flew on our RTF mission following the Columbia disaster. The primary contractor responsible for producing the repair kit had stolen our idea without any recognition for our team’s contribution. Imagine how actions like these could harm a healthy culture and destroy open, transparent communication, trust, and sharing of ideas! Trust – Interpersonal trust is defined as “the willingness of a party to be vulnerable to the actions of another party based on the expectation that the other party will perform a particular action important to the trustor, irrespective 235
Mission Out of Control
of the ability to monitor or control that other party.53” Hence, trust is shown to enable open communication and information sharing among its members and is strongly related to an environment of psychological safety and leadership traits that support trust building among team members. The open sharing of ideas and knowledge enhances team learning/cognition, creating an environment where creativity can flourish. So, it is no surprise that trust was found to be strongly related to teams whose mission was new product development54. Building and maintaining trust is more challenging for globally dispersed teams with increased reliance on virtual, computer-mediated communication (e.g., emails, videoconferencing, shared collaboration space, social media, etc.). Global dispersion includes more than just geographic dispersion, which has been studied extensively. It includes other moderating factors such as national diversity, computer-mediated communication (CMC), and team membership flexibility. The importance of trust is higher when teams are globally dispersed. Monthly face-to-face meetings reduce the need for CMC; when CMC is high, trust and team effectiveness are also high. When teams must rely on CMC as a prime mode of communication, nonverbal cues such as facial expressions and posture are limited, and there are increased misunderstandings and potential for conflict. As the potential for team flexibility increases, there seems to be an increased need for trust to ensure team effectiveness. Challenge/mission – The last, but by far not the least important, attribute that determines a team’s effectiveness is the challenge or mission of that team. An “epic” challenge is a challenge that is so difficult or impossible to solve that it requires the very best in all, so it is an honor to be selected to be a member of a team to attempt to solve such a challenge. The “epicness” of the challenge alone has the power to inspire, motivate, and sustain the team in its Odysseus-like journey. Most of the stories I share in the next chapters could be considered epic challenges. The only way possible to solve an epic challenge is with an HPT and with leadership that understands what it takes to make even ordinary challenges rise to epic proportions. Now that we understand some of the key attributes of high-performing teams, the next steps are to understand the types of teams needed and how to create them, lead them, nurture them, and sustain them to solve anomalies before they become large failures which lead to disaster. Having the correct HPT is critical to drive to the real root causes of anomalies because only then can we be sure the solutions we develop will correct the problems and prevent disasters.
236
Charles Camarda
References 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12.
13. 14. 15. 16. 17. 18. 19. 20. 21.
22.
Katzenbach, Jon R. and Smith, Douglas K.: “The Discipline of Teams.” Harvard Business Review, March 1993. Pentland, Alex “Sandy”: “The New Science of Building Great Teams.” Harvard Business Review, April 2012. Granovetter, Mark S.: “The Strength of Weak Ties,” American Journal of Sociology, Vol. 78, Issue 6, May 1973. Bennis, Warren and Biederman, Patricia Ward: “Organizing Genius – The Secrets of Creative Collaboration.” Basic Books 1997. Newman, Alexander; Donohue, Ross; and Eva, Nathan: “Psychological Safety: A Systematic Review of the Literature.” Human Resource Review 27 (2017 521-535). Edmondson, Amy C. and Lei, Zhike: “Psychological Safety: The History, Renaissance, and Future of an Interpersonal Construct.” The Annual Review of Organizational Psychology and Organizational Behavior, 1 23-43. Edmondson, Amy C.: “The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley and Sons 2019. Duhigg, Charles: “What Google Learned from Its Quest to Build the Perfect Team.” The New York Times Magazine. Feb. 25, 2016. https://www.nytimes.com/2016/02/28/magazine/whatgoogle-learned-from-its-quest-to-build-the-perfect-team.html Rozovsky, J.: “The Five Keys to a Successful Google Team.” Rework Blog. November 17, 2015. https://rework.withgoogle.com/blog/five-keys-to-a-successful-google-team/ Barthelemy, Bart: “The Sky Is Not the Limit – Breakthrough Leadership.” St. Lucie Press 1997. Tuckman, Bruce W.: “Development Sequence in Small Groups.” Psychological Bulletin 63, pp384-399, 1965. Camarda, Charles J.: “A Return to Innovative Engineering Design, Critical Thinking, and Systems Engineering.” Keynote address presented at “The International Thermal Conductivity Conference (ITCC) and the International Thermal Expansion Symposium (ITES), Birmingham, AL, June 24‐27, 2007. https://drive.google.com/file/d/0B84i3cJ_nNa0d3JiOGl4NWwwQ28/ view?usp=sharing Camarda, Charles J.: “Space Shuttle Return-to-Flight Following the Columbia Tragedy.” NATO Science and Technology Organization Lecture Series on “Hypersonic Flight Testing.” STO-AVT234-VKI, March 24-27, 2014, von Karman Institute, Rhodes-St-Genese, Belgium. Hannah, Sean T.; Uhl-Bien, Mary; Avolio, Bruce J.; and Cavaretta, Fabrice L.: “A Framework for Examining Leadership in Extreme Contexts.” The Leadership Quarterly 20 (2009) 897-919. Salas, E; Estrada, A. X.; and Vessey, W. B.: “Team Cohesion: Advances in Psychological Theory, Methods, and Practice.” Vol. 17, Emerald Group Publishing 2015. Csikszentmihalyi, Mihaly: “Creativity – Flow and the Psychology of Discovery and Invention.” Perseus Publishing 1997. Pentland, Alex: “Social Physics – How Social Networks Can Make Us Smarter.” Penguin Books 2014. Vaughan, Diane: “The Challenger Launch Decision – Risky Technology, Culture, and Deviance at NASA.” The University of Chicago Press 1996. Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U.S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Matson, Jack V.: “Innovate of Die – A Personal Perspective on the Art of Innovation.” Paradigm Press Ltd., 1996. Camarda, Charles J.: “A Return to Innovative Engineering Design, Critical Thinking, and Systems Engineering.” Keynote address presented at “The International Thermal Conductivity Conference (ITCC) and the International Thermal Expansion symposium (ITES), Birmingham, AL, June 24‐27, 2007. https://drive.google.com/file/d/0B84i3cJ_nNa0d3JiOGl4NWwwQ28/ view?usp=sharing Camarda, Charles J.; Bilen, Sven; de Weck, Olivier, Yen, Jeannette; and Matson, Jack: “Innovative Conceptual Engineering Design – A Template to Teach Problem Solving of Complex Multidisciplinary Design Problems.” American Society for Engineering Education Annual Exposition and Conference, Louisville, Kentucky 2010. https://drive.google.com/file/ d/0B84i3cJ_nNa0RlJTTkZvTVlMMzA/view?usp=sharing 237
Mission Out of Control 23. Camarda, Charles J.; de Weck, Olivier; and Do, Sydney: “Innovative Conceptual Engineering Design (ICED): Creativity, and Innovation in a CDIO-Like Curriculum.” Proceedings of the 9th International CDIO Conference, Massachusetts Institute of Technology and Harvard University School of Engineering and Applied Sciences, Cambridge, Massachusetts, June 9-13, 2013. 24. Petroski, Henry: “To Engineer is Human – The Role of Failure in Successful Design.” Vintage Books, 1992. 25. Edmundson, Amy C.: “Strategies for Learning from Failure.” Harvard Business Review, The Failure Issue, April, 2011. 26. Alliger, George M.; Cerasoli, Christopher P.; Tannenbaum, Scott I.; and Vessey, William B.: “Team Resilience: How Teams Flourish Under Pressure.” Organizational Dynamics 44, (2015) 176-184. 27. Helfgott, Ariella: “Operationalizing Systemic Resilience.” European Journal of Operational Research 268 (2018) 852-864. 28. Helfgott, Ariella: “Operationalizing Resilience: conceptual, Mathematical and Participatory Frameworks for Understanding, Measuring and Managing Systems Resilience. University of Adelaide 2014. 29. Fraccasia, Luca; Giannoccaro, Ilaria; and Albino, Vito: Resilience of Complex Systems: State of the Art and Directions for Future Research.” Hindawi Complexity Volume 2018, Article ID 3421529, 44 pages. https://doi.org/10.1155/2018/3421529 30. Black, Warren: “Building Resilience into Modern Organizations (A Complexity Science View).” August 8, 2018. https://www.linkedin.com/pulse/building-resilience-modern-organisationscomplexity-science-black/ 31. McChrystal, Stanley; Fussell, Chris; Collins, Tatum; and Silverman, David: “Team of Teams: New Rules of Engagement for a Complex World.” Penguin Publishing, 2015. 32. Graham, Margaret, B. W., and Shuldiner, Alec T.: “Corning and the Craft of Innovation.” Oxford University Press, Inc. New York, 2001. 33. Fayard, Anne-Laure and Weeks, John: “Who Moved My Cube – Creating Work Spaces That Actually Foster Collaboration.” HBR July-August 22011. 34. Kelley, Tom: “The Art of Innovation – Lessons in Creativity from IDEO, America’s Leading Design Firm.” Doubleday 2001. 35. Liu, Eric and Noppe-Brandon, Scott: “Imagination First – Unlocking the Power of Possibility.” John Wiley & Sons 2001. 36. Hargadon, Andrew: How Breakthroughs Happen – The Surprising Truth About How Companies Innovate.” Harvard Business School Press 2003. 37. Osborn, Alex F.: Applied Imagination – Principles and Procedures of Creative Thinking.” Charles Scribner’s Sons, 1957. 38. Altshuller, G.: “And Suddenly the Inventor Appeared – TRIZ, the Theory of Inventive Problem Solving.” Technical Invention Center Inc., 2004. 39. Allen, Robert: “Bulletproof Feathers – How Science Uses Nature’s Secrets to Design CuttingEdge Technology.” The University of Chicago Press 2010. 40. Singer, D. J.; Doerry, N.; and Buckley, M.E.: “What is Set-Based Design?”, Naval Engineers Journal, 121 (4), pp. 31-43. 41. Collins, Jim and Porras, Jerry I.: “Built to Last – Successful Habits of Visionary Companies.” HarperCollins 1994. 42. Rich, Ben R. and Janos, Leo: “Skunk Works.” Little, Brown, and Company 1994. 43. Toupes, E. C. and Christal, R. E.: “Recurrent Personality Factor Based on Traits Ratings.” (Tech. Rep. ASD-TR-61-97) Lackland Air Force Base, TX: U. S. Air Force 1961. 44. Briggs Myers, Isabel and Myers, Peter B.: “Gifts Differing – Understanding Personality Type.” Davies-Black Publishing 1980. 45. Judge, Timothy A., and Bono, Joyce E.: “Five-Factor Model of Personality and Transformational Leadership.” Journal of Applied Psychology Vol. 85, No. 5, pp. 751-765, 2000. 46. Hofman, David A., and Jones-Christensen, Lisa: “Leadership, Collective Personality, and Performance.” Journal of Applied Psychology June 2005. 47. Prewett, Matthew S.; Brown, Matthew I.; Goswami, Ashita; and Christiansen, Neil D.: “Effects of Team Composition on Member Performance: A Multilevel Perspective.” Group and Organization Management 2018, Vol. 43(2) pp. 316-348. 238
Charles Camarda 48. Wilde, Douglas J.: “Teamology: the Construction and Organization of Effective Teams.” Springer 2009. 49. Oxford Dictionary: www.oxforddictionaries.com. Retrieved 2016-02-04. 50. Woolley, Anita Williams, et. al.: Evidence for a Collective Intelligence Factor in the Performance of Human Groups.” Science Vol. 330 October 29, 2010. 51. Maanen, John Van and Barley, Steve: “Cultural Organization: Fragments of a Theory.” In Organizational Culture, Sage 1985. 52. Lifshitz-Assaf, Hila; “Dismantling Knowledge Boundaries at NASA: The Critical Role of Professional Identity in Open Innovation.” Administrative Science Quartely XX 2017, pp. 1-37. 53. Mayer, R. C.; Davis, J. H.; and Schoorman, F. D.: An Integrated Model of Organizational Trust.” Academy of Management Review 20, 3, 7pp. 09-734 1995. 54. Muethel, Miriam; Siebdrat, Frank; and Hoegl, Martin: “When Do We Really Need Interpersonal Trust in Globally Dispersed New Product Development Teams?” R&D Management 42, 1, 2012.
239
Chapter 7
Developing High-Performing Teams to Solve Epic Challenges Types of Teams Not all teams are created equal. There are many different types of teams depending on many factors, such as the team’s mission, its environment, objectives, cognition (e.g., domains of knowledge, methods of knowledge sharing); nature of the problem (technical/non-technical, unidisciplinary to transdisciplinary, coupling between disciplines/domains, etc.); level of disruption/transformation; level of global dispersion, level of risk, and size and complexity of the problem (see figure 1).
Figure 1. – Types of Teams. 240
Charles Camarda
Depending on the specific type of team and/or mission, and several other moderating factors like size, problem complexity, level of team member global dispersion, organizational structure, and governance; constraints (time, money, resources, political pressure), virtual efficacy, etc.; will determine the magic number of essential attributes (shown in fig. 1 of Chapter 6) and the prioritized order of critical/essential items which must be considered when creating and managing a successful HPT. I have selected examples of real teams to highlight what makes them successful and will identify and compare, whenever possible, examples of underperforming teams and/or breakdowns in team performance. The types of teams listed in figure 1 are not meant to be exhaustive; in many cases, actual teams may be a combination of types. Also, the need for different types of teams and the weighting of various attributes during the varying phases of a product development life cycle will also be apparent (e.g., during the conceptual design phase, one might stress learning, creativity, and tolerance for risk as compared to the development phase or operational phases). Teams that do things: This category of teams includes conventional product development, which might be an incremental improvement of an existing commercial product that a company markets. Usually, companies that produce classes of products like cars or airplanes have a long history of such product development methods and teams of experts who understand the design space, subsystem requirements, manufacturing details, technology risks, and safety concerns. They have typically developed a series of rules of thumb, processes, and procedures that are tried and true, which they rely on to guide incremental improvements to products and processes. Truly high-performing versions of such teams will recognize the complex nature of their problem, understand that outcomes are not deterministic and, thus, cannot be predicted, and have risk reduction, mitigation, and control strategies in place to identify when “weak signals” of impending anomalies can arise and cause catastrophes. This category also includes research and development teams whose objective is to develop a comprehensive understanding of a physical domain ranging from a single discipline that is relatively uncoupled from other disciplines (unidisciplinary) to a highly complex interdisciplinary problem that can only be solved by a team which converges multiple disciplines in a transdisciplinary way to integrate and create new knowledge which exists at the intersection of discipline boundaries and must conceive new schemas which transcend conventional paradigms. The example used to exemplify the differences between 241
Mission Out of Control
the selection of the right team for this job is the RTF Impact Dynamics Team (RIDT). This example will compare the results of the SSPO team’s approach to this problem and that of the RIDT. The key attributes that determined the success of the RIDT team included: 1) team makeup and leadership, 2) team cognition, 3) level of psychological safety, 4) trust, and 5) epic nature (epicness) of the challenge. Teams that run/operate things: Teams that run or operate things include the Space Shuttle Program Office (SSPO) of the late 1980s and later. After five or six initial shuttle flights, the program was deemed “operational.” This designation was more political than it was the actual understanding of the space shuttle and its complex systems and subsystems. In reality, the space shuttle should have always been maintained as a flight experiment with all the necessary attention to rigorous technical detail, analysis, experimentation, critical thinking, and anomaly resolution required of any experimental flight test program. Instead, political pressures demanded the space shuttle to be considered an operational vehicle after only four flights1. Usually, teams that run or operate complex systems have years of experience producing and operating hundreds/thousands of similar products and a proven track record of technical understanding of the intricate systems and subsystems and risk mitigation strategies to ensure accurate products that meet all requirements and have satisfied thousands of hours of verification and validation testing. In these cases, the analytical methods used to assess risk and probabilities of failures are more accurate because they are based on many laboratory experiments and real “flight test” experiences. NASA operates quite differently from commercial companies. The concepts and designs they study are usually experimental, and only very few vehicles or prototypes (one of a kind) have ever flown or been tested. The culture of NASA during the Apollo program was a research and development (R&D) culture that understood the complexities, the “unknown unknowns” and had a healthy pessimism that placed an emphasis on experimentation, test, analysis, and analysis-test correlation. It also maintained a psychologically safe environment2. The key attributes which were degraded, and which led to the Columbia accident were: 1) psychological safety, 2) cognition and critical thinking, 3) culture, 4) aversion to risk, 5) organizational structure, and 6) team makeup.
242
Charles Camarda
Teams that design things: Design teams usually work in the earliest phases of the product development cycle/process. The design process can begin either with a totally blank sheet of paper on one end of the spectrum or with designers having many initial concepts that have already been developed or prototyped from which to choose ideas. Designers must understand their customers’/stakeholders’ needs, wants, desires, and constraints and develop a working solution that fits into what engineers call the “design space.” Depending on the problem, the need for creative, innovative solutions can vary immensely. The three examples of a design problem I have personally worked on are the National AeroSpace Plane (NASP)3, RTF On-Orbit RCC Wing Leading Edge Repair4, an Alternate Launch Abort System (ALAS)5,6 for the Constellation vehicle, and the Land Landing of the Orion Capsule7. The first example fits into the classification of what I would also call an “epic challenge.” The on-orbit repair of the RCC wing leading edge fits this category because of the technical complexity, time criticality, and size of the worst-case damage (pizza-box size hole 14 in. x 14 in.) of the solution needed prior to the RTF mission (STS-114). The other two design problems did not rise to the level of an epic challenge, but they required an HPT to solve because of the level of creativity and bureaucratic impediments. Typical key team attributes associated with design teams include 1) creativity, 2) psychological safety, 3) team makeup/ skill mix, 4) diversity, 5) critical thinking, 6) communication that values open, engaging, and exploring modes, and 7) team identity as solution-seekers. Teams that sell things: There are some teams that are totally dedicated to sales and selling. While I do not include a specific case study as an example, selling is something we have to do either individually or as a team at some point in our lives. Several of the case studies do include elements of selling design concepts, theoretical solutions to difficult problems, and strategies for transforming organizations or architectures for colonizing space. Most of the examples of reference eight are about selling, in particular, about teams whose sole purpose is to put together a winning proposal to win a large aerospace contract. The author describes a process for using personality dimensions of team leaders and members to align with the perception of the client’s/stakeholder’s wants, needs, and desires and, thus, increase the probability of being awarded the sought-after contract. I absolutely believe emotional intelligence is necessary for this type of team to really understand the most important desires of the stakeholders and to be able to translate your team’s ideas back to make the sale. There is so much more that 243
Mission Out of Control
is required to actually “make” the sale or close the deal. For example, once you have the contract, do you have the correct team to deliver the product on time and within budget? What I hope to demonstrate are ideas to revolutionize and transform the way we conceive, select, evaluate, and develop solutions to complex problems that have a high probability of success. The teams I initiated and helped develop to solve many of the challenges I describe in the case studies that follow did just that. The teams solved impossible challenges within budget and ahead of schedule much like other great organizations like Lockheed’s Skunk Works9! The Skunk Works team won many of the government’s most critical technology contracts because of its reputation and identity, which was that they could solve the most daunting challenges within budget and ahead of schedule. And if they could not, they would tell you! Lockheed was the only prime contractor in the NASP Technology Maturation Program (TMP) that would not cost-share expenses with the government in the hopes of receiving the prime contract award to build the NASP vehicle, and they were the first company to concede to the government that it was impossible to build. There is something to be said for that kind of reputation and organizational integrity. Teams that recommend things: There are some teams whose job is to provide consultation and support for an organization to help them make a crucial decision. Such teams must have both a deep and broad understanding of a particular subject and/or problem and have the confidence of the client or stakeholder that their recommended solution will be both objective and on point. In other words, your team must have a proven track record that others value and are willing to invest in. A NASA example of this type of team or team-of-teams was the complex network of teams that outlined and executed a successful return-to-flight strategy for the Space Shuttle Program following the Columbia accident. The STS-114 RTF strategy details the network of teams, the key SMEs, the technologies that had to be developed, and the robust strategic plan necessary to ensure a safe return to flight operations4,10. However, as described by some members of the Return-to-Flight Task Group, “The utilization of operational-type management and engineers made the return to flight of the space shuttle difficult11.” Both the research impact dynamics team (RIDT) and the R&D RCC wing leading edge repair team highlight the need for research engineering leadership, which has been proven to be very effective in solving many of the RTF problems.
244
Charles Camarda
Teams that solve “epic” challenges: This is the type of team that most appeals to me, personally. The Apollo Program race to the moon challenge of the ’60s was the epic challenge President John F. Kennedy posed to the American people when I was a boy growing up in Queens, N.Y. It is the reason I selected NASA as my one and only employer for over 45 years of my life. It is the reason I selected research of hypersonic vehicle structures as my specialty early in my career and why I became an astronaut. What truly motivates me is the challenge of doing the impossible, what no one else has been able to do, and what many have tried to do and failed. Being part of or leading a team trying to solve the “seemingly” impossible challenges for the good of society. I believe a truly “epic” challenge can draw the best-ofthe-best people as members and stoke the fire and passion necessary to ensure the team can sustain the low periods and persevere to a successful conclusion. This is why I sign my astronaut photos, “Reach for the stars…Never give up! What you will hopefully learn from this chapter is what it will take to lead an HPT to solve epic challenges and to be an effective team member to make the impossible a reality. The epic challenge I will present is the RTF RCC Wing Leading Edge On-Orbit Repair. This is one of several challenges we used to develop a cyber-social, team-of-teams, or research network-of-teams approach to solve similar challenges. It was not only a success, but I was fortunate to fly an on-orbit repair kit on our mission, STS-114. It was a means of repairing our vehicle in the event of a debris strike on launch, which would ensure our safe return. Our repair kit was flown on every successive mission until the retirement of the space shuttle in 2011. Some of the key team attributes for solving epic challenges include 1) strong leadership, 2) psychological safety, 3) cognition and critical thinking, 4) team makeup, 5) communication, 6) cohesion, 7) creativity, and 8) tolerance to risk. Teams that transform things: The last and final type of team we discuss in this book is a team that is formed to transform an organization’s workforce, culture, way of doing business, product development process, and/or competency management strategy. This type of team must have a deep understanding of the organization or process it is trying to transform. It must be able to provide a logical and sound strategy to identify ways the strategy can fail and suggest recommendations for obviating those failures from becoming reality. As mentioned earlier, most organizations that try to transform their culture fail. In the next chapter, I will propose ideas that can help transform struggling organizations into high-performance 245
Mission Out of Control
organizations that can solve complex technical problems and maintain technical excellence. The ideas presented are ones that I have advocated and presented to NASA, Boeing, and other non-aerospace companies. A Research Network of Teams (RNoT) When I first read Stanley McChrystal’s book “Team of Teams”12, I immediately realized that the way my branch at Langley, the Thermal Structures Branch (TSB), operated and interacted with the other divisions and branches at Langley and the other research centers was very similar to a team-of-teams approach, the TSB being the core team with a dense network of connections to expertise within NASA and external to NASA. I would describe the organizational structure at NASA Langley in the 70’s and 80’s as operating more like a flat organizational structure or a network, for the most part, as opposed to a strict, formal hierarchical command structure or command-of-teams structure as described in reference 12 and shown in figure 2. We were able to interact freely with all branches, and unless serious resources were necessary for a particular task, most work could be accomplished “off the books” so to speak, without requiring a “charge code.”
Figure 2. – Differentiation of different types of organizational structures and their attributes according to Stanley McChrystal and “Team of Teams”12.
Integrated research of systems team: The Thermal Structures Branch (TSB) at NASA LaRC was much more than what the aerospace industry would call a conventional integrated product development team (IPDT). Rather, it was a collection of subject-matterexperts with cross-disciplinary skills in multiple key domains (e.g., structural mechanics, thermal structures, heat transfer, aerothermodynamics and fluid 246
Charles Camarda
flow, materials science, etc.) related to hypersonic vehicle structures and their related systems, and subsystems such as thermal protection systems (TPS), cryogenic tankage, hot, passive, and cooled structures. Different members of my branch might have specialized in a particular subsystem, such as the TPS, and analyzes it as a “thermal structure” to understand how the aero, thermal, mechanical, and pressure loads would cause it to deform, strain, and eventually fail. We would use the same building block approach described in Chapter 5 to understand each element of the “system” and constantly improve our analyses and experimental tests so we could design a better system. Within our branch, one person might better understand a particular discipline, such as thermal stress or aerothermal heating and heat transfer. We would try to understand all behaviors but, if we needed help, we could reach out to a specific expert in another branch, for example, the Structural Mechanics Branch, to dive deep and understand the failure modes of a thin-walled structure under combined loading. The Vehicle Analysis Branch at Langley would look at the design of an entire vehicle for a specific mission. They would reach out to us to understand if their simplifying assumptions and analyses at the vehicle level were consistent with our predictions in the laboratory. We were constantly zooming out to the vehicle level to determine if our designs for next-generation TPS, for example, could improve vehicle performance, reusability, and/or reliability. I call this type of research team or branch an integrated systems research branch (see figure 3). The hypersonic vehicles studied by the TSB included the X-15, the Apollo capsule, the space shuttle, the NASP, and the single-stage-to-orbit (SSTO) experimental vehicle called the X-33 (see upper right of figure 2).
Figure 3 – Definition of an integrated systems research branch and a team-of-teams network to solve complex problems. 247
Mission Out of Control
The collection of team members with cross-disciplinary skills (shown by the dashed oval) are connected as a “team-of-teams”12 with an integrated network of key subject-matter-experts (SMEs) in each individual discipline to rapidly assess potential failure mechanisms and anomalies. The success of this network of teams and the open and rapid collective learning and transparent dissemination of knowledge enabled us to rapidly solve complex technical issues and anomalies. My training and mentoring of individual researchers and teams at the subsystem and system levels developed the expertise needed to rapidly assemble the correct teams to solve the Columbia and return-to-flight (RTF) problems so quickly. Not having this capability is what caused entire organizations to fail and not understand critical anomalies and their impacts, literally and figuratively, on the space shuttle. It also explains why the NESC team of really good engineers and even some researchers were unable to understand the criticality of the RCC panel 8R anomaly. The examples below demonstrate the effectiveness of developing the right team and ensuring that the team is what I call a highperforming team (HPT). Formation of the RTF Impact Dynamics Team (RIDT): The first step to return to flight (RTF) post-Columbia was determining and understanding the technical root cause of the accident. The NASA SSP technical teams were in trouble from the very start. Years of working with Southwest Research Institute (SwRI) in the 1990s13-15 had failed to validate a “physics-based” capability that could accurately predict damage to the shuttle TPS (acreage ceramic foam tile thermal protection system (TPS) and the reinforced carboncarbon (RCC) wing leading edges (WLEs)). The most that was accomplished was a semi-empirical solution, which was used by a computer program called Crater, which was woefully inadequate in predicting impact damage during STS-107 as explained in Chapter 2. Why were the really good engineers and researchers at SwRI unable to solve this problem? When I returned from Russia one week after the accident and suggested to the SSP that I be allowed to create a world-class team of researchers to rapidly develop the knowledge necessary to understand the impact problem, I was told to stand down; it was not necessary. This case study highlights how I sidestepped the conventional management chain to initiate a truly research-based high-performance team (HPT), one whose mission was to “do” things like exploration, discovery, and knowledge creation/building (research)! The primary mission of the RIDT was the construction of knowledge necessary to understand the physics of the ballistic impact of ET foam on the 248
Charles Camarda
Shuttle reinforced carbon-carbon (RCC) wing leading edge (WLE). After my meeting with Rommel in March 2003, I began tackling each of the key issues I mentioned in my white paper4. The first issue, after being dismissed by Dan Bell and Glenn Miller, was to create a research-based impact dynamics team to properly understand the ballistic impact of foam on an RCC WLE. The prior work reported by SwRI using their linear-elastic modeling tools13,14 clearly would not be capable of predicting the large nonlinear deformation (geometric nonlinearity) and material nonlinear behavior, which would be necessary to understand the initiation of failure within the RCC material. I knew we needed the material scientists and nonlinear structural mechanics experts of GRC and LaRC to help us solve this problem. As mentioned in Chapter 6, one of the key elements of a successful HPT is the team makeup and skill mix. Getting the researchers from GRC and LaRC to participate would be difficult because JSC was in charge of the shuttle program. They controlled the purse strings and had the power, and it was clear after talking to the senior managers at GRC and LaRC that they were under strict orders not to offer help unless they were ASKED by JSC leadership. The first person I called for help was Del Freeman, the center director of LaRC and a good Friend of Charlie (FoC). I told him I would like to work with several key experts in the Impact Dynamics Group (IDG) at LaRC led by Dr. Edwin Fasanella16 to help present a physics-based impact dynamics solution plan to the SSP. Without hesitation, Del told me I could pull Ed’s team to help my effort. The next person I called was another FoC and colleague of mine from LaRC, Dr. Woodrow Whitlow, who became the Director of Engineering at GRC. I called Woodrow and asked if I could recruit Matt Melis and the Glenn Ballistic Impact Team at GRC to help us on this mission. His response was immediate and just like Del Freeman’s; he said yes, absolutely! You notice a distinct difference in response from even upper management at the NASA Research Centers (Glenn and LaRC) compared to the SSP, NESC, MOD, and Engineering Directorates at JSC. Neither Del Freeman nor Woodrow Whitlow was concerned with any blowback or repercussions based on their decisions to do the right thing! Matt Melis was another FoC and colleague who was a very competent structural mechanics researcher that worked at GRC whom I had known for over 20 years and whom I had worked with on the National Aero-Space Plane (NASP) back in the ’80s with Woodrow. I decided I would disregard Glenn Miller’s email announcement that JSC would not need help with ballistic impact analysis and proceeded to secure a 249
Mission Out of Control
time to present my ideas for a research and development (R&D) RTF Impact Dynamics Team at one of the Shuttle Engineering Review Board (ERB) meetings. Directly after the accident, Ralph Roe, initiated and was running the Orbiter Vehicle Engineering Working Group (OVE-WG) to understand the cause of the Columbia accident. When JSC Engineering, the SSP, and Ralph Roe had heard that I was preparing to brief the ERB on the need to develop a physicsbased impact analysis capability, they immediately requested/ordered me to brief Ralph, Glenn Miller, and John Mulholland (a structures lead at Boeing) prior to my presentation to the ERB. Once John and Ralph saw what I was going to present, they immediately informed me I had to put Glenn Miller’s name on the presentation before I briefed it to the ERB. They probably realized that the team I had put together from LaRC and GRC had the technical expertise to actually succeed in developing a physics-based impact model and how foolish they would look if we were successful! Clearly, the SSP did not want to appear to lack leadership and technical expertise in such a crucial component of the accident investigation and very possibly wanted direct control of this research team. Construction of Knowledge – The RTF Impact Dynamics Team (RIDT) used a rigorous scientific research process to construct knowledge of impact dynamics, as described in Chapter 5. Careful, intelligent tests were designed to rapidly obtain necessary knowledge as accurately and as inexpensively as possible. Experiments were modeled analytically, and results correlated with tests. The RIDT used a building-block approach comprised of three levels for model validation: 1) Level – 1 validation consisted of foam impacts to a rigid backstop instrumented with a load cell to validate the foam and RCC material models (GRC) and a series of three-point bend tests of RCC (Southern Research Center (SRI)) for RCC models; 2) Level – 2 validation included small foam ballistic impact tests onto 6 x 6 in. RCC plates and coupons; and 3) Level – 3 validation were full-scale impact tests and included impact tests of Orbiter wing panels 6, 8, and 9 that were conducted at SwRI. Details of the tests, instrumentation, NDE evaluations, etc. will be discussed below13-21. I based the selection of the research teams on my prior experience with teams of researchers at various technical organizations and their deep understanding and research of ballistic impact on aerospace structures. NASA Langley Research Center (LaRC) Network– Dr. Edwin Fasanella’s IDG had over 30-plus years of research experience as a team conducting nonlinear high-speed finite element analysis and testing of ballistic impact dynamics of general aviation aircraft, helicopters, DOD aircraft, and space 250
Charles Camarda
capsules from Apollo to Orion (see figure 4). The IDG was formed at the end of the Apollo Program. It used the 240-ft. high Lunar Lander Research Facility (red and white gantry in figure 4), which was built in the 1960s to simulate moon landings of the Lunar Excursion Module (LEM) and was later reconfigured to a cable-mounted drop tower to simulate low-speed crash tests (figure 4). This facility is now being used to conduct land and water landing capsule tests of the SpaceX and Boeing crew capsules. The researchers in the IDG worked closely with materials scientists at NASA LaRC and other research centers to develop laboratory tests to determine the material properties at semi-static and high strain rates. They were proficient in developing the material models for the LS-Dyna finite element code and had included the capability to incorporate material failure predictions into the analysis.
Figure 4. – NASA Langley Research Center (LaRC) Impact Dynamics Research Facility (IDRF) and related impact dynamics research.
In addition to experimental tests of debris impacting RCC (which was done at GRC), which are needed to validate the LS-DYNA models, tests are also needed to determine the dynamic material properties of the foam, ice, and ablators. These tests are required to develop the physics-based material models for LS-DYNA that include strain-rate data. Most of these dynamic material properties were established using a high-speed, bungee-assisted drop tower test at LaRC, as shown in figure 5. 251
Mission Out of Control
Figure 5. – LaRC bungee-assisted drop tower dynamic material property test facility.
The material properties of the BX-250 ET foam that impacted Columbia were used by researchers at GRC to develop the Fu-Chang foam model, which would be used in the LS-Dyna finite element material models to correlate the responses of foam impact tests described below. Glenn Research Center (GRC) Network – The GRC Ballistic Impact Team began its development as part of the NASA aeronautics research program in the early ’90s. Since the focus of GRC was aircraft propulsion, their early research focused on turbine engine blade containment and failure mitigation for commercial aircraft. In 2001, the GRC team was asked by the SSP to investigate a potential problem with STS-112 related to potential silicone (RTV) contaminants in the solid rocket booster (SRB) booster separation motors (BSM). It was not known why the SSP had never asked the GRC ballistic impact team to analyze the foam strike and subsequent damage to the aft skirts of the SRB on STS-112. It was also never known by this author why the Mission Management Team (MMT), the SSP, and the Engineering Directorate at JSC did not request the support of these experts and those at LaRC when they realized a very large piece of foam had struck Columbia during liftoff on July 26, 2003! Clearly, there was a breakdown in transparent communication and/or lack of knowledge among the MMT and the necessary knowledge communities/teams within NASA. The GRC Impact Lab had a small vacuum gun, which they used to study small projectile ballistic impacts of prescribed target material (figure 6). This gun used pressurized helium to propel projectiles along a 12-foot barrel with an inside diameter of 2 in. into a containment chamber that can be set to a specified pressure, and which holds the material target. These tests would be small, inexpensive, and could be conducted rapidly to gain much-needed knowledge as quickly as possible.
252
Charles Camarda
Figure 6. – GRC Impact Lab Small Vacuum Gun test setup.
The capability to conduct ET foam impact tests at various pressures was extremely important because the ET acreage area was covered by a closed-cell spray-on foam insulation, a material that maintained ambient pressures within its cells. The quasi-static and ballistics tests of this foam at various pressures (see figure 7) would help understand whether or not the full-scale WLE, sea-level tests at SwRI would be representative of failure mechanisms incurred during the foam strike of Columbia at altitude during launch!
Figure 7. – Foam impact tests at the GRC Impact Lab (vacuum/nonvacuum).
253
Mission Out of Control
Results of the material property tests and ballistic impact tests indicated that while the foam behaved differently in vacuum and nonvacuum conditions, the force-time histories recorded by the load cell impact block were similar17. Hence, it was determined that the sea-level tests of the WLE at SwRI would be a valid representation of impact response at altitude. Without such knowledge there would always remain an uncertainty or gap in knowledge as to whether or not tests at sea level would be representative of actual flight conditions at altitude! Also, the Fu-Chang foam material model and LS-Dyna ballistic impact results correlated well with tests and accurately represented foam impact dynamic behavior and failures, as shown on the right of figure 7 and in figure 8.
Figure 8. – Foam Impact analyses at GRC.
Researchers at GRC also developed and validated an accurate full-field, highspeed displacement and strain technique using photogrammetry that correlated the results of detailed finite-element analysis of the foam projectile and RCC target. High-speed cameras, used to see an irregular dot or speckle pattern on the targets, and an Aramis 3D image correlation photogrammetry system and subcomponent (6-in. square RCC panel) specimen tests were used to validate the full-field deformation and strain measurement capabilities of the system18,19 as shown in figure 9.
254
Charles Camarda
Figure 9. – Aramis high-speed camera system used for full field strain measurement18.
The GRC team was so effective that a full-field deformation/strain measurement was ready in time for the full-scale WLE test at SwRI in July 2003, only five months after the accident occurred (fig. 10)16,18 and prior to the first full-scale test there!
Figure 10. – Aramis high-speed camera system used for full-field strain measurements of full-scale shuttle wing leading edge impact test at SwRI16,18.
255
Mission Out of Control
Over 100 optical targets were attached to the internal surface of RCC panel 8 for photogrammetry measurements. Later tests used a “speckled paint” pattern developed at GRC for full-field photogrammetry measurement. As shown in figure 10, there was excellent agreement of full-field deflection results of the RCC WLE as calculated by LS-DYNA and ballistic impact tests. The highspeed measurement technique was useful in validating analytical models and correlating with failure mechanisms observed during each level of testing from sub-components to full-scale tests. Hundreds of flat RCC specimens were tested with various types of debris (foam, ablator, ice, etc.) at varying impact angles and velocities (figure 11)16. The damage was recorded using visual and non-destructive evaluation (NDE) methods and used to assess potential levels of damage (both surface and subsurface damage (e.g., RCC delaminations (separations between plies), backside coating loss) to the WLE while the shuttle was on orbit. Understanding that the backside coating could be lost during an impact because it would be undetected by inspection devices used by the crew during on-orbit inspection. Depending on the damage to the front surface coating and internal damage to the RCC laminated structure, it could result in increased heating and RCC ablation during entry caused by loss of backside cooling, increased porosity, and flow through the material!
Figure 11. – Impact tests of ET foam cylinders onto 6 x 6 in. square RCC plates16.
Numerous detailed maps of test result damage, LS-Dyna analyses, and NDE measurements were created for teams in Mission Control to use while crews were on orbit to assess the severity of both surface and subsurface impacts and damage (figure 12).
256
Charles Camarda
Figure 12. – RCC foam impact damage maps.
The foam on RCC impact maps (figure 12) relate the impact velocities, angle of incidence of impact, and impact energy with two different NDE inspections certified by the program (C-scans and IR thermography), displacement/stress isotherms and predicted levels of damage via LS-Dyna and digital photographs of damage (front surface, back surface, and edge views (to detect interlaminar integrity, delamination, etc.). It was this type of data that was used to determine that the hailstorm at the Cape while STS-117 was on the launchpad could cause damage to the wing leading edges and thus force a rollback of the vehicle to the Vertical Assembly Building (VAB) for inspection. Southern Research Institute (SRI) Network – The RIDT also included members from SRI who had research experience in thermal-structural analyses, micromechanics, oxidation protection coating development, and failure of high-temperature refractory and ceramic-matrix composite materials such as 257
Mission Out of Control
RCC. The director of the materials group at SRI, John Koenig, and his team had worked very closely with materials scientists and structural mechanics experts at LaRC and GRC in the past on numerous projects. SRI was responsible for conducting three-point RCC bend tests to failure20 (see figure 13) to provide the stress-strain curves to failure which would be used as input to the LS-DYNA material model. Unlike the SwRI analysis13, which could not predict actual nonlinear material behavior and RCC failure, the RIDT analysis capability could predict material nonlinear behavior, internal damage, failure propagation, and actual failure of the full-scale components analyzed.
Figure 13. – Reinforced Carbon-Carbon (RCC) material property tests20.
Boeing Philadelphia Network – Another important member of the RIDT was the impact dynamics team of Boeing Philadelphia, led by Dr. Jon Gabrys. The Boeing team was very proficient in conducting ballistic impact analysis using LS-DYNA and testing to understand the crashworthiness of their helicopters (similar to Ed Fasanella’s IDG at LaRC)21. The Boeing Philadelphia Team was responsible for generating the LS-DYNA finite element meshes for representative leading-edge panels and T-seals (see figure 14 for a typical foam/ RCC panel representation). All three networks of teams conducted independent checks of each other’s analyses and openly published their results so teams external to the Figure 14 – Typical ET foam and RCC panel RIDT could critique and suggest LS-DYNA finite element mesh density for additional ideas. ballistic impact modeling.
258
Charles Camarda
RTF Impact Dynamics Team (RIDT) Communication, Leadership, and Behaviors – As described above, the RIDT was composed of several small teams from multiple organizations (LaRC, GRC, SRI, and Boeing) that all had a very strong foundation in research and used a scientific method to construct knowledge as described above. It was, in a sense, a “network-of-teams” or “teamof-teams” as described in reference 12, which, although led by a representative of the SSP/JSC Engineering (Glenn Miller), had a loosely led flat organizational structure led by senior researcher Ed Fasanella. Each separate team knew their roles, responsibilities, and accountability and had local leadership and connectivity to key structures and materials experts within their own respective organizations, with whom they had developed years of working experience and trust. The complete impact analysis team, which consisted of LaRC, GRC, SRI, and Boeing, worked extremely well together during Return-to-Flight (RTF) efforts. The team members worked as well as or better than most co-located teams. Although in five separate locations, the team communicated almost on a daily basis. The team met face-to-face (F2F) at JSC, GRC, LaRC, SRI, and KSC every few months. The camaraderie was very strong and good friendships developed. Communications were daily using the following methods: 1. A website for data sharing – rapid download of presentations, pictures, experimental data, videos, computer models, etc. 2. Technical Interchange Meetings (TIMs). 3. Teleconferences with presentations at least twice a week. 4. Daily emails. 5. Telephone conversations. 6. Computer-to-computer file transfers. The various teams and interested program organizations were geographically dispersed, as shown in figure 15.
Figure 15 – Geographic dispersion of teams doing ballistic impact dynamics. 259
Mission Out of Control
What is amazing about the RIDT is that they were able to complete their development and validation of a physics-based ballistic impact modeling capability in only four months, in time to predict the outcome prior to the first full-scale test at SwRI and before the completion of the final CAIB report! This capability could have been in place over six months prior to launch of STS107 if only the SSP and the LESS-PRT had recognized the risk of foam damage to Shuttle TPS (WLEs and Tiles) following the impact to the SRB aluminum skirt following during the launch of STS-112 (the mission prior to STS-107). An opportunity to develop a capability that could have correctly alerted upper managers in the SSPO as to the seriousness of the foam debris strike to Columbia’s port wing! A timeline of the RITD progress and accomplishments is shown in figure 16.
Figure 16. – Timeline of progress and accomplishments of the RTF Impact Dynamics Team.
Full-Scale Ballistic Impact Tests – The full-scale impact test of RCC panel 8 (figure 17) was analyzed prior to the test using the tools developed by the RIDT ballistic impact team, as mentioned above. The first full-scale impact test of RCC panel 8 was supposed to represent, as closely as possible, the actual flight conditions during the launch of STS-107, Columbia. The large, compressed air gun at SwRI shot a 1.67-lb. rectangular block of ET foam (having a cross-section of 11.5 x 5.5 in.) at panel 8 at a velocity of 530 mph (777 ft/s). The primary purpose of the test was to determine if a piece of foam traveling at the estimated relative velocity of the foam that impacted Columbia could breach and/or critically damage an RCC WLE panel. Prior to the test, RCC panel 8 was fitted 260
Charles Camarda
with numerous photogrammetric targets as described above (shown in figure 18) and internal high-speed cameras to measure displacements and strain to compare with LS-DYNA predictions.
Figure 17. – Full-scale shuttle external tank (ET) foam impact tests at Southwest Research Institute (SwRI)1.
Figure 18. – Location of photogrammetric targets on RCC panel 8 prior to impact test20.
261
Mission Out of Control
Results of the full-scale ballistic impact test at SwRI for panel 8R are shown in figure 19. The predicted damage analysis was conducted prior to the actual test and correlated very well with the actual test results. A true test of your analytical/numerical models is if you can predict the system’s behavior and failure prior to the actual experimental test.
Figure 19. – Full-scale ballistic impact test results at Southwest Research Institute (SwRI) and comparisons with LS-DYNA finite element predictions made by the RTF Impact Dynamics Team.
After I was reassigned as Director of Engineering at JSC, I worked for Ralph Roe as his Deputy for Advanced Projects in the NASA Engineering and Safety Center. I recommended the use of the RIDT as an example of what I would call a “Super Problem Resolution Team (SPRT)” or an HPT, an HPT that was needed and which far surpassed the capabilities of prior Damage Assessment Teams (DAT) and the LESS-PRT of the SSPO. The RIDT was also recognized for their outstanding work with the JSC Best of the Center One NASA Peer Award. The GRC team also received a GRC One NASA Peer Award, and LaRC was included in the Marshall One NASA Peer Award for material characterization of orbiter debris for the MAPTIS database. RIDT Team Attributes, Characteristics, and Behaviors The various attributes that were important for the success of the RITD network of teams are described below: 1. Cognition and the construction of knowledge – This case study clearly contrasts the difference between a team (or network-of-teams) that understands how to work in a converged, interdisciplinary manner to 262
Charles Camarda
create and validate knowledge about a here-to-fore unknown, observed phenomena such as the impact damage caused by a ballistic impact of foam to a fragile, complex thermal protection system (TPS). Unlike the example of the SSP and LESS-PRT, which struggled for decades to understand this phenomenon and was unsuccessful, the RIDT was able to rapidly solve this challenge in only four months! The RIDT was composed of a network of teams from LaRC, GRC, Boeing Philadelphia, and SRI. These teams all had a strong foundation in applied scientific research and were accustomed to using very similar scientific methods to construct knowledge. In fact, several of the researchers each had working experience with members of the other teams and worked very well together. They were very accustomed to critical thinking and agonizing deliberations to vet ideas to understand the complex behaviors they were attempting to understand. The RIDT worked very well with other high-powered research organizations, like the team at Sandia National Labs that was able to use massively parallel computing power to validate that the lower degree-of-freedom (DoF) finite element (FE) models of the RIDT were accurate and sufficient. This was another example of an HPT exploring outside its core to elicit peer-reviewed objective critique. 2. Communication – The RIDT was composed of a network-of-teams (team-of-teams)12 from two NASA Research Centers, LaRC and GRC), one research lab (SRI), and an industry research team at Boeing Philadelphia. Several members of the teams had working experience relationships prior to the accident, and all teams worked as well as or better than most co-located teams. Although in five separate locations, the team communicated almost on a daily basis and met at JSC, GRC, LaRC, SRI, and KSC every few months. The camaraderie was very strong, and good friendships developed. Communication was daily and open and transparent. The few negative events/comments from the team were the result of attempts at micromanagement by Space Shuttle Program managers with very little technical expertise, too many managerial briefings, and a lack of an administrative team to helpprepare burdensome reports and briefings. 3. Psychological Safety – It is no surprise that the RIDT developed an environment that was psychologically safe and open to dissenting discussions and technical critiques both within and outside the team from competing teams at JSC and elsewhere. All the individual teams were 263
Mission Out of Control
4.
5.
6.
7.
264
raised in strong research organizations with healthy, psychologically safe environments2 where all competing teams could scrutinize their results. Unlike the LESS-PRT which often controlled and prevented access of critical data to other teams. Team size and makeup – The RIDT was made up of small teams of highly competent researchers who understood how to work on complex interdisciplinary problems in a closely converged manner. They increased their membership when needed for limited amounts of time to acquire much-needed expertise (e.g., the high-speed, nonlinear impact behavior and failure of RCC as obtained by the SRI team). The initial team also grew to include the Boeing Philadelphia team when it was realized the immense work that would be needed to discretize and produce the fine FE meshes needed to analyze the full-scale structural test components. Team identity – the RIDT viewed their team as “Solution Seekers” as opposed to “Problem Solvers.”22 The network of teams of the RIDT was very quick to access their own subject-matter-expert (SME) networks to bring new ideas and expertise when needed to solve the problem and were very willing to share the glory, all for the good of the mission/ program. Organizational structure – the RIDT maintained a flat organizational structure and a network-of-teams working model. Although the SSPO would assign a “leader” for the RIDT, it was apparent at the beginning that this role was really a managerial position and did not have the strong technical leadership that was required. As in all high-performing teams (HPTs), team members gravitate to the people with technical expertise and ability for leadership, and this leadership is dependent upon the expertise needed at the moment. Individual roles and responsibilities were clearly defined and recognized at the outset, and local leadership grew organically from each of the individual teams accordingly. Friction would occur occasionally when SSPO leaders would try to prioritize work based on programmatic needs and not priorities of the technical teams. This is often the case in all programs, and a healthy tension is often a good thing and helps keep the teams progressing toward the RTF mission (STS-114). I believe this particular team and mission had just the right mix of healthy programmatic and technical tension. Creativity – The RIDT incorporated several new and very creative experimental techniques and technologies instrumental to their rapid
Charles Camarda
successes. The first experimental technique developed by the researchers at GRC was the use of a plastic sabot to protect the projectile prior to entering the vacuum chamber to enable research of the foam projectiles as a function of altitude. Another new technology that was matured at GRC was the use of full-field, high-speed photogrammetry to measure displacements and strain to correlate with LS-DYNA analyses. The RIDT was also very successful in validating the failure theories for foam projectiles and the progressive damage models for RCC material. The RIDT is one of the best examples of an HPT whose mission is to explore, discover, and understand a new phenomenon as quickly and accurately as possible and to create new knowledge. It embodied almost all the elements and characteristics of an HPT and a true research culture. Imagine: the small RIDT was able to accomplish more in only four months than much larger organizations and teams had in decades of such exploration and research. In total, the teams published over 20 research papers in peer-reviewed journals and conferences and won several distinguished NASA awards. It was truly an “epic challenge” because the task was extremely difficult, and the time and funding constraints were very tight. The Space Shuttle Program wanted to fly as quickly and safely as possible for many reasons. Most importantly, NASA had to complete the construction of the International Space Station (ISS). Team cohesion and camaraderie were exceptionally high throughout the project, as was team resilience (there were multiple overlaps in expertise within each individual team as well as across teams). The only recommendation I would have made would have been to have stronger, objective technical leadership at the Shuttle Program level to moderate the programmatic leadership of the team and to prioritize resources to possibly expedite results even further. Formation of the R&D RCC On-Orbit Repair Team(s) (ROORT): Background – In early June 2004, it became obvious that all SSP TPS repair teams were struggling to develop concepts for on-orbit ceramic tile or RCC repair, which could survive Earth-entry heating. I approached the head of the Astronaut Office, Capt. Kent Rominger, with a plan for developing a team to brainstorm new ideas for solving the repair problem. I had to convince Rommel that I had the right expertise and knew the right people who could make the magic happen and could solve a problem that the much larger NASA/industry/ academic teams (probably more than 200-plus people) had been struggling to solve for more than one year. This would be no easy task since there was much 265
Mission Out of Control
friction after the Columbia accident between the Astronaut Office and the SSP, and many of the engineers and managers at JSC were tired of showboating, know-it-all astronauts butting into what they felt was their area of expertise. I was not like most of the other astronauts, though; I had spent over 22 years as a research scientist/engineer developing TPS for hypersonic vehicles and had managed a technical Branch for NASA LaRC for more than five years. I also had years of experience working with other scientists and engineers leading huge hypersonic vehicle design programs like the NASP and had a large network of top experts around the country (the Friends-of-Charlie (FoC) network). Somehow, I convinced Rommel, and he walked me over to convince the Orbiter Project Manager at that time, Steve Poulos, to allow me to form a small team that would work independently and covertly, in parallel but not interfere with the formal SSP RCC Repair Integrated Product Development Team which was led by Mike Brieden. Mike had a BS in electrical engineering and mainly worked on avionics systems for space and did not have any real experience in TPS, materials, and structures development for hypersonic vehicles. However, as was the case at most of the NASA “operations” centers like JSC, the prevailing belief was that any good manager could manage any discipline. In reality, the RTF effort was so large that the workforce was stretched too thin at JSC, and they were very reluctant to give leadership of any human spaceflight programs to members of the “research” centers. Researchers were viewed by many members of the operations centers as academics, who have very little applied/“real-world” experience and are always “playing in their sandbox” (spending money on academic interests and not providing any real value). It was later in the program that we were able to convince the SSP and Steve Poulos that we needed a larger R&D RCC repair team after our initial team found solutions. The organizational structure of the RCC Repair Integrated Product Development Team was then expanded to include representation from the NASA Research Centers (LaRC, GRC, and ARC). The R&D RTF On-Orbit Repair Team (ROORT) activity was broken up into two distinct phases with different team members and missions as dictated by the phase of the product development life cycle. The very early conceptual design phase, Phase I, was very exploratory and required a small team of key SMEs and innovative thinkers and tinkerers to rapidly conceive, prototype, test, and evaluate dozens/hundreds of ideas and categories of ideas and parallel solutions. This required a protected (in our case covert), psychologically safe environment where we could fail and learn smart, fast, small, cheap, early, and often. To ensure a psychologically safe environment, we required extreme secrecy and isolation 266
Charles Camarda
from the formal SSP repair effort. This secrecy protected us from the established SSP team of incumbents who viewed us as competitors and who might try to shut us down because of fears we would succeed and make them look foolish! As hard to believe as that sounds, it was unfortunately very true, as you will learn. The Phase II effort was later sanctioned by the SSP and led by an entirely different person, my good friend and colleague at LaRC, Dr. Stephen Scotti, and conducted exploratory studies to conceive additional ideas and utilize more rigorous thermal-structural analysis and testing to mature categories of ideas and alternative options simultaneously using a rapid concept development (RCD), set-based design approach23. Phase I – Don Pettit’s Garage10 – Before meeting with Rommel and Steve Poulos, I attended a Technical Exchange Forum (TEF) hosted by JSC in June 2003 to brainstorm repair ideas with invited guests from NASA, industry, and academia. I came away from the forum with several ideas for how to repair an RCC WLE and had enjoyed some of the out-of-the-box ideas presented by the participants. In particular, an old colleague from C-CAT corporation, Francs Schwind, whose company manufactures RCC components for NASA and the DoD, had an idea for RCC repair. His idea was to have an EVA astronaut drill and tap a hole in the RCC wing and screw RCC fasteners to fill the hole where the small damage was located and use the fasteners and RCC patches to fix larger damaged areas. Francis even came prepared to the TEF with a crude demonstration of his idea to show the participants. His idea was quickly written off by the leads of the TEF, mainly because Francis was not well known in the human spaceflight community. However, I had known Francis and respected his expertise and had worked with him on other programs and thought his idea had possibilities. One day, I was sitting in the back of the room at an offsite technical meeting to discuss the status/progress of the SSP TPS repair team and became focused on a local aerodynamic heating problem which was causing all of the patch and plug repair concepts to fail prematurely (see figure 20). The ATK Thiokol Plug concept was thick (t > 0.1 in.) and very stiff and had a steep bevel angle (> 6 degrees), which caused a protuberance to the flow and increased local and ramp heating. This problem was causing all of the plug concepts to fail prematurely for the most severe entry heating locations and conditions. This was not well understood until we led another offsite R&D retreat and asked another FoC, Dr. Peter Gnoffo, to run some parametric studies to provide our team with some design guidelines! In addition, we were told that because the ATK plug design was stiff, it required over 1,800 plugs to cover all the various curved WLE 267
Mission Out of Control
panels. It was at this time I had an idea for solving this problem by developing a thin, doubly curved RCC plug that could deform to various curved shapes and provide and thin flat profile to the incident hypersonic flow.
Figure 20. – Comparison of local, lip/stagnation heating of RCC plug repair concept with CFD predictions.
I leaned over to Dr. John Koenig of SRI and invited him and several other key RCC researchers (most were members of the LESS-PRT) over to my home for dinner that evening to discuss ideas for solving this problem! We had an all-night meeting at my house in League City, Texas, with several key experts in RCC materials, structures, aerothermodynamics and TPS later that evening, and that was when I conceived several concepts, I believed could solve the problem shown in figure 21. One idea was a combination of Francis Schwind’s idea of drilling & tapping a hole in the RCC leading edge and using a C-C fastener to fill the hole. My contribution was the idea of making the RCC plug very thin and using RCC or C-SiC material. Now, RCC and C-SiC are very brittle materials and can easily crack during deflection; however, by making the plug very thin, it was hoped we could prevent cracking and also offer a plug that could easily conform to the many differing curved surfaces of the WLE. This could greatly reduce the number of plugs needed from 1,800 to less than 18! Several of our early ideas are shown in figure 21. Francis Schwind also provided some very interesting ideas for the highly curved sections of the leading edge, such as the chevron design shown in the upper left portion of figure 22.
268
Charles Camarda
Figure 21. – Some of Dr. Charles Camarda’s early ideas for repairing an RCC wing leading edge on orbit.
Figure 22. – Covert exploratory on-orbit thermal protection system (TPS) design activity.
I decided to explore some of our ideas and generate others and created a small “covert” team to rapidly evaluate these ideas. I enlisted a small team from the FoC network and persuaded a good friend and fellow astronaut classmate, Dr. Donald Pettit, to begin exploring ideas in secret in his garage/laboratory and to work together in our spare time while not interfering with our primary duties as astronauts (see figure 22). I selected Don as my primary partner because he was a brilliant chemical engineer who worked at Los Alamos National Labs, he had his own garage/laboratory, and he could build almost anything; he was extremely curious and creative and had what is referred to in Zen as a “beginners mind.” This refers to a person who has an attitude of openness, eagerness, and lack of preconceptions when studying a subject, even when studying at an advanced level, just as a beginner would! Our small duo expanded into a very small team of key SMEs in the areas of high-temperature structures and materials from other government, industry, and academic connections. This network helped to provide not only key ideas 269
Mission Out of Control
but also allowed us to tap into key SMEs and other sources for materials without going through the SSP for approvals. Using this network, the team rapidly obtained materials and built and tested numerous wing leading-edge and tile TPS repair concepts. We also explored approaches to drill holes in C-C and a prototype doubly curved, thin C-C plug, which would conform to numerous curvatures on the leading-edge surface (see figure 22). The size of the circles in the network diagram in figure 22 relates to the number of team members, and the thickness of the lines is the relative communication traffic and/or strength of collaboration. This network was called the “Friends of Charlie” (FoC) network because it relied on trusted SMEs with whom I had many years of experience working, as well as a detailed understanding of personnel and both the knowledge and experiences that each person had related to a specific domain (the “know who with the know-how”24). The SSP team attempted to drill holes in the RCC; however, they found it impossible to develop a drill bit that could penetrate the SiC coating of the RCC. Some of the critical constraints to developing this tool were the fact that the EVA power grip tool (PGT), which the astronauts used during spacewalks, was limited to less than 25 in-lbs. of torque, and the normal force applied to the RCC surface had to be less than 5 lbs. over a 10-sec application period. This last constraint was due to a dynamic motion limit of the EVA astronaut in foot restraints on the Space Station Robotic Manipulator System (SSRMS) and the space shuttle being suspended by the Shuttle Robotic Manipulator System (SRMS). Another constraint was that RCC material to use in the development and experimentation stages was in short supply and controlled by the LeadingEdge Sub-System-Problem Resolution Team (LESS-PRT). Development of a Self-Advancing Step-Tap Drill – I want to emphasize that a key to the success of the repair effort was the ability to prototype concepts quickly and to experiment/test, fail, and learn rapidly. Don and I tested dozens of ideas for a drill bit to pierce the hard SiC coating of the RCC. We tried everything. At one point, when I was in my dentist’s chair, I grabbed my dentist’s hand as he was getting ready to drill a tooth, took a close look at his drill, and asked him if I could have a couple of his diamond-tipped drill bits! Needless to say, they would not drill through the SiC coating of RCC. A key “aha” moment occurred in our preliminary design and testing of drill bits, enabling us to develop a strategy to breach the very hard but brittle exterior SiC coating. Using a simple spring-loaded center punch, we were able to easily chip a tiny hole in the SiC coating (after only 3-6 applications). Once the coating was breached and the C-C substrate was showing, it was then very easy to drill 270
Charles Camarda
through the RCC specimen. Although most of the damaged regions we would need to repair on orbit would already have substrate showing, it would also be necessary to drill through undamaged RCC to attach a large area repair (LAR), for example. We then enlarged our small team to include a skilled toolmaker from LaRC (Ron Penner) and went to a small machine shop outside the gate at LaRC to fabricate several of our initial prototypes. The bit design was stepped to minimize the normal force necessary to below 5 lbs. We conducted tests using the EVA Power Grip Tool (PGT), which the spacewalking astronauts usually carry with them, to verify that we could drill and tap up to a 1-inch hole in RCC with less than 5 lbs. normal force (figure 23a). We were also able to design, fabricate, test, and flight certify a set of such drill bits in only seven months’ time from initial conception, ready to fly on our return-to-flight mission, STS-114 (figure 23b).
Figure 23. – Development of self-advancing step-tap drill bits for RCC repair.
Development of C-C and C-SiC fasteners – Francis Schwind of C-CAT Corp. was responsible for the design of the C-C plugs and C-SiC fasteners. We experimented with many materials, layups, and fastener head designs that would enable a captive tool to interface with the EVA PGT. In addition, the design of the head had to have as low a profile as possible so as not to protrude into the flow and cause excessive local heating, yet it had to be robust enough to accommodate the torque required for securing the fasteners. Some of the concepts developed are shown in figure 24. Fasteners were designed to be used alone, with C-SiC washers, or together with the small area repair (SAR) and large area repair (LAR) concepts.
271
Mission Out of Control
Figure 24. – R&D development of C-C and C-SiC fasteners and washers to be used alone (to plug small hole damage) and/or in combination with small- and large-area repair (SAR and LAR) concepts.
Development of flexible C-C and C-SiC covers for plugs, SARs, and LARs. – One of the key drawbacks of the original ATK Plug design was that it was too stiff and required over 1,800 9-inch diameter unique plugs to cover all the critical areas of the leading edge. Several of the preliminary concepts the R&D Team developed were aimed at developing a highly flexible design that enabled large deformations and would flex and hug the curved RCC leading edge surface. One of the original embodiments, shown in figure 25a, was the use of multiple, very thin, curved C-C or C-SiC material that could be nested like “leaf-springs” to allow large deformations and yet provide a large enough thickness to be robust and redundant and provide sufficient oxidation protection to the damaged RCC leading edge. Another variation of this idea was just a single doubly curved, thin C-SiC shell, shown in figure 26, which, together with a thin, flexible gasket, could provide redundant attachment means and a further mechanism for preventing flow under the plug and oxidation of the substrate.
Figure 25. – Doubly curved, thin, flexible C-C/C-SiC plug concepts. 272
Charles Camarda
To demonstrate the concept and to get buy-in from the SSP, I visited C-CAT and designed a curved shell concept out of C-C material. They were able to manufacture it together with a C-C bolt and a scrapped T-Seal (manufactured and delivered in only one week). Don Pettit and I presented our idea for how an astronaut using a springloaded center punch could chip the SiC coating, and a PGT could then drill and tap a precise hole of any size into an undamaged, curved piece of RCC WLE material. We then plugged the hole with a doubly curved 4 in. diameter plug which was secured with a C-C fastener. As we tightened the fastener, the curved plug flattened nicely to the curved shape of the T-seal. This demonstrated that you could manufacture a “very brittle” C-C or C-SiC material to be doubly curved and flexible enough to conform to a highly curved surface (see figure 26). We also explored the use of gaskets to help seal and prevent hot gas ingress and coated the C-C fasteners with glass, which could melt and fuse the mechanically attached fastener to the C-C structure and provide a redundant attachment mechanism.
Figure 26. – Doubly curved, thin-shell plug demonstration and prototype.
The ability to rapidly prototype a working model to demonstrate the feasibility of a concept to program managers, who are not as knowledgeable technically or who may have incorrect pre-conceived ideas, is very important. It demonstrates the simplicity and feasibility of the concept and goes a long way in gaining support. After the flexibility demonstration of the C-CAT RCC doubly curved shell concept (figure 26) at an Orbiter Configuration Control Board meeting above, on 9/17/04, Frank Lyn, one of the SSP RCC Repair leads, contacted Kevin Rivers (a repair lead at LaRC) and one of their prime contractors for RCC repair, ATK 273
Mission Out of Control
Thiokol, and told them what he had just witnessed. Less than a week later, Kevin Rivers’ team machined a singly curved, “potato chip” C- SiC specimen, shown in figure 27, to demonstrate that they could also fabricate flexible, singly curved, C-SiC plugs and thereby reduce the total number of plugs required from 1,800 to less than 20.
Figure 27. – C-SiC plug flexibility demonstration to ATK Thiokol by the SSP repair team on 9-23-04.
At the time the R&D Team was presenting their demonstration of a flexible ceramic plug to the SSP, the program was seriously considering cancelling ATK’s contract to produce 1,800 C-SiC plugs because it would take too long and cost too much. Hence, by incorporating ideas from the R&D Team, we were able to fix a problem with the ATK plug design and would be able to fly a set of eight plugs on the first RTF mission, STS-114, in the event of moderate damage to the RCC leading edges during the mission. Initiation of Phase II – After Don and I had made significant progress on several concepts for repairing not only the RCC WLE but also ideas for repairing ceramic tiles, we decided we were ready to hand over what we had learned to a new, larger R&D Repair Team which could accelerate the detailed analysis, design, prototyping, and testing using more rigorous analysis methods and test facilities. I selected Steve Scotti to lead this team. Steve was the head of my old branch, the Thermal Structures Branch at LaRC, and Kevin Rivers was a member of that branch, and together with key researchers at other NASA centers and industry, their team would be well suited to conceive new ideas and to rapidly mature multiple concepts simultaneously.
274
Charles Camarda
R&D Repair Team Innovative Design Workshop – We hosted a small, 2.5-day innovative design workshop at NASA LaRC in June 2004. We selected a group of key researchers, engineers, scientists, and technicians from around the country and developed a very short workshop at LaRC’s Innovation Center. It provided several rooms with floor-to-ceiling whiteboards, A/V equipment, computer capabilities, supplies, IT support, and a facilitator. We organized the meeting to first review the current status of the SSP RCC Repair Project; summarize the design requirements (e.g., cost, schedule, technical requirements, constraints, etc.); present a technology status with respect to several key disciplines (e.g., aerothermodynamics, thermal, materials, and structures, etc.); review the status of several key concepts such as crack repair and plug repair; present a short review of effective techniques for enhancing innovative thinking such as brainstorming and TRIZ25 ; facilitate several brainstorming sessions; and develop a strategy for cataloging concepts and paring down the list to a manageable size. We arranged for keynote dinner speakers to help the team think outside the box and supplied each member with a copy of reference 25 to read before the meeting. One of the key presenters at this meeting was a very good FoC and colleague, Dr. Peter Gnoffo, who was one of the top aerothermodynamic analysts in the world. After his presentation, I asked him what the root cause of the protuberance heating was and if he could run some parametric analyses to understand what our design options could be to help mitigate the bump in heating. In only one day, Peter was able to run a computational fluid dynamics (CFD) program he had developed with multiple plug thicknesses and bevel angles26. Results of his study indicated that if we kept the thickness of the plugs to less than 0.1 in. and the bevel angle to less than six degrees, we could keep the peak temperatures below the oxidation limit of the SiC (3250 F)! This demonstrates that if you are able to connect with just the right SME (know who with the know-how)24, and if you ask just the right questions in a very short time using an FoC smallworld network, you can get the exact answer you need in a very short time. When the SSP lead for RTF aerothermodynamics, Chuck Campbell, heard that Peter spent time helping our team, he sent a scorching email excoriating Peter for spending precious time helping us (only one day)! Immediately, the FoC network and other senior researchers came to Peter’s defense. Peter told Chuck he didn’t waste any precious time; he ran the calculations on his own time using his own CFD code. Once again demonstrating that if you have “leaders” who do not understand how researchers operate and want to control 99% of their team members’ time, it makes it impossible for breakthroughs to happen. Why 275
Mission Out of Control
hadn’t someone in the SSP RCC repair team ever asked such questions? Instead, they continued to watch their plug designs burn up in arcjet tests without understanding the root cause! Selecting and developing the team is crucial. One of the most important aspects of a healthy team is good communication. I was very fortunate to be allowed much freedom in that I was allowed to hand-pick each individual member of the original team. Only one or two of my original candidates could not attend the kickoff meeting or workshop. One key researcher from ATK Thiokol, whom I personally requested and who wanted to come, was prevented from coming by the leader of the SSP team (totally inappropriate and also illegal). The list of attendees is shown in figure 28.
Figure 28.- Attendees of the R&D RCC Repair Team Innovative Design Workshop held at NASA LaRC, 15-17 June 2004.
I believe it is important to have all the key disciplines and background experts available early on. I also believe it is important that every member of the team is given the “big picture” and is allowed to see where his/her piece fits in that big picture. I would also agree that it is important to keep the team as small as necessary and to have the required skills to solve the problem and, as a large team as necessary, to complete the job within the assigned schedule and budget constraints. The number of attendees was about 23 people with areas of expertise ranging from thermal structures, materials (both metallic and refractory composite), high-temperature seals, coatings, RCC, ablation, manufacturing, aerothermodynamics, structures, fabrication, etc. In addition, I seeded the team with several out-of-the-box creative thinkers and two astronauts (myself and Dr. Donald Pettit to also offer on-orbit operations experience). It is important to instill ownership, responsibility, and accountability in every member of the team, ensure everyone has access to the big picture, and develop an atmosphere where rapid learning of multidisciplinary skills is easy and accessible to all. 276
Charles Camarda
What was amazing was that in 2.5 days, we developed approximately 60 concepts, which we distilled down to a manageable size (approximately 12) in several weeks. We also developed and tested several prototype concepts successfully in only three months’ time. We applied for more than seven patents as a team, of which NASA has chosen to pursue three. The number of concepts grew and shrank over several weeks as did the “flexible critical mass”27 of people who made up the R&D Repair Team. The approach used in this workshop later developed into a curriculum that became the foundation for the Innovative Conceptual Engineering Design (ICED) methodology28, a pedagogy designed to infuse innovation and setbased design practices into the conceptual design phase of a project. The goal of the meeting was to exchange knowledge – each participant shared essential information from their perspective – and to define a broad design space by brainstorming concepts that could be used in a repair. Because of the varied backgrounds of the attendees, a good “cross-pollination” of ideas occurred. Dr. Steve Scotti’s technical expertise, research culture, and strong leadership were very important for the eventual success of this high-performing team! Categories of repair concepts resulting from the workshop included metallic and ceramic shells that could deform to fit different surface curvatures, large flexible refractory metal and ceramic sheets that could cover the largest damage areas, soft gaskets and pastes to prevent hot-gas ingress through gaps, and many different types of fasteners and means to drill and tap holes in the leading edge. The critical knowledge gaps for each class were identified during the meeting and afterwards, and separate teams that “championed” a given class of repair concepts were formed to close the knowledge gaps. The most critical knowledge gaps dealt with were: 1) how a repair concept could be installed and verified by a spacewalking astronaut, 2) whether or not the concept could withstand the temperatures and pressures of Earth entry, and 3) whether the concept could prevent the hot plasma formed during reentry from entering the interior of the leading edge. Concept “gates,” defined as simple tests and analyses that establish concept feasibility, were established for each repair concept that provided goals for each team’s efforts. Within each concept class, the set-based design philosophy of eliminating the “weaker” solutions was performed within the team championing the concept. Weaker solutions were determined in several ways. They could have inferior performance as demonstrated by a quantitative metric, such as maximum operating temperature, they could have a larger number of knowledge gaps that could not be easily addressed, or they could have less applicability to 277
Mission Out of Control
the different damage scenarios than other alternatives within the class. However, when a solution was also applicable to teams outside their development team, such as a high-temperature fastener that could be used with several repair concepts, it was not eliminated. Each repair class team was allowed to continue their development in parallel as long as possible because the teams didn’t all proceed at the same pace, and a “showstopper” in one class could be revealed late in the development (figure 29). Following this approach, the network shown in figure 28 was reconfigured several times during the effort to include additional participants (such as C-CAT mentioned earlier), as well as to “prune” branches that worked on approaches that were found infeasible.
Figure 29. – Rapid Concept Development (RCD) efforts of one concept for space shuttle onorbit wing leading edge repair.
Some illustrative examples of team products for one repair concept and of the many gaps that were closed are shown in figure 30. The rapid concept development set-based design approach23 utilized allowed the feasibility of this concept to be fully evaluated in only three months. Additional technologies and repair concepts that were developed by the team are shown in figure 30. The development of these innovations followed the methodology described above. The R&D Repair Team also closed the capability gap for repairing a large leading edge hole, a capability desired by the shuttle 278
Charles Camarda
program but initially believed to be infeasible, and a contingency repair kit flew on STS-114 and every mission following the Columbia accident, including the space shuttle for the Hubble repair mission (STS-125). This capability was critical for the Hubble repair mission because they were in an orbit that prevented them from docking with the ISS and using it as a contingency safe haven in the event of irreparable damage to the vehicle. Prior to RTF, one of the safeguards would be to use the ISS as a safe haven with the ability to launch a rescue shuttle mission in a predetermined short amount of time to save the stranded crew.
Figure 30. — Assorted technologies and concepts developed by the R&D Space Shuttle Wing Leading Edge Repair Team to repair small to large on-orbit damage to the reinforced carboncarbon (RCC) wing leading edge.
In conclusion, it should be noted that one of the primary reasons that members of research branches at NASA were so successful in identifying and fixing problems related to RTF issues was because they operated as wellintegrated research product development teams that were accustomed to solving interdisciplinary problems in a converged way.
279
Mission Out of Control
R&D Return-to-Flight (RTF) On-Orbit Repair Team (ROORT) Attributes, Characteristics, and Behaviors 1. Team Mission – The ROORT team’s mission was to solve an “epic challenge” that other much larger teams were unable to solve after over one year. This required innovative/creative solutions and breakthrough ideas to rapidly mature these ideas in less than three months to be certified and ready to fly on STS-114. The fact that the leaders of the ROORT were well-versed in rapid concept development (RCD) and setbased design processes was very important for their success23. 2. Culture – The ROORT consisted of two teams that worked during two serial preliminary conceptual design phases to develop and mature numerous concepts for on-orbit TPS repair. Both teams were led by technical experts with solid foundations in an exploratory research and development culture, as were most of the team members. Hence, the culture was psychologically safe and strongly supported a “permission to fail” exploratory analysis/experiment/design/prototype/fail/learn process. 3. Cognition – As mentioned above, the ROORT was led and dominated by members having a strong research engineering and science background, so the methods for knowledge construction were very similar to those exhibited by the RIDT team. The knowledge construction process during phase I was not at the level of RIDT, but phase II analyses and tests were sufficiently rigorous that most knowledge gaps were sufficiently closed so that the selected concepts, which were eventually flown on STS-114, would have been successful. We did not show the next level of rigor because the mission of this team was the 95% success assurance that was needed to intelligently select RCC repair design concepts, which, when selected, would succeed. 4. Communication –Phase I of the ROORT was kept secret from the SSP formal RCC Integrated Product Development Team (RIPDT) out of necessity. Communication within the team, however, was psychologically safe, open, and transparent. Once the ROORT was assured some of their ideas had merit and could benefit the program, Phase II was initiated, and we sought the acceptance of the formal RIPDT, and the ROORT leader, Steve Scotti, became the deputy for the RIPDT. The ROORT shared all their technology advances with the RIPDT, which led to the rapid development and certification of our ideas/concepts for flight on the RTM mission, STS-114. 280
Charles Camarda
5. Psychological Safety – While a stark comparison could be made regarding the psychological safety of the SSP RIPDT team, members of the ROORT functioned in a totally psychologically safe environment and culture. At one point, the leader of the RIPDT tried to actively prevent members of the ATK Thiokol Team from working with the ROORT. Amazing as that might seem since the objectives were to create a successful repair strategy for the space shuttle astronaut crews to help them survive in the event of a critical debris strike! 6. Team size, makeup, and organization – The core size of the ROORT was between 10-20 people. It was composed of key SMEs from all the key technical disciplines (e.g., aerothermal, thermal structures, materials (high-temperature metallics, refractory composites, ceramics, polymers, etc.), fabrication, as well as out-of-the-box thinkers and operations experts. Many of the team members came from research organizations that were used to functioning in a transdisciplinary, converged way to solve complex technical problems. The network-of-teams or teamof-teams flat organizational structure and method of communicating and researching tough problems was instrumental for rapid advances, breakthrough ideas to be matured, and mission success. 7. Team identity – The ROORT viewed their team as “solution seekers”22. They sought help from any and all sources that they could find within their network and outside of their network. For example, when required, tool design and fabrication experts were teamed up with team members to produce a working self-advancing, step-tap drill bit10,28 and RCC/CSiC material science experts from other research labs like Dr. John Koenig of the Southern Research Institute (SRI) and Francis Schwind of C-CAT. 8. Creativity – The ROORT was composed of SMEs and researchers who were individually very creative and knowledgeable in working collaboratively in teams to enhance the collective creative output. They were also trained in techniques like TRIZ25 and biologically inspired design. They used an innovative conceptual engineering design (ICED) methodology and rapid concept development (RCD) set-based design product development strategies23,25. The examples above demonstrate what is possible when you have the right organization ecosystem and culture and understand how to create, sustain, and enhance high-performing teams (HPTs) to solve complex, multidisciplinary 281
Mission Out of Control
problems. Both the RIDT and ROORT teams were developed using conventional methods described and were able to rapidly solve heretofore unsolvable problems in a matter of months, clearly within the recovery windows of most anomalies identified in both shuttle disasters and those within many other HROs that have experienced similar tragedies (e.g., the Boeing 737 Max). How can we improve the capability to detect cultural and technical “weak signals” of dysfunction instantly and rapidly respond to prevent future disasters? More importantly, what does it take to actually reverse the slippery slide to dysfunction, fix the root causes of such problems, and transform a broken culture? References 1. 2. 3. 4. 5. 6. 7.
8. 9. 10.
11. 12. 13. 14. 15.
282
Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Edmondson, Amy C.: “The Fearless Organization: Creating Psychological Safety in the Workplace for Learning, Innovation, and Growth.” John Wiley and Sons 2019. Barthelemy, Bart: “The Sky Is Not the Limit – Breakthrough Leadership.” St. Lucie Press 1997. Camarda, Charles J.: “Space Shuttle Return-to-Flight Following the Columbia Tragedy.” NATO Science and Technology Organization Lecture Series on “Hypersonic Flight Testing.” STO-AVT234-VKI, March 24-27, 2014, von Karman Institute, Rhodes-St-Genese, Belgium. Scotti, Stephen J.: “Orion Alternate Launch Abort System Follow On Technical Assessment.” NESC-RP-06-08, Vol. I, May 12, 2011. Scotti, Stephen J.: “Orion Alternate Launch Abort System Follow On Technical Assessment.” NESC-RP-06-08, Vol. II Appendices, May 12, 2011. Camarda, Charles J.; de Weck, Olivier; and Do, Sydney: “Innovative Conceptual Engineering Design (ICED): Creativity, and Innovation in a CDIO-Like Curriculum.” Proceedings of the 9th International CDIO Conference, Massachusetts Institute of Technology and Harvard University School of Engineering and Applied Sciences, Cambridge, Massachusetts, June 9-13, 2013. Pellerin, Charles J.: “How NASA Builds Teams – Mission Critical Soft Skills for Scientists, Engineers, and Project Teams.” John Wiley and Sons, Inc. 2009. Rich, Ben R. and Janos, Leo: “Skunk Works.” Little, Brown, and Company 1994. Camarda, Charles J.: “A Return to Innovative Engineering Design, Critical Thinking, and Systems Engineering.” Keynote address presented at “The International Thermal Conductivity Conference (ITCC) and the International Thermal Expansion symposium (ITES), Birmingham, AL, June 24‐27, 2007. https://drive.google.com/file/d/0B84i3cJ_nNa0d3JiOGl4NWwwQ28/ view?usp=sharing Final Report of the Return to Flight Task Group. July 2005. McChrystal, General Stanley: “Team of Teams – New Rules of Engagement for a Complex World.” Penguin Publishing Group, 2015. Report of the Columbia Accident Investigation Board Vol. II Appendix D.12 – Impact Modelling, Government Printing Office, Washington, D.C., October 2003. http://www.nasa.gov/columbia/ caib/html/VOL2.html Goodlin, Drew L.: “Orbiter Tile Testing.” Southwest Research Institute Final Report #18-7503005 prepared for NASA JSC, San Antonio, Texas, March 5, 1999. Grosch, D. J. and Riegel, J. P.: “Ballistic Testing of Orbiter Tiles.” Southwest Research Institute Final Report #06-2720 prepared for Rockwell International, San Antonio, Texas, February 10, 1989.
Charles Camarda 16. Fasanella, Edwin L.; Jackson, Karen E.; Lyle, Karen H.; Jones; Lisa E.; Hardy, Robin C; Spellman, Regina L.; Carney, Kelly S.; Melis; Matthew E.; and Stockwell, Alan E.: “Dynamic Impact Tolerance of Shuttle RCC Leading Edge Panels Using LS-Dyna.” AIAA Paper 2005-3631 presented at the 41st AIAA/ASME/SAE/ASEE Joint Propulsion Conference and Exhibit, 10-13 July 2005, Tucson , Arizona. 17. Carney, K., Melis et. al.: “Material Modeling of Space Shuttle Leading Edge and Tank Materials for Use in the Columbia Accident Investigation.” Proceedings from the 8th International LSDYNA Users Conference, Dearborn MI, May 2-4, 2004, pp3-45 to 3-55 18. Melis, Matthew E.; Brand, Jeremy Brand; Pereira, Michael J.; and Revilock, Duane M.: “Reinforced Carbon-Carbon Subcomponent Flat Plate Impact Testing for Space Shuttle Orbiter Return-toFlight.” NASA TM-2007-214384, September 2007. 19. Fasanella, Edwin L.; Jackson, Karen E.; Lyle, Karen H.; Jones; Lisa E.; Hardy, Robin C; Kellas, Sotiris; Carney, Kelly S.; Melis; Matthew E.: “Dynamic Impact Tolerance of Space Shuttle Orbiter Wing Leading-Edge Panels.” Journal of Spacecraft and Rockets, Vol. 45, No. 5, SeptemberOctober 2008. 20. Melis, Matthew E.; Revilock, Duane M.; Pereira, Michael J.; and Lyle, Karen H.: “Impact Testing of Reinforced Carbon-Carbon Flat Panels with BX-265 and PDL-1034 External Tank Foam for the Space Shuttle Return to Flight Program.” NASA TM-2009-213642/Rev 1, December 2009. 21. Gabrys, J.; Schatz, J.; Carney, K.; Melis, M.; Fasanella, E.; and Lyle, K.: “The Use of LS-DYNA in the Columbia Accident Investigation.” 8th International LS-Dyna User’s Conference, Livermore Software Technology Co., Livermore, CA, 2004, pp. 3-1 to 3-10. 22. Lifshitz-Assaf, Hila; “Dismantling Knowledge Boundaries at NASA: The Critical Role of Professional Identity in Open Innovation.” Administrative Science Quarterly XX 2017, pp. 1-37. 23. Camarda, Charles J.; Scotti, Stephen; Kunntu, Iivari; and Perttula, Antti: “Rapid Learning and Knowledge-Gap Closure During the Conceptual Design Phase – Rapid R&D.” Technology Innovation Management Review, March 2020 (Volume 10, Issue 3). 24. Larsson, Andreas: Engineering Know-Who - Why Social Connectedness Matters to Global Design Teams. PhD Dissertation, Luleå University of Technology, 2005. 25. Altshuller, Genrich: “And Suddenly the Inventor Appeared – TRIZ, the Theory of Inventive Problem Solving.” Technical Innovation Center, Inc., Worcester, Massachusetts, 2001. 26. Gnoffo, Peter A.: “Patch Heating Comparisons at Mach 18 and Mach 16.5.” March 8, 2005. 27. Graham, Margaret, B. W., and Shuldiner, Alec T.: “Corning and the Craft of Innovation.” Oxford University Press, Inc. New York, 2001. 28. Camarda, Charles J., Bilen, Sven, de Weck, Olivier, Yen, Jeannette, and Matson, J. 2010. Innovative Conceptual Engineering Design - A Template to Teach Problem Solving of Complex Multidisciplinary Problems. ASEE Louisville, Kentucky.
283
Chapter 8
Fixing a Broken Culture – Putting It All Together We have talked a lot about what causes great organizations to slowly slide into a toxic organizational culture, which can lead to recurring disasters, incurring tragic economic and human costs, and how difficult it is to reverse and/or transform that slide and return to the storied organization it once was. The detailed examples presented were from two NASA Space Shuttle tragedies and the agency’s inadequate responses during and after such disasters, which I was able to uncover because I was embedded at different levels within NASA directly working the return to flight period post-Columbia. As such, it was very easy for me to highlight the fierce persistence of such disturbing cultures and resulting behaviors and their resistance to change. These same cultural problems have been shown to be the underlying causes of disasters in many other highhazard/high-risk industries which must operate as high-reliability organizations (HRO) to ensure public safety and/or the production of critical resources, such as energy production (nuclear and fossil fuel), air and rail transportation, armed services, disease control, and medical services/emergencies. Culture will be shown to be the underlying cause of several accidents prior to its acceptance following the Challenger and Columbia accidents. In mid-January 2018, I was invited to participate in a Boeing Leadership Workshop in St. Louis, Missouri, to help build an online course for Boeing employees and others to help “Enable Technical Leaders at Every Level Change Behavior and Produce Transformational Change.” During that meeting, I presented most of the horrific details of how badly the NASA culture had failed in its efforts to maintain the safety of two space shuttle crews, as described in Chapters 2-4. After I presented my very stark and candid observations, I was approached by several Boeing leaders who took me aside, looked me in the eyes, 284
Charles Camarda
and, to my surprise, told me, “Charlie, the culture is worse at Boeing!” I did not believe what I was hearing. All my experiences with Boeing Commercial, Boeing Defense, and McDonnell Douglas Aerospace personnel and managers were always positive. I felt Boeing was one of the top “learning organizations” I had ever experienced, having very professional executives and a strong commitment to learning throughout the organization, which they also extended to universities and schools throughout the country and around the world. Then, the unimaginable happened. In October 2018 and March 2019, two Boeing 737 MAX passenger jets crashed minutes after takeoff, claiming 346 lives. This would be the second time I am aware that accident investigators would point to culture as one of the primary causes of recurring accidents. What is more disconcerting is that people within the organization, after hearing my presentation, were quick to recognize that the same toxic culture existed at Boeing over eight months prior to the first crash, Lion Air Flight JT619, on October 29, 2018. Rest assured, there were probably many more employees who experienced the boiling-frog-like gradual slide to dysfunction within Boeing much earlier but were unable, for whatever reasons, to right the ship and prevent the tragedies despite having such a long recovery window to do so. Boeing and the 737 MAX Disasters On October 29, 2018, Lion Air Flight 610 took off from Soekarno-Hatta International Airport enroute to Depati Amir Airport in Indonesia and crashed into the Java Sea 13 minutes after takeoff, killing all 189 passengers and crew. In less than six months, on March 10, 2019, Ethiopian Airlines Flight 302 departed Bole International Airport in Addis Ababa, Ethiopia, enroute to Jomo Kenyata International Airport in Nairobi, Kenya, when it crashed six minutes after takeoff, killing all 157 people onboard. Both accidents occurred on Boeing 737 MAX aircraft, and both experienced very similar flight trajectories and anomalies when the fatal accidents occurred. Detailed accident investigations in two countries1,2 and Congressional Committee Hearings3 into the causes of both accidents revealed many disturbing similarities of the technical and cultural issues. Both crashes were triggered by a failure of one of two angle of attack (AOA) sensors on the aircraft. The AOA sensors indicate the angle of the relative velocity of the airstream with respect to the chord-line of the wing of the aircraft during flight and were used by the new flight control software, the Maneuvering Characteristics Augmentation System (MCAS), to activate the appropriate response sending electronic signals to command the flight control 285
Mission Out of Control
surfaces, usually resulting in a nose-down pitch of the aircraft. A visual timeline of events in the history of the 737 MAX aircraft taken from reference 5 is shown in figure 1.
Figure 1. 737 MAX timeline key events from 2010 to 20195.
The 737 MAX, a derivative of Boeing’s very successful 737 family, was announced in 2011 as a means to counter a financial threat posed by its competitor, Airbus, which announced a new family of fuel-efficient aircraft in 2010, the A320neo, which used a much larger fuel-efficient engine. Rather than design a totally new aircraft as was originally planned, which would take much longer and cost more, Boeing chose to redesign its successful 737 family and outfit it with larger more fuel-efficient engines. There was one catch, however. The 737 family was built lower to the ground than the A320, so its engines would have to be placed higher and farther forward on the 737 wing. This, in turn, changed the aerodynamic and flying characteristics of the MAX aircraft dramatically such that it would have to rely on a new electronic control system, MCAS, to prevent a nose-up stall, a condition whereby a high AOA could cause aerodynamic lift to be rapidly lost and the aircraft to lose altitude rapidly and potentially crash. If such a condition happens during takeoff at low altitudes, the pilots would have very little time to diagnose the cause and react appropriately 286
Charles Camarda
to save the vehicle. Hence, if such a condition is diagnosed by the MCAS system, for example, if the AOA sensors are reading an excessively high AOA reading, it would send electronic commands to pitch the nose of the aircraft down to regain airspeed and lift. The way the software was written, MCAS would continue to override the manual commanded inputs to pull the nose up even to as high as its maximum stabilizer angle would allow. If the pilots did not cut out the MCAS electronically, as airspeed continues to rise in a nose-down attitude, the forces needed to manually override the electronically commanded inputs can become too great, and the aircraft could continue to dive nose-first into the ground/ water. Many have called Boeing’s decision to fix a fundamental aircraft design and stability problem using software a “Band-Aid fix.” In addition, although the aircraft has two AOA sensors, the system, although critical, was not truly redundant but a single point of failure because it required only one failed sensor to activate the MCAS. This alone is not troublesome as long as the pilots have the required training and understand the operation of the MCAS system, its potential failure modes, and cockpit alerts, and can rapidly disengage the software and manually fly and safely recover the vehicle before airspeed increases cause the forces to become too excessive to override. However, it appeared that Boeing downplayed the criticality of the MCAS system and its operation from pilots, FAA regulators, and airlines to sell the “new” aircraft as a minimal derivative of the 737 family, which would not require pilot training and thus be much less expensive for airlines to operate3. Most of the key indicators of the 737 MAX problems, both technical and cultural, were evident beginning with the redesign of the 737 aircraft and outfitting of it with the MCAS system, during the certification process, and post Lion Air Crash (October 29, 2018). The “weak signals” were not so weak and should have been heeded6: 1. Culture Problem – Long before the decision to build the 737 MAX, Boeing merged with McDonnell Douglas, and many of the high-level executive positions were filled with McDonnell Douglas managers who did not have the legacy technical background of the Boeing team and were driven by company profits, investor concerns, and market share. The Boeing culture shifted after its merger from one where sound engineering leaders were evident even at top levels of management to one driven by the financial bottom line. Its non-psychologically safe environment was exposed by numerous whistleblowers who came forward to speak following the disasters3,7-8. For example, 39% of the Boeing-appointed Authorized Representatives (AR) that were delegated 287
Mission Out of Control
regulatory responsibilities of the Federal Aviation Administration (FAA) “perceived ‘undue pressure’ and 29% were concerned about the consequences if they reported potential ‘undue pressure.’”3 2. Loss of technical excellence – Boeing, prior to its merger with McDonnell Douglas in August 1997, relied on its heritage of technical excellence and its respect for the technical research leadership, which enabled a healthy tension with program managers who had much less technical expertise and who were driven to perpetuate a “production culture” and its associated schedule and budget demands. The Boeing Technical Fellows program was a widely respected aspect of the company that awarded its most brilliant researchers the title, which was highly regarded within Boeing and throughout the aerospace community. These Technical Fellows were paid salaries that approached a “dual career” equivalence to programmatic counterparts and were allowed to remain technical and conduct needed research without the cost of administrative overload. Post-Columbia, at NASA, Ralph Roe would attempt to mimic this recognition of technical excellence by also establishing a Technical Fellows (TF) designation within the NASA Engineering and Safety Center (NESC) for its top engineers together with a substantial increase in salary. The difference was that these NASA TFs did not necessarily maintain an objective research role independent of programmatic needs, and some did not have notable research experience. The TF program at Boeing became watered down over time and driven to satisfy diversity requirements9. In the words of one Boeing whistleblower, “There is no respect for an expert culture that has existed through years of experience. There is no acknowledgment of recommendations made by experts or an explanation about why a different decision was made3.” 3. Economic Problem – 737 engines used too much fuel and cost aircraft companies more to operate, and Boeing was losing market share to its competitor, Airbus, and its new aircraft, the A320neo. NASA had a similar problem. While NASA is not concerned with profit as a commercial company is, production pressure nevertheless caused a drive toward efficiency to keep costs down, as all government bureaucracies are accountable to the American taxpayers. 4. Airframe design/selection – Utilizing the 737 airframe was the quicker and cheaper solution.
288
Charles Camarda
5. Aerodynamic stability problem – Moving the new fuel-efficient engines higher and forward on the 737 airframe caused stability problems that required an MCAS. 6. Systems engineering problem – Boeing wanted the simplest possible fix, and to keep costs down (financial and production pressure), they insisted that the 737 MAX would fit the existing systems engineering architecture and not require additional pilot training, minimal engineering rework, and maintenance. The first rule of systems engineering is to fully understand how each change will affect all other elements of the system. Boeing engineers should have understood the implications of the loss of one AOA sensor, and the importance of pilot training on the operation of MCAS, etc. 7. Maintenance problem – Both of the 737 MAX crashes experienced an AOA sensor failure. In fact, the flight prior to the Lion Air accident experienced an AOA problem that was not corrected nor recorded in the maintenance logbook prior to the Lion Air JT610 flight on October 29th. 8. Pilot training problem – Lion Air pilots were never told the details of the MCAS system. The FAA issued an Emergency Airworthiness Directive (AD) after the crash on November 7, 20184 to correct the unsafe conditions on all 737 MAX aircraft, requiring flight crews to use a revised runaway stabilizer operational procedure if they encountered certain conditions. The emergency AD was an interim action, and further action was planned based on what the FAA and Boeing learned from investigating the JT610 accident. By the time of the Ethiopian Airlines ET302 crash on March 10, 2019, no one had conducted simulator training on this failure. Resistance to Cultural Change Immediately after the Lion Air crash in October, Boeing was in defense mode, denying MCAS was the cause, similar to Shuttle Program Managers who were quick to deny the ET foam strike caused the Columbia accident. Many within Boeing blamed both accidents on the inexperienced pilots and training in both countries. In fact, even after the second crash, Ethiopian Airlines, Muilenburg’s replacement, David Calhoun, in a PBS Frontline documentary, still implied, by his refusal to answer a direct question by the interviewer that the fault of the accidents rested in the poorly trained and inexperienced foreign pilots8. Boeing’s new CEO appeared to be in denial about the real cause of the accidents. Attempts to discredit pilot responses by journalist pilot William Langewiesche 289
Mission Out of Control
to both accidents were disputed when Captain Chesley “Sully” Sullenberger, the pilot who famously and successfully crash-landed an Airbus A320 in the Hudson River after bird strikes on takeoff knocked out both engines, replied in a letter to the editor: “Inadequate pilot training and insufficient pilot experience are problems worldwide, but they do not excuse the fatally flawed design of the Maneuvering Characteristics Augmentation System (MCAS) that was a death trap10.” The final observation of the Congressional committee hearing of the Boeing 737 MAX recommended that “Boeing can and must take significant steps to create and maintain an effective, fulsome, and forthright safety culture. However, the Committee’s investigation raises questions regarding Boeing’s commitment to doing that or even to simply acknowledging that it made mistakes in the design, development, and certification of the 737 MAX aircraft.” Truer words were never spoken and were very similar to those of the Columbia Accident Investigation Board (CAIB) regarding NASA’s ability to transform and its lack of admission of wrongdoing11. The dysfunctional culture appeared to persist post-737 MAX disasters, similar to NASA post-Columbia, and was evident throughout the company, even in its newly formed manufacturing plant in South Carolina. A whistleblower, John Barnett, was the lead quality control (QC) manager working on Boeing’s new airliner, the 777 Dreamliner in a newly opened plant in South Carolina who admitted “there’s a lot of pressure to meet schedule...so there’s an incentive not to report the defect that you created because it’s gonna be held against you7.” John also confirmed that a manager in the SC plant ordered that a defective hydraulic line be used to complete a production plane. Boeing’s response to such cultural problems? Denial. Without recognition of mistakes and contrition at the top levels of Boeing or NASA management, how could anyone expect the culture there to change? And so, it did not. In the months and years following the 737 MAX disasters, Boeing was plagued by numerous anomalies, near catastrophic incidents12 and safety issues in 2024, such as 1) Alaska Airlines had to make an emergency landing after a shoddily installed door plug flew off inflight January 5, 2) All Nippon Airways forced to cancel takeoff due to a cracked cockpit window on January 13, 3) Atlas Air showed flames coming out of an engine inflight on January 18, 4) Delta Airlines Boeing 757 nosewheel fell off during taxiing on January 20, 5) United Airlines 737-8 stuck rudder pedals on February 6 prompting an NTSB probe, 6) United Boeing 777-200 forced to return to Los Angeles after one of its wheels fell off after takeoff on March 7, 7) LATAM Airlines Boeing 787-9 Dreamliner 290
Charles Camarda
midflight technical issues causing the plane to drop midflight on March 11, 8) on March 13, a United Boeing 777-3800 had to turn around and land where it started after a fuel leak was observed, and 9) on March 15, a United Boeing 737800 arrived at its destination with an exterior panel missing having seemingly fallen off during flight. While all of these nine incidents may not be attributable to Boeing directly, the damage to its reputation could be irreparable. The cost to Boeing of the 737 MAX tragedies alone amounted to more than $18 billion as of 2024, and it was projected, Boeing could stand to lose as much as $44 billion in revenue from a drop in sales. Even in its final responses to a Congressional hearing, “Boeing blamed industry-wide assumptions regarding pilot response times: In designing MCAS, Boeing relied on well-accepted, industry-wide assumptions in evaluating how pilots would react to the uncommanded activation of MCAS for any reason, including erroneous AOA. Those assumptions proved not to be accurate in these accidents. Accordingly, we now know that there is a greater risk from unintended activation of MCAS due to erroneous AOA data than we originally thought. Our system redesign addresses this issue.3” Other issues in the Boeing case are very similar to NASA pre- and postColumbia: 1) The Boeing merger with McDonnell Douglas (MD) Company and the slide from a culture led by engineers and scientists (rapid loss of a corporate research culture) to one led by profits and market share (the new CEO postmerger was considered a MD “bean counter” similar to NASA Administrator at the time of Columbia, Sean O’Keefe), 2) undue production pressure, 3) a fear by employees of speaking up (psychological safety), 4) lack of an independent technical authority (aircraft certification was delegated by the FAA to Boeing creating a clear case of the fox guarding the henhouse and no real objective oversight similar to NASA’s NESC alignment with the Shuttle Program for flight rationale). “At the Committee’s December 2019 hearing on the MAX, Dr. Mica Endsley, a human factors expert and former chief scientist of the U.S. Air Force, noted the importance of cultural changes at both the FAA and Boeing: There has been considerable discussion here today and also previously in the press about concerns about safety culture at both Boeing and the FAA that sort of underlies a lot of the failures we saw in good process and ended up being in good design. The FAA Administrator and Boeing have made a number of announcements of things they are going to do to try to fix that, and we are glad to see that, but changing culture is really hard. You can’t just give a one-shot, and it is done. It is something you have to do every day. 291
Mission Out of Control
It has a lot more to do with actions than with words, and so, the importance of really following up on those actions of taking safety issues very seriously, of reprioritizing safety with regard to production and cost in schedule, those changes are going to require a lot of continued interaction by management…3” The bottom line is that changing a culture is very hard, takes years, and leaders have to “walk the talk,” lead by example, and monitor progress at all levels! Culture has also been recognized as a contributing cause of numerous disastrous accidents in multiple complex, highly technical industries such as oil and natural gas, the chemical industry, the nuclear industry, etc. While the word culture is not mentioned in the British Petroleum (BP) Deepwater Horizon oil spill disaster report in the Gulf of Mexico, which killed 11 people, spilled over 210 million gallons of oil, and cost the company over $65 billion in losses13, culture has been recognized as a major contributing factor by whistleblowers and analysts alike who credit BP’s ambitious climb and rapid growth in becoming one of the top global leaders in oil production and its drive to cut costs and reduce standards13,14. Once again, the warning signs of a slippery slide to dysfunction at BP were evident early on by the Texas City Oil Refinery explosion in 2005 in Texas, which killed 15 people and the 200,000-gallon oil spill in Prudhoe Bay, Alaska, in 2006. Changes in upper management and organization did little to reverse the slippery slide and avoid the largest oil spill the world would ever see. Similar to the NASA and Boeing disasters, the technical slide at BP from a “research culture” caused managers and engineers to “misinterpret vital test results that would have confirmed that hydrocarbons were seeping from the well. In addition, BP relied on an older version of a complex fail-safe device called a blowout preventer or BOP that had a notoriously spotty track record.15” Fixing a Broken Culture – Returning to a High-Performance Organization Leadership – Top Down, Bottom Up: To truly change the culture of an organization takes years and requires a very strong and consistent commitment from the top leaders of the organization and complete agreement and buy-in from leadership staff at all levels within the organization. Remember, it only takes one small team (e.g., the O-Ring work group at MSFC or the LESS-PRT at JSC) or one organization (e.g., the Engineering Directorate, the NASA Engineering and Safety Center (NESC), or the Space Shuttle Program (SSP) Office at JSC) within the Agency/company 292
Charles Camarda
to destroy a psychologically safe environment and allow a disaster to occur. Leaders have to lead by example, rewarding employees who are not afraid to speak up candidly and who are ready to admit when they do not understand or do not know the answer to a critical technical question. Leaders must be ready to make drastic organizational changes to weed out remnants of past dysfunctional behaviors and actively reward behaviors that reinforce the core values and attributes that will ensure a high-performance organization, such as a research culture, and enable a restoration of past glory and success. One of the first things I did as newly appointed Director of Engineering at JSC was to hold a 3-day offsite retreat in The Woodlands, Texas, for all our Tier-2 and Tier-3 managerial staff (division chiefs and branch heads). I knew what needed to happen to achieve the technically excellent engineering team my administrator, Mike Griffin, tasked me to do; however, the journey would have to be one my team would have to recognize and own for themselves. We used ideas from “Built to Last”16 to structure the process where we would examine what made the JSC Engineering Directorate great/successful during the Apollo years, identify what were its core values and purpose, objectively examine our current state, and plan an envisioned future and strategy for how to get there! This process was very painful, and at times, I thought, almost hopeless. At one point, when we took stock of all the prioritized values listed by my senior team leaders, I realized there was one value I thought was essential which was missing: TEAMWORK! I could not believe what I was seeing and hearing. Could you imagine the JSC Apollo engineering gods like Max Faget and Bob Gilruth building an organization that would lead the U.S. to successfully land on the moon and return safely in 8 years without having a core value like teamwork? Luckily, that evening, we had planned for an inspirational speaker to speak to our teams. Walter Bond, an NBA basketball professional, spoke about what it means to be the best “sixth person” on the bench who supports the team and the difference between confidence and arrogance. The next day, when our teams reviewed their selection for organizational core values, to my surprise, teamwork was high on their list! It was beautiful. I did not have to push my team; the lecture by Walter “pulled” the teams to choose teamwork as a necessary key value. My insistence on maintaining the culture at all levels is based on my firsthand experience attempting to accomplish this monumental task and my naïve assumption that my administrator understood what it would take to change a culture and would have my back. He overruled our expert advice instead of supporting me when I stood up at the Flight Readiness Review (FRR) for STS121 when I and two other engineering and safety leaders said we should not fly. 293
Mission Out of Control
Then, when I was fired/reassigned three days later by my center director, he did not support me. Instead of leading by example, demonstrating the importance of psychological safety and technical excellence, in my opinion, he singlehandedly and probably unknowingly sent the NASA human spaceflight culture back to its darkest moments pre-Columbia where the engineering function was subservient to the program managers, dissent would not be tolerated, and production pressure ruled the day! This is not leadership from the top down and the bottom up which consistently stresses key core values such as psychological safety and definitely not one which displays by actions what you are preaching to the team (what some call walking the talk). Instead, it destroyed trust and sent mixed messages. Because most leadership of highly complex technical organizations are from the “left-brained” science and engineering professionals, there is very little understanding and tolerance for the “softer” sciences like sociology, psychology, and organizational behavior. Hence, the tendency is to dismiss the importance of understanding human behavior in processes such as knowledge capture and intelligent decision making. I would highly recommend a leadership structure where sociologists and organizational behavior researchers work side-by-side with C-Suite executives intending to attempt to change a corporate culture. Another way this can be accomplished is by incorporating technology like AI, key sociologists, and behavioral scientists trained to identify weak signals of technical and cultural dysfunction. Building a research culture throughout the organization: Building a research culture like the one that existed in NACA and early NASA takes time and requires leadership that understands what research and applied research are and what it takes to create an environment where failure is embraced as a necessary requirement for discovery and learning and dissenting opinions and critique by external experts is welcomed. Leaders of a research organization must come from experienced and accomplished scientists and researchers who are the best-of-the-best. They must be allowed to continue their research and yet lead and direct teams without the burden of administrative overhead. This culture may be easy to create when new organizations are first established; however, it is hard to maintain such an environment as external pressures (e.g., production pressure and politics) force growth and conflicting values begin to emerge, as was the case for NASA during Apollo, and Boeing post-merger with McDonnell Douglas. Hence, the fight to maintain such a capability can only be articulated at the top by someone who understands and 294
Charles Camarda
has the freedom and the courage to speak truth to power. As described in Chapter 5, a research culture is committed to knowledge, the creation of new knowledge, and the rigorous validation of that knowledge. The organization must place a premium on hiring the brightest minds, supporting their advanced learning, and mentoring the journey of employees hired to conduct applied research in the process by allowing them to explore and grow in a psychologically safe environment. It must establish and maintain premier laboratories and facilities to conduct experiments to the highest standards, which simulate the actual environment and loading conditions and utilize the latest sensors and instruments to accurately measure parameters to correlate with analytical and numerical predictions. It must also recognize the importance of a highly skilled and trained technician staff that works side-by-side with engineers and researchers to use the most advanced manufacturing techniques to develop the precise instrumented wind tunnel and laboratory models and fixtures. Information and new knowledge must be transparent and allowed to flow freely and be shared by everyone within the organization. NACA and early NASA had three research laboratories/centers, which created a nexus of cutting-edge theoretical and experimental research which supported the growth and development of world-class engineers and scientists throughout the agency. As long as such a research core of SMEs is allowed to forms dense networks with reach throughout the entire organization, it is possible for the entire organization to continually learn and solve complex anomalies as they arise and prevent potential disasters. When NASA decided to reduce funding to the core research centers because research was viewed as an expense instead of as a critical investment, the depth and breadth of critical knowledge began to dry up slowly. When the leaders of non-Research Centers at NASA grew in power and started regarding researchers as “playing in the sandbox”/wasting money and unable to solve real “hardware” problems, the communication between the nodes in the network slowly stopped, the research branches’ capability throughout the network dried up, and critical understanding of complex problems like ET foam debris impact damage was unobtainable to the JSC teams when it was needed. The absence of technical ability and enlarged egos prevented so-called “leaders” from making the correct technical decision, unlike leaders of old like Wernher von Braun, who selected Lunar Orbiter Rendezvous (LOR) over his own idea, a decision which ensured the success of the Apollo program. Post-Columbia, the NESC attempted to fill that void by selecting good engineers and researchers, anointing them with the title of Technical Fellows, and 295
Mission Out of Control
creating “communities of practice” of key knowledge areas like aerodynamics, structures, materials, etc. However, without continuous, independent, objective applied research (as opposed to program-directed research) and by imposing a command organizational structure and not a flat, networked organizational structure, research declined, knowledge sharing remained opaque, and the flow of knowledge to the people that needed it was impeded. The NESC preferred to control the information flow and “filter” it from within its own network of engineers. In addition, the NESC did not have the experience it needed to form highperforming teams (HPTs) that were trained to solve complex, highly coupled, interdisciplinary problems. Hence, the NESC teams were unable to identify the RCC panel 8R anomaly as a systemic aero-thermal-structural aging issue. General Leslie R. Groves, military leader for the Manhattan Project, had to learn very quickly that a directive style of leadership and a command and control military organizational structure that compartmentalized and stifled communication between scientists working on separate teams would not be successful. Instead, what was needed was the technical expertise and contemplative leadership style of renowned theoretical physicist Robert J. Oppenheimer, a connected, networked system of teams at Los Alamos and throughout the U.S., and the collaboration and sharing of information. He had to delegate authority and allow more flexibility to solve such a complex problem as the development of the atomic bomb18. In a true research culture, competing theories and hypotheses are allowed to be explored, discussed, and evaluated with a common set of standards and requirements to achieve common goals or missions. For example, the fibrous ceramic thermal protection systems (TPS) proposed by NASA Ames researchers for reusable spacecraft should be evaluated alongside ceramic and/ or metallic TPS concepts developed at Glenn and Langley Research Centers to evaluate performance to satisfy standards that meet reliability and reusability goals. The teams at each center, while competing, should share and evaluate ideas and collaborate (e.g., competitive collaboration17 or co-opertition) to create advanced concepts and solutions. The research cannot be “directed” from a command structure but must grow organically as the pieces of the puzzle are understood, come together, and new knowledge is gained in a building block approach. Instead, when budgets get tight, competing ideas and capabilities are viewed as “redundancies” by senior managers with systems engineering and business degrees and funding and resources are cut, and wind tunnels and
296
Charles Camarda
laboratories are closed. Researchers share their work and body of knowledge with other subjectmatter-experts (SMEs) around the world at conferences hosted by professional societies that recognize and expand the body of knowledge in varied disciplines around the world. These conferences are ideal times to network with the best and brightest from multiple disciplines, cross-pollinate ideas, and form new partnerships. The organization should reward excellence and recognize individual and collaborative efforts by awards and salaries that are commensurate with similar achievements on the programmatic side of the organization (e.g., a true dual ladder). This creates a strong technical structure that has authority at the same level as the highest programmatic levels, which will be respected and ensure a healthy technical-/production-culture tension. Building and growing a research culture throughout the organization To maintain a healthy respect for research and prevent business and program leaders from viewing research as a cost instead of an investment, future project and program leaders have to experience a problem-based/challengebased program of study whereby they are trained how to apply new principles of product development and project management on “real” interdisciplinary challenges which parallel existing programs with small teams at the project level using advanced analyses and product development strategies. The parallel project team examples in Chapter 7 have been proven to rapidly uncover and solve critical program issues and benefit major program choices for very little funding and orders of magnitude return on investment (ROI). For example, while researching elements of the Constellation Program, which was initiated in 2005, it became apparent that the Orion capsule and its launch abort system (LAS) were severely overweight, and the program was desperate to find solutions. An informal independent study, led by me and Steve Scotti, indicated a new LAS/Capsule design could save considerable mass to orbit. When we presented our idea to the NASA and Lockheed program leads, it became apparent that the program was facing another serious problem that they had no solution for, which they were reluctant to publicize and kept hidden (opaque communication). The problem, it was determined, was that the aeroacoustic loads of the baseline design Ares I design grossly exceeded structural limits. In fact, the loads were so excessive that the U.S. did not even have an acoustic test facility to certify the capsule for launch. That is correct: The Constellation Program manager was proceeding at full speed with their design, knowing that 297
Mission Out of Control
it had a major acoustic problem and no solution. The bridge ahead was out, nevertheless, the engineers and conductor were throttling at full speed to meet schedule and budget demands. When it was clear that our proposed idea to save mass could solve both problems, the program managers gave us the go-ahead to proceed to mature our idea further. After our first meeting with the NASA and Lockheed program managers for the Orion capsule and its LAS design, Ralph Roe wanted us to follow the strict direction of the program managers as to what was needed to understand and solve the problem. I refused. I knew if left to non-technical program managers and engineers, we would fail, so instead, I proceeded to give our team the flexibility they needed. We were allowed to develop a small research team of 15 people from LaRC and Glenn to investigate the design of a new LAS we came to call the alternate launch abort system (ALAS). The ALAS design study, which included some of the first-ever wind-tunnel experimental results of the capsule/LAS configuration, resulted in a concept that reduced acoustic loads and developed a LAS/capsule design that reduced mass and increased the ability to launch an additional 1,200 lbs. to orbit; increased the research body of knowledge for predicting acoustic loads for launch vehicles, and resulted in a patent for a LAS design that is currently being used for NASA’s Artemis Program and the Space Launch System (SLS)19-21. The ALAS design team worked in parallel with the Orion Program Office on a non-interference basis and rapidly solved the acoustic problem and was awarded an NESC Group Achievement Award. This small team could have served as an excellent project for hands-on learning and training for future program managers, engineers, and researchers to learn critical skills they would need to transform and sustain a high-performance organizational culture. It would instill understanding of the value that a research culture can provide to a program. Instead of placing a flight director in charge of a research and development project who did not have the proper training and mentorship to lead such an endeavor, once again I chose Steve Scotti, an FoC who had the correct research expertise and leadership style. When I later attempted to counsel said Constellation Program manager and offered my assistance by providing a highly qualified chief engineer whom I had enlisted for a one-year detail assignment from Langley, Steve Scotti, I was rebuked by the Constellation Program Manager, Jeff Hanley, and later called to NASA HQ in DC to be scolded by my administrator for supposedly “criticizing” the Constellation PM. Months later, Mike Griffin appointed Brian Muirhead of JPL to be Jeff Hanley’s chief engineer. It was too late, however, and the program 298
Charles Camarda
was eventually canceled. You cannot train someone who has never worked as an engineer how to recognize what they do not know and what help they need by having them take week-long training programs at NASA’s facility at Wallops Island. A healthy learning organization realizes the need for proper mentorship and hands-on training starting with smaller projects and how to grow those individuals to lead much larger and more complex programs, similar to how the early Apollo program managers had been trained at Langley. Building a learning organization – training, mentorship, and HPT team building: Every high-performance learning organization should have a thirst for knowledge and maintain continuous learning as one of its core values. The development and advancement of the knowledge, skills, and abilities (KSAs) of every employee is essential to ensure competence across all domains to accomplish the mission and objectives of the organization. For example, the research teams within the organization should allow and support the attainment of additional learning and career development, advanced degrees, online training/education, participation in professional conferences, quality laboratory facilities and resources; and senior mentorship in building multidisciplinary teams to solve complex, highly coupled technical problems. Shared knowledge management systems and resources where in-house expertise is collected, curated, stored, and easily accessed via online tools using AI for enhanced search of information and SME contacts throughout the organization and external to the organization. Transforming Education and Complex Problem Solving What I realized post-Columbia was that something was lacking in the current education and mentorship of our engineers at NASA if somehow, they could miss the impact prediction inadequacy of the Crater program. As NASA’s Senior Advisor for Innovation and Engineering Development, I started researching learning and training within NASA and throughout the U.S. to understand the root cause and learn how we could transform engineering education to ensure the correct skills were being stressed at NASA and in the higher education institutions throughout the country. I developed a short course for NASA engineers on innovative engineering design as part of the NESC Academy and was allowed to work on a two-year detail for the Polytechnic Institute of New York (PINY) (formerly my alma mater, Brooklyn Polytechnic, currently called the Tandon School of Engineering at NYU after its merger with NYU) as Distinguished Engineer in Residence to help infuse innovation and creativity into the engineering curriculum there. I developed a hands-on course for engineering 299
Mission Out of Control
students to design and build innovative solutions to complex problems, what I called “epic challenges,” problems that even NASA was struggling to solve. This idea came to me after the work Don Pettit, and I did exploring solutions for the on-orbit repair of an RCC wing leading edge for the shuttle in his garage/ laboratory. The pedagogy I developed with professors from MIT, Georgia Tech, and Penn State we called Innovative Conceptual Engineering Design (ICED)22,23. The course was a huge success for undergraduate engineering and high school students in New York/New Jersey, and we later developed a program for students and educators in the U.S., Finland, and Australia. I researched many of the innovative ideas for using technology as part of the learning process24,25 and connected with some amazing innovators that were developing learning experience platforms like Valamis in Finland (Janne Hietala)26; people developing learning and competency management systems in industry like iQ4 (Frank Cicio); Boeing Chief Learning Officer (CLO) Dr. Michael Richey27-29 who had developed one of the best two-semester university design-build-fly capstone programs called AerosPACE and who was developing online education programs with MIT and edX for Boeing employees and students around the world, and PINY Chief Information Officer (CIO) (Peter Morales). These four people, Janne, Mike, Peter, and Frank, helped shape my thoughts on how to transform learning and create a collaborative virtual platform/environment for innovation, education, and complex problem-solving called the “Collaboratory.” Dr. Michael Richey and I traveled to Boston in 2014 to give keynote addresses on the transformation of education at the Siemens PLM Analysts Meeting30 (https://www.youtube.com/watch?v=xKSpi6dlMQ0 ). Many of the ideas we proposed in 2014 are still relevant and being used today. Over the next five years, we experimented with elements that we were developing and maturing at various levels and thought of how we could now put those pieces together to rapidly solve multiple problems simultaneously. Problems like searching for the correct SMEs with just the right knowledge that was needed to solve complex, highly coupled engineering problems (epic challenges); how we could take those experts and develop high-performing teams (HPTs); how we could link networks of such HPTs with the curated information they would need to accelerate learning; and how we could use social network analysis to monitor the knowledge capture, flow of information and the technical and cultural behaviors to identify “weak signals” of dysfunction and correct them in real-time. In 2019, we all traveled to NASA HQ to suggest a strategy that could help transform the culture of NASA and companies like Boeing and prevent recurring disasters. 300
Charles Camarda
The Collaboratory – A Collaborative Environment for Innovation, Education, and Complex Problem Solving I coined the term Collaboratory in 2011 and defined it to be “a virtual environment which links the collective intelligence and creativity of the ‘crowds’ with subject-matter-experts to effectively collaborate, learn, and share resources to resolve complex, coupled, multi-disciplinary problems in engineering design by simultaneously maturing numerous, diverse, innovative solutions.” When I met Mike Richey and Janne Hietala in 2014, we all agreed that efficiently finding the right learning material, data, and key SMEs to help us solve critical problems was essential. With the information age came an exponential rise in available knowledge, and searching for relevant information became very cumbersome (Figure 1). Wasting time reading irrelevant information or trying to find just the right SME could easily consume a valuable percentage of a day’s work. The search can become compounded when looking for just the right person because the wrong SME could spin your wheels further by not being forthright and admitting they are not the right expert upfront and/or sending you off to others, like themselves, who can waste your time further. Think about how your search could result if you were directed to Cal Schomburg as the “TPS expert” during STS-107 and trusted him as having an understanding of the impact resistance of RCC wing leading edges. Would any of the TPS or LESS-PRT team members ever have pointed to the researchers who truly understood ballistic impact to structures like Ed Fasanella and his team at LaRC? Hence, having a trusted network like the FoCs enables one to find just the right SME to help answer the question and/or quickly admit they are not the correct person, and one can be assured that because of their integrity and dense technical network, they could immediately point one to the correct individuals, like Dr. Peter Gnoffo and Dr. Tom Horvath. In social network analysis, this is called a “small-world” network, which rapidly links the right people together with very limited traverses through a multitude of extraneous node-network connections31.
301
Mission Out of Control
Figure 1. – The exponential rate of knowledge growth.
Searching for just the right information: Suppose I want to solve the first problem of rapidly finding relevant knowledge as quickly as possible. In that case, I reach out to my super intelligent and highly skilled computer scientist and self-learner, Janne Hietala when he was Chief Operating Officer of one of the most innovative learning experience platforms in Finland called Valamis. We selected a problem that a NASA researcher was struggling with at the time related to the curation of JSC’s expertise and knowledge related to the design and development of spacesuits over 50 years at NASA (Figure 32)32. It becomes very easy to see that the precise information you may need, even in this local area of expertise or community of practice (CoP) within NASA, is very difficult to retrieve because it is siloed in different formats in various locations and organizations with various limitations on access; is difficult to curate and validate as to accuracy and authenticity; and is limited by conventional searching due to specific contexts. Looking at one aspect of the data, 77 public video lectures by retiring experts in spacesuit design and development and lessons learned, and a small contract with Janne Hietala and Valamis, we were able to show how the use of artificial intelligence (AI) we could transcribe and curate all 77 video lectures and make them easily searchable by using machine learning (ML) and a little assistance with a spacesuit SME33. Imagine if one was looking for just the right expert discussing the topic of water-cooled garments or environmental control of a spacesuit 302
Charles Camarda
and finding that knowledge in 77 one-hour videos. How long would that take? Hence, training employees in what is called “fusion skills,”34 those that enable them to work effectively at the human-machine interface, would be important for a successful digital transformation of the workforce. In most AI ventures, humans are needed to assist in training the algorithms. Rather than expecting research engineering teams to all learn and become proficient in AI, cognitive agents, chatbots, for example, could be used to unobtrusively ping them with questions to assist AI learning during an HPT solution of a complex problem or assessment of team health and performance.
Figure 2 – The problem of finding relevant knowledge, U.S. spacesuit knowledge capture example32.
Based on conservative estimates of an average worker spending 8.8 hours a week searching for information, an employee workforce of 40,000, a $100/hour average salary, and targeting only a 10% decrease in search time, it was predicted such a capability could save an organization like NASA or Boeing more than $112 million per month33! University programs to mimic engineering design challenges and understand team performance: Dr. Michael Richey and I met in 2012. We were both working to develop interesting engineering educational programs for students using slightly different approaches while deeply interested in how students learn individually, student-centric learning, but also how diverse teams of students work together on teams and how they access, share, and create new knowledge; make decisions, 303
Mission Out of Control
and solve problems. Mike’s Boeing program, called AerosPACE (Aerospace Partners for Advanced Collaborative Education), followed a more traditional systems engineering (SE) process and product development life-cycle modeled after NASA as taken from the NASA Systems Engineering Handbook35. The program begins with a three-day kickoff where teams are selected (based on results of pre-administered surveys) and instructed on the basics of the program, the Collaboratory, and introduction to SE principles and work together to solve a representative design-build-fly aeronautics challenge as an ice breaker. Seven teams of ten students from multiple different universities are selected to mimic the geographic dispersion of global multinational companies. The students undergo a curriculum of online instruction and interact with various learning modalities (video lectures, online content, pre-/post- survey questions, assignments (individual and team), etc.) and are reviewed by faculty mentors, peers, and external Advisory Board SMEs during select product development milestones throughout the course of the two-semester program (e.g., Preliminary Design Review (PDR), Critical Design Review (CDR), etc.). The program concludes with a supervised fly-off of student autonomous unmanned aerial vehicles (UAV) at a selected airfield. The program follows a problem-based learning (PBL) philosophy where elements of learning and skills development are sequenced to align with actual analyze-design-build activities so there is a direct connection between knowledge and skills understanding and the application of those skills on a real problem/project. Boeing’s AerosPACE program tracked and collected information regarding students’ interaction with teammates, SMEs, faculty, and online educational content as they progressed throughout the multi-phase product development process (figure 3).
Figure 3. – Network connections of students, SMEs, and online educational content during Boeing’s design-build-fly AerosPACE university capstone program. 304
Charles Camarda
At around the same time the AerosPACE program was exploring online learning and problem-based learning with teams of students around the country, I started the Epic Challenge program. It began as an experiment with 30 junior NASA engineers attempting to solve the land landing of its Orion capsule (a challenge which still eludes NASA to this day) used challenge-based learning and a different pedagogy which infused innovation into a rapid concept development (RCD)36 scenario using set-based design practices called ICED22,23. To prove the idea would work, I refused to listen to faculty who wanted me to select a problem that was solvable, one for which we already had an answer. Instead, I selected the land landing of a space capsule, a problem whose solution has continued to elude NASA for over 50 years. The very first experiment was a success with one of the solution concepts proposed by NASA teams during the one-week workshop handed off and developed by an MIT graduate student into an advanced design for an in-capsule airbag/seat which would ensure a safe landing for a crewed Orion capsule under parachute descent and save a total of 265 lbs. (36%) and increase on-orbit habitable volume by 26% (figure 3)37. This experiment proved that using an open innovation forum, with little help/cost ($130,000 and 0.03 full-time equivalents (FTEs)), a team of students could innovatively solve complex problems a sponsoring organization like NASA struggled to solve. Imagine the ideas and concepts that could be rapidly matured and harvested simultaneously by sponsoring organizations in need of innovative solutions to complex problems.
Figure 3. – Results of first Epic Challenge, design of a land landing capability for the Orion capsule22,23,37. 305
Mission Out of Control
The Epic Challenge Program grew to reach thousands of high school and university students in the U.S., Finland, and Australia. Throughout the development of AerosPACE and Epic, Mike and I experimented with learning platforms (e.g., Blackboard, Canvas, EdX, etc.), pedagogy (problem-based, phenomenon-based, and challenge-based learning), virtual/online, hybrid, and conventional/classroom methods; collaborative applications (e.g., Slack, Yellowdig, etc.), team selection/optimization, and data analysis formats. To demonstrate the effectiveness of a collaborative platform for an organizational learning and team assessment tool, we selected the AerosPACE program because it mimicked the engineering design and product development strategies of conventional aerospace companies, the Valamis Learning Experience platform as our dashboard and learning analytics platform (figure 4), and Slack as our collaboration and social networking means of communication and information transfer (figure 5).
Figure 4. – Valamis Learning Experience Platform Dashboard for AerosPACE students.
We could track the students as they formed Slack channels to address various aspects of the UAV design process and product development, such as aeronautics, structures, materials, guidance, navigation, and control. The learning analytics would track student trajectories through the course material and peer-to-peer knowledge sharing, learning, and decision-making.
306
Charles Camarda
Figure 5. – Slack channel and network formation as a function of time during AerosPACE student program.
We surveyed students and developed algorithms to select teams (using ideas mentioned in Chapter 6 for developing high-performing teams (HPTs)), and we analyzed the effectiveness of each team using the parameters energy, engagement, and exploration mentioned in Chapter 5 to gauge team performance prior to the fly-off competition38,39. The results of the two highest-scoring teams, Team 3 and Team 7, are shown in figure 6.
Figure 6. – Evaluation of AerosPACE teams using Sandy Pentland’s parameters, energy, engagement, and exploration38,39.
307
Mission Out of Control
Team 3 edged out Team 7 slightly regarding Prof. Sandy Pentland’s parameters for measuring team energy (measure of individual team member participation), engagement (the distribution of team energy equally among team members) and exploration (the extent to which a team is willing to reach outside the team to get additional information)38,39. What was amazing was that when it came time to evaluate the teams’ UAV concepts during the flyoff, team 3 did exceptionally well with their unique vertical takeoff and land (VTOL) aircraft even though it was more complex and their faculty advisor had even suggested against it. It was also obvious that one of the other teams was struggling with psychological safety and did not trust their faculty advisor. When directed by their advisor to include him in all Slack channel discussions, the team members quickly formed a studentonly private channel. Hence, using even relatively crude measurements to assess team performance gives you some insight into behaviors. Imagine now if we could use trained sociologists and behavioral scientists to work with computer scientists and AI/ML experts to develop and train algorithms to identify team attributes such as, psychological safety, cohesion, resilience, critical thinking, etc., and to monitor team communication and knowledge ingestion during key technical decisions. The Collaboratory: The Collaboratory was conceived as multiple interconnected components that address critical functions necessary to create an effective and engaging virtual environment necessary to digitally transform an entire organization into a premier learning organization whose goal is to create a high-performance organization (HPO) that can monitor culture and team performance at granular levels and identify “weak signals” of technical and behavioral dysfunction and rapidly create teams to address anomalies prior to disasters. Some of the key components of such an environment are described below and shown schematically in figure 7: 1. Knowledge Management System – to collect, curate, and store content (documents, videos, lectures, links, etc.) for use by collaborators. 2. Learning Experience System – to create individual lessons using online, open source interactive tools to auto transcribe lectures, chunk and meta tag for ease of searching; link reference material, allow collaboration with educators, mentors, and SMEs; embed evaluation and testing, enable automatic alignment with the Engineering Competency Model (ECM), ABET, CDIO, NGSS and other educational, training, and professional development standards, and links to a Learning Records Store to conduct 308
Charles Camarda
learning data analytics. 3. Matching Engine and Profiling System – to survey, reference, and allow ease and rapid search of skills, technical expertise, and resources (laboratory, test, material, etc.) of all participating individuals, teams, and organizations. Develop validated skills assessments and badges/ certificates, etc. This could expand upon already existing and outdated competency management systems currently in use within the organization. 4. Modeling and Simulation Tools – to access and integrate key modeling and simulation software and tools to analyze, design, prototype, and manufacture ideas/concepts. Tools that incorporate computer-aided engineering analysis, design, manufacturing, life-cycle management, and model-based systems engineering (MBSE). 5. Online/Hybrid Course Development – tools for developing massive open online courses (MOOCS) and massive open online projects (MOOPS) or Epic Challenges40 utilizing the latest technical advances in course delivery, tools for easily creating and uploading learning objects; learning analytics, data visualization, AI/ML, cognitive search and recommender agents (chatbots); gamification, etc. 6. Miscellaneous Tools for – network analysis, participant surveys, data analysis, Big Data collection research, data visualization, metrics, gamification of learning, and ideation. In our three-day workshop at NASA HQ in 2019, our team presented a strategy for transforming the workforce and culture at NASA that could be used
Figure 7. – The Collaboratory, a collaborative environment for innovation, education, and complex problem-solving.
309
Mission Out of Control
in any similar organization like Boeing. We had two Boeing senior leaders in attendance, Dr. Michael Richey (Boeing Tech Fellow and Chief Learning Officer) and Marc Nance, a Boeing Defense Vice President who had invited me to give Boeing leadership lectures on the importance of culture two separate times and with whom I had worked to help develop advanced training for Boeing employees. Together with Frank Cicio (CEO of iQ4) and Janne Hietala (COO of Valamis), our team presented our strategy for using technology together with a carefully designed building-block approach to team universities, industry, NASA, and other government agencies to help us develop the necessary tools and methods for verification by using a distributed network of test cases and real-world challenges facing the agency to simultaneously educate NASA employees on advanced problem-solving and product development challenges (epic challenges), mentor future project and program managers, and incorporate monitoring capabilities to identify technical and cultural “weak signals” and team issues. The product we left NASA was a 60-page how-to white paper with associated costs based on our prior experimental work over ten years with known outcomes41. I left an executive summary and some strategic bullet points noting that a failure to transform NASA could result in us losing the next space race to China, which will have a “reverse Apollo effect” and destroy the technical dominance and entrepreneurship within our country and have a negative effect on our economy. When we did not even get a response back from NASA, I knew it was time to retire. Following my retirement, we are witnessing a world leader in commercial aviation, Boeing, imploding following the 737-MAX disasters and NASA struggling with major delays, cost overruns, and critical technical issues regarding its attempt to return to the moon with the Artemis program. It remains to be seen whether NASA and Boeing can correct and recover from serious cultural issues and return to become the storied technical and commercial leaders they once were. Yet, I remain persistent and ever hopeful in my belief that it is within our power to do so!
310
Charles Camarda
References 1. 2.
3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. 15. 16. 17. 18. 19. 20. 21.
22. 23.
24.
Anon: “Komite Nasional Keselamatan Transportasi Republic of Indonesia: Aircraft Accident Investigation Report PT. Lion Mentari Airlines Boeing 737-8 (MAX); PK-LQP Tanjung Karawang, West Java Republic of Indonesia 29 October 2018.” October 2019. Anon: “The Federal Democratic Republic of Ethiopia Ministry of Transport and Logistics: Aircraft Accident Investigation Bureau Investigation Report on Accident to the B737-MAX8 Reg. ET-AVJ Operated by Ethiopian Airlines 10 March 2019.” Report No. AI-01/19, Dec. 23, 2022. DeFazio, Peter A., and Larsen, Rick: “Final Committee Report – The Design, Development & Certification of the Boeing 737 MAX.” House Committee on Transportation and Infrastructure, September 2020. Anon: “Summary of the FAA’s Review of the Boeing 737 MAX – Return to Service of the 737 MAX Aircraft.” November 18, 2020. Herkert, Joseph; Borenstein, Jason; and Miller, Keith: “The Boeing 737 MAX: Lessons for Engineering Ethics.” Science and Engineering Ethics 2020. Zweifel, Thomas D. and Vyas, Vip: “Crash: Boeing and the Power of Culture.” Journal of Intercultural Management and Ethics, Issue No. 4, 2021. Barbaro, M.: “The Whistle-Blowers at Boeing. The NY Times. April 23, 2019. https://www. nytimes.com/2019/04/23/podcasts/the-daily/boeing-dreamliner-charleston.html PBS Frontline Documentary: “Boeing’s Fatal Flaw.” Sept. 15, 202. https://www.bing.com/ search?q=frontline%20737%20MAX&FORM=ARPSEC&PC=ARPL&PTAG=5515 Afifian, Andrew: “Is Boeing Prioritizing DEI Over Safety.” The Dallas Express, March 8th, 2024. Sullenberger, S.: “My Letter to the Editor of the NY Times Magazine.” Published October 13, 2019. https://www.sullysullenberger.com/my-letter-to-the-editor-of-new-york-times-magazine/ Gehman, H. W., et. al.: “Columbia Accident Investigation Board.” Report Volume 1, U. S. Government Printing Office, Washington D. C., August 2003. http://www.nasa.gov/columbia/ home/CAIB_Vol1.html Kika, Thomas: “Boeing Plane Incidents Timeline: Full List of 9 Issues in Three Months.” Newsweek, March 25, 2024. Anon: “Deepwater Horizon Accident Investigation Report.” September 8, 2010. PBS Frontline Documentary: “The Deepwater Horizon Oil Spill in the Gulf of Mexico.” Oct. 26, 2010. https://www.youtube.com/watch?v=NzrGZCJojUE Haas Edersheim, Elizabeth: “The BP Culture’s Role in the Gulf Oil Crisis.” Harvard Business Review June 8, 2010. Collins, Jim and Porras, Jerry I.: “Built to Last: Successful Habits of Visionary Companies.” HarperBusiness 1994. Barthelemy, Bart: “The Sky Is Not the Limit – Breakthrough Leadership.” St. Lucie Press 1997. Bennis, Warren and Biederman, Patricia Ward: “Organizing Genius: The Secrets of Creative Collaboration.” Basic Books The Perseus Publishing Group 1997. Scotti, Stephen J.: “Orion Alternate Launch Abort System Follow On Technical Assessment.” NESC-RP-06-08, Vol. II Appendices, May 12, 2011. Camarda, Charles J. et. Al.: “Multi-Functional Annular Fairing for Coupling Launch Abort Motor to Space Vehicle. United States Patent US 2008/0265099 A1, Oct. 30, 2008. Panda, J., Burnside, Nathan J., Bauer, Steven X. S., Scotti, Stephen J., Ross, James C., and Schuster, David M., “A Comparative Study of External Pressure Fluctuations on Various Configurations of Launch Abort System and Crew Exploration Vehicle”, 15th AIAA/CEAS Aeroacoustics Conference (30th AIAA Aeroacoustics Conference), Paper AIAA-2009-3322, May 2009. Camarda, Charles J., Bilen, Sven, de Weck, Olivier, Yen, Jeannette, and Matson, J. 2010. Innovative Conceptual Engineering Design - A Template to Teach Problem Solving of Complex Multidisciplinary Problems. ASEE Louisville, Kentucky. Camarda, Charles J.; de Weck, Olivier; and Do, Sydney: “Innovative Conceptual Engineering Design (ICED): Creativity, and Innovation in a CDIO-Like Curriculum.” Proceedings of the 9th International CDIO Conference, Massachusetts Institute of Technology and Harvard University School of Engineering and Applied Sciences, Cambridge Massachusetts, June 9-13, 2013. Christensen, Clayton M.; Horn, Michael B.; and Johnson, Curtis W.: “Disrupting Class: How Disruptive Innovation Will Change the Way the World Learns.” McGraw Hill 2008. 311
Mission Out of Control 25. Christensen, Clayton M. and Eyring, Henry J.: “The Innovative University: Changing the DNA of Higher Education from the Inside Out.” Jossey-Bass 2011. 26. Camarda, Charles J. and Janne Hietala: “The Digital Transformation of Learning: Why We Need Another Apollo Effect.” December 2016. 27. Richey, Michael; Zender, Fabian; and Camarda, Charles J.: “Engineering the Future Work Force by a Global Engineering Industry.” 122nd ASEE Annual Conference and Exposition, June 14-17, 2015, Seattle, Washington. 28. Zender, F. et al. “Aerospace Partners for the Advancement of Collaborative Engineering (AerosPACE) - Connecting Industry and Academia through a Novel Capstone Course.” Presented at International Conference for e-Learning in the Workplace, New York, NY, 2014. 29. Zender, F. et al.: “Wing Design as a Symphony of Geographically Dispersed, Multidisciplinary, Undergraduate Students.” Presented at 54th AIAA/ASME/ASCE/AHS/AS Structures, Structural Dynamics, and Materials Conference, Boston, MA, 2013. 30. Camarda, Charles J. and Michael Richey: Transforming Education – Keynote Addresses at the Siemens PLM Analysts Meeting in Boston 2014: https://www.youtube.com/ watch?v=xKSpi6dlMQ0 31. Larson, Andreas: “Engineering Know Who – Why Social Connectedness Matters.” Doctoral Thesis, Luleä University of Technology, Department of Applied Physics and Mechanical Engineering Division of Computer Aided Design 2005. 32. Chullen, Cinda, et. al: “U.S. Spacesuit Knowledge Capture.” 41st International Conference on Environmental Systems, 17-21 July 2011, Portland Oregon. 33. Hietala, Janne: “Knowledge Discovery with AI based on Dark Data: NASA LaRC Knowledge Capture AI/ML Proof of Concept.” 2017. 34. Wilson, James H. and Daugherty, Paul R.: “Collaborative Intelligence: Humans and AI are Joining Forces.” Harvard Business Review, July-August, 2018. 35. NASA Systems Engineering Handbook. NASA SP-2007-6105 Rev 1, December 2007 36. Camarda, Charles J.; Scotti, Stephen; Kunntu, Iivari; and Perttula, Antti: “Rapid Learning and Knowledge-Gap Closure During the Conceptual Design Phase – Rapid R&D.” Technology Innovation Management Review, March 2020 (Volume 10, Issue 3). 37. Do, Sydney and de Weck, Olivier: Á Personal Airbag System for the Orion Crew Exploration Vehicle.” Acta Astronautica 81, 2012. 38. Pentland, Alex “Sandy”: “The New Science of Building Great Teams.” Harvard Business Review, April 2012. 39. Pentland, Alex: “Social Physics – How Social Networks Can Make Us Smarter.” Penguin Books 2014. 40. The Epic Education Foundation and the Epic Challenge Program. https://epiceducationfoundation. org/ 41. Camarda, Charles J.: FoC Moon-to-Mars Strategy Plan to Transform NASA. 2019: https://docs. google.com/document/d/1O76JfD_bkvpAXhL7Y5RaE7E1ZCBGVMAF/edit?usp=sharing&oui d=117977985248994911042&rtpof=true&sd=true
312
Epilogue What we are witnessing today following the Boeing Commercial 737-MAX disasters are setbacks in Boeing’s commercial launch business to develop a capability to send crews to the International Space Station (ISS). Boeing’s concept was one of two funded by NASA that uses its Starliner Capsule atop an Atlas V rocket. The other company funded by NASA, SpaceX, uses its Dragon Capsule atop its Falcon 9 rocket. Having two concepts would thus enable a redundant capability for the U.S. to launch crews to low Earth orbit (LEO) and dock with the ISS. As of this writing, the SpaceX Crew Dragon has successfully launched 11 crews to the ISS (8 astronaut and 3 commercial crews from May 2020 to March 2024). The first Starliner Crewed test flight (CFT 1) launched June 5, 2024, after many delays and technical setbacks and resulted in the crew being stranded on the ISS because of vehicle helium leaks and multiple reaction control system (RCS) thruster failures prior to docking. Instead of an 8-day mission, the Starliner crew will be stranded in space for over 8 months and will have to be rescued by a SpaceX Crew 9 vehicle, which is scheduled to return the crew in February 2025. This could be another devastating blow to an already struggling international corporation trying to regain its storied past glory. Instead of preempting NASA and recommending that the Starliner return uncrewed, Boeing’s new CEO, Kelly Ortberg, allowed NASA to make that final decision on August 24, 2024, insisting that it was Boeing’s position that the Starliner vehicle was safe to return with the crew onboard. In my opinion, Mr. Ortberg missed an opportunity to send a message to all Boeing employees and the world that technical expertise, knowledge, and safety were once again the principal values and that he recognized the difficulties ahead in changing the organizational culture of the corporation. NASA, on the other hand, made the correct decision and showed signs of movement in the right direction regarding its own culture change. The root cause of the Starliner thruster failures was not 313
well understood by analysis and testing and so there was no definitive reason to believe a mishap would not be imminent on Starliner’s return to Earth. NASA voted in favor of the safety of the astronauts as opposed to potential schedule delays, impacts to program scheduling, and additional costs. NASA, however, has one more very critical decision to make to convince this author that they are heading in the right direction regarding the necessary cultural transformation as described in this book. NASA’s new human space exploration program to the Moon, called Artemis, uses a rocket called the Space Launch System (SLS), which is larger than the Saturn V and uses legacy hardware from both the Shuttle Program and the Apollo Program. The core stage of the launcher consists of similar design but slightly larger liquid cryogenic hydrogen-oxygen propellant tanks than the shuttle, four space shuttle main engines (SSMEs) as opposed to three on the shuttle, two slightly longer solid rocket boosters (SRBs), 5-segment, instead of 4-segments, and a command module which is slightly larger than the Apollo capsule with an ablative heat shield similar to what was used on Apollo. SLS has about 15% more thrust than the Saturn V and can take about 6% more mass to the surface of the Moon. The initial flight test to qualify the heat shield (an Apollo-like honeycomb design hand filled with AVCOAT ablator using caulk guns) called Exploration Flight Test 1 (EFT 1) was a two-orbit, four-hour test of the Orion Crew module and a return to Earth following a high-apogee orbit high-energy returned to Earth. The test verified the Apollo-like honeycomb-filled ablative heat shield could survive a near lunar entry heating profile (32,000km/h (20,000mph) as compared to an actual lunar entry of about 40km/h (25,000 mph)). The heat shield survived entry with very little damage and/or erosion. The Artemis Program, however, decided to switch heatshield designs to one that used AVCOAT blocks bonded to a base heat shield structure in lieu of the filled honeycomb design in an effort to save time and money. It was on the next uncrewed integrated flight test of Artemis I around the moon in November 2022 that issues occurred. The new block AVCOAT heatshield design lost numerous large chunks of the ablative heatshield during Earth Entry, as opposed to a gradual erosion of the honeycomb-filled ablative surface that the Apollo vehicles experienced. Instead of openly reporting the severity of the problem, NASA managers downplayed the problem. According to Howard Hu, NASA’s Orion Program Manager, “Some of the expected char material that we would expect coming back home ablated away differently than what our computer models and what our ground testing predicted. So, we had more liberation of the charred material during reentry before we landed than we had expected1.” In another 314
article, NASA states, “We didn’t expect the small pieces that came off versus being ablated” and “There was a healthy margin remaining of virgin Avcoat and temperature data inside the cabin remained at expected levels, so if the crew were on board they would not have been in danger.”2 The second statement gives the false sense of security even though the real root cause of the problem had not been identified nor understood and so, depending on when the material jettisoned from the vehicle, the circumstances could have been very different! NASA was forced to come clean as to the seriousness of the problem in May 2024 when a NASA Office of Inspector General report entitled: “NASA’s Readiness for the Artemis II’s Crewed Mission to Lunar Orbit” was published and finally showed the public photographs of the damage to the Artemis I heat shield (see below).3
Is NASA up to its old tricks and will it fly a crewed Artemis II mission with the exact same heatshield as Artemis I, or will it stand down, fully understand the technical root cause after validation by another uncrewed mission and fix the problem prior to the first crewed mission? Only time will tell.
References 1. 2. 3.
Tingley, Brett: “NASA’s Artemis 1 Orion spacecraft aced moon mission despite heat shield issue.” Space.com, March 7, 2023. https://www.space.com/artemis-1-orion-moon-mission-heat-shieldissue Leonard, David: “NASA still investigating Orion heat shield issues from Artemis 1 moon mission.” Space Insider. April 2024. https://www.space.com/nasa-investigate-orion-heat-shieldartemis-1-mission NASA’s Office of the Inspector General Report: “NASA’s Readiness for the Artemis II’s Crewed Mission to Lunar Orbit.” May 1, 2024, Report IG-24-011.
315
Glossary of Terms Acceptable Risk – refers to the level of risk that is considered tolerable or manageable in space missions, given the balance between potential hazards and the objectives of the mission. It was the basis for all technical decision making at NASA, from daily decision making to the formalized final decision process immediately before a launch known as the Flight Readiness Review. Bureaucratic Accountability – control from the top with strict allegiance to hierarchy, procedures, rules, and chain of command. Cognition – is the mental process of knowing or that which comes to be known as through awareness, perception, reasoning, judgment, intuition, etc. Cognitive Biases – natural tendencies or biases that affect what we know or think we know, a stubborn attachment to an existing belief. Cognitive Frame – mental structures – tacit beliefs and assumptions – that simplify and guide participants’ understanding of a complex reality. When we refer to cognitive frames, we refer to meaning overlaid on a situation by human actors. Collaboratory – a virtual environment which links the creative intelligence of the crowds with subject-matter-experts (SMEs) to effectively collaborate, learn, and share resources to solve complex, multidisciplinary engineering design problems by simultaneously maturing numerous, diverse, innovative solutions. Complexity – is the study of systems with many interconnected parts, where the interaction between these parts leads to emergent behavior that cannot be predicted by understanding the parts alone. Complex Systems – are networks of diverse components that interact in such a way that the system as a whole displays emergent properties. Properties or behaviors that arise from the interactions of its components but are not present in the individual components of the system. Complicated system - A complicated systems characterized as having having numerous componentsmany components or parts, but these parts are usually wellorganized and follow clear cause-effect relationships. In complicated systems, the relationships between the components are deterministic, meaning the behavior or performance of the system can often be understood and predicted beforehand (a priori) by breaking it down to its constituent parts. Confirmation Bias – a stubborn attachment to existing beliefs and a tendency to downplay the possibility of ambiguous threats. Making sense of situations automatically and then favoring information that confirms rather than disconfirms initial views (ref. 13). Culture – is a set of solutions produced by a group of people to meet specific problems posed by the situations that they face in common. These solutions become institutionalized, remembered and passed on as the rules, rituals, and values of the group. The behavior patterns, arts, beliefs, institutions and all other products of human work and thought, especially as expressed in a particular community or period. 316
Culture of Production – A culture which deviates from a pure technical culture and has to merge bureaucratic, technical and cost/schedule/efficiency mandates. It is a culture which emphasizes a business-oriented, compartmentalized sub-division of labor and a set of practices and rules to improve efficiency to meet cost and schedule demands. Disqualification Heuristic – is an “ideological mechanism or mind-set that leads experts and decision makers to neglect information that contradicts a conviction – in this case that a socio-technical system is safe. Engineer – An engineer is a professional who applies scientific, mathematical, and practical knowledge to design, develop, build and maintain systems, structures, machines, devices or processes that solve real-world problems. Ethnography/Ethnographer – People who study and try to understand a culture by actually participating in the culture they are studying. Friends-of-Charlie (FoC) Network – is a trusted network of key individuals with the right knowledge and understanding who are forthright and who will candidly tell you what they know and exactly what they do not know and how to best obtain that knowledge. Group or Working Group – is a collection of individuals who are usually independent of each other and has a different set of tasks that are carried out by individuals. Working groups are formed in organizations when information needs to be shared, and decisions need to be made. High-Reliability Organizations (HRO) – HRO’s are organizations that operate in complex, high-risk environments but manage to maintain extremely low rates of accidents and failures. Historical Ethnography/Ethnographer – attempting to elicit organizational structure and culture from the documents created prior to an event. The ethnographic historian studies the way ordinary people in other times and cultures made sense of things by “passing from text to context and back again, until (s)he has cleared a way through a foreign mental world” (Diane Vaughan: “The Challenger Launch Decision: Risky Technology, Culture, and Deviance at NASA”, The University of Chicago Press 1996). Multidisciplinary Optimization (MDO) – is the field of engineering that focuses on the optimization of systems that involve multiple, often interacting, disciplines. The goal of MDO is to optimize overall system performance rather than optimizing each discipline in isolation. An overall cost function is selected and minimized subjected to satisfying a multitude of constraints. Mixed Signals – Signals of potential danger followed by signals that all is well, reinforcing the belief in accepted risk. For example, taking corrective actions to prevent blow by of the O-rings and then not seeing any erosion for a couple of flights and thinking you solved the problem. (An incident of O-ring erosion or bipod foam debris would be followed by several launches where the machine behaved properly, so that signals of danger were followed by all-clear signals).
317
Normal Accident Theory (NAT) – proposed by sociologist Charles Perrow in 1984, the theory explains that in complex, tightly-coupled systems, accidents are inevitable or “normal” because of the inherent characteristics of such systems. Normalization of Deviance – is a term which describes behavior which an individual, “work group”, or organization first identified as technically deviant was subsequently reinterpreted as within the norm for acceptable performance and then finally officially labeled as “acceptable risk”. Three factors which help explain the normalization of Deviance are: the production of a “work group culture”, the “culture of production”, and “structural secrecy”. The gradual shift away from what is regarded as “normal” after exposures to “deviant behavior” – behavior straying from correct or safe operating procedure. Organizational Culture – The values, norms, beliefs, and practices which govern how an organization functions. At the most basic level it defines the assumptions that employees make as they carry out their work and it is a powerful force the often times persists through reorganizations and departure of key personnel. Organizational Sub-Culture – in a complex organization, many times discrete work groups will develop a subset of a culture or sub-culture which may possess some attributes of the larger, organizational culture and also include unique elements which are related to the specific work group task. Overconfidence Bias – people are typically overconfident in their judgment. Could be related to past successes and possibly also related to arrogance! Production of Culture – the process by which people at the bottom of the technical organization working within the confines of “production culture” with an established structure develop a routinized set of norms, rules, and procedures to address unknown anomalies and drive to an assessment of risk which provides a rationale for flight (“flight rationale”). Professional Accountability – control over organizational activity rests with the employees with technical expertise. A system built on trust of and deference to the skills of those at the bottom of the organization. Psychological Safety – is the belief that the workplace is safe for interpersonal risk taking, such as questioning others or sharing unpopular but relevant news. Recency Effect – decision makers tend to place too much emphasis on the information and evidence that is most readily available to them (e.g., paying too much emphasis on recent events like “last flight” and ignoring past problems). Recovery Window – is a period of time between a threat and a major accident (or prevented accident), in which constructive collection may be feasible. Research Culture – as defined in this book, is one which maintains a “thirst for knowledge” and the validation of that knowledge using scientific methods. It embraces key values and principles such as psychological safety, a knowledge-based hierarchy, intelligent fast failure and rapid concept development; transparency and collaboration; and the encouragement of competing theories. Research Engineer – is an engineer who conducts applied or use-inspired basic research to understand new concepts or machines. A research engineer at NASA Langley 318
was the product of fruitful engineering science: a solid combination of physical understanding, systematic experimentation using a “building-block” approach, and applied mathematics. Routine Signals – the frequent event, even when acknowledged to be inherently serious, loses some of its seriousness as similar events occur in sequence and methods of assessing and responding to them stabilize. Scientific Positivism – dispute resolution by numbers. Quantitative data sufficient to constitute a scientific proof of engineering logic would convert disagreements into consensus. Scientist – A scientist studies the world around them; they develop hypotheses that lead to theories that try to explain what they observe in the natural world. Their theories are validated by carefully planned and executed experiments, which they use to corroborate their hypotheses and assumptions. Shared Cognitive Frame – the way in which an organization or group of individuals views or frames a problem, operation, or environment (e.g., the way the Shuttle Program viewed the Space Shuttle as an operational vehicle in a routine production environment). Strong Signals – signals for potential danger that are based on rigorous systematic engineering data, analysis and testing. Structural Secrecy – the way patterns of information, organizational structure, processes and transactions, and the structure of regulatory relations systematically undermine the attempt to know and interpret situations in all organizations (concealed the seriousness of the O-ring problem). Structural secrecy can also be defined as the way that organization structure and information dependence obscured problem seriousness from people responsible for problem oversight. Sunken Costs Effect – is the tendency for people to escalate commitment to a course of action in which they have made substantial prior investments of time, money, or other resources (e.g., ego, reputation, etc.). Systems Engineering – is an interdisciplinary approach to designing, integrating, and managing complex systems throughout their life cycles. It focuses on the holistic development of a system, considering not just individual components but also their interactions, dependencies, and the broader operational environment. Team – a small group of people with complimentary skills who are committed to a common purpose, set of performance goals, and approach for which they hold themselves mutually accountable. Weak Signals – when a signal for potential danger is based on observational information which is informal and ambiguous. One that at the time had no apparent clear and direct connection to risk and potential danger, or one that occurred once but the conditions that created it were viewed as rare or unlikely to occur again. Work Group – people in an organization who interact because they have a common central task. In interaction, work groups create norms, beliefs, and procedures that are unique to their particular task. 319
Acknowledgments I would like to thank the army of research engineers and scientists throughout the three NASA Research Centers, Langley, Ames, and Glenn for their dedication to excellence and understanding the complex unknowns related to spaceflight, for determining the cause of the Columbia accident, for their work during our return-to-flight in 2005, and in ensuring we continued to fly space shuttle safely up until its retirement in 2011. NASA Langley Research Center: Peter Gnoffo Michael Nemeth Stephen Scotti Max Blosser Kim Bey Sandra Walker Tom Horvath Scott Berry Mike Gazarik Kay Wurster Vince Zoby Charlie Harris Wallace Vaughan Erik Maderas Bill Winfree Ed Fasanella Delma Freeman Erik Weiser Mia Siochi Marshall Rouse Dawn Jegley Norm Knight Dave Moore K. Song Ronald Kraeger 320
Mark Cagel Genevieve Dixon Ray Milneck James Florance Karen Jackson Karen Lyles Terry St. Clare Kevin Rivers Steve Altar Bill Woods Chris Glass Bob Novac Frank Novak Damodar Ambur Charlie Miller Mark Hilburger NASA Ames Research Center: Dave Driver Joe Lavelle George Raiche John Balboni James Reuter Tina Panontin Stuart Rogers Jay Grinstead
Dan Dittman Rabi Metha Jim Strong Aga Goodsell Jan Heinemann Keith Shackleford Cathy Schulbach NASA Glenn Research Center: Matt Melis Mike Pereira Duane Revlock Kelly Carney Jay Singh Erv Zaretsky Angel Otero Fred Oswald Fred Morales Tim Krantz Bob Handschuh Ken Street Jim Zecrichek James Frazier Fran Hurwitz Beth Opila
“The book does a remarkable job of pulling together every possible relevant concept and evidence from organizational research to support the goal of helping organizations get better.” —DR. AMY EDMONDSON NOVARTIS Professor of Leadership and Management, Harvard Business School, author of The Fearless Organization (Wiley 2019) and several case studies on Columbia, Challenger, and safety cultures
“Dr. Camarda’s insider critique of NASA exposes its corruption of research by sending vulnerable astronauts on potentially suicidal missions without the checks and balances of scientific inquiry. He outlines what it would take to make NASA and other large organizations, such as Boeing, which has been beset with airplane catastrophes, develop high-performance organizations by balancing research goals with performance objectives.” —JACK MATSON Professor Emeritus, Penn State University and author of Innovate or Die
“Dr. Camarda describes a near miss not as a success but a ‘system failure’—Many Firefighters would not have died or been seriously injured over my 32 year FDNY career if the ‘Cultures’ and ‘Biases’ he teaches us about in his book were understood.” —COMMISSIONER THOMAS VON ESSEN Commissioner of FDNY during 9/11 Tragedy, author of Strong of Heart, Harper Collins, 2002
“Dr. Camarda’s remarkedly well written and documented book clearly and effectively presents the methodology for developing and maintaining High-Performing Organizations.” —DR. RANDY GRAVES Former Director of Aerodynamics, NASA Headquarters
“Charles Camarda has drawn on his rich personal experience at NASA and extensive research to describe how a high-risk organization can build an effective safety culture, as well as how such a culture can erode over time. Managers in any high-risk environment can learn from this book.” —PROF. MICHAEL ROBERTO Bryant University, author, Unlocking Creativity, Assistant Professor Harvard Business School, and Visiting Assistant Professor at the NYU Stern School of Business
“Mission Out of Control is told from the moving perspective of a friend and colleague of the astronauts who perished on Columbia. It should be required reading not only for students of spaceflight history, but also by anyone interested in how organizations develop certain cultures, how those cultures can change over time, and how, in turn, culture can lead to success or ultimate failure. In the end, this is also a book about speaking truth to power; the story of one man’s integrity and grit, and the challenges he faced in delivering a message that some did not wish to hear. —DR. MARK J. LEWIS, CEO Purdue Advanced Research Institute LLC “Mission Out of Control exposes the systemic breakdown and erosion of the ‘research engineering’ culture that made the Apollo program so successful. Until NASA is reinvented, safe return to the moon and on to Mars will remain very high risk.” —JOHN NEER, Former Chief Engineer for Lockheed Martin during the Columbia Accident Investigation
Astronaut Dr. Charles J. Camarda has uncovered a recurring cause of accidents that no one has articulated yet—loss of a research culture that places a premium on learning and the quest for knowledge and what that means. He shows how to develop high-performing teams and networks of such research teams to solve anomalies rapidly, which can help prevent catastrophes in complex high-risk/high-reliability organizations.
Astronaut DR. CHARLES J. CAMARDA is an inventor, author, educator, and internationally recognized invited speaker on subjects related to engineering, engineering design, innovation, safety, organizational behavior, and education. He has over 60 technical publications, holds 9 patents, and has over 20 national and international awards. Dr. Camarda is a NASA veteran with over 22 years of experience as a research engineer, 18 years as a NASA Astronaut who flew on STS-114, the return-to-flight mission following the Columbia disaster; and 13 years as a Senior Executive holding many positions within NASA. He is an adjunct professor at several universities, has developed an innovative conceptual engineering design pedagogy called ICED which he has taught to NASA engineers, and which forms the basis for his 501 (c)(3) educational nonprofit called the Epic Education Foundation which he founded to democratize STEM/STEAM education for students of all ages around the world.
DR. CHARLES J. CAMARDA
“I absolutely love this book! While reading Mission Out Of Control, I was reminded of my many experiences and lessons from my aviation career in civil, military, and flight tests and the space shuttle program. While it is not pleasant to remember difficult times, it is much more essential to learn from the past and share these experiences to help prevent future occurrences. Dr. Charles Camarda has certainly achieved this. As a researcher, engineer, and astronaut with an impressive understanding of organizational culture, he can succinctly describe the conditions within NASA. In this book, he describes his experiences and uses case studies and other examples to keep the reader engaged. These lessons are timeless, and this book will still be valuable many decades into the future.” —COL. EILEEN COLLINS, First female Commander of Space Shuttle, Author, Through the Glass Ceiling to the Stars
MISSION OUT OF CONTROL
“A bold and courageous expose of how a big government agency can lose its way. Filled with details that only a research insider would be able to see, Charlie’s work is a masterpiece of fact-finding, problem-solving and innovative solutions. Once you start believing your own hype, the true story is often embarrassing and hidden, particularly in a bureaucratic organization like NASA. A must-read for everyone in big government and industrial bureaucracies, not only to avoid preventable tragedies but to use the author’s suggestions, approaches, and ideas to enhance and improve their culture and operations.” —DR. BART BARTHELEMY, former Director of the National Aero Space Plane Program and Founding Director of the Wright Brothers Institute
MISSION OUT OF CONTROL
DR. CHARLES J. CAMARDA
“The story of one man’s integrity and grit, and the challenges he faced in delivering a message that some did not wish to hear.” —DR. MARK J. LEWIS CEO Purdue Advanced Research Institute LLC