Team TelNet

What NetOps Can Learn from Apollo 13 — NASA’s Most Successful Failure

Garrett Williamson
• May 29, 2020

This year marks the fiftieth anniversary of three American heroes climbing out of a crippled spacecraft as it bobbed in swells of the South Pacific Ocean. Those heroes were astronauts Jim Lovell, Jack Swigert and Fred Haise. Their mission — Apollo 13 — would eventually be dubbed NASA’s “most successful failure.”

The mission was a failure in that the crew members were unable to complete their main objective: to land on the moon. But it was successful because, 87 hours after their craft suffered an oxygen tank explosion, all three men walked away alive.

It would be easy to scoff at the idea of drawing parallels between this ill-fated mission, and the goings-on in a typical Network Operations Center. (My wife did.) After all, the drama involved in troubleshooting and repairing a ship hurtling through space at thousands of miles per hour is just incomparable. Nevertheless, if your network is supporting our critical healthcare infrastructure right now, it is no exaggeration to say that your deliberate actions during an outage could save lives.

It’s with this in mind that I’d like to talk about the key actions taken during the Apollo 13 mission, and what we as network operators can learn from them.

Applying Apollo 13 Lessons to NetOps

1. Always Over Communicate

“Houston, we’ve had a problem here.” These words, uttered by Jack Swigert, were mission controls first indication that their astronauts were in trouble. A routine maintenance — what we in the NOC would call ‘Non-Service-Impacting’ — set off a chain of events that would push the limits of the crew, their ship and the experts on the ground tasked with bringing them home.

This illustrates one of the most important parts of any major outage response: communication. While their craft was suddenly beset by a flurry of bangs, shimmies and warning lights, Swigert and his crew knew that their survival depended on bright minds on the ground. Getting these guys ‘in the loop’ at the very instant the problem was detected, and continuing to provide them updates in those first critical minutes, may well have saved their lives.

A major network outage can be a stressful, disorienting experience. Amid your own flurry of bangs, shimmies and warning lights, your number one priority is to identify, isolate and fix whatever it is that’s broken. It would be easy to let this alone consume all your energy. Much like the astronauts of Apollo 13, you very likely have a team of experts ‘on the ground’ that can be instrumental in getting your network back together. Now is not the time to ‘see what you can do’.

When seconds count, you want to make sure that everyone in your organization who could help, or might have insight, is aware and involved as soon as humanly possible. Insights can come from places you wouldn’t expect, which is why it’s always better to over-share than under-share.

Put simply: if you have a major problem, communications with your team should be loud, clear and immediate.

2. Make a Checklist

One of the first problems that needed to be solved immediately following the explosion of oxygen tank No. 2 was that the Command Module (the cone-shaped section of the craft which houses the astronauts) was losing power, and quickly. On the Apollo 13 mission, like the Apollo missions before it, electricity was being generated through fuel cells. These cells use oxygen and hydrogen to charge a series of batteries, which in turn power everything from the lights in the cabin, to critical communication and navigation equipment.

With the loss of an oxygen tank, these fuel cells ceased to function, leaving the batteries in a state of rapid discharge. With minutes to spare before the crew plummeted into darkness, they found themselves saved by something actually rather mundane: a checklist.

With electricity being such a crucial resource, NASA already had a plan in place to conserve it should a fuel cell ever fail. A checklist of non-critical systems to power down, and the order in which to do so, was provided to the crew immediately on discovering the power loss condition. By shutting down these nonessential systems, power was reserved for only the most vital instruments.

3. Prepare in Advance (But Be Ready to Improvise)

The success or failure of any disaster response hinges, to a great extent, on preparation. By entertaining our fears and imagining our worst-case-scenarios, we can put in place policies and procedures that can be activated should such a disaster take place. Having these policies and running periodic ‘fire drills’ against them is one of the best ways to ensure you’ll be ready to take action when the worst happens.

Unfortunately, it is not possible to anticipate every hurdle you may encounter. Such was the case with Apollo 13. With the Command Module on life support, the crew was forced to refashion the Lunar Module, a craft designed to transport two men to and from the lunar surface over a mere 20 hours, to serve as their lifeboat for the 80+ hour flight back to Earth.

Necessary procedures included firing the Lunar Modules descent engine for course corrections, an engine designed for an entirely different purpose, despite on-board navigation equipment powered down and useless. Astronauts were also forced to retrofit Command Module air filters for use in the Lunar Module, using supplies such as duct tape and the cardboard front-cover of their flight plan booklet.

4. Develop a Culture of Excellence

To quote the Apollo 13 Review Board report, “The accident is judged to have been nearly catastrophic. Only out-standing performance on the part of the crew, Mission Control, and other members of the team which supported the operations successfully returned the crew to Earth.”

These words belie the profound challenges overcome by the men and women of NASA, many of which had no precedent in aerospace engineering at the time. Ultimately, it was the problem-solving, critical-thinking and collaboration performed by a vast team of engineers, all of them experts in their craft, that saved the crew.

No organization better exemplifies a Culture of Excellence than NASA did in the 1960s. To follow their example, today’s NOCs should embrace a posture of ‘hardheaded self-criticism’ ensuring that we learn from our mistakes.

Better yet, by constantly challenging ourselves in our quest for knowledge, and dedicating ourselves to self-betterment, we may avoid mishaps altogether. History has proven that for a dedicated team of experts with a singular unified vision, very little is impossible.

5. Document Everything

As the dust settled on the Apollo 13 saga, with its crew safely back on earth, NASA finally had the luxury of asking the all-important question: “What the heck happened?” This question was not purely academic. If an oxygen tank exploded on this mission, what’s stopping one from exploding on the next?

NASA pursued an answer relentlessly, and eventually discovered several key points when this disaster could have been prevented and was not. The full technical details are outside the scope of what we’re talking about today, but in part, it was discovered that the faulty oxygen tank had been dropped on an assembly line some two years prior to making its debut in space. It only fell two inches, and it was subsequently tested and cleared for flight. That this detail was discoverable after so much time is testament to something NASA proved so particularly adept at: documentation.

Over the course of operating a network, there are many instances where it can be tempting to cut corners on documentation. This is one of our most mundane tasks, and when all your lights are green, it’s easy to get complacent. Don’t! The most seemingly innocuous detail can make the difference between being able to say “This is why this broke, and here’s how it’s never happening again.” Or “Gee, I have no clue.” Document everything!

The events of Apollo 13 forever shaped the way NASA operates, from its testing and compliance, to its air-to-ground communication, to its disaster response and recovery. Outages can and will happen. It’s our deliberate acts before, during and after that will shape your reputation, your company’s reputation and the outcome for end-users. It’s up to every one of us to bring our own astronauts home, wherever they may be.

Garrett Williamson

Garrett Williamson is a Network Operations Technician for TelNet Worldwide. Making his home in Southeast Michigan, Garrett is passionate about IP networking, emerging technologies and Australian Shepherds.

Discover More...

Customer Spotlight: Michigan Insurance and Financial Services

Delve into the remarkable journey of Bryan Ede, from a former Detroit police officer to a visionary leader in the insurance industry. Explore how a commitment to customer satisfaction, impressive growth, and strong partnerships have propelled them to new heights.

Learn More

The Difference Between a Call Center and a Contact Center

Contact center outsourcing has emerged as a strategic solution for companies looking to streamline their customer support processes, reduce costs, and focus on core business functions.

Learn More

Partner Spotlight: NuWave Technology Partners

Discover how the 20-year partnership between TelNet Worldwide and NuWave Technology Partners has transformed the telecom landscape, providing cutting-edge solutions in Microsoft Teams Direct Routing, unified communications, cybersecurity and more.

Learn More

What NetOps Can Learn from Apollo 13 — NASA’s Most Successful Failure