Network Resiliency Is NOT About Redundancy

If you ask a network engineer about “resiliency,” you’ll often hear about redundant links, fast OSPF or BGP convergence, and high-availability clusters. But what if that’s only a tiny piece of the puzzle? What if true resiliency isn’t about the network never breaking, but about what happens—or doesn’t happen—when it inevitably does?

Inspired by a deep dive with networking expert Russ White, let’s reframe what network resiliency really means for our designs and daily operations.

The Resiliency Misconception: What It Is NOT

First, let’s bust some myths. A resilient network is not:

  • The network never being down. (Impossible).

  • No one ever noticing when it is down. (A nice dream, but not the goal).

  • The routing protocol never needing to converge.

  • Just having redundancy. (Redundancy is a tool, not the outcome).

  • Solely about how fast convergence is.

Chasing these alone leads to complex, fragile systems that are nightmares to troubleshoot.

The True Goal: The “Forget-It-and-Fix-It” Network

Russ White defines resilience brilliantly: “Resilience is not needing to troubleshoot, but also making it easy to troubleshoot and quickly fix things when they do break.”

The goal is a network with a very high MTBF/MTBM (Mean Time Between Failures/Mistakes)—because most failures are caused by our own changes and errors—and a very low MTTR (Mean Time To Repair).

In simple terms: How do we prevent breaks, and when they happen, how do we fix them incredibly fast?

This shifts our focus from purely prevention to a balance of prevention and rapid recovery.

The Hidden Time Sinks: Dwell Time and MTTI

Two critical metrics often get overlooked:

  1. Dwell Time: How long does it take you to detect a failure? Is it minutes after a critical app slows, or hours when a user finally calls? Good monitoring shrinks dwell time.

  2. MTTI (Mean Time To Innocence): This is the frustrating time spent proving “it’s not the network.” It’s annoying, time-consuming, and often leads to finger-pointing.

Russ argues we shouldn’t seek MTTI just to drop the phone and say “not my problem.” Instead, we should use the network as a platform to help others troubleshoot. Work with the server, application, and security teams. Provide flow data, path traces, and device metrics to help find the real problem faster. This collaborative approach turns the network from a suspected culprit into a central tool for enterprise-wide problem-solving.

Step One: Triage – “What Does ‘Down’ Even Mean?”

Before you can fix anything, you must understand what’s broken. Is it:

  • A business-critical application being completely offline?

  • That same application just being slow?

  • A single VIP who can’t reach email?

  • An entire site offline (and how big is that site?)?

  • Or something entirely non-business-related?

Taking two minutes to triage properly prevents you from applying a data-center fix to a printer problem. It focuses your effort on what truly impacts the business.

The Ultimate Win: Resiliency Makes Troubleshooting Easier

Here’s the powerful, circular logic of a resilience-focused design: A network built for high MTBF and low MTTR is inherently simpler, more observable, and better documented. This means that when a failure occurs, the path to diagnosis and repair (MTTR) is shorter.

The tools and clarity you built to prevent failures become the same tools that help you cure them. Your resilient network becomes easier to troubleshoot, which in turn makes it more resilient—a virtuous cycle.


Verdict / Summary

Forget about building an “unbreakable” network. That pursuit leads to complexity, which is the enemy of stability and troubleshootability.

Instead, focus on building a “forgettable” network.

  • Design for High MTBF/MTBM: Use simplicity, automation guards, and validated changes to make failures rare.

  • Engineer for Low MTTR: Build comprehensive visibility (shrinking Dwell Time), design for easy fault isolation, and foster collaboration to move past MTTI.

  • Always Solve the Right Problem: Start every troubleshooting session and design review by asking, “What does resilience look like for this application, for this business, at this time?”

True network resiliency isn’t found in a hardware datasheet. It’s a mindset—a commitment to building transparent, understandable systems that are as easy to repair as they are to live with. It’s about making the network a stable, silent foundation for the business, and a powerful ally when things go wrong elsewhere.

Leave a Comment

Your email address will not be published. Required fields are marked *