Grey Failures and the Law of Large Numbers

You’ve just finished building a 1,000 router fabric using a proper underlay and overlay. You’ve thought of everything, including doing it all with a single SKU, carefully choosing transceivers, using only the best optical cables, and running all the software through a rigorous testing cycle.

Time to relax? Perhaps—or perhaps not.

One of the various qualities of quantity is the new kinds of failure modes you discover while building at scale, like grey failures and the law of large numbers. Let’s look at these two things a little to get a better feel for how scale impacts failures.

What is a grey failure? The optic transceiver that drops 1% of its packets, the fiber cable that drops packets only when the air currents caused by a fan blow it around a little, and the router that drops a BGP update because of momentary memory fragmentation are all grey failures. Grey failures are:

Intermittent large failures that self-correct too quickly to detect
Constant small failures that don’t raise any alarms and yet still impact network performance

At-scale network deployments are more likely to face grey failures than smaller networks because of the law of large numbers, which says that if you have enough of anything, you’re likely to see things you wouldn’t (statistically) expect.

The law of large numbers exacerbates all kinds of failures, of course. Consider the small network below.

4 networks illustrating grey failures
Source: Russ White

If the routers along row B will all be down (because of a failure, software upgrade, etc.) for one hour a year, then we have:

Net1 will have one hour of downtime each year
Net2 will have about 30 minutes of downtime each year
Net3 will have about 15 minutes of downtime each year
Net4 will have about seven minutes of downtime each year

Note these are very rough numbers; use an approximation rather than “official math” to figure the times out.

That’s good, right? Well, consider what happens when you add one hundred routers in row B … You get low overall network downtimes, but if each router fails at least once a year, you will have a router failure about every three-and-a-half days.

You’ve increased the network’s global optimization but also caused serious local optimization problems. This is the law of large numbers in action—even if there is some small chance of something happening in one device, if you put enough devices in one place, low probability things will happen.

Expanding the scale to several thousand routers and tens of thousands of optical transceivers ensures you will have daily—or even hourly—failures in your network.

Another problem with expanding the number of parallel links between rows A and C in this small network is each parallel link adds to the amount and velocity (or rate of perceived change) of control plane state. Increasing the number of links increases the number of paths the protocol must track, the number of neighbors must build and maintain, and the number of routing updates a protocol must send.

These problems do not mean we should stop building highly parallel Clos and butterfly fabrics—it just means we need to think about the impact of these massively parallel networks on routing protocol convergence. Solutions like distributed optimal flooding and RIFT can control the control plane’s state while allowing massively parallel networks to scale efficiently.

What About Grey Failures?

If there’s even a 1% chance a given optical transceiver model will drop 1% of its traffic, and you have 10,000 transceivers in your network, you will have somewhere around 10 transceivers dropping 1% of their traffic. Even 10 transceivers can (seemingly randomly) wreck application performance.

Grey failures cannot be solved by changing how control planes work or avoided at scale. The tools you can use to catch grey failures are:

High-volume testing paired with paying attention to even seemingly minor variations in performance
Tracing real traffic flows to find lower performance paths in your network
Forcing grey failures out into the open by making them into hard failures

Each of these options has positive and negative aspects. For instance, some large-scale operators send massive pings across all their links periodically, and then look for even low packet loss levels to find grey failures in optical circuits. Mass ping systems like this work, but they require special tooling, which must often be done when production traffic is drained from the links under test.

Tracing every packet’s path through the network is difficult and expensive today, but research and thought are being put into this idea, particularly in relation to segment routing (SRv6).

You can sometimes force grey failures to become hard failures by setting low tolerances on certain interface parameters, such as drops. Of course, false positives are still a problem in these situations, so you might want to look for consistent drops across time before bringing an interface down.

There aren’t many “good” solutions to these problems today—but grey failures, exacerbated by the law of large numbers—are one of those places where the quantity of a large-scale network can turn into a quality of the network.

Blog > Good to Know | April 18, 2024

TAGS: networking | scaling

Grey Failures and the Law of Large Numbers

Russ White

What About Grey Failures?

Leave a Comment Cancel reply