TAGS:

Thoughts On Switch Failures

Russ White

In April of 2021, Microsoft published a study of switch failures in its data centers. What can we learn?

First Observation: The law of large numbers can often overwhelm resilience through redundancy

I’ve used the example of some (very) old lab testing with EIGRP to show how increasing the number of redundant (parallel) links or paths in a network doesn’t necessarily increase the resilience of a network. This Microsoft paper gives us some real numbers. According to this paper, about 3% of all switches installed in a data center fabric fail within 80 days. If you bring up a new fabric with 2,000 switches, around 60 will fail in the first three months of operation. The law of large numbers is brutal in hyper-scale environments.

Second Observation: Most failures are caused by hardware

According to the researchers, 32% of all switch failures were caused by hardware. The authors did note there was a single vendor with a much higher rate of hardware failures they needed to investigate and understand better, but in general, this is a much higher failure rate than I would have expected.

Based on experience, most of these failures are optical transceivers or fiber runs. Optical cables are the most “delicate” moving parts in a data center fabric (or any other network). Fiber is difficult to terminate correctly, unforgiving of even minor mistakes, and each disconnect/reconnect tends to cause a degradation in the quality of the connectors. In addition, the optical-to-electrical interface produces a lot of heat—heat which destroys electronics over time.

The study does break out server connectivity, which caused 5% of all failures. This category would include at least some of the optical problems I would expect to show up in a study of this kind.

The second most common failure was unplanned power outages, which caused 28% of all failures. I didn’t expect power outages to be this common. I’ve always had this vague notion that data center power distribution i is a solved problem and that when there is a failure it will be a “big failure,” taking out many devices. Based on this paper, I am probably wrong—it looks like individual devices lose power more often than I thought.

Third Observation: Software failures still cause a lot of outages

Software failure was the third-most common cause of outages, coming in at around 17%. The paper has an intriguing note on software failures: “We find that using the same underlying hardware, SONiC switches have a higher survival likelihood than vendor switch OSes.” This statement needs to be explored more fully.

For those unfamiliar with SONiC, it’s a network operating system initially developed by Microsoft and subsequently supported by an open-source community. Because it is open source, many engineers would expect SONiC to be “lower-quality code.”

After all, isn’t part of the vendor’s job to validate each release of their software against some standard test suite? Aren’t the vendor’s software developers directly responsible for the quality of their code, and hence more likely to produce better code than an open source project? The paper’s implication—that SONiC has fewer software defects than commercial code—seems to fly in the face of expectations.

It’s a myth, however, that open-source software developers are all working in their free time, there’s no solid test suite, etc. While some open-source projects rely on a small community of volunteers, and others rely on a single vendor who sells support to gain revenue, there are some projects, like FR Routing and SONiC, that multiple organizations support. (Microsoft is a major supporter of SONiC.)

In the case of open-source projects supported by multiple organizations, many full-time developers are working on the project—they don’t all work for the same company. While there may be a common set of CI/CD tests available to the entire community (and, in the case of FR Routing, run daily across the project by a community-supported nonprofit), each supporting organization will also have internal test suites, acceptance tests, and so on.

Each organization using the open-source package will also find defects, security holes, etc.—and rather than simply tossing them over the cubicle wall to be fixed by a vendor (who must prioritize many thousands of requests each year), they can prioritize the fix themselves, using their resources to make needed changes.

Does all of this make open-source superior to commercially available code? No. Overall, the quality of the code is just the quality of the code, regardless of the source. Commercial and open source have advantages and disadvantages.

There isn’t space to dive into when to choose which in this article—I’ll leave that for another time. But I will point out another reason the researchers saw a difference between open-source and commercial software: feature count.

Routers and switches are complex. We often don’t think about that complexity, but it’s there, lying under the surface, in millions of lines of code. For instance, when I started working for a major router vendor, the entire routing protocols stack (including OSPF, IS-IS, BGP, and a few others) was probably less than 100,000 lines of code. Today, most BGP implementations by themselves stand at well over 250,000 lines of code.

There have been increases in complexity across every part of a router—each new interface type or speed supported, each new quality of service feature supported, each new way to access the router, each new telemetry method, and so on. They all add complexity.

It’s not that any individual feature of a given open-source implementation is necessarily simpler than any given feature of a commercial implementation. However, it is broadly true that smaller code bases, given the same level of development effort and skill, will tend to have smaller defect counts.

Given the smaller feature count and the general professionalism of the SONiC project, Microsoft’s statement that SONiC’s uptime rivals or exceeds that of commercial network operating systems isn’t all that surprising.

The bottom line is hardware failures are more common than software failures; power outages are surprisingly more common than expected. When you deploy a lot of hardware, you will experience a lot of failures. Finally, open-source is not necessarily worse than commercial software—it all depends on the quality of the respective code bases and the set of problems each is trying to solve.