TAGS: |

Preempting Gray Failures With AI/ML

Ethan Banks

The network was definitely up, and had been up. There was nothing in the logs indicating link flaps, spanning-tree convergence events, or routing process adjacency changes. The packets had been, were presently, and presumably would forever be flowing. Flowing like a river.

I was pondering this inaccurate version of reality because of an annoying ticket that wouldn’t go away. One of the oddball apps was logging transaction failures at the finserv where I was an underpaid network engineer. “Was it the network?” The ticket wanted to know. I scrolled through the ticket’s notes, reading the comments of various teams.

That other teams had reviewed the issue was noteworthy. The ticket hadn’t been punted to networking, usually the first action taken for any application performance issue. That behavior drove me a little insane, as we’d have to make some sort of effort to demonstrate that the crapplication was leaning too hard into the database which was sitting on metal that was gasping for more CPU. Or RAM. Or IOPS. Or all of those since the server upgrades never seemed to get done until a customer was screaming bloody murder.

This ticket was not that. The app folks had logged their investigation, and there was definitely no code problem. They had dug into the error logs, and found that every now and then, a transaction would be initiated, but no response would come back. This would suspend the transaction, which is a big problem at a finserv. Transactions must go through. That’s how we all got paid. No transaction, no fee. And that was just our side of it. More to the point, customers tend to get cranky when transactions don’t settle on their account in a timely fashion. Money, you see. It’s a thing.

The server folks had also looked at the metal. The metal in this case was a very special bit of metal, a custom platform from HP built for high availability and costing eye watering sums of money to have sitting in our data centers. According to the keepers of the special metal, it was fine–underutilized in every metric that mattered, and that was believable.

A Needle In A Haystack

I stared at the question again, logged by a fellow IT team member I’d worked with before and respected. “Was it the network?” Hmm. I took the ticket seriously and started asking questions. What was the starting point of the transaction? And the ending point?

I learned that the transaction was flowing between our two main data centers, located at different corners of the continent. My first thought was that one of our WAN carriers was dropping packets, and my instinct was to open a case with them. But I hesitated. Opening a case with a carrier is only slightly better than screaming into the void.

So, rather than open a ticket with the carrier, I decided to first do a packet walk host to host, documenting every switchport, router link, firewall, and IPS in between. This wasn’t quite as arduous as it sounds. We were fairly good at documentation, and fairly thorough with network element management. Fairly.

Referencing my own mental map and our documentation library paired with some input from the NOC, I had the physical path nailed down in an hour or so. I started looking for…anything. I didn’t know what I was looking for, exactly–something out of place that would make a light bulb go off in my head.

As I made my way from the access layer and into the core in this topology that predated leaf-spine, I discovered that this troublesome transaction would sometimes traverse an etherchannel between the two core switches in the data center I was stationed at on its way to the carrier cloud. And that’s when I had my light bulb moment.

An etherchannel is a bundle of physical links—fiber links in this case—that behave as a single logical link. Each fiber link had a pair of optics, one on either end of the link. As this etherchannel had four links, there were eight total optics involved. The light bulb suggested that I review Ethernet statistics for each of these interfaces, and sure enough, one of the eight optics was throwing errors. Not a lot of errors based on a percentage of traffic flowing across that specific fiber link, but enough to stick out when compared to the other interfaces in the etherchannel group.

To confirm my suspicion, I verified with (probably, it’s been a while) “show port-channel load-balance forwarding-path”. Sure enough, the error-prone link was indeed carrying the traffic for this oddball app between these two very special hunks of metal. And so it was that the root cause of the gray failure was revealed. The network was up. The etherchannel was up. All link members in the etherchannel were up. But one optic was ever-so-slightly bad, resulting in a little bit of dropped traffic, and breaking this fussy, non-TCP application every now and then.

Maybe Observability Is An AI/ML Problem

I sigh as long as the next networker when I receive a ticket asking me what changed on the network, why the network is slow, why the firewall is dropping traffic, etc. The authors of tickets like that presume to know what the problem is. They revel in firing up the blamethrower instead of collaborating on problem solving. Usually, these tickets are networking snipe hunts that accomplish little. But sometimes, it really is the network.

The art of observability is knowing what to keep an eye on to get in front of these problems. This is where I perk up when vendors start talking about using machine learning and artificial intelligence to bring network issues to the surface. I believe that in the long run, AI and ML might have their best networking use case in observability. Why? Even the most experienced network engineer won’t know everything worth monitoring until they are sitting slack-jawed in the post-mortem drinking cold coffee and chewing on a stale granola bar. If only they’d been tracking that one metric that was the key to finding the root of the gray failure, they’d have seen the problem coming. But a well-trained algorithm might just be able to pick out the troublesome patterns we don’t notice ahead of time, bring them to the surface, and put us on the path of fixing the issue before a customer impact is severely felt.

About Ethan Banks: Hey, I'm Ethan, co-founder of Packet Pushers. I spent 20+ years as a sysadmin, network engineer, security dude, and certification hoarder. I've cut myself on cage nuts. I've gotten the call at 2am to fix the busted blinky thing. I've sat on a milk crate configuring the new shiny, a perf tile blowing frost up my backside. These days, I research new enterprise tech & talk to the people who are making or using it for your education & amusement. Hear me on the Heavy Networking & Day Two Cloud podcasts.