TAGS:

Maintenance Window Gone Wrong

Andres Sanchez Ramos

This post originally appeared in the Packet Pushers’ Human Infrastructure newsletter. You can read back issues and sign up for free here (and we don’t share your details with anybody ever).


The year is 2014. I’m fresh out of university and working in my first engineering role. As you might infer, I don’t have much real-world experience but I’m eager to learn. I’m working for a local integrator that provides professional services to CSPs and large companies in other industries. I’m assigned to a team that worked exclusively on networking projects.

We had a support contract with a CSP that covered all the Juniper devices in their network. One day we got a ticket regarding the failure of a switch that was part of a virtual chassis comprising several devices in production. The outcome of that ticket was the need for an RMA for that specific switch.

My manager said he wanted me to handle the RMA. This was a production device, so the replacement would be done during an overnight maintenance window. At the time I was completely ignorant of what that implied, so I happily told him OK and went on with my day. Ignorance is a short-lived bliss.

Game Day

A couple of days before the maintenance window, my manager sent me the official Juniper documentation on how to substitute a device that’s part of a virtual chassis. I reviewed the documentation and the procedure seemed straightforward, so I thought “This should be simple.”

The day of the maintenance window I reviewed the procedure again at the request of my manager and said I had no doubts. When it was time to leave the office I grabbed the new switch and said goodbye. My manager said he would be on-call in case anything happened.

That evening I played a soccer match, ate a quick dinner, and then went to the site. As I write this today I wonder what I was thinking back then. Running for 90 minutes before a night of work, really?

I met up with an engineer from the customer and chatted for a bit. When it was time to do the work, we approached the rack and realized that not all the cables connected to the switch to be replaced were properly labeled.

How did we not check this when we had more time? Off to a rocky start, I got a little nervous. But the job needed to be done, so we both started working on properly labeling all cables. I believe it was around 2 a.m. when I actually started following the procedure to replace the faulty switch. I was really tired.

When the time came to turn on the switch, big surprise! The console was outputting all text in another alphabet, something that seemed Asian and that I couldn’t understand at all. What had just happened? What did it mean?

It was now 3 a.m. The window would close at 5 a.m. The customer engineer totally panicked because the entire virtual chassis was down and we were at risk of impacting real traffic. Needless to say, I panicked too.

Honestly, I didn’t know what to do so I just went through the procedure again. But no luck, we still got the same unintelligible text. The customer engineer was  pressuring me. He said “Look, I can extend the window until 6, but you need to figure out what the hell you to do to get that thing back up, I don’t care how.”

I could barely think, my heart was pounding. I was scared. I called my manager for help. After a brief conversation he quickly understood he needed to come over ASAP. He solved the problem just before our window closed. I don’t remember exactly how. What I do remember is the look of disappointment on his face when we said goodbye.

At work the next day, he sat me down and told me I should be careful when doing maintenance windows, and to prepare better next time. I felt terrible for failing. I even questioned if I should continue to do this type of work.

Retrospective

While I was disappointed in myself at that moment, I believe our failures tend to be our best lessons. After that experience, several things stuck with me:

  • Respect any window where changes are made over a production environment because most issues occur when changes are introduced to a running system
  • Write a detailed procedure for what you are going to do, including what actions you will take if something goes sideways. You won’t think straight when you are scared, tired, or anxious.
  • Test as much as you can before the window and figure out the differences between your lab and the real site.
  • Prepare diligently for the procedure, including getting as much information as possible about to the environment and setup before you actually have to do the work.
  • Be mentally prepared for failure, try to test corner cases, and don’t assume all will go smoothly.
  • Administer your energy appropriately—no soccer matches before an overnight!
  • If you are the project leader, be sure the person who will do the work is capable of it. Review the procedure with them and help them get their plan straight.

As a curious side note, the manager in this story was the person that introduced me to Packet Pushers.

About Andres Sanchez Ramos: Results-driven electronics engineer with 7+ years of professional experience in information technology. Highly proactive and motivated professional with a strong networking, cloud and programming background, good communication skills and team-player, always looking for a professional challenge and personal growth. Fluent in Spanish and English. I like solving complex problems, learning new skills and I believe in helping others to make a better world. I'm a technology, AI, bio-hacking, psychology, space, adventure and martial arts enthusiast.