You might face a situation where you’re considering establishing a LAG port (whether using LACP, manual configuration, etc.), or perhaps you’re planning a maintenance window and considering shutting down a link within a LAG port, assuming it won’t affect the routing protocol due to the typically rapid convergence time of LAG. Well, it might be wise to think twice!
Imagine the following scenarios, in which we can consider two use cases:
- Direct BFD sessions between two routers using a LAG port.
- BFD sessions established between two routers, with an L2 switch between them forming a LAG port.
- Simply accept it as it is; this may not pose a significant issue for you, especially if you have sufficient capacity on the backup link to handle traffic switching.
- Remove the BFD if you don’t need it.
- Increase the single interface capacity if feasible, and remove the LAG bundle.
- Using ECMP involves employing multiple distinct interconnections instead of a single LACP link. However, depending on the scale and requirements of your network, it may be more advantageous to establish LAG bundle instead of relying on ECMP. This approach may helps mitigate platform-specific scaling challenges and prevents unnecessary interference with your IGP.
- Set a more aggressive setting for LACP (i.e., fast or short LACP), ensuring faster periodic updates, while allowing BFD timers to be slightly longer; however, this is sometimes not available or not tunable enough depending on the router vendor/OS.
- Palo Alto Firewalls
- Fortinet firewalls not supported
How BFD and LAG Works
The BFD protocol operates within Layer 4 of the network stack, employing common IP addresses and UDP ports. Therefore, when the router initializes a session, the ‘Load Balancing’ algorithm determines the appropriate link for transmitting this session.
To make things more interesting, each router may select a different link for its session (in a “non echo”/asynchronous mode).
What Is the Issue
You probably began to guess where the issue lies here ^^
The BFD protocol is generaly configured with more aggressive timers compared to LACP, enabling quicker detection of link abnormalities. Thus, if there’s an issue on link “3”, BFD might detect it and initiate a timeout prior to LACP’s detection of the problem. Consequently, the BFD client protocol could report adjacency down.
Also, deploying a BFD session across the aggregation without internal insight into the member links would render BFD incapable of ensuring detection of physical member link failures especialy if the LACP timer is high (30s for example). The objective is to confirm link continuity for each member link.
The traditional approach to solving this problem
RFC 7130 approach to resolve this
Micro-BFD: Integrating BFD across all LAG member links which necessitates the BFD code’s understanding of LAG configurations, enabling it to verify connectivity independently across each constituent link. It will depend on vendor, and their OS, if they did develop this functionality Consulting RFC 7130 offers additional information.)
Some vendors support the RFC 7130:
Others still not:
In conclusion
Configuring BFD is simple, configuring LAG bundle is simple, but configuring both need some interaction and anticipation, the more protocols you add together in your design, the more complexity and unexpected results you may have.
So regarding micro-bfd, at the end, depending on your specific requirements, this may not be necessary, and simplicity could be prioritized since enabling micro-BFD may come with certain limitations, which can vary depending on the vendor (make sure to refer to the vendor configuration guide for specific details). Ultimately, as always, you have to consider the required resilience level and adherence to your SLA objectives.
This blog post first appeared at LastOpinion. Reprinted with permission from the author.