TAGS: |

Part 4 – Monitoring PSN Load Balancing

Dan Massameno

The best way to know that your configuration is working properly is to measure with a tool outside of ISE.  Unfortunately, authentications per second is not available via SNMP or the REST API.  What does happen is for each authentication a SYSLOG message is generated.  The following messages are for every passed and failed authentication:

  • CISE_logging category = CISE_Passed_Authentications
  • CISE_logging category = CISE_Failed_Attempts

There are numerous message codes covered by these two broad categories.  If you total them all up, you’ll get a general sense of authentications per time period.  This topic could fill an entire blog post, but if your syslog server can count all these messages and put them in a time-series graph, you’ll see something like this…

Here we see PSN01 receiving all the load and PSN02 receiving almost none of the load.  This is the typical usage pattern when there is no LB in place.

Before I go any further, I must recommend that all these configuration changes should be tested in a lab environment to the greatest extent possible. This is true for any serious change to an enterprise-grade network. That said, it’s very difficult to test load balancing without having production traffic to test with. This is another reason to have high-quality monitoring in place to measure the effect of any change and detect problems before they are experienced by end-users.

A little before 10:00 we turned on IOS-XE based load balancing for the first two PSNs using a small part of our campus.  Here we see the load on PSN02 gracefully ramp up and meet the load of PSN01.

This next graph shows a maintenance window where we upgraded the ISE software on our PSNs.

A little after 8:00 PM we brought PSN01 down to perform the upgrade (blue bars). The orange line ramped up and took the entire load while PSN01 was unavailable. Once PSN01 was upgraded we brought down PSN02 for an upgrade and the blue bars ramped up and took the entire load. When PSN02 was restored, the load evened out and both PSNs were servicing clients.

Our ultimate goal is to scale out with more than two PSNs:

The middle hump was a typical Monday.  On the right-side of the chart we see a typical Tuesday as we enabled PSN03, 04, 05, 06, 07, and 08.  Once IOS-XE Load Balancing was enabled we saw the load roughly even across all eight PSNs.  As a happy (and expected) side-effect, the load on PSN01 was much lower because the entire load is being spread across all eight servers.

We also use SYSLOG to monitor Endpoint Ownership Changes (EON).  An EON happens when a session hits an initial PSN (e.g., PSN05) but then subsequent AAA packets for that same session arrive at a different PSN.  That’s bad.  When this happens the new PSN must consult the ISE MnT node and get the relevant information for the session.  This represents a cache miss and is very expensive for the AAA process.  In fact, if the system is seeing a lot of EONs you can expect your performance to be terrible.  This is the reason for using the RADIUS attribute calling-station-id in the sickness algorithm when using LBs.

We can see how many EOBs are happening because every twenty minutes the PSNs emit a SYSLOG message that provides diagnostic information.  Within this message will be information in the form:

  • CISE_System_Statistics <msg_id> <total seg> <seg num><timestamp> <seq_num> 70011 INFO System-Stats: ISE Counters, <log details>

The <log details> is a sequence of key-value pairs.  One of those keys will be 4_EndPoint_OwnerShip_Change and will indicate the number of EOCs for the last twenty minutes.  If your SYSLOG server can parse these key-value pairs it can produce a report on how many EOBs each PSN is seeing.  At present our systems are seeing zero EOBs.  This is because (I assume) the IOS-XE device “knows” to keep a particular client session stuck to a particular PSN.  It seems to be doing this quite well in our environment.

Wireless LAN Controllers vs. Catalyst Switches

Our environment uses both Catalys 9800-80 series IOS-XE based wireless LAN controls and Catalyst 9k switches, also running IOS-XE.  We use IOS-XE Load Balancing on both kinds of devices.  It seems the load balancing algorithm works equally well on both platforms.

Final Thoughts

IOX-XE Load Balancing is a great feature.  It keeps packets for a single session grouped together and hitting the same PSN to reduce the effect of Endpoint Ownership Changes.   It also distributes baches of transactions fairly across multiple RADIUS servers.  Whether these RADIUS servers are individual load balancers or are direct connections to PSNs, the effect is the same.

The remaining task for the administrator is to monitor the load on all PSNs and make sure they don’t get close to their rated capacity.  The resulting system should be redundant and scalable.