Register now for better personalized quote!

HOT NEWS

Wireless Catalyst 9800 WLC KPIs, Part 1

Jun, 13, 2022 Hi-network.com

Part 1 of the 3-part Wireless Catalyst 9800 WLC KPIs

When working in critical wireless infrastructures,it is important to be proactive and determine in advance if there is any potential issue that could impact end-clients experience. Wireless Catalyst 9800 WLC KPIs will help in that task.

In this blog,I will share a systematic approach plus a list of commands that I have used while providing support on the NOC for one of the largest worldwide wireless events. The idea behind is to keep a close eye on how to monitor Key Performance Indicators (KPIs) for Catalyst 9800 WLC.

KPIs outputs can be collected periodically to create a baseline when a network is working fine. Therefore,making it easier later to find any deviation by comparing new outputs with previously collected ones.

I have divided WLC KPIs into six different buckets or areas:

  • WLC checks
  • Connection with other devices
  • AP checks
  • RF checks
  • Client checks
  • Packet Drops

KPIs will help us to spot issues in any of the mentioned six areas. In this blog, I have included WLC checks and Connections with other devices. Additionally, there will be two more blogs where I will share AP checks, RF checks, Client checks, and Packet Drops.

WLC checks

I usually start by checking the WLC first, since it is the most critical part. If any issues are seen in the controller, they will cascade shortly after as problems with APs and clients.  In other words, the idea here is to perform top-down criteria.

While reviewing the health state of the WLC, I would first confirm that WLC is running the intended version and in install mode. Install mode will ensure that the controller will boot faster, with a reduced memory footprint. After that, I would check the uptime of the WLC to see if any reload has occurred. Use the command:"show version | i uptime|Installation mode|Cisco IOS SoftwareCisco IOS Software [Amsterdam], C9800 Software (C9800_IOSXE-K9), Version 17.3.5a, RELEASE SOFTWARE (fc2)Gladius1 uptime is 2 weeks, 5 days, 21 hours, 30 minutesInstallation mode is INSTALL| i uptime|Installation mode|Cisco IOS Software"| i uptime|Installation mode|Cisco IOS Software"

Gladius1#show version | i uptime|Installation mode|Cisco IOS SoftwareCisco IOS Software [Amsterdam], C9800 Software (C9800_IOSXE-K9), Version 17.3.5a, RELEASE SOFTWARE (fc2)Gladius1 uptime is 2 weeks, 5 days, 21 hours, 30 minutesInstallation mode is INSTALL

Check expected release, uptime, and WLC running in install mode.

For Catalyst 9800 WLC deployed in High Availability, which by the way, is highly recommended for critical deployments, we need to first verify that the HA pair stack is formed and in a standby-hot state. Secondly, check the stack uptime and each of the member's individual uptime. Thirdly, identify a number of switchovers between active and standby. Use the command:"show redundancy | i ptime|Location|Current Software state|Switchovers       Available system uptime = 2 weeks, 1 day, 2 hours, 48 minutesSwitchovers system experienced = 1               Active Location = slot 1        Current Software state = ACTIVE       Uptime in current state = 7 hours, 10 minutes              Standby Location = slot 2        Current Software state = STANDBY HOT       Uptime in current state = 7 hours, 4 minutes| i ptime|Location|Current Software state|Switchovers"| i ptime|Location|Current Software state|Switchovers".

Gladius1#show redundancy | i ptime|Location|Current Software state|Switchovers       Available system uptime = 2 weeks, 1 day, 2 hours, 48 minutesSwitchovers system experienced = 1               Active Location = slot 1        Current Software state = ACTIVE       Uptime in current state = 7 hours, 10 minutes              Standby Location = slot 2        Current Software state = STANDBY HOT       Uptime in current state = 7 hours, 4 minutes

Check stack uptime, number of switchovers, and uptime for members. Switchover occurred 7 hours ago. Slot1 is new active and Slot2 reloaded.

In HA deployments, the recommendation is to use RMI feature. This will allow monitoring active and standby through Wireless Management Interface (WMI) and Redundancy Port (RP). After that, we should enable Default-gateway Check to confirm that both active and standby can reach the gateway. Here is a link to the 9800 High Availability deployment guide.

The next step will be to check if there are any WLC crashes. Determine if crash matches with the time of switchovers or unexpected reload. When WLC crash occurs it should generate a core dump or a system report. Those files are stored in WLC harddisk for 9800-40/80 or in bootflash for 9800-L/CL. Use command:"dir harddisk:/core/ | i core|system-reportDirectory of harddisk:/core/3661831  -rw-         11260562  Mar 25 2022 22:07:12 +01:00  Gladius1_1_RP_0_wncd_16574_20220325-220708-CET.core.gz3661830  -rw-            48528  Mar 25 2022 21:57:20 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET-info.txt3661829  -rw-        126548098  Mar 25 2022 21:57:10 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET.tar.gz3661828  -rw-            57191   Mar 9 2021 16:21:48 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET-info.txt3661827  -rw-        504311304   Mar 9 2021 16:20:51 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET.tar.gz3661826  -rw-         11714625  Nov 19 2020 10:35:54 +01:00  Gladius1_1_RP_0_wncd_30240_20201119-103550-CET.core.gz| i core|system-report", "dir stby-harddisk:/core/| i core|system-report"| i core|system-report", "dir stby-harddisk:/core/| i core|system-report"and replace harddisk by bootflash for 9800-L/CL.

Gladius1#dir harddisk:/core/ | i core|system-reportDirectory of harddisk:/core/3661831  -rw-         11260562  Mar 25 2022 22:07:12 +01:00  Gladius1_1_RP_0_wncd_16574_20220325-220708-CET.core.gz3661830  -rw-            48528  Mar 25 2022 21:57:20 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET-info.txt3661829  -rw-        126548098  Mar 25 2022 21:57:10 +01:00  Gladius1_1_RP_0-system-report_20220325-215658-CET.tar.gz3661828  -rw-            57191   Mar 9 2021 16:21:48 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET-info.txt3661827  -rw-        504311304   Mar 9 2021 16:20:51 +01:00  Gladius1_1_RP_0-system-report_20210309-161907-CET.tar.gz3661826  -rw-         11714625  Nov 19 2020 10:35:54 +01:00  Gladius1_1_RP_0_wncd_30240_20201119-103550-CET.core.gz

Check for cores and system reports. 2xcores in wncd process and 2xsystem-reports have occurred.

In case we observe any core dump we can identify the impacted process by checking file name. For example: WLC_1_RP_0_wncd_16574_20220325-220708-CET.core.gz crash occurred in "wncd" process, WLC_1_RP_0_dbm_14119_20201104-092800-CET.core.gz crash occurred in "dbm" process. Open a TAC case to identify the root cause of the crash.

Once we have verified crashes or unexpected reloads, we can continue by reviewing WLC CPU and memory utilization. For CPU monitoring we need to run command several times. Detect if there are any processes showing CPU above 80% consistently and not as a spike. I prefer to execute the command with sorted keyword. That way you can focus on processes with high CPU first. We have seen cases where consistent high CPU in WNCD process lead to AP disconnections. However, the releases 17.3.5 and 17.6.3 have received additional hardening, with the objective to protect AP CAPWAP connections in case a high CPU occurs. Use command:"show processes cpu platform sorted | ex 0%      0%      0%CPU utilization for five seconds:  14%, one minute:  16%, five minutes:  16%Core 0: CPU utilization for five seconds: 10%, one minute:  7%, five minutes: 11%Core 1: CPU utilization for five seconds:  6%, one minute: 28%, five minutes: 12%Core 2: CPU utilization for five seconds: 48%, one minute: 55%, five minutes: 68%Core 3: CPU utilization for five seconds: 20%, one minute:  8%, five minutes: 11%Core 4: CPU utilization for five seconds: 38%, one minute: 13%, five minutes: 17%Core 5: CPU utilization for five seconds: 14%, one minute: 11%, five minutes: 13%Core 6: CPU utilization for five seconds:  9%, one minute: 20%, five minutes: 23%Core 7: CPU utilization for five seconds:  5%, one minute:  8%, five minutes: 18%Core 8: CPU utilization for five seconds:  7%, one minute: 50%, five minutes: 34%Core 9: CPU utilization for five seconds: 100%, one minute: 58%, five minutes: 27%Core 10: CPU utilization for five seconds: 27%, one minute: 17%, five minutes: 25%   Pid    PPid    5Sec    1Min    5Min  Status        Size  Name                 -------------------------------------------------------------------------------- 19056   19037     99%     99%     99%  R          7525896  wncd_0                21922   21913     96%     97%     99%  R           127488  smand                 19460   19451     37%     34%     33%  R          6363828  wncd_2                19604   19596     18%     19%     18%  R          4556132  wncd_3| ex 0%      0%      0%"| ex 0%      0%      0%"

Gladius1#show processes cpu platform sorted | ex 0%      0%      0%CPU utilization for five seconds:  14%, one minute:  16%, five minutes:  16%Core 0: CPU utilization for five seconds: 10%, one minute:  7%, five minutes: 11%Core 1: CPU utilization for five seconds:  6%, one minute: 28%, five minutes: 12%Core 2: CPU utilization for five seconds: 48%, one minute: 55%, five minutes: 68%Core 3: CPU utilization for five seconds: 20%, one minute:  8%, five minutes: 11%Core 4: CPU utilization for five seconds: 38%, one minute: 13%, five minutes: 17%Core 5: CPU utilization for five seconds: 14%, one minute: 11%, five minutes: 13%Core 6: CPU utilization for five seconds:  9%, one minute: 20%, five minutes: 23%Core 7: CPU utilization for five seconds:  5%, one minute:  8%, five minutes: 18%Core 8: CPU utilization for five seconds:  7%, one minute: 50%, five minutes: 34%Core 9: CPU utilization for five seconds: 100%, one minute: 58%, five minutes: 27%Core 10: CPU utilization for five seconds: 27%, one minute: 17%, five minutes: 25%   Pid    PPid    5Sec    1Min    5Min  Status        Size  Name                 -------------------------------------------------------------------------------- 19056   19037     99%     99%     99%  R          7525896  wncd_0                21922   21913     96%     97%     99%  R           127488  smand                 19460   19451     37%     34%     33%  R          6363828  wncd_2                19604   19596     18%     19%     18%  R          4556132  wncd_3

Check CPU utilization per Core and per Process. Process wncd_0 and smand facing close to 100% CPU utilization

Catalyst 9800-CL and 9800-L platforms use CPU cores for data forwarding. Therefore, it is expected to see high CPU in ucode_pkt_PPE0. For those platforms to evaluate data plane performance use command:"show platform hardware chassis active qfp datapath utilization | i loadCPP 0: Subdev 0            5 secs        1 min        5 min       60 minProcessing: Load (pct)            4            3            4            3Check datapath load %| i Load"| i Load"

Gladius1#show platform hardware chassis active qfp datapath utilization | i loadCPP 0: Subdev 0            5 secs        1 min        5 min       60 minProcessing: Load (pct)            4            3            4            3Check datapath load %

While checking memory utilization, we need to monitor if the device utilization is too high. Subsequently, identify if there are any processes holding memory and not releasing it over time (leak). Use command:"show platform resources"(basic),"show process memory platform sorted", "show processes memory platform accounting"(advanced)

Gladius1#show platform resources**State Acronym: H - Healthy, W - Warning, C - CriticalResource                 Usage                 Max             Warning         Critical        State----------------------------------------------------------------------------------------------------RP0 (ok, active)                                                                               HControl Processor       0.79%                 100%            80%             90%             HDRAM                   4839MB(15%)           31670MB         88%             93%             Hharddisk               0MB(0%)               0MB             80%             85%             HESP0(ok, active)                                                                               HQFP                                                                                           HTCAM                   68cells(0%)           1048576cells    65%             85%             HDRAM                   420162KB(20%)         2097152KB       85%             95%             HIRAM                   13738KB(10%)          131072KB        85%             95%             HCPU Utilization        0.00%                 100%            90%             95%             H

Confirm state is healthy for metrics. Review Control Processor and memory utilization

Gladius1#show processes memory platform sortedSystem memory: 15869340K total, 6152000K used, 9717340K free,Lowest: 9717340KPid    Text      Data   Stack   Dynamic       RSS              Name----------------------------------------------------------------------3546  367768   1404580     136       488   1404580   linux_iosd-imag23602   22335    449968     136      1052    449968    ucode_pkt_PPE024525     847    437624     136     46628    437624            wncd_024004     160    373176    3956      6400    373176           wncmgrd26358     128    344868     136    136628    344868         mobilityd

Check free memory available. Identify top processes holding more memory.

Gladius1#show processes memory platform accountingHourly Statsprocess                 callsite_ID(bytes)  max_diff_bytes   callsite_ID(calls)  max_diff_calls   tracekey                                  timestamp(UTC)------------------------------------------------------------------------------------------------------------------------------------------------------------cpp_cp_svr_fp_0         2887897091          7243446          2887897092          1133             1#e4bd31e0c668be2b8786dec9fcc99486        2022-05-25 14:04ndbmand_rp_0            3571094529          5453112          3570931712          1119             1#00c5632bf072231d06cf80b8ccc37392        2022-05-09 21:52wncd_4_rp_0             2556049411          3059712          3028615169          227              1#9f4792f37292983824f5bb97d7e2167c        2022-05-10 14:54wncd_0_rp_0             2556049411          1990656          3028615168          680              1#9f4792f37292983824f5bb97d7e2167c        2022-05-25 11:05wncd_2_rp_0             2556049411          1953792          3028615169          682              1#9f4792f37292983824f5bb97d7e2167c        2022-05-13 14:01smand_rp_0              2887895047          1491984          3028615168          89               1#eaf6dd665e73b1edeee32fb9c5ac8639        2022-05-10 14:54

Check top processes and the number of calls. Stats are hourly, daily, weekly, and monthly.

As final controller health check, we can do a validation of the hardware. Check the status of power supplies, fans, SFPs, and temperature (only for physical WLCs). Likewise, review license status and the right number of licenses in use. Use commands:"show platform", "show inventory", "show environment"and "show license summary | i Status:"| i Status:"

Gladius1#show platformChassis type: C9800-40-K9Slot      Type                State                 Insert time (ago)--------- ------------------- --------------------- -----------------0         C9800-40-K9         ok                    2w5d0/0      BUILT-IN-4X10G/1G   ok                    2w5dR0        C9800-40-K9         ok, active            2w5dF0        C9800-40-K9         ok, active            2w5dP0        C9800-AC-750W-R     ok                    2w5dP1        Unknown             empty                 neverP2        C9800-40-K9-FAN     ok                    2w5dSlot      CPLD Version        Firmware Version--------- ------------------- ---------------------------------------0         19030712            16.10(2r)R0        19030712            16.10(2r)F0        19030712            16.10(2r)Gladius1#show inventoryNAME: "Chassis 1", DESCR: "Cisco C9800-40-K9 Chassis"PID: C9800-40-K9       , VID: V03  , SN: TTM242504SRNAME: "Chassis 1 Power Supply Module 0", DESCR: "Cisco Catalyst 9800-40 750W AC Power Supply Reverse Air"PID: C9800-AC-750W-R   , VID: V01  , SN: ART2418F0GJNAME: "Chassis 1 Fan Tray", DESCR: "Cisco C9800-40-K9 Fan Tray"PID: C9800-40-K9-FAN   , VID:      , SN:NAME: "module 0", DESCR: "Cisco C9800-40-K9 Modular Interface Processor"PID: C9800-40-K9       , VID:      , SN:NAME: "SPA subslot 0/0", DESCR: "4-port 10G/1G multirate Ethernet Port Adapter"PID: BUILT-IN-4X10G/1G , VID: N/A  , SN: JAE87654321NAME: "subslot 0/0 transceiver 0", DESCR: "10GE LR"PID: SFP-10G-LR          , VID: V02  , SN: AVD2141KCFBNAME: "module R0", DESCR: "Cisco C9800-40-K9 Route Processor"PID: C9800-40-K9       , VID: V03  , SN: TTM242504SRNAME: "module F0", DESCR: "Cisco C9800-40-K9 Embedded Services Processor"PID: C9800-40-K9       , VID:      , SN:NAME: "Crypto Asic F0/0", DESCR: "Asic 0 of module F0"PID: NOT               , VID: V01  , SN: JAE242711XFGladius1#show environmentNumber of Critical alarms:  0Number of Major alarms:     0Number of Minor alarms:     0

Check power supplies, fan status, SFPs, SPAs, and any alarms.

An example of those Catalyst 9800 WLC KPIs helping to identify an issue, was a customer-facing High Availability setup issue between two WLCs. By reviewing the version, and hardware installed in both WLCs we identified a difference in SPA adapters that was causing the WLC to not pair as HA.

Connection with other devices Checks

In addition to WLC health, we can check the status of  WLC's connections. The most important connections are mobility with other WLCs for inter-WLC roams, telemetry with DNAC/PI for monitoring and automation, and Nmsp with DNA-Spaces/CMX for location services. We need to ensure that those connections are established and working fine.

Confirm that mobility tunnels with other WLCs are up and using the right encryption and MTU. And clients can roam or be anchored to other WLC. If tunnels are down we can find if an issue is occurring in the control tunnel (UDP port 16666), in the data tunnel (UDP port 16667), or in both. Use command: "show wireless mobility sum"

Gladius1#sh wireless mobility summaryWireless Management VLAN: 25Wireless Management IP Address: 192.168.25.25Mobility Control Message DSCP Value: 48Mobility Keepalive Interval/Count: 10/3Mobility Group Name: eWLC3Mobility Multicast Ipv4 address: 0.0.0.0Mobility MAC Address: 001e.f62a.46ffMobility Domain Identifier: 0x2e47Controllers configured in the Mobility Domain: IP          Public Ip    MAC Address      Group Name   Multicast IPv4    Multicast IPv6  Status  PMTU----------------------------------------------------------------------------------------------------------192.168.25.25  N/A          001e.f62a.46ff   eWLC3        0.0.0.0         ::              N/A     N/A192.168.5.35  192.168.5.35  00b0.e1f2.f480   3500-2       0.0.0.0         ::              Up     1385192.168.25.23 192.168.25.23 706d.1535.6b0b   DAO2         0.0.0.0         :: Control And Data Path Down192.168.25.33 192.168.25.33 f4bd.9e57.ff6b   5500         0.0.0.0         ::              Up     1005

Check for mobility down and low PMTU.

If we have DNAC for Assurance or Provision we can confirm that DNAC Netconf connection is established. Afterward verify telemetry statistics for WLC, APs, and clients are updated in DNAC.  Use command: "show telemetry internal connection". After 17.7 this command have been replaced by "show telemetry connection all"

Gladius2#show telemetry internal connectionLoad for five secs: 29%/5%; one minute: 4%; five minutes: 2%Time source is NTP, 10:21:45.942 CET Wed Nov 4 2020Telemetry connectionsIndex Peer Address               Port  VRF Source Address             State----- -------------------------- ----- --- -------------------------- ----------    1 192.168.0.105              25103   0 192.168.25.42              Active

Check for telemetry state

In case we are using DNA-Spaces for location. Firstly, we can confirm Nmsp connection status, and the number of packets transmitted and received. Secondly, list of clients in WLC probing database. And lastly, the client location is updated in DNA-Spaces. Use command "show nmsp status"

Gladius1#show nmsp statusNMSP Status-----------DNA Spaces/CMX IP Address  Active    Tx Echo Resp  Rx Echo Req   Tx Data     Rx Data     Transport----------------------------------------------------------------------------------------------------------192.168.0.65                  Active    693870        693870        16833737    181084      TLS      192.168.0.66                  Inactive  21            21            222         7           TLS

Check for inactive servers, mismatch between echo tx/rx

With provided checks, we can proactively monitor the health of our 9800 WLC and connection with other devices like CMX/DNA-Spaces, other WLCs, and DNAC. In the next blog, we will share KPIs to monitor APs and RF.

List of commands to use for KPIs and automation scripts

In the document below, there is also a link to a script that will automatically collect all the commands. It will collect commands based on platform and release, save them in a file, and export the file. The script is using the "Guest-shell" feature that for now is only available in physical WLCs 9800-40/80 and 9800-L.

The document also provides an example of EEM script to collect logs periodically. In conclusion, EEM along with "Guest-shell" script will help to collect 9800 WLC KPIs and have a baseline for your Catalyst 9800 WLC.

For the list of commands used to monitor those KPIs

Visit the Monitor Wireless Catalyst 9800 KPIs


tag-icon Hot Tags : catalyst 9800 Cisco Catalyst 9800 KPIs Wireless Catalyst 9800 WLC KPIs Series Wireless Catalyst 9800 WLC KPIs

Copyright © 2014-2024 Hi-Network.com | HAILIAN TECHNOLOGY CO., LIMITED | All Rights Reserved.