Alarms
Asset
Field | Data |
---|---|
Category | asset |
Severity | critical |
Message | <node> .<router> has a software version that does not match its peer. |
Threshold | Issued if any node in a router has mismatched versions. Cleared when they are all equal. |
Cause | Troubleshooting Step |
---|---|
Multiple nodes configured within one router have different software versions. | Manually upgrade the node that has the lower version. Upgrade the router from the PCLI by issuing the send command upgrade router <router> <version> or the upgrade button on the Router page on the Conductor's GUI. |
Field | Data |
---|---|
Category | asset |
Severity | critical |
Message | A duplicate asset with id <id> has been detected. Ensure all assets have a unique id and restart salt-minion on asset <id> , which is configured as <node> .<router> . |
Threshold | Issued if any node being manged by a conductor has the same asset id as another node in the authority. |
Cause | Troubleshooting Step |
---|---|
Multiple nodes configured within an authority have the same asset id. | Execute show assets to identify the <router> .<node> that has a duplicate ID. Change the asset id for that node in the Conductor to have a unique id.Tip: Clearing the asset-id value will generate a random value. |
Field | Data |
---|---|
Category | asset |
Severity | major |
Message | Asset <id> , which is configured as <node> .<router> , is not running |
Threshold | Issued when the SSR service stops on a node (must be managed by ZTP). Clears on SSR start. |
Cause | Troubleshooting Step |
---|---|
SSR is not running on node <node> router <router> | Start SSR from the Conductor PCLI by entering with send command start router <router> node <node> or pressing the start button on the Conductor’s router page in the GUI. If the SSR cannot start check systemctl status 128T on that node. |
Field | Data |
---|---|
Category | asset |
Severity | major |
Message | Asset<id> , which is configured as <node> .<router> , failed to install. |
Threshold | Issued when the SSR install fails on a node (must be managed by ZTP). Clears once SSR is installed. |
Cause | Troubleshooting Step |
---|---|
SSR failed to install on node <node> router <router> , which is asset <id> | Issue command show assets <id> to see detailed information on why the install failed and follow the instructions to fix the issue and retry the installation. |
bgp_neighbor
Field | Data |
---|---|
Category | bgp_neighbor |
Severity | major |
Message | Neighbor <ipaddress> failed to reach the ESTABLISHED state. |
Threshold | Issued when the BGP neighbor is not in the ESTABLISHED state. Clears when the BGP neighbor returns to the ESTABLISHED state. |
Cause | Troubleshooting Step |
---|---|
1. The remote IP address is not reachable due to some network connectivity problem. 2. The remote router is not configured to accept a BGP connection. 3. The OPEN message exchange fails. | Use the command show bgp neighbors and review the content for misconfiguration, state machine connection status, and disconnect failures. |
giid
Field | Data |
---|---|
Category | giid |
Severity | major |
Message | DHCP address for interface [<interface name> ] has not been resolved |
Threshold | Issued when DHCP address for interface is unresolved. |
Cause | Troubleshooting Step |
---|---|
Interface configured to obtain address dynamically using DHCP but was not able to acquire one in time. | Ensure the interface is operationally up Ensure the interface is connected to a network with a DHCP server and the server will accept the node’s request for DHCP address. Collect the DHCP statistics to check for any failures. Collect packet traces on the DHCP interface to investigate any protocol level failures. |
Interface
Field | Data |
---|---|
Category | interface |
Severity | critical |
Message | interface operational down |
Threshold | up/down |
Cause | Troubleshooting Step |
---|---|
Interface is down for an Ethernet WAN connection | The next hop networking equipment is down. Troubleshoot by checking for link status on adjacent equipment, adjacent switch ports, etc. |
Interface is down for an HA or LAN connection | The next hop networking equipment is down. Troubleshoot by checking for link status on adjacent equipment, adjacent switch ports, etc. |
The down interface is an LTE interface | Check that strength and status of the LTE connection by using the show device-interface router <router name> id <interface id> command.• If the signal strength is marginal, poor, or 0 the LTE interface is malfunctioning. • If the system mode is not listed as LTE the signal is malfunctioning. • If the Operation Status is down, the LTE interface is malfunctioning. In the event of the conditions above, contact Juniper. |
Field | Data |
---|---|
Category | interface |
Severity | info |
Message | interface administratively down |
Threshold | up/down |
Cause | Troubleshooting Step |
---|---|
The interface is down due to being disabled in the configuration | Re-enable the interface in the configuration. |
Peer
Field | Data |
---|---|
Category | peer |
Severity | critical |
Message | Peer <name> is not reachable. |
Threshold | When all paths to a peer are marked down by BFD. |
Cause | Troubleshooting Step |
---|---|
All “Peer path” alarms to a given peer are triggered. | Review the statistics for show stats bfd by-peer-path to investigate for anomolies.Capture packets on the interface(s) that talk to the peer and look for successful UDP traffic to and from the peer at port 1280. |
Field | Data |
---|---|
Category | peer |
Severity | major |
Message | Peer <name> path is down |
Threshold | When a single path is marked down by BFD. The source of the alarm includes the Node/interface/IP/VLAN. |
Cause | Troubleshooting Step |
---|---|
Router Interface is down. | Enter the show device-interface router <router> node <node> <interface> command to verify the router's interface status. If the interface is down, the next hop equipment is likely down. Troubleshoot the adjacent device(s). |
Adjacency router's interface is down. | Enter the show device-interface router <router> command to verify the adjacency router's interface status. If the interface is down, the troubleshoot the adjacent device’s interface. |
Path health has degraded sufficiently and is impacting performance. | Using the GUI, click the Home icon and select the appropriate view for the current environment. Examine the graph for any anomalies at the time of the alarm. If the loss is 5% or higher the path has degraded. |
Field | Data |
---|---|
Category | peer |
Severity | major |
Message | Peer <name> path MTU is unresolvable. |
Cause | Troubleshooting Step |
---|---|
Maximum Transmit Unit for packet size is unable to be determined. | Set the MTU for the device-interface statically. |
Platform
Field | Data |
---|---|
Category | platform |
Severity | critical |
Message | Security Rekey failed for: <node-name(s)> |
Cause | Troubleshooting Step |
---|---|
Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem still persists please contact Juniper customer support. |
Field | Data |
---|---|
Category | platform |
Severity | critical |
Message | Security Rekey failed for: <node-name(s)> |
Cause | Troubleshooting Step |
---|---|
Issued when a conductor fails to distribute newly created security keys during rekey process to any managed routers. | Make sure failed nodes are running and have connectivity to the conductor. If the problem still persists please contact Juniper customer support. |
Field | Data |
---|---|
Category | platform |
Severity | major |
Message | flow table limit exceeded |
Threshold | greater than 90% of the total flow table |
Cause | Troubleshooting Step |
---|---|
Occurs when 90% or more of the total flow table is utilized. | The alarm is cleared when 80% or less of the total flow table is utilized. |
Field | Data |
---|---|
Category | platform |
Severity | major |
Message | fib table limit exceeded |
Threshold | greater than 90% of the total FIB table |
Cause | Troubleshooting Step |
---|---|
Occurs when 90% or more of the total FIB table is utilized. | The alarm is cleared when 80% or less of the total FIB table is utilized. This may be due to suboptimal configuration or insufficient memory allocated to the SSR software. Contact Juniper support if this alarm persists. |
Field | Data |
---|---|
Category | platform |
Severity | major |
Message | action table limit exceeded |
Threshold | greater than 90% of the total action table |
Cause | Troubleshooting Step |
---|---|
Occurs when 90% or more of the action table is utilized. | The alarm is cleared when 80% or less of the action table is utilized. This table's use is proportional to the number of active flows. |
Field | Data |
---|---|
Category | platform |
Severity | major |
Message | arp table limit exceeded |
Threshold | greater than 90% of the total arp table |
Cause | Troubleshooting Step |
---|---|
Occurs when 90% or more of the ARP table is used. | The alarm is cleared when 80% or less of the table is used. |
Field | Data |
---|---|
Category | process |
Severity | major |
Message | Process has exited unexpectedly: <process-name> |
Threshold | Issued when a SSR system process exits and is cleared when it is successfully restarted |
Cause | Troubleshooting Step |
---|---|
Process exits once and restarts to normal operation | The SSR is designed to restart processes in the event of a failure. If this alarm state is only seen briefly and then clears it is likely that the system has self-recovered. Please report to Juniper customer support. |
Process exits continuously | Contact Juniper customer support |
System
Field | Data |
---|---|
Category | system |
Severity | critical |
Message | Node <node-name> went offline |
Threshold | Issued when an HA node goes offline |
Cause | Troubleshooting Step |
---|---|
The HA peer node has shut down or stopped running | Verify that the HA peer node is powered on and running. If the node is running verify that the SSR service is running without error by issuing the command systemctl status 128T . If the system appears to be running correctly check connectivity between the systems by issuing the PCLI command show system connectivity on both nodes. |
Connectivity between HA nodes is down | HA node connectivity can be evaluated with the PCLI command show system connectivity . If the state to the peer node is not connected check the inter node tunnel status by running the PCLI command show system connectivity internal . All tunnels to the peer node should report “connected”. If connectivity is down verify links between the systems and if they are up then please contact Juniper support. |
Field | Data |
---|---|
Category | system |
Severity | critical |
Message | system memory exceeded |
Threshold | greater than 90% |
Cause | Troubleshooting Step |
---|---|
Alarm triggers above 90% memory usage | The Alarm resets/clears when memory usage drops below 80%. |
A process is consuming excessive memory | Locate the system processes consuming large amounts of system memory by running show stats process memory rss from the PCLI. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | disk space low |
Threshold | less than 10% disk space left |
Cause | Troubleshooting Step |
---|---|
Disk usage is high | Using standard Linux tools such as “df” and “ls” determine which files are consuming large amounts of disk space. In the event that there are unneeded files they should be removed. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | No connectivity to <router> .<node> |
Threshold | When a connection between a node that is present in config/environment config is not present. |
Cause | Troubleshooting Step |
---|---|
The node is not reachable by the conconductor. | Enter `show system connectivity router all node all |
The node is not reachable by its HA peer. | Enter `show system connectivity router all node all |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | Host cpu utilization exceeded |
Threshold | greater than 85% for 30 seconds |
Cause | Troubleshooting Step |
---|---|
Alarm triggers above 85% CPU usage for 30 seconds | The alarm clears when the CPU usage drops below 70%. |
Intermittent process consuming large amount of CPU | If the alarm triggers and clears intermittently this could indicate a periodic load spike or intermittent process workload. Check the current cpu utilization of all processes in the system by using the linux command top or the PCLI command show stats process cpu . |
Process consistently consuming large amount of CPU | If the alarm is constantly active this could indicate an under-provisioned system. Check the current cpu utilization of all processes in the system by using the linux command top or the PCLI command show stats process cpu . Contact Juniper support for guidance on provisioning the system. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | Received config sync info state for node <node-name> with syncVersion=<version> , error=<error> , message=<message> and resulting action=<action> |
Threshold | Configuration synchronization error |
Cause | Troubleshooting Step |
---|---|
The router is unable to receive the configuration from the Conductor | Run the command show asset <asset-id> of the system exhibiting the problem. This will return the status of the asset and provide more detailed information regarding the nature of the problem. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | Hostname [<hostname> ] is unresolved |
Threshold | When a configured hostname is unresolved. |
Cause | Troubleshooting Step |
---|---|
The router was unable to resolve the hostname given in the config | Verify that the hostname is resolvable from linux using a utility like dig . Verify that the hostname has a corresponding /etc/hosts entry |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | No active NTP server |
Threshold | Issued when the system is not connected to any active NTP servers. |
Cause | Troubleshooting Step |
---|---|
The router is having connectivity problems to the NTP server that was selected. | Specify NTP server(s) to connect to. From PCLI, “configure authority router <router name> system ntp server <ntp server address> ”.Make this more resilient by specifying more NTP servers. A common practice is to specify 4 servers. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | Corrupt entitlement certificate received Invalid entitlement certificate received Unable to obtain entitlement certificate |
Threshold | Certificate failure |
Cause | Troubleshooting Step |
---|---|
Unable to read entitlement data from certificate | Ensure that the certificate installed on the system matches the one received from Juniper. Run install128t to reinstall the certificate. If the problem persists, contact customer support to obtain a new certificate. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | SNMP server failure |
Threshold | Unable to communicate to SNMP server |
Cause | Troubleshooting Step |
---|---|
Network connectivity failure or misconfiguration | Ensure that the SNMP server defined in the configuration is reachable. Usually this can be determined by issueing a ping to the server address. If the server does not respond, run a packet capture on the interface used for SNMP to observe if traffic is being generated from the SSR upon event generation. |
Field | Data |
---|---|
Category | system |
Severity | major |
Message | Restart required |
Threshold | Restart is required for configuration to take effect |
Cause | Troubleshooting Step |
---|---|
Non-dynamically reconfiguable filed has been edited | Some fields within the SSR configuration are not dynamic and requires a restart of the SSR process to take effect (e.g. forwarding-cores). From the Conductor Router page, click on the gear icon to issue a restart of the SSR process. Alternatively, from within the linux shell of the SSR Router, issue systemctl restart 128T |
Field | Data |
---|---|
Category | system |
Severity | minor |
Message | Application Identification cache utilization is approaching maximum capacity |
Threshold | Fires at 95% of max-capacity value (default is 10,000) |
Cause | Troubleshooting Step |
---|---|
Capacity of the cache exceeds 95% of the max-capacity configured value | The alarm is cleared once the capacity of the cache goes below 85% of the configured value, and as sessions using those stats expire. The alarm can be addressed by adjusting the max-capacity value under application-identification . App-id stats are tracked per application, per client, and per next-hop. The granularity of per-application, per-client traffic stats will be reduced while the alarm is active on the system. |