Symbol Error Counter: Understanding Link Errors and Troubleshooting Steps - NADDOD Blog

Infiniband Symbol Error Counter

NADDOD Brandon InfiniBand Technical Support Engineer Jan 8, 2024

The SymbolErrorCounter is the most basic and common indicator of errors on a link. It indicates that an invalid combination of bits was received. In order to truly understand SymbolErrorCounter one must first understand error counters in general. Error counters in its simplest form are errors that are tracked on a switch port and HCA port. The errors that are tracked in the order of severity from most severe to least severe of the link include: LinkDownedErrors, ExcessiveBufferOverruns, LinkErrorRecovery, LocalLinkIntegrityErrors, PortRcvErrors, SymbolErrors.

 

  1. LinkDownedErrors :

 

Explanation: This error occurs when the link between two network devices is physically down or has been intentionally brought down. It indicates a loss of connectivity between the devices.

 

Possible Causes: Physical cable disconnection, power loss, or intentional administrative actions to disable the link.

 

  1. ExcessiveBufferOverruns :

 

Explanation: This error occurs when the buffers in a network device (like a switch or network interface card) become overloaded with data, leading to data loss or corruption.  
 
 
Possible Causes: High network traffic, inadequate buffer capacity, or a mismatch in data transfer rates.

   

  1. LinkErrorRecovery:

 

Explanation: This error refers to the process of recovering from a link-level error. It involves mechanisms to detect and correct errors in the communication link.   
 
 
Possible Causes: Temporary disruptions or interference causing data errors, and the system's attempt to recover from these errors.
 
  1. LocalLinkIntegrityErrors:

   

Explanation: These errors relate to problems with the physical integrity of the local link, indicating issues with the reliability and quality of the connection between two devices.   
 
 
Possible Causes: Physical issues like cable damage, electromagnetic interference, or problems with the network interface hardware.

 

  1. PortRcvErrors:

  

Explanation: PortRcvErrors occur when there are issues with the reception of data on a particular port. It indicates errors in the receiving process on the network device.   
 
 
Possible Causes: Network congestion, hardware malfunctions, or misconfigurations affecting the receiving port.
 
  1. SymbolErrors:

  

Explanation: SymbolErrors indicate that an invalid combination of bits was received during the transmission of data. It often points to problems with the encoding or modulation of the data.
 
 
Possible Causes: Electrical noise, signal interference, or issues with the encoding/decoding mechanisms in the network devices.
 

Understanding and monitoring these different error types helps network administrators diagnose and troubleshoot issues within the network, ensuring the reliability and stability of data transmission.

 

SymbolErrorCounter is the most basic and common indicator of errors on a link. It indicates that an invalid combination of bits was received. While it is possible to get other link integrity errors on a link without Symbol Errors, this is not typical. Quite often if zero Symbol Errors are found, but there are Link Downs, or LinkErrorRecoveries, another read of the Symbol Error counter will reveal that you just happened to read it after it had been reset on a link recovery action.

 

Some off the common causes of Symbol Errors are: power cycle or reboot on the device, a cable being pulled or reseated, a leaf being reseated(A hotplug event for the blade server), an event in SFP that has the HCA in the FRU list, or a power event that could have brought down the device. Any of these actions can cause somewhere between 88 and 102 Symbol Errors left in the counter with the typical counts being in the 90s. Therefore, error counters should be cleared after any one of these actions. It is also very important for the error counters to be reset or cleared on a regular basis so that you may understand the rate of the errors.

 

If there are Symbol Error reported by the HCA, and if there are somewhere between 85 and 100 errors, you should first determine if the errors where cause by one for the common causes listed above. If there were outside events that caused the link down, you should clear the error counters on the link. If there were no outside events, then monitor the link to see if the number of errors increases. If the number of symbol errors does not increase in about two hours, and there are no LinkErrorRecovery errors, increasing during that time period, you can just clear the error counters on the link. It is also possible to see the symbol error count decrease if there are LinkErrorRecovery errors, because a link recovery sequence includes a clear of the symbol error counter. LinkErrorRecovery errors, increasing during that time period, you can just clear the error counters on the link. It is also possible to see the symbol error count decrease if there are LinkErrorRecovery errors, because a link recovery sequence includes a clear of the symbol error counter.

 

For the HCA, the combined reporting of PortRcvErrors and SymbolErrors is important to understand. If a Symbol Error occurs on a data cycle, as opposed to an idle cycle, a PortRcvError will also be recorded. For example, if you are only seeing Symbol Errors on an HCA port, it is likely that no data transfers are being impacted. Therefore, if the number of Symbol Errors is fairly low, maintenance could be deferred to a more convenient time. As the number of Symbol Errors increases, this must be reassessed, because the probability of impacting data transfers will increase. Typically, the number of Symbol Errors will be equal to or greater than the number of PortRcvErrors.

 

If the number of PortRcvErrors is much higher than the number of Symbol Errors, and there are not enough corresponding PortRcvRemotePhysicalErrors or ExcessiveBufferOverrunErrors to explain the difference, it is possible that there is some remote, source HCA adapter that is corrupting a CRC that is only checked at the destination HCA.

 

Image: ports without symbol errors and with symbol errors

ports without symbol errors and with symbol errors

ports without symbol errors and with symbol errors-2

Symbol Errors - This is the total number of minor link errors detected on one or more physical lanes. This means that some Symbols (physical layer units) arrived to the end node with errors. In normal conditions this counter will remain zero.  In cases were this counters grows, may need to check the cable connectivity or replace the cable.

 

Resource links:https://mellanox.my.site.com/mellanoxcommunity/s/article/MLNX2-117-1634kn