Monitoring is one of the core functions of DevOps and Nagios is a popular open-source monitoring system for infrastructure and applications of a system.
It enables users to monitor various system components like servers, network devices, applications, and services in real time. On top of that, it also alerts system administrators in case of system outage.
Nagios can be easily integrated with many other DevOps tools including the ones for log management, deployment pipelines, and automation frameworks.
Since it helps DevOps teams ensure smooth running of systems and avoiding down time, optimize web pages, improve system performance and reduce costs across industries:
In this article, the experts at Talent500 will share an overview of how you can tackle the commonly encountered issues.
Let’s get started:
Typing Issues Encountered: Nagios
In this section, we will go through the common errors faced in troubleshooting Nagios and their solutions are:
Configuration errors: Nagios configuration files are fairly complicated and prone to syntax errors. Incorrect syntax, permissions, path and permissions are some common configuration errors.
Permission issues: Simply check that Nagios as well as its associated files have the needed permissions and ownership rights.
Plugin errors: It is well known that Nagios relies on plugins to fetch data from systems, and plugin errors can cause monitoring to collapse. Common plugin errors include incorrect plugin path, incorrect permissions, and incorrect parameters plugin path, and parameters.
Service checks not running: See if your service checks are properly scheduled and keep the system running.
Firewall and network issues: At times, users access remote hosts or need to connect necessary ports.
Resource limitations: Errors are also caused by system resource limitations like RAM and CPU.
Logging and Debugging: You must start verbose logging and debugging when trying to identify using Nagios.
Incorrect alerts: At times, it encounters false alerts and you must verify if the alert thresholds are set properly and that notification reaches the concerned.
Software and OS upgrades: Check if showing that Nagios is compatible with new software and OS upgrades.
Monitoring configuration: In this case, you must check if the Nagios monitoring configuration is correct and that it includes all necessary hosts and services.
Network issues: Nagios relies on network connectivity to gather data from remote systems, and network issues can cause monitoring to fail. Firewall rules blocking access, DNS resolution issues, and network congestion are some of the common network issues.
False positives: When there isn’t really a problem with the system, Nagios might occasionally generate false positive warnings. Incorrect alarm levels or monitoring the wrong metrics are typical reasons behind such errors. Nagios alarms need to be adjusted, and it must ensure that the right metrics are being monitored in order to prevent such false positives.
Example: How to troubleshoot performance graph problems in Nagios?
There are multiple reasons behind performance graph problems in Nagios including incorrect data collection, missing data, or incorrect graph configuration.
Follow these steps to troubleshoot Nagios performance graph problems:
Verify data collection: Firstly, verify that Nagios is correctly collecting performance data from the monitored systems. You simply need to go to the System Information section of the Admin page and see Performance Data Process is green.
See if spooled files exceed 20,000: It is observed that many errors cause the file to stop from any processing and spool up. Here’s the code you must use to count the total files in the respective locations:
ls /usr/local/nagios/var/spool/perfdata/ | wc -l
ls /usr/local/nagios/var/spool/xidpe/ | wc -l
Use the below command if you find more than 20,000 spooled files:
find/usr/local/nagios/var/spool/perfdata/ -type f -delete
Now, Boost Performance Data Logging Verbosity:
Edit the below file from another SSH session:
/usr/local/nagios/etc/pnp/process_perfdata.cfg
Here, change LOG_LEVEL = 2 to LOG_LEVEL = 0 as process_perfdata.pl displays all errors.
If you face a performance data processor’s timeout error, it can be easily controlled by increasing timeout limits.
Increase NPCD Logging Verbosity by editing a file in an SSH session:
/usr/local/nagios/etc/pnp/npcd.cfg
Here change log_level = 0 to log_level = -1
Next you might need to go through the user account and verify metrics.
Similarly, more issues and their solutions can be found on their forum.
Wrap Up
If you currently use Nagios or plan to do so, you must spend ample time on honing your skills when it comes to troubleshooting. From lacking scalability to Naigos Guru’s high TC expectations is concerning for most business owners.
If you want to learn more about DevOps tools and trends, subscribe to our blog now.
Join Talent500 to get the latest leads on remote DevOps opportunities.
Add comment