Nagios vs StatsD: What are the differences?
What is Nagios? Complete monitoring and alerting for servers, switches, applications, and services. Nagios is a host/service/network monitoring program written in C and released under the GNU General Public License.
What is StatsD? Simple daemon for easy stats aggregation. StatsD is a front-end proxy for the Graphite/Carbon metrics server, originally written by Etsy's Erik Kastner. StatsD is a network daemon that runs on the Node.js platform and listens for statistics, like counters and timers, sent over UDP and sends aggregates to one or more pluggable backend services (e.g., Graphite).
Nagios and StatsD belong to "Monitoring Tools" category of the tech stack.
Some of the features offered by Nagios are:
- Monitor your entire IT infrastructure
- Spot problems before they occur
- Know immediately when problems arise
On the other hand, StatsD provides the following key features:
- buckets: Each stat is in its own "bucket". They are not predefined anywhere. Buckets can be named anything that will translate to Graphite (periods make folders, etc)
- values: Each stat will have a value. How it is interpreted depends on modifiers. In general values should be integer.
- flush: After the flush interval timeout (defined by config.flushInterval, default 10 seconds), stats are aggregated and sent to an upstream backend service.
"It just works" is the primary reason why developers consider Nagios over the competitors, whereas "Single responsibility" was stated as the key factor in picking StatsD.
Nagios and StatsD are both open source tools. StatsD with 14.2K GitHub stars and 1.83K forks on GitHub appears to be more popular than Nagios with 60 GitHub stars and 36 GitHub forks.
According to the StackShare community, Nagios has a broader approval, being mentioned in 177 company stacks & 40 developers stacks; compared to StatsD, which is listed in 72 company stacks and 16 developer stacks.
What is Nagios?
What is StatsD?
Need advice about which tool to choose?Ask the StackShare community!
Sign up to add, upvote and see more prosMake informed product decisions
What are the cons of using Nagios?
Sign up to get full access to all the companiesMake informed product decisions
Sign up to get full access to all the tool integrationsMake informed product decisions
Why we spent several years building an open source, large-scale metrics alerting system, M3, built for Prometheus:
By late 2014, all services, infrastructure, and servers at Uber emitted metrics to a Graphite stack that stored them using the Whisper file format in a sharded Carbon cluster. We used Grafana for dashboarding and Nagios for alerting, issuing Graphite threshold checks via source-controlled scripts. While this worked for a while, expanding the Carbon cluster required a manual resharding process and, due to lack of replication, any single node’s disk failure caused permanent loss of its associated metrics. In short, this solution was not able to meet our needs as the company continued to grow.
To ensure the scalability of Uber’s metrics backend, we decided to build out a system that provided fault tolerant metrics ingestion, storage, and querying as a managed platform...
(GitHub : https://github.com/m3db/m3)
Data science and engineering teams at Lyft maintain several big data pipelines that serve as the foundation for various types of analysis throughout the business.
Apache Airflow sits at the center of this big data infrastructure, allowing users to “programmatically author, schedule, and monitor data pipelines.” Airflow is an open source tool, and “Lyft is the very first Airflow adopter in production since the project was open sourced around three years ago.”
There are several key components of the architecture. A web UI allows users to view the status of their queries, along with an audit trail of any modifications the query. A metadata database stores things like job status and task instance status. A multi-process scheduler handles job requests, and triggers the executor to execute those tasks.
Airflow supports several executors, though Lyft uses CeleryExecutor to scale task execution in production. Airflow is deployed to three Amazon Auto Scaling Groups, with each associated with a celery queue.
Audit logs supplied to the web UI are powered by the existing Airflow audit logs as well as Flask signal.
Datadog, Statsd, Grafana, and PagerDuty are all used to monitor the Airflow system.
We use collectd because of it's low footprint and great capabilities. We use it to monitor our Google Compute Engine machines. More interestingly we setup collectd as StatsD replacement - all our Clojure services push application-level metrics using our own metrics library and collectd pushes them to Stackdriver
A huge part of our continuous deployment practices is to have granular alerting and monitoring across the platform. To do this, we run Sentry on-premise, inside our VPCs, for our event alerting, and we run an awesome observability and monitoring system consisting of StatsD, Graphite and Grafana. We have dashboards using this system to monitor our core subsystems so that we can know the health of any given subsystem at any moment. This system ties into our PagerDuty rotation, as well as alerts from some of our Amazon CloudWatch alarms (we’re looking to migrate all of these to our internal monitoring system soon).
We use Nagios to monitor our stack and alert us when problems arise. Nagios allows us to monitor every aspect of each of our servers such as running processes, CPU usage, disk usage, and more. This means that as soon as problems arise, we can detect them and call out an engineer to resolve the issues as soon as possible.
StatsD is used to track the number of messages we're publishing and the type of realtime subscribers. So it shows the number of longpoll connections, the number of websocket connections etc. It also tracks how Redis is performing.
We use Nagios to monitor customer instances of Bridge and proactively alert us about issues like queue sizes, downed services, errors in logs, etc.
We use nagios based OpsView to monitor our server farm and keep everything running smoothly.