We use Datadog to centralise our log outputs, monitor our hosts and set alerts for our tools.

Datadog
Hey there! We are looking at Datadog, Dynatrace, AppDynamics, and New Relic as options for our web application monitoring.
Current Environment: .NET Core Web app hosted on Microsoft IIS
Future Environment: Web app will be hosted on Microsoft Azure
Tech Stacks: IIS, RabbitMQ, Redis, Microsoft SQL Server
Requirement: Infra Monitoring, APM, Real - User Monitoring (User activity monitoring i.e., time spent on a page, most active page, etc.), Service Tracing, Root Cause Analysis, and Centralized Log Management.
Please advise on the above. Thanks!
My team is divided on using Centreon or Zabbix for enterprise monitoring and alert automation. Can someone let us know which one is better? There is one more tool called Datadog that we are using for cloud assets. Of course, Datadog presents us with huge bills. So we want to have a comparative study. Suggestions and advice are welcome. Thanks!
I work at Volvo Car Corporation as a consultant Project Manager. We have deployed Zabbix in all of our factories for factory monitoring because after thorough investigation we saw that Zabbix supports the wide variety of Operating Systems, hardware peripherals and devices a Car Manufacturer has.
No other tool had the same amount of support onboard for our production environment and we didn't want to end up using a different tool again for several areas. That is the major strong point about Zabbix and it's free of course. Another strong point is the documentation which is widely available; Zabbix Youtube channel with tutorial video's, Zabbix share which holds free templates, the Zabbix online documentation and the Zabbix forum also helped us out quite a bit. Deployment is quite easy since it uses templates, so almost all configuration can be done on server side.
To conclude, we are really pleased with the tool so far, it helped us detect several causes of issues that were a pain to solve in the past.
Centreon is part of the Nagios ecosystem, meaning there is a huge number of resources you may find around in the community (plugins, skills, addons). Zabbix monitoring paradigms are totally different from Centreon. Centreon plugins have some kind of intelligence when they are launched, where Zabbix monitoring rules are configured centrally with the raw data collected. Testing both will help you understand :) Users used to say Centreon may be faster for setup and deployment. And in the end, both are full of monitoring features. Centreon has out of the box a full catalog of probes from cloud to the edge https://www.centreon.com/en/plugins-pack-list/ As soon as you have defined your monitoring policies and template, you can deploy it fast through command line API or REST API. Centreon plays well in the ITSM, Automation, AIOps spaces with many connectors for Prometheus, ServiceNow, GLPI, Ansible, Chef, Splunk, ... The polling server mode is one of the differentiators with Centreon. You set up remote server(s) and chose btw multiple information-exchange mechanisms. Powerful and resilient for remote, VPN, DMZ, satellite networks. Centreon is a good value for price to do a data collection (availability, performance, fault) on a wide range of technologies (physical, legacy, cloud). There are pro support and enterprise version with dashboards and reporting. IT Central Station gathers many user feedback you can rely on both Centreon & Zabbix https://www.itcentralstation.com/products/centreon-reviews
We are looking for a centralised monitoring solution for our application deployed on Amazon EKS. We would like to monitor using metrics from Kubernetes, AWS services (NeptuneDB, AWS Elastic Load Balancing (ELB), Amazon EBS, Amazon S3, etc) and application microservice's custom metrics.
We are expected to use around 80 microservices (not replicas). I think a total of 200-250 microservices will be there in the system with 10-12 slave nodes.
We tried Prometheus but it looks like maintenance is a big issue. We need to manage scaling, maintaining the storage, and dealing with multiple exporters and Grafana. I felt this itself needs few dedicated resources (at least 2-3 people) to manage. Not sure if I am thinking in the correct direction. Please confirm.
You mentioned Datadog and Sysdig charges per host. Does it charge per slave node?
Can't say anything to Sysdig. I clearly prefer Datadog as
- they provide plenty of easy to "switch-on" plugins for various technologies (incl. most of AWS)
- easy to code (python) agent plugins / api for own metrics
- brillant dashboarding / alarms with many customization options
- pricing is OK, there are cheaper options for specific use cases but if you want superior dashboarding / alarms I haven't seen a good competitor (despite your own Prometheus / Grafana / Kibana dog food)
IMHO NewRelic is "promising since years" ;) good ideas but bad integration between their products. Their Dashboard query language is really nice but lacks critical functions like multiple data sets or advanced calculations. Needless to say you get all of that with Datadog.
Need help setting up a monitoring / logging / alarm infrastructure? Send me a message!
Thanks for the reply, I am working on DataDog trail version now. I am able to see my containers/pods/VMs metrics in the DataDog.
I am trying to do the jmx integration with autodiscovery now. But I am not able to see the jvm metrics in DataDog. Can you please help on this?
Here is my deployment yaml:
`
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
namespace: datadog
annotations:
ad.datadoghq.com/myapp.check_names: >-
'["myapp"]'
ad.datadoghq.com/myapp.init_configs: >-
'[{"is_jmx": true, "collect_default_metrics": true}]'
ad.datadoghq.com/tomcat.instances: >-
'[{"host": "%%host%%","port":"5000"}]'
labels:
app: myapp
spec:
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: nexus.nslhub.com/sample-java-app:2.0
imagePullPolicy: Always
ports:
- containerPort: 8080
name: http
- containerPort: 5000
name: jmx
imagePullSecrets:
- name: myappsecret
nodeSelector:
kubernetes.io/hostname: ip-10-5-7-173.ap-south-1.compute.internal
`
Would like to help, but there could be hundreds of reasons why the incoming and outgoing jmx ports are not accessible from the agent.
Hi Medeti,
you are right. Building based on your stack something with open source is heavy lifting. A lot of people I know start with such a set-up, but quickly run into frustration as they need to dedicated their best people to build a monitoring which is doing the job in a professional way.
As you are microservice focussed and are looking for 'low implementation and maintenance effort', you might want to have a look at INSTANA, which was built with modern tool stacks in mind. https://www.instana.com/apm-for-microservices/
We have a public sand-box available if you just want to have a look at the product once and of course also a free-trial: https://www.instana.com/getting-started-with-apm/
Let me know if you need anything on top.
We use AppOptics. I am curious what are the current leaders for APM for small companies (50 employees) that use Python, MariaDB, RabbitMQ, and Google Cloud Storage. We run both Celery and Gunicorn services. We are considering Datadog or some other deep code profiling tool that can spot I/O, DB, or other response time/request rate issues
If you want to get deep insights and fast issue resolution have a look at INSTANA.
There is a public sandbox to get first insights and feeling for the tool. If you like it you can also run a free trial if you like.
We are running Python & Celery, our stack is based on AWS ECS. We are using NewRelic. This tool is just amazing, both for API and Offline workers. It would provide any metric I was looking for, including a profiler, SLA / SLO dashboards, infrastructure metrics. It has alerting capability that is easily integrated with Pingdom / PagerDuty / Webhooks.
I see StatsD is commonly used in conjunction with Datadog. In fact, Datadog even has their own StatsD daemon (called DogStatsD) embedded in the DataDog agent. Can someone explain to me what it is that StatsD gives you which you don't already have with Datadog's APM and distributed tracing functionality?
The Datadog statsd agent is not really a normal statsd client: it implements a large subset of the original (etsy) features, but also some Datadog-specific features (about histograms). It is used to send metrics to Datadog APM, and its big advantage is the developer experience, which is familiar and easy to use, just like any statsd client, making it trivial to replace an existing statsd metrics client in any application with the Datadog version to publish metrics to Datadog.
I'm considering moving from Flask to Quart, does anyone have some experience with this migration?
I expect possible problems with connexion which we use as OpenAPI specification.
Would be good if someone can point downsides of moving to the Quart framework so I can double-check if my plan is worth doing.
Other libs and tools used in the project: SQLAlchemy, alembic, PostgreSQL, Datadog
cons for now:
- Refactoring uncertainty (not sure how big of a task is it)
- Connexion might not work with Quart (moving to another library)
- ...
Coming from a Ruby background, we've been users of New Relic for quite some time. When we adopted Elixir, the New Relic integration was young and missing essential features, so we gave AppSignal a try. It worked for quite some time, we even implemented a :telemetry
reporter for AppSignal . But it was difficult to correlate data in two monitoring solutions, New Relic was undergoing a UI overhaul which made it difficult to use, and AppSignal was missing the flexibility we needed. We had some fans of Datadog, so we gave it a try and it worked out perfectly. Datadog works great with Ruby , Elixir , JavaScript , and has powerful features our engineers love to use (notebooks, dashboards, very flexible alerting). Cherry on top - thanks to the Datadog Terraform provider everything is written as code, allowing us to collaborate on our Datadog setup.
Via acquisitions and internal product developments over the last year+, Splunk provides really differentiated APM and monitoring for microservices and AWS. I'd recommend giving it a peak if you haven't yet! For some validation, a recent Cloud Observability vendor report by GigaOm came out and ranked Splunk as the "top performer" in the space. Hope this helps in your search
We use a combination of Java and C# microservices on AWS. We started off with CloudWatch but found it severely lacking - even for basic logging functionality; mostly because the way it sets up log groups is not very useful for distributed applications. It gets hard to find the right logs for the right instance; the interface is rather lacking, etc.
We looked into several alternatives. Our final decision fell upon: - Datadog, bundled with every docker image during the CI/CD build process. Datadog has agents easily hook into our existing processes. No real code had to be changed other than the build script and the Dockerfile (https://docs.datadoghq.com/tracing/setup_overview/setup/java/?tab=containers). Datadog has been very good at providing insights on many different levels (performance, errors, infrastructure load) and can be set up to send automated alerts when unexpected behavior happens. - Kibana, to centralize our logging into a more easily searchable and filterable configuration.
I would also recommend considering Dynatrace. I believe it comes at a higher price, but I fondly look back to my time working with the tool in the past. Dynatrace is remarkably deep and smart; it ended up being very good at helping us find tricky issues like memory leaks, it helped us monitor performance, trace user paths throughout our apps and so much more. I understand they've evolved quite a bit since I last used them, investing heavily into AI components to improve the experience. Worth the consideration.