Site Reliability Engineer at Cooperative Bank Of Thessaly·
Needs advice
on
GraylogGraylog
and
PrometheusPrometheus

We are a small bank and we have 5 VMware ESXi servers with mainly Windows Server VMs with numerous windows services installed and most of these servers have Microsoft SQL Server and Microsoft IIS installed. Also we have some applications that have application logs (mainly in a db table) and we have a few Hangfire instances and one MQ Series server.

Now the management gave me the task of site reliability (I'm fairly new to this) which means all Windows Services must run 24/7 so I have to know if a service fails to start. All databases must run properly so I have to know locks, Query performance, and any SQL Agent job failures. The same goes for IIS websites/services must be up and running all the time.

In addition to these, I must collect all the Hangfire job failures(which are a lot) as well as general server metrics like CPU, RAM, I/O Disk, Disk sizes, etc.

On top of all these, I must setup alerts via Slack/sms or mail. Now the question which tool or a stack of tools can achieve all that?

READ LESS
4 upvotes·13.1K views
Replies (1)
Recommends
on
New Relic

Please check the NewRelic, It has most of the capabilities that you were expecting with dashboards and alerts with SMS/Email notification options.

READ MORE
3 upvotes·2.2K views
Avatar of Prakash Mohankumar