Oct 15, 2014
Throwing more hardware at Cassandra and no more multi-tenancy
On June 3, 2014 PagerDuty experienced a major issue: their Cassandra pipeline had stopped processing events and refused new ones. All in all, an outage was created that lasted 3 hours, along with additional degraded performance.
"Cassandra seems to have two modes: fine and catastrophe" said one of the PagerDuty engineers, as a seemingly routine repair had cascaded into a very bad situation. Constant memory pressure and underprovisioned amounts of RAM were isolated as a few of the factors that pointed to weaknesses in the way the cluster was set up.
After the outage, each node in the Cassandra cluster was replaced with m2.2xlarge EC2 nodes with 4 cores and 32GB of RAM. PagerDuty also moved away from using a multi-tenant Cassandra setup at that point, to help isolate failures in the future.

