Throwing more hardware at Cassandra and no more multi-tenancy

On June 3, 2014 PagerDuty experienced a major issue: their Cassandra pipeline had stopped processing events and refused new ones. All in all, an outage was created that lasted 3 hours, along with additional degraded performance.

"Cassandra seems to have two modes: fine and catastrophe" said one of the PagerDuty engineers, as a seemingly routine repair had cascaded into a very bad situation. Constant memory pressure and underprovisioned amounts of RAM were isolated as a few of the factors that pointed to weaknesses in the way the cluster was set up.

After the outage, each node in the Cassandra cluster was replaced with m2.2xlarge EC2 nodes with 4 cores and 32GB of RAM. PagerDuty also moved away from using a multi-tenant Cassandra setup at that point, to help isolate failures in the future.

Throwing more hardware at Cassandra and no more multi-tenancy

Related Tools

Trending on StackShare

Needs advice on code coverage tool in / with External API Te...

I was building a personal project that I needed to store ite...

Your tech stack is solid for building a real-time messaging ...

I had a goal to create the simplest accounting software for ...

Your development environment should ideally match the produc...