Originally, we had a single MongoDB replica set that we stored everything on. As we scaled, we realized two things:
- A single Mongo replica set wasn’t going to cut it for our many quickly growing collections
- Analytics and rich searching don’t scale well in Mongo.
To solve for the first item, we now run multiple large scale Mongo deployments with a mix of replica sets and sharded replica sets (depends on the application activity for the given database). In solving for the second item, we now run multiple large Elasticsearch deployments to provide the majority of our rich searching functionality.
We also heavily use Redis across the entire platform for things like distributed locking, caching, and backing part of our job queuing layer. This has led to our most recent (and ongoing!) scaling challenge.
In a distributed world, auditing database access, credential management and rotation, and onboarding can be a nightmare. Someone running a query on a staging DB that’s taking down the test environment forever? Good luck hunting that down. Have a new engineer onboard and they need to run an audit query on the staging DB to see if their new code might break an old schema? Have fun configuring that. Need to run your periodic credential rotation, ...enjoy. This was not only a huge pain point for our team, but me personally, and then strongDM came into the picture.
strongDM acts as a control plane to manage access to every database and server. By centralizing all database credentials & ssh keys in strongDM, onboarding and offboarding becomes much faster.
I seriously cannot imagine working without strongDM now. It’s one of those tools that seamlessly fits into your workflow and you can’t envision work without it.
Mixmax was originally built using Meteor as a single monolithic app. As more users began to onboard, we started noticing scaling issues, and so we broke out our first microservice: our Compose service, for writing emails and Sequences, was born as a Node.js service. Soon after that, we broke out all recipient searching and storage functionality to another Node.js microservice, our Contacts service. This practice of breaking out microservices in order to help our system more appropriately scale, by being more explicit about each microservice’s responsibilities, continued as we broke out numerous more microservices.
A huge part of our continuous deployment practices is to have granular alerting and monitoring across the platform. To do this, we run Sentry on-premise, inside our VPCs, for our event alerting, and we run an awesome observability and monitoring system consisting of StatsD, Graphite and Grafana. We have dashboards using this system to monitor our core subsystems so that we can know the health of any given subsystem at any moment. This system ties into our PagerDuty rotation, as well as alerts from some of our Amazon CloudWatch alarms (we’re looking to migrate all of these to our internal monitoring system soon).