How Mixmax Uses Node and Go to Process 250M Events a day

Background

Mixmax is the product that your team uses to communicate with the outside world. What Slack did for internal communication, we’re doing for email and external communication.

Building a communication platform means processing a TON of data. Our backend, built primarily in Node and Go, processes up to 250M events a day with 200k/minute at peak load. As the glue for an organization’s communication, not only are we processing a huge number of internal events, but we’re also processing data from external sources like CRMs and ATSs totalling 3.2 million events and amounting to a data volume exceeding 14 GB each hour. We've already scaled our platform up 2x in the past 3 months and plan to grow another 10x this year, all while maintaining strict "three 9's" uptime that our customers expect, as they rely on Mixmax all day to get their work done.

I’m the Head of Platform Engineering at Mixmax, which means that I spend most of my time supporting and unblocking our engineering teams. I’ve spent most of my time working in SaaS with a stint in security.

Mixmax Engineering

Our engineering team currently consists of 15 engineers with highly varied backgrounds. Today, everyone on the team is a full-stack engineer, although all of us have our own strengths (i.e. Elasticsearch, MongoDB, devops, security, etc). It makes for an amazing mix with everyone bringing their own superpower to the table. Our team is also highly distributed, with engineers in Australia, Canada, Mexico, and the US.

We’re self-organized into a constantly varying number of teams. We have two evergreen teams, our Core and Support teams, which are responsible for the two pillars of engineering departments - stability and quality. Beyond those two teams, we create teams around our product priorities. This means each product team lives for the duration of the development life-cycle, and no longer. This dynamic nature allows us to more seamlessly share and distribute knowledge across the team so that we’re all constantly learning and growing. Teams are also cross-functional, which helps us have consistent and open feedback between everyone in engineering, product and design. Each team defines what their success and failure metrics are, as well as they will measure their own progress (some teams do two week sprints, some do Kanban, etc). The one constant is that every team agrees and publicizes the metrics that they use to monitor their own success.

Orthogonal to our teams, we have two guilds: our web guild and our platform guild. Our guilds are for helping us improve our and develop our individual strengths, not for siloing knowledge. Our guilds focus on elevating best practices for their areas of ownership, as well as helping to mentor and provide safety nets for members outside their guild. One clear distinction that we draw, is that we explicitly ensure that guild members are not the only ones during work that would fall into their areas. First and foremost, all engineers are expected to focus on their team, helping them to achieve their goals - as an example, this means that “platform work” is not meant to be worked on by only platform guild members.

Initial architecture and application evolution

Mixmax was originally built using Meteor as a single monolithic app. As more users began to onboard, we started noticing scaling issues, and so we broke out our first microservice: our Compose service, for writing emails and Sequences, was born as a Node.js service. Soon after that, we broke out all recipient searching and storage functionality to another Node.js microservice, our Contacts service. This practice of breaking out microservices in order to help our system more appropriately scale, by being more explicit about each microservice’s responsibilities, continued as we broke out numerous more microservices.

This resulted in a system with many Node.js microservices and one still fairly large Meteor service. All of these Node.js services did, and still do, run on Elastic Beanstalk in AWS as we optimized for developer velocity by using a managed deployment platform. The Meteor app ran in Galaxy, which had necessitated that we use a subdomain-based microservice approach for that main Meteor app to talk to the other microservices.

As we began to scale super quickly, with more and more customers joining the platform, we started to see that the Meteor app was still having a lot of trouble scaling due to how it tried to provide its reactivity layer. To be honest, this led to a brutal summer of playing Galaxy container whack-a-mole as containers would saturate their CPU and become unresponsive. I’ll never forget hacking away at building a new microservice to relieve the load on the system so that we’d stop getting paged every 30-40 minutes. Luckily, we’ve never had to do that again! After stabilizing the system, we had to build out two more microservices to provide the necessary reactivity and authentication layers as we rebuilt our Meteor app from the ground up in Node. This also had the added benefit of being able to deploy the entire application in the same AWS VPCs. Thankfully, AWS had also released their ALB product so that we didn’t have to build and maintain our own websocket layer in EC2. All of our microservices, except for one special Go one, are now in Node with an Nginx frontend on each instance, all behind ELBs or ALBs running in Elastic Beanstalk.

Data storage at Mixmax

Originally, we had a single Mongo replica set that we stored everything on. As we scaled, we realized two things:

A single Mongo replica set wasn’t going to cut it for our many quickly growing collections
Analytics and rich searching don’t scale well in Mongo.

To solve for the first item, we now run multiple large scale Mongo deployments with a mix of replica sets and sharded replica sets (depends on the application activity for the given database). In solving for the second item, we now run multiple large Elasticsearch deployments to provide the majority of our rich searching functionality.

We also heavily use Redis across the entire platform for things like distributed locking, caching, and backing part of our job queuing layer. This has led to our most recent (and ongoing!) scaling challenge.

(here’s a screenshot of the tool that we use to administer to our worker queues that live on Redis)

Asynchronous processing at Mixmax

At Mixmax, we have multiple queueing systems running that all exhibit very different behaviours, due to all the different ways that our platform is used. We’ve gone through quite a few Redis-backed job queueing technologies before we arrived at our current place (from Kue to bull-queue to bee-queue to a mix of bee-queue and AWS Kinesis). Our current stack, a mix of bee-queue and AWS Kinesis, allows us to both seamlessly handle our steadily active queues (i.e. for sending emails) and weather the storm of work that powers our CRM syncing engines. This has been a really fun challenge, as part of this system handles in the high hundreds of millions of jobs a day with sporadic spikes of millions of jobs per minute. We’ve made huge progress here, and we still have a lot of progress to make as we continue to scale this asynchronous processing system.

How we ship

Our workflow centers around getting code live ASAP. Our CI pipeline is centered around GitHub as our VCS tied into TravisCI. Our CD pipeline then continues on from there using AWS Elastic Beanstalk to deploy new application versions.

All developers are able to work on a local copy of the entire infrastructure. Once a developer has their code ready, it goes through review on GitHub - side note, we’re loving all the work that they’re putting into their code review tooling. After code is reviewed and good to go, it lands on our staging environment, where we manually QA a few core flows before we’ll elevate the code to be released on our production environment. For running all of our services locally, we currently use a mix of supervisord and a tool built by one of our engineers named custody.

A huge part of our continuous deployment practices is to have granular alerting and monitoring across the platform. To do this, we run Sentry on-premise, inside our VPCs, for our event alerting, and we run an awesome observability and monitoring system consisting of Statsd, Graphite and Grafana. We have dashboards using this system to monitor our core subsystems so that we can know the health of any given subsystem at any moment. This system ties into our PagerDuty rotation, as well as alerts from some of our CloudWatch alarms (we’re looking to migrate all of these to our internal monitoring system soon).

(screenshot of our monitoring cluster monitoring our strongDM gateways)

Security hygiene at distributed scale

Being a distributed team is in our DNA. One challenge that we’ve faced as a part of being such a distributed team is providing auditible, available, secure and stable access to databases in our private networks for engineers that are authorized and need to have access to them. In a distributed world, auditing database access, credential management and rotation, and onboarding can be a nightmare. Someone running a query on a staging DB that’s taking down the test environment for every? Good luck hunting that down. Have a new engineer onboard and they need to run an audit query on the staging DB to see if their new code might break an old schema? Have fun configuring that. Need to run your periodic credential rotation, ...enjoy. This was not only a huge pain point for our team, but me personally, and then strongDM came into the picture.

strongDM acts as a control plane to manage access to every database and server. By centralizing all database credentials & ssh keys in strongDM, onboarding and offboarding becomes much faster. Simply add a user to a group and since the user never has access to the DB credentials (strongDM handles that) you never need to worry about rotating credentials purely due to employee offboarding. For auditing, since strongDM knows and can monitor each users’ connection, you have direct insight into every single query or access that a user makes - a godsend for auditing. When it comes to periodically rotating keys, it’s even simpler, as your rotating credential sets instead of credentials per user, without any action needed from a single other engineer - it simply works. Our engineers have enjoyed strongDM so much that some have even tweeted about it in moments of pure joy.

I seriously cannot imagine working without strongDM now. It’s one of those tools that seamlessly fits into your workflow and you can’t envision work without it.

(screenshots of strongDM in action)

What’s next? Processing all the things.

It’s an exciting time to be at Mixmax, our entire company is scaling quickly (we’ve grown 4x in the last two years, and this trend isn’t slowing down) along with our customer base. This means that we’re processing more data than ever before, and we’re having to get more and more creative to keep up with the amount of data coming in.

We’re currently prototyping our next generation processing systems, building them out in different languages, with different tech - it’s a fantastic time to join to come help us figure out our future direction as an engineering team all while working on a platform that our customers love!