How SendGrid Scaled to 40 Billion Emails Per Month

5,790
Sg logo mark rgb 01
SendGrid
SendGrid is a digital communication platform that enables businesses to engage with their customers via email reliably, effectively and at scale.

Written by Seth Ammons, Principal Engineer at SendGrid


Some background

Founded in 2009, after graduating from the TechStars program, SendGrid developed an industry-disrupting, cloud-based email service to solve the challenges of reliably delivering emails on behalf of growing companies. We currently send over a billion emails daily (with a peak of over 2 billion emails sent in a single day) for companies like Spotify, Yelp, Uber, and Airbnb. We focus on work that enables our customers to be successful and reach their customers. If you open up your inbox, chances are that some of that mail was sent through SendGrid.

I'm a principal engineer here, and I've worked on nearly all aspects of our backend infrastructure over the last seven years. Currently, I work with the team responsible for our outbound MTA (mail transfer agent, the software that communicates with inbox providers). The team is based out of our Irvine office and I really enjoy coming to work every day. We have a high focus on testable, quality software, and our ability to focus on driving excellence in engineering is aided by our manager, who was formerly a developer and team lead for our delivery and processing teams.

Besides our focus on ensuring sending to and handling responses from the multitude of inbox providers, our work encompasses the handling of bounced email and unsubscribes. We work closely, often pairing on the more difficult tasks, striving to make sure that wanted mail is delivered, and maintaining suppression systems to prevent unwanted mail from being processed.


Early days at SendGrid

SendGrid's backend architecture has changed a lot since its inception. What started off as a glorified Postfix install, grew into a large push-based system, and that system is currently transforming into a more scalable pull-based model. As part of this transition, we are moving more and more of our services into the cloud.

In our legacy, push-based system, SMTP or HTTP API requests came in via our edge nodes, and those nodes pushed the requests to our processing cluster's on-disk queues. Once there, mail was mutated per user settings (link tracking, unsubscribe footers, dynamic content substitution, etc), and then pushed to our MTA software's on-disk queues. After being placed into the MTA's queue, the MTA worked to send the mail out quickly and efficiently as possible while applying algorithms to enhance the deliverability rate of that mail.

This system worked really well, but it did have some downsides such as potential event processing delays or even potential mail loss in the event of total node failure. In light of these drawbacks, we've worked towards a pull-based model backed by a distributed file system.


Our current architecture

In our new, pull-based model, the basic systems in place are still there (edge node receiving, processing, delivering); however, we've flipped the dynamic from pushing onto queues to pulling from our custom distributed queues instead (more on this shortly). This change allows our systems to be ephemeral, stateless services that can be spun up or down to match customer needs in a more real-time fashion, and this will be more evident as we increase our presence in and usage of Amazon Web Services (AWS).

The majority of our backend services are written (or being rewritten) in Go. The concurrent nature of our systems made Go a natural and easy choice as we moved from Perl's AnyEvent and Python's Twisted. These services leverage Redis and/or Redis Sentinel for caching and MySQL for data persistence. Metrics are emitted to Graphite and displayed in Grafana. Logs are forwarded from STDOUT to Syslog, where we have utilities that slurp these logs up for Kafka and Splunk. We have alerts setup through PagerDuty and it feeds off data from Splunk Alerts and Sensu checks.

Remember when I mentioned a custom, distributed queue? For queueing between services, one of our teams developed a specialized "heap of heaps" service called SGS (SendGrid Scheduler) that is backed by Ceph. This core piece of technology was needed to ensure that we could fairly dequeue messages without crowding out smaller senders or less popular recipient domains when larger users send out large blasts to popular inbox domains.

Roping this back in, what does our MTA's tech stack look like? The previous, legacy version was Perl with AnyEvent that had a parent process that forked off children processes. The parent scheduled work, and the children delivered mail. The switch to Go removed the callback-hell and the forking as Go's concurrency model is much easier to work with compared to AnyEvent. Being statically typed and compiled lets us actually know what variables are in scope in any given function, the absence of which was a major drawback to the Perl service. So, yeah, we like Go.

Requests come into the MTA, and we secure a connection to the inbox provider using either the customer's IP address, or a shared one from our pool. Once we establish a connection, we blast out as much mail for that user to that email domain as the inbox provider will allow. We rinse and repeat, ensuring that each customer and each domain get a fair opportunity for delivering mail.


Testing Transactional Email

One of our recent and interesting challenges is how we ensure system fidelity as we moved and continue to move from a Perl focused stack to Go. There are basically two schools of thought on how you can achieve this: test all the things or quick deployments and quick reverts in production. We've done a lot of both. We've put a bit of effort in increasing our tooling and monitoring around quickly deploying and quickly reverting. In fact, just yesterday, our newest team member and I deployed a new caching strategy and it only made it on to one production node (after working great in our pre-production environments) for around a minute before it was reverted.

There are certain things that are nearly impossible to test before production, and in this case, it was iptables. Production specific settings aside, we have a system where customers expect things to just work™. At our scale, and on our team, using production as testing could potentially result in lost mail. At the same time, we need to refactor and/or rewrite code to keep up with our growth. To be able to do this with confidence, it requires tests, tests, and more tests.

We have our suites of unit tests, unit-integration tests (tests that verify our immediate integration with databases, caches, and 1st order connected services), and system-integration tests that actually use our email sending API to send an email through our entire system and ensure that expected events are sent to user's webhooks and messages are received to inboxes with the expected data.


Dockerizing SendGrid

For that unit-integration layer, we leverage Docker. Our incoming edge is when the upstream service is finished processing a message and hands it to us for delivery, and then our outgoing edge is actually communicating with someone's inbox. We don't actually want to set up a bunch of receiving MTAs and such, but we still need to test behavior at that layer. Our solution is still a work in progress, but it gets the lion's share of use cases covered so we can confidently refactor and push new features and know we did not break anything.

This Docker setup leverages DNSMasq for setting up MX and A records and ensures they point to running mock inbox sinks. These inboxes are configured from a base image with multiple options. We can specify that the sink's TLS certificate is expired or improperly set up, we can have them respond slowly or with given errors at different SMTP conversation parts. We can ensure that we are backing off and deferring email if the inbox provider says to do so. This detailed faking of the outside world allows us to automate all kinds of outside behavior and ensure that our services behave as expected.

In addition to the fancy Docker setup, we have captured and sanitized production logs for the behavior of our legacy Perl MTA, and we can test that the log output from the new Go version behaves the same way as the old version. These tests are set up to allow us to switch between the legacy and new version of the MTA and ensure that both systems behave in a legacy-compatible way. Not only can we ensure that we operate against a variety of issues we've seen over time from inboxes, but we know that the newest version of our MTA continues to cover all the same expected behaviors of the legacy version.

Oh, and these tests are still fast. All of our unit-integration tests are run and an artifact produced and ready for deployment in under five minutes in our CI system. If it is not pulling Docker images, our local development environment can run these unit-integration tests in under 10 seconds.

We develop locally in Docker, as we just went into. Our docker-compose file spins up containers with fancy DNS settings and all our dependencies, allowing us to test the MTA against a variety of MX and TLS settings, alongside a variety of potential inbox responses and behaviors. Everyone uses their editor of choice and we often pair up on more complex tasks to prevent siloed system understanding.

When we've gone through code reviews (every code and config change goes through a code review) and feel good about the level of automated testing (no one can sign off on their own code's functionality; a quality assurance engineer or other developer has to verify functionality), we merge our changes via a bot that interacts with GitHub (the bot maintains our versions and change logs). After BuildKite has a green build and our binary is shipped to our repo servers, we are good to roll out deploys to our data centers and to keep pushing the needle on the performance of our system.


In Closing

Our systems are changing, our capacity is increasing, and our problems continue to be interesting. Every Monday, I'm still excited to show up, roll up my sleeves, and help push our product forward to tackle greater and greater scale. It is great to be part of a company that strives to keep innovating and improving and to better serve our customers every day.

SendGrid is always looking for top talent to join our team. For more information on positions at SendGrid, please visit our careers page at: https://sendgrid.com/careers/.

Sg logo mark rgb 01
SendGrid
SendGrid is a digital communication platform that enables businesses to engage with their customers via email reliably, effectively and at scale.
Senior Manager of Software Engineerin...
Irvine

SendGrid’s Senior Manager of Software Engineering will be a dynamic and inspiring engineering leader with a passion for delivering world-class products through management, mentorship, and influential leadership.  You will grow and lead talented engineers with backgrounds in software development, quality engineering and production operations. You are a strong believer in the benefits of working through Agile and lean methodologies delivering small incremental customer value frequently. You are a champion and advocate for a DevOps culture whereby teams have full ownership of all stages of the product life-cycle. Your passions include delivering top quality services with amazing reliability and operability, holding your teams accountable to this high standard, and leveraging automation throughout the delivery lifecycle.  You believe in Servant Leadership and prioritize your responsibilities to hire great people, mentor through candid, caring, and timely feedback, and lead the team to consistently deliver; you are also able to dig in and support your team to solve hard technical problems and provide technical guidance as we grow to massive scale. You are an influencer and drive collaboration across the company, and you ensure departmental strategy is set around supporting the needs of our customers.

What You’ll Do

  • Live by and champion our cultural values of Happy, Hungry, Honest, and Humble
  • Manage and lead two or more delivery teams consisting of Software, Quality Assurance, and DevOps Engineers and work collaboratively with peers across disciplines to meet business commitments.
  • Lead teams through design, development, testing, operability, code reviews, and deployment of features and components
  • Responsible for hiring, managing, and mentoring software engineers to foster their career growth and progression
  • Lead innovation with your team, evaluating new trends and opportunities in the industry
  • Drive quality, predictability and velocity on your team(s) with continuous improvement mindset
  • Collaborate with Product Management to define roadmap, priorities and User Stories

About You

  • 12 plus years of relevant experience or equivalent combination of experience and education
  • People management experience with responsibility in any engineering discipline: Quality, DevOps, Development with a strong understanding and passion in each area
  • Ability to oversee the design and operation of complex components and large scale systems
  • Success at participating in cross-functional teams; naturally collaborative but decisive when needed
  • Track record in building and sustaining high-performance teams
  • Ability to manage multiple projects, teams and schedules in a rapid-growth environment, to coordinate successful/timely releases and achieve quality objectives
  • Excellent written and oral communication skills enabling you to articulate complex, technical material to a non-technical audience
  • SaaS product experience is a bonus
Comments
Open jobs at SendGrid
Associate Software Engineer - Labs
Redwood City

Founded in 2009, SendGrid is an industry-disrupting, cloud-based email company that solves the challenges of reliably delivering emails on behalf of our customers.  We deliver over 25 billion emails a month for customers like Airbnb, Spotify, and Uber.

As a software engineer in SendGrid Labs, you will work on a small, cross functional, nimble and dynamic team, validating new products that expand SendGrid’s market opportunities. You will focus on all aspects of new product development from ideation, problem and solution validation, prototyping and testing. You will have the opportunity to engage with users, identify problems, architect solutions, write code and deploy solutions to test product hypothesis. Ultimately, you will have the opportunity to shape architectural solutions that enable us to achieve our goal of simplifying communication between businesses and their customers.

What You'll Do

  • Live by and champion our cultural values of Happy, Hungry, Honest, and Humble
  • Actively participate in the ideation, problem, and solution validation process to identify strategic product initiatives
  • Work as part of a high velocity team focused on experimental product validation, prototyping and testing
  • Work across many functional domains (fullstack)
  • We're an agile, fast growing company and this job description isn't meant to be a complete list of your qualifications or all the things you'll do

About You

  • BS Computer Science, related technical discipline, or relevant work experience
  • 1-2 years in software development
  • Enjoy participating in a high velocity, nimble and dynamic team in an open collaborative environment
  • A solid foundation in computer science
  • Knowledge of service oriented architecture
  • Knowledge of Linux
  • Knowledge of JavaScript
  • Write clean, efficient, testable code
  • Accountable - Being willing to answer for the outcomes resulting from their own choices, behaviors, and actions. Take ownership of situations that they're involved in
  • Self Motivated - Motivated to do or achieve something because of one's own enthusiasm or interest, without needing pressure from others
  • Focused - Achieve what they set out to do before launching new initiatives. Complete company-linked goals and tasks, not simply to be busy and active
  • Collaborative - A keen ability to support cross-functional projects and decisions. Gets energized from working within a team and cross-functionally to achieve the company's goals

Bonus Points

  • Experience with Go (Golang), NoSQL databases
  • Experience with Amazon AWS
  • Experience with email applications and SMTP
  • Experience with APIs
  • Knowledge of IaaS environments
  • Knowledge of SaaS systems
  • Experience with continuous integration and deployment

SendGrid is proud to be an equal opportunity employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, genetic information, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status.

 

DevOps Engineer (Night Shift)
Denver

SendGrid is looking for a talented and passionate individual to help manage our world-class SaaS email delivery infrastructure. This individual will be part of a team that ensures the reliability and performance of a large and diverse tech stack. They will be providing critical systems response capabilities during the night (9PM to 7AM 4 days per week). Additionally, this individual will work on critical path operations projects.

What You’ll Do

  • Take personal responsibility for the availability and reliability of our service
  • Use configuration management tools to interact with and manage SendGrid infrastructure
  • Identify key system metrics and ensure adequate monitoring coverage for new and existing services, leveraging lessons learned while providing front-line response
  • Support our existing production systems while also finding ways to improve it
  • Work on our server image configurations, collaborating with other engineers to optimize for task performance, reliability, failover and scale
  • We're an agile, fast growing company and this job description isn't meant to be a complete list of your qualifications or all the things you'll do

About You

  • You have the passion to "do server management infrastructure right"
  • A minimum of 5 years of Linux system administration experience
  • A minimum of 2 years experience using configuration management tools (Chef, Puppet, Ansible, etc)
  • A minimum of 2 years experience monitoring large-scale deployments (Sensu, Nagios, Graphite, OpenTSDB, etc)
  • Familiarity supporting at least two of the following: Ruby, Go, Perl, Python
  • Experience with virtualization systems: KVM, QEMU, etc.
  • Strong familiarity with the SMTP protocol
  • Computer Science / Engineering degree or equivalent experience
  • A distributed systems foundation and a service-oriented mindset
  • You've "carried the pager" before (ideally at both a startup and a large infrastructure provider) & have first-hand experience with what happens when infrastructure / tools fail
  • You are a self-starter who works well independently and in off-hours
  • You have great communication skills with the ability to work asynchronously with a large team
  • You’ve worked in colo-based SaaS environments with more than 100 servers
  • You read up on and experiment with new technologies because it’s in your nature, not because it’s a job requirement
  • You don’t just learn how things work, you learn why

SendGrid is proud to be an equal opportunity employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, genetic information, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status.

Lead Software Engineer - Applications...
Orange

Founded in 2009, SendGrid is an industry-disrupting, cloud-based email company that solves the challenges of reliably delivering emails on behalf of our customers.  We deliver over 25 billion emails a month for customers like Airbnb, Spotify, and Uber.

As a Technical Lead at SendGrid, you will lead a team of passionate developers and help build a first class user experience. This team’s primary objectives are rooted in the web application that is utilized by our customers to access the features of SendGrid. You will have the opportunity to write mission-critical code that enables us to achieve our goal of simplifying communication between businesses and their customers.

What You’ll Do

  • Live by and champion our cultural values of Happy, Hungry, Honest, and Humble
  • Advise the team on technical direction, design, and standards
  • Collaborate with product and engineering management on timeline and implementation details
  • Build and deploy, reliable full-stack web applications and services that support millions of requests per day
  • Work in a dynamic team environment developing and maintaining a high-quality code base with short turnaround times
  • Contribute to continuous improvement of software development best practices in the areas of tools, languages, development processes and APIs
  • We're an agile, fast growing company and this job description isn't meant to be a complete list of your qualifications or all of the things you'll do

About You 

  • BS, MS, PhD in Computer Science, related technical discipline, or relevant work experience
  • Experience with developing, testing, deploying, troubleshooting, and optimizing large scale web applications
  • At least 2 years of technical leadership experience and 6-9 years in software development
  • Ability to coach and mentor less experienced engineers
  • Great verbal and written communication skills
  • Experience with relational databases and Linux
  • Strong analytical and problem solving skills
  • Able to work well without supervision; sees commitments through to completion
  • Ability to work and participate on a team in an open collaborative environment

Bonus Points

  • Experience with developing in any of the following languages: Go, Python, Java, Javascript, or Ruby
  • Experience implementing and troubleshooting multithreaded applications
  • Experience in working with and developing for highly distributed environments
  • Understanding of microservice, service oriented architectures, and messaging queues

SendGrid is proud to be an equal opportunity employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, genetic information, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status.

Manager of Software Engineering - Del...
Denver

Founded in 2009, SendGrid is an industry-disrupting, cloud-based email company that solves the challenges of reliably delivering emails on behalf of our customers.  We deliver over 25 billion emails a month for customers like Airbnb, Spotify, and Uber.

SendGrid’s Manager of Software Engineering will be a dynamic and inspiring software product development leader with a passion for thoughtful management, impactful mentorship, and inspiring leadership. You will grow and lead our critical Delivery and Compliance engineering teams, facilitate Agile methodologies and own, review and manage sprint planning in collaboration with executive staff and product management.

We are building a world class engineering organization, which means your primary responsibilities will be to hire great people, mentor them through regular feedback, and lead your teams to deliver consistently superb results. You will draw upon your years of hard won technical expertise to provide wisdom to help us avoid pitfalls, prioritize issues and meet our biggest technical challenges. To succeed in this role, you will have a passion for Agile software development, foster a culture of collaboration and innovation, and possess a strong background in the SaaS or PaaS space. Ultimately your job is to ensure we rapidly deliver and successfully operate quality software systems which can support our scaling and fast growing business.

What You’ll Do

  • Live by and champion our cultural values of Happy, Hungry, Honest, and Humble
  • Successfully recruit, manage, motivate and mentor members of the software engineering organization
  • Collaborate with the product, sales, and ops organizations to deliver innovative, reliable products that delight our customers
  • Build and manage software systems and architecture to meet the future needs of our customers
  • You are also responsible for the operations of your team’s production systems, so being able to provide solid operational guidance is a must  
  • Ensure a high level of quality in SendGrid’s software products
  • Look for ways to improve the engineering organization to increase throughput, improve quality, and continuously improve processes
  • Apply best practices to lead development teams through design, development, testing, code reviews, and deployment of features and components
  • Collaborate with product management to ensure a deep understanding of requirements, define priorities and develop a clear roadmap and how they map to our larger goals.  Successfully translate those requirements for our engineers so that they can deliver
  • Ensure your teams’ goals are aligned with the company and help keep them on track
  • Be the voice of the team within the business and communicate the company vision and goals back to the team
  • Provide inspirational thought leadership to the team in the areas of software development best practices, industry direction and innovation
  • Develop people within your teams towards career progression goals
  • Nurture SendGrid/Engineering culture
  • We're an agile, fast growing company and this job description isn't meant to be a complete list of your qualifications or all the things you'll do

About You 

  • Proficient across a highly diverse set of technologies (email, golang, Perl, databases, etc.)
  • Demonstrated ability to oversee the design of complex components and large scale systems
  • Demonstrated ability to oversee the translation of architecture vision into concrete system designs
  • Track record in building and sustaining high-performance teams
  • Humble leader with strong people-management and conflict-resolution skills
  • Ability to closely track the details of large and complex engineering projects with a deep-seated drive to deliver!
  • Ability to articulate ideas to technical and non-technical audiences
  • Demonstrated analytical, evaluative, and problem-solving abilities
  • SaaS product experience
  • 5+ years of software development of large distributed production systems
  • 5+ years of engineering management of cross functional teams of at least 6-10
  • BS in Computer Science, Engineering or technical discipline or equivalent experience
  • Decisive - Uses a framework or process by which to make decisions. At times, will need to make decisions quickly and often with incomplete data
  • Persuasive - Bring others to their point of view using logic, data, and emotion. Have a formal process and framework by which to make qualitative and quantitative points, not just using emotional appeals
  • Accountable - Being willing to answer for the outcomes resulting from their own choices, behaviors, and actions. Take ownership of situations that they're involved in
  • Self Motivated - Motivated to do or achieve something because of one's own enthusiasm or interest, without needing pressure from others
  • Focused - Achieve what they set out to do before launching new initiatives. Complete company-linked goals and tasks, not simply to be busy and active
  • Collaborative - A keen ability to support cross-functional projects and decisions. Gets energized from working within a team and cross-functionally to achieve the company's goals

Bonus Points 

  • Experience at substantial scale
  • You have more than a passing familiarity with email aka you are familiar with the relevant RFCs like RFC 5322 — Internet Message Format and RFC 5321 — Simple Mail Transfer Protocol
  • Experience with all phases of the recruiting pipeline

SendGrid is proud to be an equal opportunity employer. We are committed to equal opportunity regardless of race, color, ancestry, religion, gender, gender identity, genetic information, parental or pregnancy status, national origin, sexual orientation, age, citizenship, marital status, disability, or Veteran status.

Verified by
989820?v=4
Principal Software Developer
You may also like
E-Commerce at Scale: Inside Shopify's Tech Stack
How Stream Built a Modern RSS Reader With JavaScript
How Heap Built an Analytics Platform that Auto-Tracks Every User Event
How Raygun Processes Millions of Error Events Per Second