How Sentry Receives 20 Billion Events Per Month While Preparing to Handle Twice That

29,314
Sentry
Developers use Sentry to cut time to resolution for application issues from five hours to five minutes.

By James Cunningham, Operations Engineer, Sentry.


About Sentry

Sentry illustration

Unless your engineering team is staffed by angels who commute down to the office from heaven every morning, we’re pretty confident you run into plenty of problems developing and iterating on your applications in production. Sentry provides all the tools you need to find, triage, reproduce, and fix application-level issues before your users even know there was a problem. With the added bonus that you won’t get any more nasty looks from support engineers at happy hour.

By automating error detection and aggregating and adding important context to stack traces, Sentry helps you proactively correct the errors that are doing the most harm to your business more efficiently and durably and with minimal disruption. Closing the gap between the product team and customers improves productivity, speeds up the entire development process, and helps engineers focus on what they do best: build apps that make users’ lives better.

I was personally a Sentry user way before I was an employee. Early on at my previous company, I was tasked with upgrading the open-source error tracking service that hadn’t really been maintained or used for a while. I reached out for help and heard back from David (Sentry’s co-founder) and Matt (Sentry’s second engineer), meeting two of my future co-workers on IRC years before I ever saw their faces (protip: connect with Matt on LinkedIn).


This is Matt

This is Matt


They were incredibly helpful and, when I went looking for a new job, I thought, “Hey, this is a very nice piece of software, and the people who are running it are really mindful of their community. I’d love to be a part of that.” Today, I spend my waking hours happily keeping Sentry’s hosted service operational, available, and responsive to our exponentially-increasing event volume (editor’s note: when he’s not trolling new hires on Slack for their taste in hip-hop and Fruit Gushers).

A Powerful Side Project

Sentry started as (and remains) an open-source project, growing out of an error logging tool David built in 2008. He displayed a truly shrewd notion of branding even then, giving the project a catchy name that companies the world over remain jealous of to this day: django-db-log. For the longest time, Sentry’s subtitle on GitHub was “A simple Django app, built with love.” A slightly more accurate description probably would have included Starcraft and Soylent alongside love; regardless, this captured what Sentry was all about.

That original build nine years ago was Django and Celery (Python’s asynchronous task codebase), with Postgres as the database and Redis as the power behind Celery.

A Fast-Growing Company

As you might expect, Sentry usage has grown exponentially over the past decade, and the infrastructure has changed and matured to accommodate massive scale. We now host the open-source project as a SaaS product. Sentry has SDKs for just about every framework, platform, and language and integrations with the most popular developer tools, which helps make it incredibly easy to adopt. Today, Sentry is central to the error tracking and resolution workflows of tens of thousands of organizations and more than 100,000 active users around the world, many of whom support implementations for some of the biggest properties on the internet: Dropbox, Uber, Stripe, Airbnb, Xbox Live, HubSpot, and more. That’s 5 billion events per week, just from the hosted service.

When a customer sends events to Sentry, they don’t receive a laundry list of notifications, they get the aggregate issue with counts of how often it’s occurred and which of their users are experiencing the issue. This is all presented very simply and cleanly in Sentry, but if a user wants individual events, we’ll provide those also. We save every single event we accept, which gets very expensive to do in a traditional relational database.

One of the first improvements Sentry made to address scalability was storing all of these events in a distributed key-value store. There are a variety of key-value stores out there, all with their promises and pitfalls, but when evaluating solutions, we ultimately chose Riak. Our Riak cluster does exactly what we want it to: write event data to more than one location, grow or shrink in size upon request, and persist through normal failure scenarios.

The first major infrastructure project that I contributed to when joining Sentry was horizontally scaling our ability to execute offline tasks. As Sentry runs throughout the day, there are about 50 different offline tasks that we execute—anything from “process this event, pretty please” to “send all of these cool people some emails.” There are some that we execute once a day and some that execute thousands per second.

Managing this variety requires a reliably high-throughput message-passing technology. We use Celery’s RabbitMQ implementation, and we stumbled upon a great feature called Federation that allows us to partition our task queue across any number of RabbitMQ servers and gives us the confidence that, if any single server gets backlogged, others will pitch in and distribute some of the backlogged tasks to their consumers.

Another project we’ve undergone is setting up safeguards in front of our application to protect from unpredictable and unwanted traffic. When accepting events, we would be crazy to just expose the Python web process to the public Internet and say, “Alright, give me all you got!” Instead, we use two different proxying services that sit in front of our web machines:

  • NGINX, our product-aware proxy, handles many of the upper bounds that we have deemed reasonable. It is responsible for a variety of bounds, but its most popular one is protecting Sentry from exceedingly large event volumes. Ever so often, a user will run into a problem where they’ve deployed their code out into the abyss, and their event volume clocks in at a few zeroes higher than what they signed up for.
  • - In front of NGINX, we use another proxying service called HAProxy, which acts as a delta of connections without any of that product awareness logic and has a lot higher throughput. All it does is accept connections and send them off to different NGINX servers, allowing us to gracefully add or remove NGINX servers as we see fit.


Everything is fine now


An Evolving Architecture

Sentry began life as a traditional Django application, and has gone through a couple of architecture iterations since. The current Sentry dashboard, which is what customers use to browse and debug their production issues, has evolved into a single-page application written in React and Reflux (an early Flux library). We write ES6 and transpile to JavaScript using Babel and Webpack. For fetching and submitting data, we communicate with the Django backend through a straightforward REST-based HTTP API.

The event processing pipeline, which is responsible for handling all of the ingested event data that makes it through to our offline task processing, is written primarily in Python. For particularly intense code paths, like our source map processing pipeline, we have begun re-writing those bits in Rust. Rust’s lack of garbage collection makes it a particularly convenient language for embedding in Python. It allows us to easily build a Python extension where all memory is managed from the Python side (if the Python wrapper gets collected by the Python GC we clean up the Rust object as well.)


Sentry Releases animation


A Simple Deploy Workflow

For the most part, Sentry is still a classically monolithic app. This is driven, in part, by the fact that Sentry is still open-source, and we want to make it easy for our community to install and run the server themselves. To do this, we provide installation details for a Docker image that contains all of Sentry’s core services in one place. This monolithic nature makes contributing to and deploying Sentry ourselves relatively straightforward.

When someone wants to commit a change to the codebase, it is submitted as a pull request to our public project on GitHub. From there, Travis CI runs a set of parallelized builds, which include not only unit and integration tests, but also visual regression tests that are managed through Percy. Since we’re still an open-source project that supports different relational databases, we run test suites not only for Postgres, but also for MySQL and SQLite, as well.

Once all tests are green, the code has been reviewed, and any detected UI changes have been approved, the code is merged through GitHub. We then use an internal open-source tool named Freight to build and deploy our Docker image to production. Additionally, Freight injects the only closed source piece of Sentry, our billing platform. Once the image is in production, we trigger a rolling restart of every Sentry container to pick up the new image.


Sentry plus Slack integration GIF


An Unpredictable World

One of our biggest challenges is that Sentry’s traffic is inherently unpredictable, and there’s simply no way to foresee when a user’s application is going to melt down and send us a huge influx of events. On bare metal, we handled this by preparing for the worst(ish) and over-provisioning machines in case of an event deluge. Unfortunately, as demand grew, our time window for needing new machines shrunk. We started demanding more from our provider, requesting machines before they were needed, and keeping common machines idle for days on end, waiting to see which component needed it the most.

For that reason, we made the leap to Google Cloud Platform (GCP) in July 2017 to give ourselves greater flexibility. Calling it a “leap” makes it sound impulsive, but the transition actually took months of planning. And no matter how long we spent projecting resource usage within Google Compute Engine, we never would have predicted our increased throughput. Due to GCP’s default microarchitecture, Haswell, we noticed an immediate performance increase across our CPU-intensive workloads, namely source map processing. The operations team spent the next few weeks making conservative reductions in our infrastructure, and still managed to cut our costs by roughly 20%. No fancy cloud technology, no giant infrastructure undertaking -- just new rocks that were better at math.

You can find way more detail about it on the Google Cloud Platform Blog.

Observability and Action

A big reason we can sustain Sentry is that it falls into a category of observability tooling that requires a non-trivial amount of resources to host. We run Sentry ourselves because we’ve gotten pretty good at it. We rely on Sentry to track errors in our production app and help us set priorities for iteration, based on user experience and impact.

But when it comes to the rest of our monitoring stack, we apply the same thinking as the users signing up for Sentry’s hosted service every day: “It’s better to pay for uptime in dollars than in engineering hours.” (If you haven’t used Sentry’s hosted service, it only takes a couple minutes and a few lines of code to set up.)

We use a few toolchains outside of our production environment. I could write an essay detailing each (and I probably will), but let’s just outline how I would get notified that we’ve regressed in our 95th percentile of request latency:

  • Each host running a web server sends the timing of requests to Stripe’s Veneur
  • Veneur creates histograms of request timings and forwards those to Datadog
  • A Datadog threshold alert detects we’ve gone higher than 500ms
  • The threshold alert is configured to notify a Slack channel and a PagerDuty rotation
  • The PagerDuty rotation notifies both operations engineers currently on-call


Sentry welcome gif

We introduce every new employee with their own welcome gif


Fantastic Co-Workers

Our Engineering org is split into four teams in two programs: Product and Infrastructure. Their names do a pretty solid job describing their purposes, but:

  • Product is broken into the Workflow and Growth teams. Workflow focuses specifically on how our users interact with Sentry throughout their own workflows and development processes. Growth looks at the tweaks we can make that will increase the likelihood that a new user will find Sentry relevant, onboard effectively, and stick around to use it more and more.

  • Infrastructure is broken into the Platform and Operations teams. Platform is dedicated to all of the Sentry code that powers our API, including event ingestion. Operations is where I live, and we’re dedicated to building, deploying, maintaining, and monitoring all of the components that keep sentry.io stable.

We also have an unofficial fifth team that plays a large part in Sentry’s development and will always outnumber the others: our open-source contributors. Sentry’s entire codebase is right on GitHub for the whole world to see, and many improvements to our service have been introduced by users and community members who don’t work here.

Other Stacks

Just as Sentry is a part of many software teams’ stacks, we rely on a number of additional commercial and open-source services to help run our business. We use Stripe to handle customer billing, SendGrid for reliable email delivery, Slack for team communication, Google Analytics for basic web analytics, BigQuery for data warehousing, and Jira for project management.

On the open-source side, our growth and BI teams use Redash to derive useful statistics from our data. We use Jekyll to publish sentry.io and other online marketing content, like our blog.

Closing


Sentry team photo


Open source, open company. That’s our credo, and it really captures what we’re all about. As I mentioned earlier, I applied for a job at Sentry because it’s such a nice piece of software, and the people who run the company are mindful about the role of the community. Since everyone who works here is also a member of the open-source community, that mindfulness extends to and flows between employees.

Growth is inevitable here. The hard decision is not what to scale, but when. It’s the Operations team’s responsibility to put engineering hours into the right initiative and balance scale with security, reliability, and productivity. Maybe you want to make some of those hard decisions on my team?

Or maybe operations isn’t your thing, but you want to build something open-source. Want to contribute to Sentry beyond just code? We’re hiring pretty much across the organization and would love to talk to you if you’ve read this entire post and think you still might be as into Sentry as I am.

Sentry
Developers use Sentry to cut time to resolution for application issues from five hours to five minutes.
Tools mentioned in article
Open jobs at Sentry
Sr. Software Engineer – Ecosystem
San Francisco, CA

Sentry's mission is to empower software development teams to build better products, faster. Our open source crash reporting platform helps almost a million developers and tens of thousands of software teams at some of the internet's most loved websites/apps (Dropbox, Uber, Airbnb, Stripe, Atlassian, and many more) discover, triage, and resolve production software issues, so they can spend less time debugging and more time building software.

About this role

Sentry is one of many tools developers use to create and ship high-quality production software. The Ecosystem team is tasked with connecting Sentry with this wider set of developer tools, by continually expanding and improving our API platform, as well as building first-class integrations with the industry's most popular products (GitHub, Slack, etc.). The goal: making sure Sentry works nicely with every team's preferred development workflow.

As a Sr. Software Engineer on the Ecosystem team, you'll take on a lead role in growing our developer API platform and first-class integrations. This platform doesn't just allow external integrators to communicate with our REST API; it lets them augment the in-application user experience with new product capabilities. You'll work directly with major partners and 3rd-party developers to validate your progress, ensure the success of integrators, and ultimately deliver a world-class integration platform.

If you want to work in a high-leverage role where you're not just building product features – you're building a platform in which anybody can build on top of – this could be the job for you.

Responsibilities

  • Expand Sentry's Integration Platform across multiple categories and types of partners
  • Ensure Sentry's first-class integrations (GitHub, Slack, etc.) remain best-in-class
  • Communicate with internal and external engineering teams
  • Make architectural decisions based on wants and needs of external engineering teams
  • Review code and mentor less-experienced teammates
  • Lead design and discussions around projects the team is working on
  • Improve the experience external developers have when interacting with our API and Integration Platform features
  • Improve the long-term quality of Sentry's Integration Platform and codebase

Qualifications

  • 5+ years building web applications
  • 2+ years building high traffic web applications at scale
  • Experience with Python, Git, and PostgreSQL (or other relational databases)
  • Can write robust, well designed, full-stack code while understanding the long-term tradeoffs of your choices
  • Experience and interest in API design and best practices

Preferred Qualifications

  • Experience with
  • Experience building, maintaining, and improving public APIs that internal and external consumers rely on.
  • Experience navigating large codebases
  • Experience building distributed web software and understanding the tradeoffs of design decisions
  • Experience operating web applications in production using metric and data driven tools.

Benefits

  • Contribute to an open source product used by almost a million of your fellow developers and tens of thousands of companies
  • Be part of an experienced and renowned team that cares a lot about diversity and inclusivity
  • Competitive salary and meaningful equity
  • 100% medical and dental coverage
  • Commuter subsidy
  • Charitable matching program
  • Generous parental leave policy
  • Flexible working schedule and vacation policy, and real work/life balance
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance
 
Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Sr. Automation Engineer
San Francisco, CA

About this role 

At Sentry, Software Engineers are responsible for the accuracy, reliability, and resiliency of their own code changes. To do so, they rely on a bevy of tools to help them test their code in an automated fashion, to communicate that the changes are sound, and to ensure the changes do not regress in the future.

As Sr. Automation Engineer, you will help us raise the bar for software quality by improving the tools and infrastructure through which teams perform automated testing. You'll improve and scale our CI infrastructure so that we can run more tests, more frequently, ultimately letting us spot more bugs before they reach production. You will push teams to raise test coverage by accurately reporting and enforcing code coverage metrics – for back-end, front-end, and even full integration tests. You'll help us collect and report quality metrics so that teams can understand where they're falling down, and how they can meet the challenge.

If you're looking for a high-autonomy + technical leadership role in which you'll accelerate how our engineering organization performs automated testing at scale, this is the job for you.

 
You'll be responsible for:
  • Scaling the performance of our CI infrastructure as our engineering team grows
  • Reducing test "flakiness" and improving overall reliability
  • Instrumenting our code to surface more accurate code coverage metrics
  • Meeting regularly with engineering teams to understand their automation and testing pain points, and developing and executing plans to address those pain points
  • Writing automated tests that help increase resiliency across the board (e.g. smoke tests, automated performance tests, etc.
 
You'll love this job if:
  • You take pride in building features that don't just work, but are also delightful to use
  • You want to join a modern software development team that iterates and ships code rapidly
  • You want to contribute to open source full-time
  • You want to work with collaborative, thoughtful engineers who push themselves and others to do better

Qualifications

  • 4+ years of experience as an Automation Engineer or similar role (Build Engineer, Software Engineer, etc.)
  • Proficiency in at least one of the following programming languages: Python, JavaScript (Node.js)
  • Automated testing experience (Selenium, PyTest, etc.)
  • CI experience (Jenkins, TravisCI, CircleCI, etc.)
  • Live in the San Francisco Bay Area, or are willing to relocate

Benefits

  • Contribute to an open source product used by almost a million of your fellow developers and tens of thousands of companies
  • Be part of an experienced and renowned team that cares a lot about diversity and inclusivity
  • Competitive salary and meaningful equity
  • 100% medical and dental coverage
  • Commuter subsidy
  • Charitable matching program
  • Generous parental leave policy
  • Flexible working schedule and vacation policy, and real work/life balance
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance
 
Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.
Software Engineering Intern (Summer 2...
San Francisco, CA

About this role

Sentry's mission is to empower software development teams to build better products, faster. Our error monitoring platform helps over a million developers discover, triage, and resolve production software issues, so you can spend less time debugging and more time doing what you love: building software.

You get to write real code on real projects, no fetching coffee required. We are looking for interns who are excited about jumping right out of school and right into impacting an expansive (and expanding) user base. If you want to build something that makes life for you and your fellow developers easier, start thinking of your new hire gif now!

You'll love this job if:

  • You want to experience being a full-time engineer on a team and not "just an intern"
  • You value mentorship and being able to work closely with and learn from world-class engineers
  • You want to join a modern software development team that iterates and ships code rapidly
  • You want to contribute to open source full-time
  • You are looking for a supportive, inclusive, and diverse team that celebrates your individuality

As an engineering intern at Sentry, you'll be paired with a mentor for the summer, and become an integral team member on our Applications team or Infrastructure team:

  • Our Applications team is a cross-functional group of software engineers and designers that are responsible for the end-user operation of Sentry, building the user interfaces that over 1 million software developers see and touch, and contributing to the APIs and server logic that power those interactions.
  • Our Infrastructure team helps scale Sentry to support our fast-growing user base, and assist in making important back-end technology and implementation decisions. We have a few teams that help us do this. Our R&D team owns the end-to-end event lifecycle, from event capture to event ingestion and processing, and event storage and querying. Our engineering operations team keeps our services up and helps engineers with infrastructure needs, while the business operations team creates tools to facilitate data-driven decision-making.

Meet our interns past:

Learn more on our blog and careers page!

Qualifications:

  • Currently enrolled in an Undergraduate or Master's program in Computer Science or related field
  • Experience with one or more internships/co-ops, writing or shipping software
  • Experience with Python and/or Javascript, or similar dynamic programming languages (e.g. Ruby, PHP, etc) through internships, personal projects (Github), or coding competitions
  • Passion for developer tools or open source
  • Willing to relocate to SF for the duration of your internship (we'll help get you here and provide a stipend for winter housing!)

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status.

Senior Software Engineer - Growth
San Francisco, CA

About Sentry

Sentry's mission is to empower software development teams to build better products, faster. Our application monitoring platform helps more than a million developers and tens of thousands of software teams at some of the internet's most loved websites/apps (Dropbox, Uber, Airbnb, Stripe, Atlassian, and many more) discover, triage, and resolve production software issues, so they can spend less time debugging and more time building software.

Growth Engineering

The growth team at Sentry is enabling exponential user growth by building end-to-end experiences, identifying optimizations to improve user activation, and constantly innovating new ways to expand Sentry’s value proposition. This may just be one of the reasons why developers across the world love Sentry.

This is a multi-disciplinary team that uses data science, product, marketing, design and engineering to drive exponential growth for Sentry, and runs experiments to measure what effective growth approaches of the future could look like. You’ll bring your technical skills to bear on this problem, and work cross-functionally to ship features that impact user acquisition, engagement, and revenue.

Our ideal candidate:

  • Gets excited about making a measurable impact on Sentry’s growth
  • Loves working on innovative ideas that lead to significant changes in user acquisition, engagement and revenue
  • Is able to ship quickly and independently without creating long-term technical debt

Responsibilities

  • Proactively identify opportunities for growth, whether they are new innovations or optimizations of existing flows
  • Work with product, engineering, and design to define, implement, test and roll out new features
  • Learn from experiments to build better user experiences in the future
  • Lead engineering discussions and write abstractions that enable the team to move and learn faster

Qualifications

  • 3+ years as a software engineer
  • Experience with Python and/or Javascript
  • Live in the San Francisco Bay Area, or are willing to relocate

Benefits

  • Contribute to an open source product used by tens of thousands of companies
  • Be part of an experienced and renowned team that has worked on some of the world's most popular software products and open source tools
  • Competitive salary and meaningful equity
  • 100% medical, dental, and vision coverage
  • Commuter subsidy
  • 401k
  • Charitable matching program
  • Generous parental leave policy and 529 College Savings Plan
  • Flexible working schedule and vacation policy, and real work/life balance
  • Company events (Hack Weeks, All Hands, quarterly social events) and friends and family events
  • Relocation assistance
      

Sentry values diversity and inclusivity in our company and is an equal opportunity employer. We do not discriminate on the basis of race, religion, color, national origin, gender, sexual orientation, age, marital status, veteran status, or disability status. 

You may also like