How Raygun Processes Millions of Error Events Per Second

8,807
Raygun
Raygun lets you detect, diagnose and resolve issues in your web and mobile apps with greater speed and accuracy

Raygun


Who am I?

My name is John-Daniel Trask, and I’m the Co-Founder & CEO of Raygun.

I started writing software when I was nine years old, and now have more than 25 years experience with software development. In high school, I started selling software commercially and have always been deeply passionate about the intersection of business and software. In both cases, it’s about achieving more than a single human can and amplifying human capability.


John-Daniel Trask, CEO of Raygun


These days I don’t write active code on the Raygun product, but I still lead product direction and help our performance architecture team make critical tech decisions.


Background

Raygun provides a window into how users are really experiencing your software applications.

We help companies like Microsoft, Nordstrom, and Coca-Cola give their customers error-free user experiences by monitoring their apps.

Every day, the Raygun platform processes billions of data points with crash reporting, real user monitoring and APM (application performance management).

Unlike traditional logging, Raygun silently monitors applications for issues affecting end users in production, then allows teams to pinpoint the root cause behind a problem with greater speed and accuracy, providing detailed diagnostic information for developers. Raygun makes fixing issues 1000x faster than traditional debugging methods using logs and incomplete information.


Raygun crash interface


From the outset, Raygun has been full-stack, able to track crashes and user experiences across 25+ programming languages and platforms, including web, mobile and server side. This includes iOS, Android and Xamarin stacks for mobile.

Customers integrate a small SDK, and with a couple of lines of code, have full visibility into the health of their software, across their entire technology stack.


Raygun’s initial (MVP) architecture and stack

Our original MVP consisted of a customer-facing Microsoft ASP.NET MVC web application with an API layer written using Mono on Linux machines. The marketing site was a mixture of N2 CMS and WordPress for the blog.

We chose .NET because it was a platform that our team was familiar with. Also we were skilled enough with it to know many performance tips and tricks to get the most from it. Due to this experience, it helped us get to market faster and deliver great performance.

We get questioned about it frequently since startups don’t tend to look at Microsoft as much, but it’s been an absolute competitive edge for us. We’ve also benefited significantly from Microsoft’s continued investment and the performance of .NET Core, which has made us very happy with our decision to choose .NET at the start.


Raygun’s current architecture and stack

The core Web application is still a Microsoft ASP.NET MVC application. Not too much has changed from a fundamental technology standpoint.

We originally built using Mono, which just bled memory and would need to be constantly recycled. So we looked around at the options and what would be well suited to the highly transactional nature of our API. We settled on Node.js, feeling that the event loop model worked well given the lightweight workload of each message being processed. This served us well for several years.

The biggest change to date has been increasing throughput by 2,000 percent with a change from Node.js to .NET Core.

When we started to look at .NET Core in early 2016, it became quite obvious that being able to asynchronously hand off to our queuing service greatly improved throughput. Unfortunately, at the time, Node.js didn’t provide an easy mechanism to do this, while .NET Core had great concurrency capabilities from day one.

This meant that our servers spent less time blocking on the hand off, and could start processing the next inbound message. This was the core component of the performance improvement.

Beyond just the async handling, we constantly benchmark Node, and several of the common web frameworks. Our most recent performance test between Hapi, Express, and a few other frameworks found some interesting results which you can read about here.

We were previously using Express to handle some aspects of the web workload, and from our own testing we could see that it introduces a layer of performance cost. We were comfortable with that at the time, however the work that Microsoft have invested in the web server capabilities has been a huge win.

Memory usage

Another popular question about our increased throughput was around memory usage, specifically if we achieved any gain, and if this was something we specifically needed to manage with high throughput. Indeed we did see a gain, although in both cases memory was fairly static.

However, our Node.js deployments would operate with a 1GB footprint, and the .NET Core deployment reduced the size of that footprint to 400MB. For both deployments we have a level of ‘working’ memory involved associated with each concurrent active request. Similarly, with the improvements to the overall footprint, the operating overhead was reduced in moving to .NET Core.

Raygun aren’t the only folks benefiting from the performance improvements of increased throughput.

MMO Age Of Ascent also benefited from a switch to .NET Core, with 2300 percent more request processed per second – a truly amazing result!

Performance is absolutely a feature. We know the better performance we deliver, the happier we make our customers, and the more efficient we can run our infrastructure.

Ditching WordPress

There’s no doubt WordPress is a great CMS, which is very user friendly. When we started the company, our blog wasn’t really our top priority, and it ended up being hosted on a fairly obscure server within our setup, which didn’t really change much until recently when things become harder to manage and make significant updates.

As our marketing team increased, the amount of traffic that found us through our content marketing increased. We found ourselves struggling to maintain our Wordpress install given the amount of theme updates, plugins and security patches needing to be applied.

Our biggest driver to find an alternative solution however was just how slow Wordpress is at serving content to the end user. I know there will be die hard fans out there with ways to set things up that mean WordPress sites can load quickly, but we needed something a lot more streamlined.

We could see in our own Real User Monitoring tool that many users were experiencing page load speeds of over five seconds, even longer in worst case scenarios.

Hugo is an open source static site generator that has enabled us to reduce load times by over 500% and make our blog far more maintainable across the whole team. The Raygun marketing site runs on a .NET CMS called N2 but we plan to swap that out with Hugo as well in future.

Hosting infrastructure

We chose AWS because, at the time, it was really the only cloud provider to choose from. We tend to use their basic building blocks (EC2, ELB, S3, RDS) rather than vendor specific components like databases and queuing. We deliberately decided to do this to ensure we could provide multi-cloud support or potentially move to another cloud provider if the offering was better for our customers. While we’re satisfied with AWS, we do review our decision each year and have looked at Azure and Google Cloud offerings.

We’ve utilized c3.large nodes for both the Node.js deployment and then for the .NET Core deployment. Both sit as backends behind an nginx instance and are managed using scaling groups in EC2 sitting behind a standard AWS load balancer (ELB).

Business tools and utilities

Intercom has been part of our stack since the very beginning assisting with for customer communications. It seems like every SaaS product these days uses it too as you see the familiarity of the pop up chat window or the Intercom icon down in the bottom right hand corner for many web apps.

I think they have done a good job at innovating in the right areas, as the problems we’ve had with the product or any missing functionality have usually been addressed in future updates. I think they have a solid product that does what it needs to well.

We also use


What does our engineering team look like?

Our Chief Technology Officer and Co-Founder is Jeremy Boyd, who has been a Microsoft Regional Director for more than a decade. He sets the technology direction that underpins the Raygun Platform development.

Our engineering team spends significant time on performance-related concerns due to the scale of data that is being ingested, managed and alerted for our customers. Performance improvements not only provide better customer experiences but also drive down our cost to serve, meaning we can reinvest in improving our product.

We use an agile approach to engineering at Raygun with weekly sprints. The team decides which tasks they want to work on from the list of priorities that enter into each sprint. Each product area has a dedicated ‘champion’. So backend infrastructure, application front end and marketing work are all undertaken with someone overseeing the output from an engineering standpoint.

Though we use JIRA for managing sprints, issues and tickets, we’ve actually found physically printing the items out and putting them onto a whiteboard has been very effective for managing the workload and it’s progress from ‘to-do’ to ‘done.

At the moment we haven’t split into seperate per product teams as we want to encourage team members to learn various parts of the system, so all the knowledge does not sit with a single employee. Longer term as the team grows we do envisage splitting the team into more dedicated groups rather than just front end, backend and infrastructure.


How we build, test, and deploy at Raygun

At Raygun, we use GitHub for our source code management. We have a continuous integration infrastructure in place, using JetBrains TeamCity. This undertakes builds, runs tests and generates the deployable assets on each commit that gets pushed to our repositories.

Regarding environments, we run several stages to the pipeline:

Office environment

This is a set of environments in the Wellington office that the team can do internal acceptance testing on with the Product team

Beta environment

This environment is on the same infrastructure as our production environment and helps simulate real world conditions

Production

Live to customers

We perform deployments with Octopus Deploy. We deeply integrate our pipeline into Slack, so the team always have visibility on who’s pushing code, where deployments are going, etc.


Raygun Real User Monitoring


We also of course run Raygun, on Raygun. So we’re always instantly alerted of a bad deployment if we start seeing errors that users are running into popping up in our operations Slack channel. It provides a real-time feedback loop for us to know if we need to rollback a deployment and acts as our safety net should any issues slip through to production that impact users.


Our biggest engineering hurdle - and how we fixed it

An engineering problem that stands out was from November 2015. At the time, several team members, including myself, were in Dublin, Ireland for a conference. We were out at dinner and started getting monitoring alerts about the load on our API being too high. It would auto-scale out, but if it auto-scaled too much, alerts were triggered.

Our philosophy has always been that app outages are bad, but losing inbound data is unacceptable. Given this was an issue with our ingestion services, it was a P1 event that needed resolving.

While it was the evening in Dublin, it was daytime in New Zealand, and we were almost on the opposite points of the earth. I connected with the team in NZ who could start triaging the issue immediately and begin reviewing our metrics and reports—hoping it would show some simple but dumb mistake that could be easily fixed.

I excused myself from the dinner and returned to my hotel room. The issue continued to worsen. The pressure was rising to find what the fault was. A couple of the other team members at the dinner turned up to the hotel room and we set up a mini war room right there. We had had an open line with the New Zealand team, with all of us combing through any data we could find to identify what had happened.

We theorized a range of possible things that could cause this behavior. We then assigned each team member to investigate the possibility and report back. We cleared through many dead ends quite quickly.

As time progressed, one of the team members noticed that one of our databases was running red hot. That was the lead we needed. It turned out that the API servers were hitting that database. But why? That database held the valid API keys that we accept data from.

Digging deeper, we identified that while the API nodes were caching the active keys, if data was sent that had no matching API key, it would check in the database. Basically, the code was:

If (!_cache.ContainsKey(apiKey))

{

Check the database

Put the value into the cache

}

But what would happen if somebody was sending a lot of data, with a non-existent API key?

Turns out that it would effectively result in a denial of service attack. The API servers themselves weren’t running particularly hot, but the database was getting smashed so hard that the queries were stuck in query queues.

We deployed a fix, and immediately saw the processing rate improve, bringing our resource consumption on the API layer down by more than 80%.


What’s next for Raygun?

When it comes to performance, getting the complete picture is very compelling. Our vision is to provide the most sophisticated, and integrated, monitoring platform.

Raygun Github Integration

To achieve that vision, we recently announced the launch of our APM product. to complement the existing platform. This is a huge step forward, and Raygun will be the only platform that can connect software crashes, server side traces and end user performance data into a complete, single view.

At launch, Raygun APM will support the Microsoft Stack (.NET, .NET Core), with additional language support coming later in 2018. Our focus is on lifting APM out of the current early 2000’s feel, and make it actionable and automated. Raygun APM will include an advanced rules engine that will detect common programming mistakes (e.g. slow methods, N+1 queries, chatty API calls) and generate issues that can assigned to team members to resolve.

Raygun
Raygun lets you detect, diagnose and resolve issues in your web and mobile apps with greater speed and accuracy
Tools mentioned in article