Rust at OneSignal

2,537
OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.

This post is by Joe Wilm of OneSignal

Earlier last year, we announced OnePush, our notification delivery system written in Rust.

In this post, we will cover improvements in our delivery capabilities since then, an interactive tour of OnePush’s subsystems and reflections of our experience shipping production Rust code. We hope you'll find it insightful!

Delivery Stats

OnePush was built to scale deliveries at OneSignal. To know whether this endeavor was a success, we collect metrics such as historical delivery counts and delivery throughput. Here's how OnePush is performing:

  • OneSignal had ~10,000 users at the start of 2016 and now has over 110,000 at the time of publishing this post. (Over 10x growth!)
  • We've increased the number of daily notifications sent by 20x in the same period.
  • OnePush delivers over 2 billion notifications per week.
  • OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second.

The title image on this post is a screenshot from our live delivery monitoring. Each bar represents deliveries occurring in that second, and each vertical division denotes 5,000 deliveries. The colors represent different platforms like iOS, Android, Chrome WebPush, etc. Every single one of them was delivered by OnePush.

OnePush

OnePush is comprised of several subsystems for loading notifications, delivering notifications across HTTP/1.1 and HTTP/2, and for processing events and results.

Choosing Rust

Choosing the programming language for a core system is a big decision. If not careful, one could end up with months of time invested and get stuck writing library code instead of the application itself. This is less of a concern with programming languages that have a mature ecosystem, but that's not exactly Rust just yet. On the other hand, Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules.

Given that we now have a production system written in Rust, it's obvious which side of this trade we landed on. Our experience has been positive overall and indeed we have had fantastic results. The following sections discuss the specific pros and cons we considered for building OnePush in Rust, what risks we accepted on the outset, the successes we had, and issues we ran into.

Reasons to not use Rust

The Rust ecosystem is young. Even if there exists a library for your purpose, it's not guaranteed to be robust enough for a production deployment. Additionally, many libraries today have a "truck factor" of 1. If the library's developer gets hit by a truck, it's going to be on you to maintain it.

Next, Rust's tooling story is weak. You can use tools like Racer and YCM to get pretty far, but they fail in a lot of cases. Good tooling is a necessity, especially for developers that are getting up-to-speed.

Having team members (who may be unfamiliar with Rust) contribute to the project may take a lot of "ramp-up" time. This risk has turned out to be quite real, but it hasn't stopped other members of our team from contributing patches to the project. Mentoring from team members more proficient with the language and familiar with the code base helped a lot here.

Finally, iteration times can be long. This wasn't something we anticipated up front, but build times have become onerous for us. A build from scratch now falls into the category of "go make coffee and play some ping-pong." Recompiling a couple of changes isn't exactly quick either.

Before settling on Rust, we considered writing OnePush in Go. Go has a lot going for it for this sort of application - its concurrency model is perfectly suited for managing many async TCP connections, and the ecosystem has good libraries for HTTP requests, Redis and PostgreSQL clients, and serialization. Go is also more approachable for someone unfamiliar with the language; this makes the code base more accessible to the rest of your team. Go's developer tools have also had more time to mature than Rust's.

Why choose Rust

Despite the negatives and the presence of a good alternative, Rust has a lot going for it that makes it a good choice for us. As mentioned earlier,

Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules

This is huge. Being able to encode constraints of your application in the type system makes it possible to refactor, modify, or replace large swaths of code with confidence. The type system is our ultimate "move quickly and don't break things" secret weapon.

Rust's error handling model forces developers to handle every corner case. Even if there is a system with the potential to panic, it can be moved into its own thread for isolation. More recently, it has become possible to catch panics within a thread instead of only at the boundary. Languages like Go make it too easy to ignore errors.

Next, OnePush needed to be fast. Rust makes writing multithreaded programs quite easy. The Send and Sync traits work together to ensure such programs are free from data races.

At the end of the day, our OnePush service is just a program optimized for sending a lot of HTTP requests. The library ecosystem offered everything we needed to build this system: An async HTTP/2 client, an async HTTP/1.1 client, a Redis client library and a PostgreSQL client library. We are fortunate that the Rust community is full of talented and ambitious developers who have already published a great deal of quality libraries that suit our specific needs.

Finally, the developer leading the effort had experience and a strong preference for Rust. There are plenty of technologies that would have met our requirements, but deferring to a personal preference made a lot of sense. Having engineers excited about what they are working on is incredibly valuable. Such intrinsic motivation increases developer happiness and reduces burnout. Imagine going to work every day and getting to work on something you're excited about! Developer happiness is important to us as a company. Being able to provide so much by going with one technology versus another was a no-brainer.

Risks

Aside from risks associated with not choosing Rust, we had a few additional concerns for this particular project.

As a glorified HTTP client, OnePush needed to be able to send lots of HTTP/1.1 requests very quickly. In the beginning, this wasn't quite as true because of our scale and because Android notifications could be batched into single requests. Going forward, we expected a huge increase in HTTP/1.1 outgoing request volume due to growth and the new WebPush specification with encrypted payloads. Hyper (Rust's HTTP library), had an async branch that was just a prototype when we started. We hoped that, by the time we truly needed an async client, it would be ready.

As it turned out, the initial async Rotor-based branch of Hyper never stabilized since tokio and futures were announced in August 2016. By the time we really needed the async branch, we ended up having to spend a week or two debugging, stress-testing and fixing the Rotor-based hyper::Client. This turned out to be ok since it was a chance to give back to the Rust community.

Since we would be on the nightly channel for serde derive and clippy lints, another risk was spending a lot of time doing rustc upgrades. We avoided this situation by pinning to specific versions of the compiler and upgrading infrequently. When we did upgrade, the process required finding a recent rustc that was supported by both libraries. This will become less of an issue very soon with the advent of Macros 1.1.

Finally, Solicit (Rust's HTTP/2 library) uses three threads per connection. Although this is fine in isolation, having 20,000 connections quickly becomes expensive. We've mitigated this issue by using a short keep-alive to limit the number of active connections and by taking advantage of the Apple's HTTP/2 provider API (APNs), which allows 500 requests in-flight per connection.

Unexpected Issues

For the most part, we knew what we were getting into building such a system in Rust. However, one thorn in our side that we didn't anticipate was rust-openssl upgrades. We are stuck on an earlier version of rust-openssl since the Solicit library depends on an API that has been removed since v0.8.0. This means that we are unable to upgrade other dependencies which rely on rust-openssl until we fix the Solicit issue.

Another minor issue at one time was the limited test framework. A common feature for test frameworks is to have some setup and teardown steps that run before and after a test. We say this issue was minor because we were able to work around its absence by generating many tests declaratively with macros (discussed below).

Successes

Writing OnePush in Rust has been hugely successful for us. We've been able to easily meet our performance and scaling goals with the application. OnePush is capable of delivering over 100k notifications per second and efficiently maximizes the use of system resources. Despite being highly multithreaded, race conditions have not been an issue for us. Even better, OnePush needs very little attention. We were able to leave it running without any issues through the holiday break.

Regressions are very infrequent. There's a huge class of bugs in languages like Ruby that just aren't possible in Rust. When combined with good test coverage, it becomes difficult to break things - all thanks to Rust's fantastic type system. This isn't just about regressions either. The compiler and type system make refactoring basically fool-proof. We like to say that Rust enables belligerent refactoring - making dramatic changes and then working with the compiler to bring your project back to a working state.

The macro system has been another big win. Our favorite example of how this saves us engineering time is using macros for writing tests declaratively. For example, a large set of tests we have are for the Terminal. Each test takes some Events as input, and then the state of Redis and Postgres are checked to be correct after processing the event. The macro system enabled us to remove all of the boilerplate for these tests and declaratively say what the event is and what the expected outcome should be. Writing a test for this system today looks like this:

// Invoking terminal test-writing macro
push_test! {
    // The part before the arrow ends up being the test name.
    // The `response` describes an `Event`, and the rest describes the system
    // state after processing it. There are more parameters that can be
    // specified, but the default values are acceptable in this case.
    apns_success => {
        response: apns::Response::Success,
        success: 1,
        sending_done: true
    },
    // .. and so on
}

Writing a lot of similar tests in this fashion enables us to get a lot of coverage without a lot of work. It also helps us work around the lack of features in the Rust test system (such as before/after hooks).

The final thing we want to comment on here is serde. This library enables adding a #[derive(Deserialize)] attribute to a struct and getting a deserialize implementation. Combined with our serde-redis library, this makes it possible to load data out of Redis like so:

/// A person has a name and an ID.
///
/// This is just some data with a derived
/// Deserialize implementation
#[derive(Deserialize)]
struct Person {
    name: String,
    id: u64
}

// Gets a `Person` out of redis
let person: Person = redis.hgetall("person")?;

On the left hand side of the line fetching person, there's a binding name with a type annotation. On the right hand side, there's a call to Redis with HGETALL, and a ?. The ? is a bit of error handling; if the request is successful and deserialization works, person will be a valid Person, and the name and id fields can be used directly with knowledge that they were returned from Redis. If something goes wrong, like Redis is unreachable or there is data missing for the Person (such as a missing id), an error is returned from the current function.

This is really powerful! We can just describe our data, add this derive attribute and then safely load the data out of Redis. To get the same effect in a dynamic language, one would need to load this dictionary out of Redis and write a bunch of boilerplate to validate that the returned fields are correct. This sort of thing makes Rust more expressive than many high-level languages.

Open Source

Early adoption in an ecosystem means there are lots of opportunities for open source contributions. The most notable of our contributions is a project called serde-redis, a Redis deserialization backend for serde. We've also had the opportunity to contribute several patches to Hyper's Rotor-based async client. We use that client in OnePush and have made billions of HTTP requests with it.

What's next

We've come far with OnePush, but there's still more work to do! Here's just a few of our upcoming projects related to OnePush:

  • Upgrade to Hyper's Tokio-based async implementation. We probably won't be super early adopters here since we've got an HTTP client with a lot of production miles on it right now.
  • Rework result processing to use futures. The Terminal's concurrency from threads is limited, whereas something backed by mio could have much higher throughput. This would require futures compatible Redis and Postgres clients.
  • Replace Solicit's thread-based async client with a mio-based one. We've actually got a prototype of something from earlier in 2016.

We also have a new internal application written in Rust which we hope to blog about soon! It's a core piece of our monitoring which is responsible for collecting statistics from our production systems and storing them in InfluxDB.

Conclusion

We've had fantastic results building one of our core systems in Rust. It has delivered many billions of notifications, and it's delivering more and more each day. We hope that sharing our experience as early adopters in the Rust ecosystem will be helpful to others when making similar decisions. We've certainly found Rust to be a secret weapon for quickly building robust systems.

Like what we're doing? We're hiring!

OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.
Tools mentioned in article
Open jobs at OneSignal
Frontend Developer
San Mateo, California
We’re seeking an experienced front-end developer to lead the development of improvements to OneSignal’s dashboard. Every day over 10,000 clients visit our website and dashboard and thousands more use our API. Our clients love what we’ve built so far and we can’t wait to make it even better. Your responsibilities will include working closely with a product designer and our clients to help build new features and improve our existing ones. You will primarily program in Typescript, and React.
  • Enjoy rapid iteration. We ship code multiple times per day
  • Fluent in Javascript, HTML, and CSS
  • Know React, or a similar framework
  • Have worked in an environment where developers have written tests and shared ownership of code

  • Experience with Webpack, Redux, CSS grid
  • Know Ruby on Rails or similar MVC framework
  • Experience building and integrating REST API's
  • Have experience writing queries with MySQL or PostgreSQL
  • Salary: $110k - $140k
  • Equity: 0.1% - 0.15%
  • iOS Developer
    San Mateo, California
    OneSignal’s 15 first-party Mobile and Javascript SDKs are installed into nearly 100,000 websites and applications that reach 900 Million unique users a month. Our SDKs must remain easy to install, work alongside other services and consistently improve as we add new features. This is no small feat, but the effort is well worth it: Our clients rave about the quality, documentation, and ease of use of our service. We're looking for a skilled developer to help us build upon and maintain SDKs across over a dozen platforms. The right candidate must have the skill and confidence to learn new programming languages, programming techniques, and fearlessly troubleshoot bugs in mobile devices and web browsers.
  • Write code in Objective-C and Swift. Additionally writing SDK bindings for C#, Lua, C++, JavaScript, and more.
  • Write high quality code in previously unfamiliar programming languages.
  • Create and maintain open source SDKs used by hundreds of thousands of developers.
  • Is a polyglot programmer and enjoys learning new programming languages.
  • Has built and published a native app for iOS.
  • Enjoys diving deep to find solutions to tricky bugs.
  • Gets excited about the opportunity to join a small but fast growing startup company.
  • Salary: $120k - $170k
  • Equity: 0.2% - 0.5%
  • Senior Backend Engineer
    San Mateo, California
    We're looking for an engineer interested in writing Rust. Experience with the language is not required, but we are looking for experience in some sort of statically typed language and a couple years of experience. We have several projects using Rust today including the OnePush delivery service, pstats, our stats daemon that runs on each server, and oscachemgr, a cache manager for our front end servers. We've recently started another Rust project pertaining to analytical work on our ever-growing data set. We're also starting to plan a project to integrate Rust into our Rails application. In addition to the Rust projects, business needs may at times require you to work on another part of the application such as Rails or infrastructure.
  • Work closely with a small team shipping lots of code
  • Write Rust and Ruby
  • Add features to and improving our push delivery service
  • Work on native Rust extensions to our Rails application
  • Open source contributions - we have contributed patches to several crates and released one of our own. We aspire to do more of this as time progresses
  • Contribute to our stats monitoring process (Rust) which runs on all of our servers
  • Architect solutions to address our scaling needs
  • Design and build a custom message queue
  • 5+ years of experience writing software
  • Experience writing with a scripting language such as Node.js, Python, or Ruby
  • Experience writing with a statically typed language such as Rust, Java, C++, etc
  • Solid understanding of web service architecture. To be less ambiguous, we are looking for knowledge of the following systems and how they fit together: http clients, DNS, load balancers, reverse proxies, CDNs, application servers (ex. Rails), databases, and caches
  • Open to learning and writing Rust
  • Proficiency in written and oral communications
  • Ability to collaborate well on a team
  • Friendliness and empathy
  • Modesty
  • Can deliver solutions independently
  • Love of learning
  • Experience extending an interpreted language with native code
  • Familiarity with Redis and PostgreSQL
  • Proficiency with Linux systems
  • Familiarity with POSIX C APIs
  • Understanding of how multiplexed I/O works
  • Again, these are nice-to-haves. Even if you don't know them, we hope you are interested in learning them!
  • Salary: $120k - $165k
  • Equity: 0.2% - 0.35%
  • Distributed Systems Architect
    San Mateo, California
    OneSignal is seeking an experienced Systems architect that will continue to enhance our highly available, scalable system architecture to support our growth of more than 3 billion notifications per day. Our 450,000+ developers using the product send more than 3.5 billion notifications per day, and that number continues to grow quickly. Our clients love what we’ve built so far, and we’re excited to work on scaling the product the support more than 10x the number of daily notifications. As a systems architect, you’ll be working on improving and extending our core infrastructure  API’s, infrastructure to help scale our product to support 10X the current capacity as well as architect a number of new features in the product that will allow our customers to leverage our notification system in new powerful ways. Our primary language is Rust, and while the language is not required, we’re looking for someone with multiple years of experience in a statically typed language.
  • Architecting new solutions to support high growth, scale and availability
  • Reviewing system infrastructure, and proposing efficient scalable solutions 
  • Designing/implementing services that give users greater power and flexibility
  • Benchmarking performance, and productionizing development efforts
  • Writing code, scaffolding and working with a team to achieve your architecture vision
  • Documenting detailed technical architecture
  • Evangelizing design patterns, and great development techniques
  • 6+ years of experience writing software
  • 2+ years of experience working with distributed systems
  • Strong technical writer
  • Experience writing with a statically typed language such as Rust, Java, C++, etc.
  • Experience with infrastructure and capacity planning
  • Solid understanding of web service architecture. To be less ambiguous, we are looking for knowledge of the following systems and how they fit together: http clients, DNS, load balancers, reverse proxies, CDNs, application servers (ex. Rails), databases, and caches.
  • Open to learning and writing Rust
  • Understanding of how multiplexed I/O works
  • Friendliness and empathy
  • Modesty
  • Proficiency in written and oral communications
  • Ability to collaborate well on a team
  • Can deliver solutions independently as well
  • Love of learning
  • Experience writing Kafka consumer and streaming applications
  • Proficiency with administering Linux systems and applications
  • Again, these are nice-to-haves. Even if you don't know them, we hope you are interested in learning them!
  • Salary: $140k - $180k
  • Equity: 0.2% - 0.5%
  • Verified by
    VP of Engineering
    Cofounder & CEO, OneSignal
    You may also like