Rust at OneSignal

4,322
OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.

This post is by Joe Wilm of OneSignal

Earlier last year, we announced OnePush, our notification delivery system written in Rust.

In this post, we will cover improvements in our delivery capabilities since then, an interactive tour of OnePush’s subsystems and reflections of our experience shipping production Rust code. We hope you'll find it insightful!

Delivery Stats

OnePush was built to scale deliveries at OneSignal. To know whether this endeavor was a success, we collect metrics such as historical delivery counts and delivery throughput. Here's how OnePush is performing:

  • OneSignal had ~10,000 users at the start of 2016 and now has over 110,000 at the time of publishing this post. (Over 10x growth!)
  • We've increased the number of daily notifications sent by 20x in the same period.
  • OnePush delivers over 2 billion notifications per week.
  • OnePush is fast - We've observed sustained deliveries up to 125,000/second and spikes up to 175,000/second.

The title image on this post is a screenshot from our live delivery monitoring. Each bar represents deliveries occurring in that second, and each vertical division denotes 5,000 deliveries. The colors represent different platforms like iOS, Android, Chrome WebPush, etc. Every single one of them was delivered by OnePush.

OnePush

OnePush is comprised of several subsystems for loading notifications, delivering notifications across HTTP/1.1 and HTTP/2, and for processing events and results.

Choosing Rust

Choosing the programming language for a core system is a big decision. If not careful, one could end up with months of time invested and get stuck writing library code instead of the application itself. This is less of a concern with programming languages that have a mature ecosystem, but that's not exactly Rust just yet. On the other hand, Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules.

Given that we now have a production system written in Rust, it's obvious which side of this trade we landed on. Our experience has been positive overall and indeed we have had fantastic results. The following sections discuss the specific pros and cons we considered for building OnePush in Rust, what risks we accepted on the outset, the successes we had, and issues we ran into.

Reasons to not use Rust

The Rust ecosystem is young. Even if there exists a library for your purpose, it's not guaranteed to be robust enough for a production deployment. Additionally, many libraries today have a "truck factor" of 1. If the library's developer gets hit by a truck, it's going to be on you to maintain it.

Next, Rust's tooling story is weak. You can use tools like Racer and YCM to get pretty far, but they fail in a lot of cases. Good tooling is a necessity, especially for developers that are getting up-to-speed.

Having team members (who may be unfamiliar with Rust) contribute to the project may take a lot of "ramp-up" time. This risk has turned out to be quite real, but it hasn't stopped other members of our team from contributing patches to the project. Mentoring from team members more proficient with the language and familiar with the code base helped a lot here.

Finally, iteration times can be long. This wasn't something we anticipated up front, but build times have become onerous for us. A build from scratch now falls into the category of "go make coffee and play some ping-pong." Recompiling a couple of changes isn't exactly quick either.

Before settling on Rust, we considered writing OnePush in Go. Go has a lot going for it for this sort of application - its concurrency model is perfectly suited for managing many async TCP connections, and the ecosystem has good libraries for HTTP requests, Redis and PostgreSQL clients, and serialization. Go is also more approachable for someone unfamiliar with the language; this makes the code base more accessible to the rest of your team. Go's developer tools have also had more time to mature than Rust's.

Why choose Rust

Despite the negatives and the presence of a good alternative, Rust has a lot going for it that makes it a good choice for us. As mentioned earlier,

Rust enables one to build robust, complex systems quickly and fearlessly thanks to its powerful type system and ownership rules

This is huge. Being able to encode constraints of your application in the type system makes it possible to refactor, modify, or replace large swaths of code with confidence. The type system is our ultimate "move quickly and don't break things" secret weapon.

Rust's error handling model forces developers to handle every corner case. Even if there is a system with the potential to panic, it can be moved into its own thread for isolation. More recently, it has become possible to catch panics within a thread instead of only at the boundary. Languages like Go make it too easy to ignore errors.

Next, OnePush needed to be fast. Rust makes writing multithreaded programs quite easy. The Send and Sync traits work together to ensure such programs are free from data races.

At the end of the day, our OnePush service is just a program optimized for sending a lot of HTTP requests. The library ecosystem offered everything we needed to build this system: An async HTTP/2 client, an async HTTP/1.1 client, a Redis client library and a PostgreSQL client library. We are fortunate that the Rust community is full of talented and ambitious developers who have already published a great deal of quality libraries that suit our specific needs.

Finally, the developer leading the effort had experience and a strong preference for Rust. There are plenty of technologies that would have met our requirements, but deferring to a personal preference made a lot of sense. Having engineers excited about what they are working on is incredibly valuable. Such intrinsic motivation increases developer happiness and reduces burnout. Imagine going to work every day and getting to work on something you're excited about! Developer happiness is important to us as a company. Being able to provide so much by going with one technology versus another was a no-brainer.

Risks

Aside from risks associated with not choosing Rust, we had a few additional concerns for this particular project.

As a glorified HTTP client, OnePush needed to be able to send lots of HTTP/1.1 requests very quickly. In the beginning, this wasn't quite as true because of our scale and because Android notifications could be batched into single requests. Going forward, we expected a huge increase in HTTP/1.1 outgoing request volume due to growth and the new WebPush specification with encrypted payloads. Hyper (Rust's HTTP library), had an async branch that was just a prototype when we started. We hoped that, by the time we truly needed an async client, it would be ready.

As it turned out, the initial async Rotor-based branch of Hyper never stabilized since tokio and futures were announced in August 2016. By the time we really needed the async branch, we ended up having to spend a week or two debugging, stress-testing and fixing the Rotor-based hyper::Client. This turned out to be ok since it was a chance to give back to the Rust community.

Since we would be on the nightly channel for serde derive and clippy lints, another risk was spending a lot of time doing rustc upgrades. We avoided this situation by pinning to specific versions of the compiler and upgrading infrequently. When we did upgrade, the process required finding a recent rustc that was supported by both libraries. This will become less of an issue very soon with the advent of Macros 1.1.

Finally, Solicit (Rust's HTTP/2 library) uses three threads per connection. Although this is fine in isolation, having 20,000 connections quickly becomes expensive. We've mitigated this issue by using a short keep-alive to limit the number of active connections and by taking advantage of the Apple's HTTP/2 provider API (APNs), which allows 500 requests in-flight per connection.

Unexpected Issues

For the most part, we knew what we were getting into building such a system in Rust. However, one thorn in our side that we didn't anticipate was rust-openssl upgrades. We are stuck on an earlier version of rust-openssl since the Solicit library depends on an API that has been removed since v0.8.0. This means that we are unable to upgrade other dependencies which rely on rust-openssl until we fix the Solicit issue.

Another minor issue at one time was the limited test framework. A common feature for test frameworks is to have some setup and teardown steps that run before and after a test. We say this issue was minor because we were able to work around its absence by generating many tests declaratively with macros (discussed below).

Successes

Writing OnePush in Rust has been hugely successful for us. We've been able to easily meet our performance and scaling goals with the application. OnePush is capable of delivering over 100k notifications per second and efficiently maximizes the use of system resources. Despite being highly multithreaded, race conditions have not been an issue for us. Even better, OnePush needs very little attention. We were able to leave it running without any issues through the holiday break.

Regressions are very infrequent. There's a huge class of bugs in languages like Ruby that just aren't possible in Rust. When combined with good test coverage, it becomes difficult to break things - all thanks to Rust's fantastic type system. This isn't just about regressions either. The compiler and type system make refactoring basically fool-proof. We like to say that Rust enables belligerent refactoring - making dramatic changes and then working with the compiler to bring your project back to a working state.

The macro system has been another big win. Our favorite example of how this saves us engineering time is using macros for writing tests declaratively. For example, a large set of tests we have are for the Terminal. Each test takes some Events as input, and then the state of Redis and Postgres are checked to be correct after processing the event. The macro system enabled us to remove all of the boilerplate for these tests and declaratively say what the event is and what the expected outcome should be. Writing a test for this system today looks like this:

// Invoking terminal test-writing macro
push_test! {
    // The part before the arrow ends up being the test name.
    // The `response` describes an `Event`, and the rest describes the system
    // state after processing it. There are more parameters that can be
    // specified, but the default values are acceptable in this case.
    apns_success => {
        response: apns::Response::Success,
        success: 1,
        sending_done: true
    },
    // .. and so on
}

Writing a lot of similar tests in this fashion enables us to get a lot of coverage without a lot of work. It also helps us work around the lack of features in the Rust test system (such as before/after hooks).

The final thing we want to comment on here is serde. This library enables adding a #[derive(Deserialize)] attribute to a struct and getting a deserialize implementation. Combined with our serde-redis library, this makes it possible to load data out of Redis like so:

/// A person has a name and an ID.
///
/// This is just some data with a derived
/// Deserialize implementation
#[derive(Deserialize)]
struct Person {
    name: String,
    id: u64
}

// Gets a `Person` out of redis
let person: Person = redis.hgetall("person")?;

On the left hand side of the line fetching person, there's a binding name with a type annotation. On the right hand side, there's a call to Redis with HGETALL, and a ?. The ? is a bit of error handling; if the request is successful and deserialization works, person will be a valid Person, and the name and id fields can be used directly with knowledge that they were returned from Redis. If something goes wrong, like Redis is unreachable or there is data missing for the Person (such as a missing id), an error is returned from the current function.

This is really powerful! We can just describe our data, add this derive attribute and then safely load the data out of Redis. To get the same effect in a dynamic language, one would need to load this dictionary out of Redis and write a bunch of boilerplate to validate that the returned fields are correct. This sort of thing makes Rust more expressive than many high-level languages.

Open Source

Early adoption in an ecosystem means there are lots of opportunities for open source contributions. The most notable of our contributions is a project called serde-redis, a Redis deserialization backend for serde. We've also had the opportunity to contribute several patches to Hyper's Rotor-based async client. We use that client in OnePush and have made billions of HTTP requests with it.

What's next

We've come far with OnePush, but there's still more work to do! Here's just a few of our upcoming projects related to OnePush:

  • Upgrade to Hyper's Tokio-based async implementation. We probably won't be super early adopters here since we've got an HTTP client with a lot of production miles on it right now.
  • Rework result processing to use futures. The Terminal's concurrency from threads is limited, whereas something backed by mio could have much higher throughput. This would require futures compatible Redis and Postgres clients.
  • Replace Solicit's thread-based async client with a mio-based one. We've actually got a prototype of something from earlier in 2016.

We also have a new internal application written in Rust which we hope to blog about soon! It's a core piece of our monitoring which is responsible for collecting statistics from our production systems and storing them in InfluxDB.

Conclusion

We've had fantastic results building one of our core systems in Rust. It has delivered many billions of notifications, and it's delivering more and more each day. We hope that sharing our experience as early adopters in the Rust ecosystem will be helpful to others when making similar decisions. We've certainly found Rust to be a secret weapon for quickly building robust systems.

Like what we're doing? We're hiring!

OneSignal
OneSignal is a high volume mobile push, web push, email, and in-app messaging service.
Tools mentioned in article
Open jobs at OneSignal
Senior Site Reliability Engineer (US)
United States ()
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. Our blog contains more information about the OneSignal Engineering career ladder, compensation model, remote-first culture, and our diverse team. Our salary bands are available on AngelList. We have grown rapidly to where we are today serving billions of HTTP requests daily and sending upwards of over 10 billion messages daily. We achieved this scale writing scale sensitive components in languages like Rust and Go. This potent combination of high performance with efficient resource utilization has given us an incredible competitive edge. At our rapid growth pace, we are hiring SREs to help us continue to scale by operating and engineering the future of our infrastructure. We are maintaining 99.95% uptime today, and we are investing to ensure we maintain that as then business continues to grow and as the product evolves. Your primary task will be software engineering with a focus on infrastructure, operations, and automation. You'll be building systems to run our product, improving internal services, and advising product teams on architecture as it relates to the operability of the service. The systems you'll be responsible include all of the services which power our product. This ranges from off-the-shelf services like haproxy, nginx, Redis, PostgreSQL, Kafka, Kubernetes, etc. to our in-house services such as the Rails web app, various Rust backend services, and our high-performance API layer written in Go. You'll be working with Kubernetes to automate our data center operations and writing operational services to automate database operations. One of the key challenges in this role is to not only understand systems to the point of being able to manually operate by hand but also to understand in sufficient detail to write software systems to automate such operations. For some additional context on how we think about SRE, please see the introductory chapter of the Google SRE book. <li>Improve our CI/CD pipeline to improve deploy performance</li><li>Develop new tools to enable other developers to better spend their time</li><li>Add new code to the system to enable messaging users on a new platform</li><li>Help evaluate a new storage technology to further scale our stack</li><li>Provision and configure new hardware</li><li>Investigate network issues</li><li>Improve application and infrastructure monitoring</li> <li>At least 3 years experience working as a software engineer</li><li>Experience operating reliable production systems at scale</li><li>Knowledge of Linux systems internals</li><li>Easily bored running tasks by hand and the ability to automate such tasks</li><li>Experience with PostgreSQL</li> <li>Experience working with Cloud Providers(AWS/GCP/Azure)</li><li>Operational experience deploying and managing Kubernetes&nbsp;</li><li>Experience writing Kubernetes controllers and operators</li><li>Recent experience writing Go and/or Rust</li><li>Past experience as an SRE</li><li>Experience working with Layers 1-3 of the OSI networking model</li><li>Experience with any of Redis, Kafka, etcd, ZooKeeper, nginx, haproxy</li> <li>Flexible work hours</li><li>20 days paid vacation + 8 holidays&nbsp;</li><li>Equity - as the company grows in value, you benefits</li><li>Yummy Foods: Lunch and snacks provided when in office</li><li>Choice of workstation!&nbsp;</li><li>Sweet Swag:You'll need another closet for all the OneSignal gear & jackets!</li>
Lead Analytics & Data Engineer
United States ()
1+ million mobile app developers and marketing teams use OneSignal to send push notifications, in-app messages, emails, and SMS messages. We started as a YCombinator-backed company. Our founders were frustrated with existing push notification tools, so we built our own system.   When you pick up your smartphone, the first thing you will see are push notifications - maybe there’s a breaking news alert, a football game reminder, a promo from your favorite retailer. Whatever it is, chances are the message you are reading was sent using OneSignal.   We have raised a total of $35M from investors including SignalFire, Y Combinator, Rakuten Ventures, and Hubspot. OneSignal customers include Volkswagen, Verizon, Burger King, 7 Eleven, Zynga, Virgin Mobile, KFC, and many more. Join us in scaling the business!  OneSignal has a lot of the great tech startup qualities you'd expect, but we don't stop there. OneSignal handles a large amount of scale sending over 10 billion messages a day and also emphasizing a healthy life balance and kindness in all our interactions, and focus on ownership and personal growth make OneSignal a uniquely great place to work. <li>The Lead Analytics & Data Engineer will need to be an individual contributor and capable of managing a team. Over time there will be the opportunity to grow out a team, managing Data Analysts as well as Data Engineers as the company scales</li><li>We have a lot of data as we have over 1.6 million developers on OneSignal and send 10+ billion messages a day (and growing) so an ability to work with data systems at scale will be key</li><li>As a strategic member of the Operations team, you will be responsible for unifying OneSignal’s data stack and enabling all teams to confidently make data-based decisions to inform our growth & GTM strategy</li><li>Maintain and build ETL pipelines between SaaS tools and the data warehouse. This could include Salesforce, Marketo, Intercom, NetSuite, Recurly, and backend entitlement data. In particular, making sure business teams have access to data they can action upon in their respective tooling to grow the business or to message our customers</li><li>Maintain and scale a cloud data warehouse that can be connected to a business intelligence tool. This may mean using the current tool that we have (metabase) or evaluating new tools. This could also mean redesigning the way data is stored in our backend systems</li><li>Build a tool that will allow people across the company to have access to data while fulfilling customer compliance obligations that will scale with the company growth</li><li>Create automated cohort analysis and revenue bridges to monitor acquisition, expansion, and churn</li><li>Help evaluate and develop and build automated tracking of KPIs across the business as well develop in depth reports and dashboards for individual groups across the organization</li><li>Evaluate ways to increase the efficiency of internal data flows and centralize sources of truth</li><li>Derive actionable insight from the data including working with sales teams, customer success, and others to build reports to both acquire and retain customers</li><li>Additionally, this role will require strong cross functional collaboration particularly with the engineering team, product teams, sales, and marketing teams</li><li>This role requires work authorization in the US and can be done remotely in California, Texas, New York, or Pennsylvania</li> <li>6+ years of professional experience, including analytics, business intelligence, data engineering, data science or comparable fields</li><li>Skilled at querying relational databases (SQL) and ability to pull data from various sources</li><li>Proficiency with at least one analytics language such as Python</li><li>In-depth experience with web analytics tools and analyzing online customer behavior</li><li>Familiarity with the Google Cloud Platform is preferred as our infrastructure is hosted on Google Cloud</li><li>Strong critical thinking skills and attention to detail</li><li>Strong interpersonal and communication skills. Must be able to explain technical concepts and analysis implications clearly to a wide audience, including senior executives, and be able to translate business objectives into actionable analyses</li><li>Knowledge of database systems and data pipelines</li><li>Passionate about understanding customers and their behavior</li><li>Experience using business intelligence tools such as Tableau, Looker, etc to develop and enhance dashboards and reports preferred</li> <li>Friendliness&nbsp;</li><li>Modesty</li><li>Ability to collaborate well on a team&nbsp;</li><li>Can deliver solutions independently</li><li>Self Starter</li><li>Love of learning</li>
Staff Software Engineer, Product Engi...
United States ()
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. Our blog contains more information about the OneSignal Engineering career ladder, compensation model, remote-first culture, and our diverse team. Our salary bands are available on AngelList. OneSignal has a lot of the great tech startup qualities you'd expect, but we don't stop there. Our massive scale and small team, emphasis on healthy life balance and kindness in all our interactions, and focus on ownership and personal growth make OneSignal a uniquely great place to work.  Our Product Engineering group builds our main product interface (https://www.onesignal.com) and API. OneSignal is used by 1.5 million+ marketers and developers and sends over 10 billion messages each day to billions of devices. Our small team’s work makes a massive impact. We tackle challenges at the intersection of engineering and product development, like notification authoring for over a dozen different platforms and multiple mediums (push, in-app, SMS, and email) and intelligent message delivery customized to each recipient.  About the Team: Our Channels team focuses on building and scaling new messaging channels such as email and SMS on the OneSignal platform. We’re continually improving our product to improve our customers’ experience and empower them to create more effective and impactful messaging campaigns. Building clean, modular, and scalable systems is a particular focus as we expect all of our channels to work seamlessly within our omni-channel product. We move quickly, yet deliberately, always focused on delivering value to our customers. <li>Drive critical development on major features, tackling big scalability and system design challenges to create a delightful product experience.</li><li>Iterate with product on prospective product specs and designs as a technical expert, weighing in not just on feasibility but on opportunities driven by great tech.</li><li>Identify our highest leverage technical investment opportunities, socialize, and document plans to enhance our systems, and lead them to success with your colleagues.</li><li>Provide feedback to managers on the team to ensure every engineer is succeeding at their work, enjoying it thoroughly, and tackling new challenges regularly so they can continue to grow.</li><li>Pair/group program, iterate on technical designs with colleagues, and engage in highly communicative code reviews to drive engineering excellence and share knowledge.</li> <li>Expert-level understanding and experience with server-side development, Ruby + Rails or equivalent preferred.</li><li>Strong experience with modern client-side development with TypeScript + React or equivalent.</li><li>A demonstrable ability to analyze multifaceted challenges, seek out and understand the tradeoffs involved, and make a thorough proposal of the most fitting path forward.</li><li>A proven track record of tackling hard engineering problems, navigating complex software systems, and learning + applying new tools and languages on–the-job.</li><li>A passion for mentoring junior through senior software engineers to achieve technical excellence together.</li><li>7+ years software engineering experience building high volume, scalable SaaS applications from end-to-end.</li> <li>Friendliness&nbsp;</li><li>Modesty</li><li>Ability to collaborate well on a team&nbsp;</li><li>Can deliver solutions independently</li><li>Self Starter</li><li>Love of learning</li>
Engineering Manager, Core
United States ()
OneSignal is a Remote First Collaboration Company, offering Remote work as the default option across the United States. We offer in-office experiences in San Mateo, CA and New York, NY. Our blog contains more information about the OneSignal Engineering career ladder, compensation model, remote-first culture, and our diverse team. Our salary bands are available on AngelList. OneSignal has a lot of the great tech startup qualities you'd expect, but we don't stop there. Our massive scale and small team, emphasis on healthy life balance and kindness in all our interactions, and focus on ownership and personal growth make OneSignal a uniquely great place to work. Our Product Engineering group builds our main product interface (https://www.onesignal.com) and API. OneSignal is used by 1.5 million+ marketers and developers, and sends over 10 billion messages each day to billions of devices. Our small team’s work makes a massive impact. We tackle challenges at the intersection of engineering and product development, like notification authoring for over a dozen different platforms and multiple mediums (push, in-app, SMS, and email) and intelligent message delivery customized to each recipient. We are seeking an Engineering Manager to lead and grow our Core Product Engineering team.  About the Team: Our Core team is responsible for bringing the core functionality of OneSignal to our customers via the dashboard and API. The team’s surface area includes critical components such as onboarding, subscriber management, in-app messaging and more. You will partner closely with Product, Design, and Engineering leads to create a vision for and deliver an easy-to-use platform and make a direct impact on revenue. This team is the backbone of our product offering, so you will be able to have a significant impact one the success of OneSignal in this engineering leadership role.  In a typical month, an Engineering Manager at OneSignal might: Collaborate with Product, Design, and other stakeholders throughout the company to build our product strategy, then communicate it to team members to provide context for their work and incorporate their feedback. Work closely with other engineering leaders, designers, and product managers to review and prioritize long term investments and drive key KPIs for your team. Mentor engineers you support to ensure they have a clear career growth path and the resources and opportunities they need to pursue it. Provide and solicit regular feedback to help them achieve their goals. Ensure we maintain a high standard on all our work by reviewing and contributing to project plans, technical designs, and PRs. Coach team members as needed to improve quality and technical excellence. Help evolve our hiring practices by improving the candidate experience and the efficiency and fairness of our interviews. What You'll Bring: A demonstrable ability to communicate clearly and empathetically with folks with varying communication and working styles Proven track record in shipping products and solutions in a fast-paced and large scale environment Experience working in cross-functional teams A passion for mentoring software engineers at varying levels to achieve technical excellence, grow their skills, and advance their career together An understanding of data and performance considerations for high scale applications, and experience guiding engineers in trade-offs with a maintainability mindset 5+ years software engineering experience building consumer or business applications 2+ years management experience supporting a team of software engineers
Verified by
Cofounder & CEO, OneSignal
COO
You may also like