Scaling Zapier to Automate Billions of Tasks

19,829
Zapier
Zapier is for busy people who know their time is better spent selling, marketing, or coding. Instead of wasting valuable time coming up with complicated systems - you can use Zapier to automate the web services you and your team are already using on a daily basis.

Editor's note: By Bryan Helmig, ‎Co-founder & CTO at Zapier



Zapier is a web service that automates data flow between over 500 web apps, including MailChimp, Salesforce, GitHub, Trello and many more.

Imagine building a workflow (or a "Zap" as we call it) that triggers when a user fills out your Typeform form, then automatically creates an event on your Google Calendar, sends a Slack notification and finishes up by adding a row to a Google Sheets spreadsheet. That's Zapier. Building Zaps like this is very easy, even for non-technical users, and is infinitely customizable.

As CTO and co-founder, I built much of the original core system, and today lead the engineering team. I'd like to take you on a journey through our stack, how we built it and how we're still improving it today!

The Teams Behind the Curtains

It takes a lot to make Zapier tick, so we have four distinct teams in engineering:

  • The frontend team, which works on the very powerful workflow editor.
  • The full stack team, which is cross-functional but focuses on the workflow engine.
  • The devops team, which keeps the engine humming.
  • The platform team, which helps with QA, and onboards partners to our developer platform.

All told, this involves about 15 engineers (and is growing!).

The Architecture

Our stack isn't going to win any novelty awards — we're using some pretty standard (but awesome) tools to power Zapier. More interesting are the ways we're using them to solve our particular brand of problems, but let's get the basics out of the way:

The Frontend

We're smack in the middle of transitioning from Backbone to React. We use Babel for ES6 and Webpack + Gulp to compile the frontend. We rely heavily on CodeMirror to do some of the more complex input widgets we need, and use React + Redux to do much of the heavy lifting for the uber-powerful Zap editor.

The Backend

Python powers a large majority of our backend. Django is the framework of choice for the HTTP side of things. Celery is a massive part of our distributed workflow engine. Most of the routine API work is done with the epic requests library (with a bunch of custom adapters and abstractions).

The Data

MySQL is our primary relational data store — you'll find our users, Zaps and more inside MySQL. Memcached and McRouter make an appearance as the ubiquitous caching layer. Other types of data go in other data stores that make more sense. For example, in-flight task counts for billing and throttling find themselves in Redis, and Elasticsearch stores historical activity feeds for Zaps. For data analysis we love us some AWS Redshift.

The Platform

Most of our platform is nestled inside our fairly monolithic core Python codebase, but there are lots of interesting offshoots that offer specialized functionality. The best example might be how we utilize AWS Lambda to run partner/user provided code to customize app behavior and API communication.

The Infrastructure

Since Zapier runs on AWS, we have quite a bit of power at our fingertips. EC2 and VPC are the centerpiece there, though we do use RDS where possible along with copious numbers of autoscaling groups to ensure the pool of servers are in tip-top shape. Jenkins, Terraform, Puppet and Ansible are all daily tools for the devops team. For monitoring, we can't rave enough about Statsd, Graylog, and Sentry (they're so good).

Some Rough Numbers

These numbers represent a rough minimum to help the reader guage the general size and dimensions of Zapier's architecture:

  • over ~8m tasks automated daily
  • over ~60m API calls daily
  • over ~10m inbound webhooks daily
  • ~12 c3.2xlarge boxes running HTTP behind ELB
  • ~100 m3.2xlarge background workers running Celery (split amongst polling, hooks, email, misc)
  • ~3 m3.medium RabbitMQ nodes in a cluster
  • ~4 r3.2xlarge Redis instances - one hot, two failover, one backup/imaging
  • ~12 m2.xlarge Memcached instances behind ~6 c3.xlarge McRouter instances
  • ~10 m3.xlarge ElasticSearch instances behind ~3 m3.xlarge no-data ElasticSearch instances
  • ~6 m3.xlarge ElasticSearch instances behind ~1 c3.2xlarge Graylog server
  • ~10 dc1.large Redshift nodes in a cluster
  • 1 master db.m2.2xlarge RDS MySQL instance w/ ~2 more replicas for both production reads and analysis
  • a handful of supporting RDS MySQL instance (more details below)
  • ...and tons of microservices and miscellaneous specialty services

Improving the Architecture

While the broad strokes of the architecture remain the same - we've only performed a few massive migrations - a lot of work has been done to grow the product in two categories:

  1. Supporting big new product features
  2. Scaling the application for more users

Let's dive into some examples of each, with as many nitty-gritty details as possible without getting bogged down!

Big Features like Multi-Step Zaps

When we started Zapier (fun fact: it was first called Snapier!) at a Startup weekend, we laid out the basic architecture in less than 54 hours (fueled by lots of coffee and more than a few beers). Overall, it was decent. We kept the design really, really simple, which was the right call at the time.

Except where it was too simple. Specifically, we made Zaps two-stepped: a trigger paired with an action, full stop.

It didn't take long for us to realize the missed opportunity, but the transition was going to be pretty complex. We had to implement a directed rooted tree with support for an arbitrary numbers of steps (nodes), but maintain 1-to-1 support for existing Zaps (of which there were already hundreds of thousands). And we had to do that while preserving support for hundreds of independent partner APIs.

Starting at the data model, we built a very simple directed rooted tree implementation in MySQL. Just imagine a table where every row has a self-referencing parent_id foreign key, plus an extra root_id foreign key to simplify queries, and you pretty much got it. We discussed switching to a proper graph database (like neo4j) but decided against it because the sorts of queries we make are simple and over isolated graphs of smaller sizes (roughly ~2-50 nodes).

A key aspect to make this work is inter-step independence. Every step has to consume some data (which folder to read from or which list ID to add to, for example), do some API magic, and return some data (say, the new file created or the new card added to a list), but otherwise be ignorant of its placement in the workflow. Each independent step is as dumb as a rock.

In the middle exists the omniscient workflow engine which coordinates independent Celery tasks by stringing together steps as tasks — one step feeding into the next as defined by the Zap's directed rooted tree. This omniscient engine also houses all the other goodies like error & retry handling, reporting, logging, throttling and more.

Even after we nailed the backend support, we had another huge problem: how do you build a UI for this thing?

First, you make sure you have some amazing designers and Javascript engineers on the team. Then you wrestle with nested Backbone views for a while before moving onto React. :-) In all seriousness: React is a godsend for the sorts of complex interfaces we are building.

One of the unique things about React is the performance characteristics are developer friendly, but only as long as you have your data figured out. If you aren't using immutable data structures, you should use some structural sharing library to do all mutations, along with deep Object.freeze() in development to catch spots where you attempt mutation directly.

There are tons of challenges in building such a complex UI, much of it around testing and feedback, but a huge amount of time was spent just getting the long-tailed data from different APIs to fit elegantly into the same places. Just about every weird shape of data has to be accounted for.

Finally, we were tasked with getting the new editor in front of users for alpha and beta testing. To do this we shipped both versions of the editor simultaneously and used feature switches to opt users in. We did months and months of testing and tweaking before we were happy with the result - you can check out the Multi-Step Zap launch page to get an idea of where it ended up.



Scaling the Application

It would all be for nought if the service isn't up and running reliably. As such, much of our attention is focused jointly on application design to support horizontal scalability and redundant infrastructure to ensure availability.

Some of the wiser decisions we've made so far is to double down on tech we're comfortable with and spin out isolated functionality when we hit a bottleneck. The key is to reuse the exact same solution and move it to another box where it is free to roam fresh pastures of CPU and RAM.

For example, over the last year or so we noticed session data had eaten up a ton of our primary database's IO and storage. Since session data is effectively a key/value arrangement with softer consistency requirements, we heavily debated options like Cassandra or Riak (or even Redis!), but ultimately decided to stand up a dedicated MySQL instance with a single sessions table.

Our instinct as engineers was to find the tool best suited to the job, but as a practical matter, the job didn't warrant additional operational complexity. We know MySQL, it can do simple key/value storage and our application already supports it. Talk about a no-brainer.

Further, careful design of the application can make horizontal scaling equally simple. Long running background tasks (like our Multi-Step Zaps) aren't bound by strict consistency requirements due to their light write pattern, so it is trivial (and safe!) to use MySQL read-only replicas as the primary touch point. Even if we occasionally get horrible replica lag measured in the minutes, 99.9% of Zaps aren't changing — and certainly aren't changing soon — so they continue to hum along.

Another good practice is to assume the worst. Design for failure from the beginning. While usually this is easier said than done, nowadays it is actually surprisingly easy to do. For starters: use auto scaling groups with auto-replacement. A common misconception is that ASGs are only for scaling to accommodate fluctuating load. Wrong! The ASG + ELB combo can be your backbone of reliability, one that enables you to randomly kill instances without worry since they get replaced in quick order.

Somehow we keep re-learning that the simpler the system is, the better you'll sleep.

The Day-To-Day

Locally, our engineers enjoy a fully functioning environment courtesy of Docker. docker-machine and docker-compose together stand up proper versions of MySQL, Memcached, Redis, Elasticsearch as well as all the web and background workers. We generally recommend running npm and even runserver locally, as file watching is kind of broken with VirtualBox.

The canonical GitHub "pull request model" drives most of our projects that are engineering focused, where day-to-day work is logged and final code review happens before merging. Hackpad houses the majority of our documentation, including copious onboarding documentation.

A big thing at Zapier is all hands support. Every four or five weeks, every engineer spends a full week in support helping debug and fix difficult customer issues. This is hugely important to us as it provides a baseline for understanding customers' pain (plus, you might have to deal with the bug you shipped!).

For CI and deployment, we use Jenkins to run tests on every commit in every PR as well as to provide a "one-click deploy" that anyone at the company can press. It's not uncommon for a new engineer to click the deploy button the first week on the job!

We have a full staging environment in a standalone VPC, as well as a handful of standalone web boxes perfect for testing long lived pull requests. Canary deploys to production are common — complete with full logs of any errors courtesy of Graylog.

You Can Zapier, Too!

Developers can use Zapier to do some pretty awesome stuff.

In addition to Multi-Step Zaps, we've also launched the ability to write Python and Javascript as Code steps in your workflow. No need to host and run scripts yourself — we take care of all of that. We also provide bindings to call out to the web (requests and fetch) and even store a bit of state between runs!

Our users are employing Code steps to build Slack bots (and games!), to replace one-off scripts and lots more. I personally use Code steps to write bots and tools to track code & bug metrics to a spreadsheet, transform oddly formatted data and replace a ton of crontabs.

Or, if you have an API that you want non-developers to be able to consume, we have a pretty epic Developer Platform, too. Simply define your triggers, searches and actions, and any user can mix your API into their workflows and integrate your app with over 500 apps like GitHub, Salesforce, Google Docs, and more.

And, we are often hiring, so keep an eye on our jobs page if you'd like to help us help people work faster and automate their most tedious tasks!


Zapier
Zapier is for busy people who know their time is better spent selling, marketing, or coding. Instead of wasting valuable time coming up with complicated systems - you can use Zapier to automate the web services you and your team are already using on a daily basis.
Tools mentioned in article
Open jobs at Zapier
Technical Support Engineer

Hi there!

We're looking for a Technical Support Engineer to join the Support Engineering team at Zapier. Zapier’s on a mission to make everyone more productive at work. Zapier has helped millions of people build businesses through the power of automation. One way we do this is by cultivating and supporting relationships with partners, who sometimes require assistance in building or troubleshooting the app integrations they own.

If you’re interested in advancing your career at a fast-growing, profitable, impact-driven company, then read on…
 

Our Commitment to Applicants

Culture and Values at Zapier

Zapier Guide to Remote Work

Zapier Code of Conduct

Diversity and Inclusivity at Zapier

 

About You

  • You’re empathetic. You’ll be working directly with customers using our services as well as developers who are building on our platform as they overcome problems. You’re able to put yourself in their shoes and help point them in the right direction—whether that means sending a link to relevant documentation or explaining a more complex concept in clear terms.
  • You love code and APIs. You are proficient at reading and writing code and genuinely enjoy making and maintaining software. You’ve worked with many APIs and have a fundamental understanding of how they work. You have a solid intuition for what could be causing an API to respond to a request with an error, and you know the little tricks you can employ to get misbehaving requests back on track. You're comfortable working with code and logs to diagnose, fix, and safeguard against API issues.
  • You love figuring things out. You enjoy being presented with situations that don't have an immediately obvious answer and relish finding the solution. You are excited, not intimidated, by problems you don't know the answer to. You love applying the things you have learned to unfamiliar situations in order to see the deeper patterns that connect seemingly disparate issues.
  • You love variety. You would enjoy a multifaceted role that interacts with a wide variety of people, topics, and issues. The idea of seeing something new and different every day is invigorating. You are able to learn and act quickly to keep up with rapidly changing and sometimes unfamiliar situations.
  • You’re an excellent written communicator. We’re a 100% remote team, and writing is our primary means of communication at Zapier and to our end users. You have a very strong command of written English and your writing is concise but effective.
  • You’re solid at time management. You can balance a variety of projects and responsibilities without getting overwhelmed. As a part of a distributed team, you’ll be trusted to work with minimal supervision. As a part of a growing company, you have an opportunity to make a big impact, and you’re keen to build processes that’ll make your job more efficient over time. 

Things You’ll Do

  • Serve as a point of internal escalation on technical issues within the Support org, helping our Customer Champions level up their troubleshooting skills and tackle harder issues
  • Create and improve documentation to help users and partners help themselves
  • Employ your programming skills to triage and fix bugs on our platform, and to oversee fixes coming in from Customer Champions
  • Focus on sustainability by seeking out projects that improve the lives of the people around you and the customers they support
  • Find other opportunities to move the team, the org, and the company forward, such as contributing to building and maintaining internal tools, code review, mentoring

 

The Whole Package

Location: Remote

Our flexible, distributed environment lets us work with the best people from around the world. Zapiens live in 40+ countries, including the United Kingdom, Thailand, India, Nigeria, Taiwan, Guatemala, New Zealand, Australia, and more!
 
Zapier offers:
  • Competitive salary
  • Healthcare + dental + vision coverage*
  • Retirement plan with 4% company match*
  • Profit-sharing program for 100% of Zapiens
  • $2,000 annual learning stipend for use on courses, conferences, and more—your choice
  • Two annual all-company retreats
  • 14 weeks paid leave for new parents of biological or adopted children
  • Customized Zapiversary rewards on your 1, 3, 5, 7 and 10 year work anniversaries
  • Leading-edge equipment. We set you up with an Apple laptop and provide an additional budget for you to choose other home office accessories and software you may need.
  • Time to renew. We encourage Zapiens to take at least 2 weeks off each year. Most of us take 4-5 weeks, in addition to locally recognized holidays.
  • Opportunity to work with Zapier’s amazing partners network
*While we take care of Zapiens around the world the best we can, healthcare and retirement plans are currently available specifically in the UK, Canada, and United States.
 

How to Apply

We have a non-standard application process designed to promote inclusion and equity. We first ask a few questions in our application form that would typically be asked at the start of an initial interview. This helps speed up the process and lets us get to know you a bit better right out of the gate. Please be sure to answer each question; the resume and CV fields are optional.
 
After you apply, you are going to hear back from us—even if we don’t see an immediate fit with our team. In fact, throughout the process, we strive to make sure you never go more than seven days without hearing from us.
 
Zapier is an equal opportunity employer. We're excited to work with talented and empathetic people, and do not discriminate based on race, color, sex, gender identity or expression, sexual orientation, religion, national origin, physical or mental disability, military or veteran status, genetic information, pregnancy, age, or any other status protected by local law. Our code of conduct provides a beacon for the kind of company we strive to be, and we celebrate our differences because those differences are what allow us to make a product that serves a global user base.
 
Zapier is is committed to inclusion. As part of this commitment, Zapier will ensure that people with disabilities are provided reasonable accommodations. If reasonable accommodations are needed to participate in the job application or interview process, please contact jobs@zapier.com.
 
Verified by
Co-founder
You may also like