Scaling Zapier to Automate Billions of Tasks

19,687
Zapier
Zapier is for busy people who know their time is better spent selling, marketing, or coding. Instead of wasting valuable time coming up with complicated systems - you can use Zapier to automate the web services you and your team are already using on a daily basis.

Editor's note: By Bryan Helmig, ‎Co-founder & CTO at Zapier



Zapier is a web service that automates data flow between over 500 web apps, including MailChimp, Salesforce, GitHub, Trello and many more.

Imagine building a workflow (or a "Zap" as we call it) that triggers when a user fills out your Typeform form, then automatically creates an event on your Google Calendar, sends a Slack notification and finishes up by adding a row to a Google Sheets spreadsheet. That's Zapier. Building Zaps like this is very easy, even for non-technical users, and is infinitely customizable.

As CTO and co-founder, I built much of the original core system, and today lead the engineering team. I'd like to take you on a journey through our stack, how we built it and how we're still improving it today!

The Teams Behind the Curtains

It takes a lot to make Zapier tick, so we have four distinct teams in engineering:

  • The frontend team, which works on the very powerful workflow editor.
  • The full stack team, which is cross-functional but focuses on the workflow engine.
  • The devops team, which keeps the engine humming.
  • The platform team, which helps with QA, and onboards partners to our developer platform.

All told, this involves about 15 engineers (and is growing!).

The Architecture

Our stack isn't going to win any novelty awards — we're using some pretty standard (but awesome) tools to power Zapier. More interesting are the ways we're using them to solve our particular brand of problems, but let's get the basics out of the way:

The Frontend

We're smack in the middle of transitioning from Backbone to React. We use Babel for ES6 and Webpack + Gulp to compile the frontend. We rely heavily on CodeMirror to do some of the more complex input widgets we need, and use React + Redux to do much of the heavy lifting for the uber-powerful Zap editor.

The Backend

Python powers a large majority of our backend. Django is the framework of choice for the HTTP side of things. Celery is a massive part of our distributed workflow engine. Most of the routine API work is done with the epic requests library (with a bunch of custom adapters and abstractions).

The Data

MySQL is our primary relational data store — you'll find our users, Zaps and more inside MySQL. Memcached and McRouter make an appearance as the ubiquitous caching layer. Other types of data go in other data stores that make more sense. For example, in-flight task counts for billing and throttling find themselves in Redis, and Elasticsearch stores historical activity feeds for Zaps. For data analysis we love us some AWS Redshift.

The Platform

Most of our platform is nestled inside our fairly monolithic core Python codebase, but there are lots of interesting offshoots that offer specialized functionality. The best example might be how we utilize AWS Lambda to run partner/user provided code to customize app behavior and API communication.

The Infrastructure

Since Zapier runs on AWS, we have quite a bit of power at our fingertips. EC2 and VPC are the centerpiece there, though we do use RDS where possible along with copious numbers of autoscaling groups to ensure the pool of servers are in tip-top shape. Jenkins, Terraform, Puppet and Ansible are all daily tools for the devops team. For monitoring, we can't rave enough about Statsd, Graylog, and Sentry (they're so good).

Some Rough Numbers

These numbers represent a rough minimum to help the reader guage the general size and dimensions of Zapier's architecture:

  • over ~8m tasks automated daily
  • over ~60m API calls daily
  • over ~10m inbound webhooks daily
  • ~12 c3.2xlarge boxes running HTTP behind ELB
  • ~100 m3.2xlarge background workers running Celery (split amongst polling, hooks, email, misc)
  • ~3 m3.medium RabbitMQ nodes in a cluster
  • ~4 r3.2xlarge Redis instances - one hot, two failover, one backup/imaging
  • ~12 m2.xlarge Memcached instances behind ~6 c3.xlarge McRouter instances
  • ~10 m3.xlarge ElasticSearch instances behind ~3 m3.xlarge no-data ElasticSearch instances
  • ~6 m3.xlarge ElasticSearch instances behind ~1 c3.2xlarge Graylog server
  • ~10 dc1.large Redshift nodes in a cluster
  • 1 master db.m2.2xlarge RDS MySQL instance w/ ~2 more replicas for both production reads and analysis
  • a handful of supporting RDS MySQL instance (more details below)
  • ...and tons of microservices and miscellaneous specialty services

Improving the Architecture

While the broad strokes of the architecture remain the same - we've only performed a few massive migrations - a lot of work has been done to grow the product in two categories:

  1. Supporting big new product features
  2. Scaling the application for more users

Let's dive into some examples of each, with as many nitty-gritty details as possible without getting bogged down!

Big Features like Multi-Step Zaps

When we started Zapier (fun fact: it was first called Snapier!) at a Startup weekend, we laid out the basic architecture in less than 54 hours (fueled by lots of coffee and more than a few beers). Overall, it was decent. We kept the design really, really simple, which was the right call at the time.

Except where it was too simple. Specifically, we made Zaps two-stepped: a trigger paired with an action, full stop.

It didn't take long for us to realize the missed opportunity, but the transition was going to be pretty complex. We had to implement a directed rooted tree with support for an arbitrary numbers of steps (nodes), but maintain 1-to-1 support for existing Zaps (of which there were already hundreds of thousands). And we had to do that while preserving support for hundreds of independent partner APIs.

Starting at the data model, we built a very simple directed rooted tree implementation in MySQL. Just imagine a table where every row has a self-referencing parent_id foreign key, plus an extra root_id foreign key to simplify queries, and you pretty much got it. We discussed switching to a proper graph database (like neo4j) but decided against it because the sorts of queries we make are simple and over isolated graphs of smaller sizes (roughly ~2-50 nodes).

A key aspect to make this work is inter-step independence. Every step has to consume some data (which folder to read from or which list ID to add to, for example), do some API magic, and return some data (say, the new file created or the new card added to a list), but otherwise be ignorant of its placement in the workflow. Each independent step is as dumb as a rock.

In the middle exists the omniscient workflow engine which coordinates independent Celery tasks by stringing together steps as tasks — one step feeding into the next as defined by the Zap's directed rooted tree. This omniscient engine also houses all the other goodies like error & retry handling, reporting, logging, throttling and more.

Even after we nailed the backend support, we had another huge problem: how do you build a UI for this thing?

First, you make sure you have some amazing designers and Javascript engineers on the team. Then you wrestle with nested Backbone views for a while before moving onto React. :-) In all seriousness: React is a godsend for the sorts of complex interfaces we are building.

One of the unique things about React is the performance characteristics are developer friendly, but only as long as you have your data figured out. If you aren't using immutable data structures, you should use some structural sharing library to do all mutations, along with deep Object.freeze() in development to catch spots where you attempt mutation directly.

There are tons of challenges in building such a complex UI, much of it around testing and feedback, but a huge amount of time was spent just getting the long-tailed data from different APIs to fit elegantly into the same places. Just about every weird shape of data has to be accounted for.

Finally, we were tasked with getting the new editor in front of users for alpha and beta testing. To do this we shipped both versions of the editor simultaneously and used feature switches to opt users in. We did months and months of testing and tweaking before we were happy with the result - you can check out the Multi-Step Zap launch page to get an idea of where it ended up.



Scaling the Application

It would all be for nought if the service isn't up and running reliably. As such, much of our attention is focused jointly on application design to support horizontal scalability and redundant infrastructure to ensure availability.

Some of the wiser decisions we've made so far is to double down on tech we're comfortable with and spin out isolated functionality when we hit a bottleneck. The key is to reuse the exact same solution and move it to another box where it is free to roam fresh pastures of CPU and RAM.

For example, over the last year or so we noticed session data had eaten up a ton of our primary database's IO and storage. Since session data is effectively a key/value arrangement with softer consistency requirements, we heavily debated options like Cassandra or Riak (or even Redis!), but ultimately decided to stand up a dedicated MySQL instance with a single sessions table.

Our instinct as engineers was to find the tool best suited to the job, but as a practical matter, the job didn't warrant additional operational complexity. We know MySQL, it can do simple key/value storage and our application already supports it. Talk about a no-brainer.

Further, careful design of the application can make horizontal scaling equally simple. Long running background tasks (like our Multi-Step Zaps) aren't bound by strict consistency requirements due to their light write pattern, so it is trivial (and safe!) to use MySQL read-only replicas as the primary touch point. Even if we occasionally get horrible replica lag measured in the minutes, 99.9% of Zaps aren't changing — and certainly aren't changing soon — so they continue to hum along.

Another good practice is to assume the worst. Design for failure from the beginning. While usually this is easier said than done, nowadays it is actually surprisingly easy to do. For starters: use auto scaling groups with auto-replacement. A common misconception is that ASGs are only for scaling to accommodate fluctuating load. Wrong! The ASG + ELB combo can be your backbone of reliability, one that enables you to randomly kill instances without worry since they get replaced in quick order.

Somehow we keep re-learning that the simpler the system is, the better you'll sleep.

The Day-To-Day

Locally, our engineers enjoy a fully functioning environment courtesy of Docker. docker-machine and docker-compose together stand up proper versions of MySQL, Memcached, Redis, Elasticsearch as well as all the web and background workers. We generally recommend running npm and even runserver locally, as file watching is kind of broken with VirtualBox.

The canonical GitHub "pull request model" drives most of our projects that are engineering focused, where day-to-day work is logged and final code review happens before merging. Hackpad houses the majority of our documentation, including copious onboarding documentation.

A big thing at Zapier is all hands support. Every four or five weeks, every engineer spends a full week in support helping debug and fix difficult customer issues. This is hugely important to us as it provides a baseline for understanding customers' pain (plus, you might have to deal with the bug you shipped!).

For CI and deployment, we use Jenkins to run tests on every commit in every PR as well as to provide a "one-click deploy" that anyone at the company can press. It's not uncommon for a new engineer to click the deploy button the first week on the job!

We have a full staging environment in a standalone VPC, as well as a handful of standalone web boxes perfect for testing long lived pull requests. Canary deploys to production are common — complete with full logs of any errors courtesy of Graylog.

You Can Zapier, Too!

Developers can use Zapier to do some pretty awesome stuff.

In addition to Multi-Step Zaps, we've also launched the ability to write Python and Javascript as Code steps in your workflow. No need to host and run scripts yourself — we take care of all of that. We also provide bindings to call out to the web (requests and fetch) and even store a bit of state between runs!

Our users are employing Code steps to build Slack bots (and games!), to replace one-off scripts and lots more. I personally use Code steps to write bots and tools to track code & bug metrics to a spreadsheet, transform oddly formatted data and replace a ton of crontabs.

Or, if you have an API that you want non-developers to be able to consume, we have a pretty epic Developer Platform, too. Simply define your triggers, searches and actions, and any user can mix your API into their workflows and integrate your app with over 500 apps like GitHub, Salesforce, Google Docs, and more.

And, we are often hiring, so keep an eye on our jobs page if you'd like to help us help people work faster and automate their most tedious tasks!


Zapier
Zapier is for busy people who know their time is better spent selling, marketing, or coding. Instead of wasting valuable time coming up with complicated systems - you can use Zapier to automate the web services you and your team are already using on a daily basis.
Tools mentioned in article
Open jobs at Zapier
Technical Support Engineer

Technical Support Engineer

Hi there!


We're looking for a Technical Support Engineer to join the Support Engineering team at Zapier. Zapier is on a mission to make everyone more productive at work. Over 3 million professionals already use Zapier to save more time, but there are millions more to reach. One way we do this is by cultivating and supporting relationships with partners, who sometimes require assistance in building or troubleshooting the app integrations they own. If you love interacting with people on a daily basis to help them optimize their work just as much as you love diving deep into code to squash bugs, read on…


We know applying for and taking on a new job at any company requires a leap of faith. We want you to feel comfortable and excited to apply at Zapier. To help share a bit more about life at Zapier, here are a few resources in addition to the job description that can give you an inside look at what life is like at Zapier. Hopefully, you'll take the leap of faith and apply.

 

Our Commitment to Applicants

Culture and Values at Zapier

Zapier Guide to Remote Work

Zapier Code of Conduct

Diversity and Inclusivity at Zapier


Zapier is proud to be an equal opportunity workplace dedicated to pursuing and hiring a diverse workforce.

 

About You

 

You’re empathetic. You’ll be working directly with customers using our services as well as developers who are building on our platform as they overcome problems. You’re able to put yourself in their shoes and help point them in the right direction—whether that means sending a link to relevant documentation or explaining a more complex concept in clear terms.

 

You love code and APIs. You are proficient at reading and writing code and genuinely enjoy making and maintaining software. You’ve worked with many APIs and have a fundamental understanding of how they work. You have a solid intuition for what could be causing an API to respond to a request with an error, and you know the little tricks you can employ to get misbehaving requests back on track. You're comfortable working with code and logs to diagnose, fix, and safeguard against API issues. 

 

You love figuring things out. You enjoy being presented with situations that don't have an immediately obvious answer and relish finding the solution. You are excited, not intimidated, by problems you don't know the answer to. You love applying the things you have learned to unfamiliar situations in order to see the deeper patterns that connect seemingly disparate issues. 

 

You love variety. You would enjoy a multifaceted role that interacts with a wide variety of people, topics, and issues. The idea of seeing something new and different every day is invigorating. You are able to learn and act quickly to keep up with rapidly changing and sometimes unfamiliar situations. 

 

You’re an excellent written communicator. We’re a 100% remote team, and writing is our primary means of communication at Zapier and to our end users. You have a very strong command of written English and your writing is concise but effective.

 

You’re solid at time management. You can balance a variety of projects and responsibilities without getting overwhelmed. As a part of a distributed team, you’ll be trusted to work with minimal supervision. As a part of a growing company, you have an opportunity to make a big impact, and you’re keen to build processes that’ll make your job more efficient over time. 

 

Things You’ll Do

  • Serve as a point of internal escalation on technical issues within the Support org, helping our Customer Champions level up their troubleshooting skills and tackle harder issues
  • Create and improve documentation to help users and partners help themselves
  • Employ your programming skills to triage and fix bugs on our platform, and to oversee fixes coming in from Customer Champions
  • Focus on sustainability by seeking out projects that improve the lives of the people around you and the customers they support
  • Find other opportunities to move the team, the org, and the company forward, such as contributing to building and maintaining internal tools, code review, mentoring

 

About Zapier

Zapier helps people across the world automate the boring and tedious parts of their job. We do that by helping everyone connect the web applications they already use and love.

 

We believe that there are jobs a computer is best at doing and that there are jobs a human is best at doing. We want to empower businesses to create processes and systems that let computers do what they are best at doing and let humans do what they are best at doing.

We believe that with the right tools, you can have a big impact with less hassle.

We believe in small teams. Small teams are fast and nimble. Small teams mean less bureaucracy and less management and more getting things done.

 

We believe in a safe, welcoming, and inclusive environment. All teammates at Zapier agree to a code of conduct.

 

The Whole Package

Location: Remote

 

Our distributed environment lets us work with the best people. You don't have to be located in the USA either. Some team members live in the United Kingdom, Thailand, India, Nigeria, Taiwan, Guatemala, New Zealand, Australia, and more! You just need the skills and drive to succeed in this role and the ability to work from anywhere.

 

Compensation:

  • Competitive salary (we don't use remote as an excuse to pay less)
  • Great healthcare + dental + vision coverage*
  • Retirement plan with 4% company match*
  • Profit sharing
  • 2 annual company retreats to awesome places
  • 14 weeks paid leave for new parents of biological or adopted children
  • Pick your own equipment. We'll set you up with whatever Apple laptop + monitor combo you want plus any software you need.
  • Unlimited vacation policy. Plus we require you to take at least 2 weeks off each year. We see most employees take 4-5 weeks off per year. This isn't a vague policy where unlimited vacation means no vacation.
  • Travel of 5% - 10% for company retreats which rotate to various cities throughout North America
  • Work with awesome companies around the world. We partner with great software companies all over the world and you'll constantly get to interact with people from these great companies

 

*While we take care of our international folks as best we can, currently, healthcare and retirement plans are only available to US-based employees.

 

How to Apply


We have a non-standard application process. To jump-start the process we ask a few questions we normally would ask at the start of an interview. This helps speed up the process and lets us get to know you a bit better right out of the gate. Please make sure to answer each question.

After you apply, you are going to hear back from us, even if we don't seem like a good fit. In fact, throughout the process, we strive to make sure you never go more than seven days without hearing from us.

Optional: Share anonymously some demographic information about yourself to help us better track trends related to the backgrounds of candidates interested in working at Zapier in order for us to build a team that represents the users at Zapier and the broader world population.

Zapier is an equal opportunity employer. We're excited to work with talented and empathetic people no matter their race, color, gender, sexual orientation, gender identity/expression, religion, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), or any other basis protected by law. Our code of conduct provides a beacon for the kind of company we strive to be, and we celebrate our differences because those differences are what allow us to make a product that serves a global user base.

Verified by
Co-founder
You may also like