How Stitch Consolidates A Billion Records Per Day

By Jake Stein, Co-founder & CEO of Stitch.

Stitch is an ETL service built for developers. We power data infrastructure for hundreds of amazing organizations. For example:

Indiegogo uses Stitch as their fully managed data pipeline
Postman consolidates data from SaaS tools and internal applications to get insights into customer behavior.
Charity:Water uses Stitch to access and understand the data generated by Netsuite

ETL is an acronym for extract, transform, and load, and it refers to the process by which data is:

Extracted from source systems (databases like MongoDB and MySQL, SaaS tools like Salesforce and Zendesk, and ad networks like Adwords and Facebook Ads, etc)
Transformed into a different structure or format
Loaded into a destination, which in our case is a customer’s data warehouse (Redshift, BigQuery, Postgres, or Snowflake).

Stitch is different from traditional ETL tools because it focuses primarily on extraction and loading, and it facilitates a new workflow where customers run transformation inside their data warehouses, usually defined in SQL. This is sometimes called ELT rather than ETL, and its increasing traction is a function of data warehouses moving to the cloud. ETL tools got started loading data into Netezza, Teradata, and other on-premises data warehouses. These were very expensive, and the latency of adding capacity was measured in weeks or months. When operating in that environment, it’s rational to do as much preparation work as possible in the pipeline prior to loading data.

Modern cloud data warehouses—like Amazon Redshift, Google BigQuery, and Snowflake—work much differently. They are far more cost effective and can be scaled up and down on demand. This enables transformation to be done in the warehouse rather than in the data pipeline. We wrote a blog post on why ELT is a better model than ETL, and here is a quick summary of the benefits:

Transformation can be done in the same language and/or tool as analysis
Raw data is available in the same data warehouse as transformed data
A single centralized data warehouse can support many different uses, which may each require different transformations

Our customers typically set up a data warehouse for at least one of three reasons:

Deploy business intelligence and data visualization tools
Enable analysts to interactively query data
Build an internal application that takes advantage of external data to power recommendation, personalization, or segmentation

Stitch launched August 1st of 2016, and we’ve been fortunate to see lots of demand. Here are some statistics on our traction and scale so far:

300+ customers
1 billion+ records consolidated per day (for context, this is roughly double the number of tweets per day)
66 data sources
4 data warehouse destinations
25 team members

Origin story

Stitch began as an internal project at RJMetrics, which developed data analysis software for ecommerce companies. In November 2015, the ETL product was offered as a separate product from the analysis software under the name RJMetrics Pipeline. In August 2016, Magento acquired RJMetrics, and RJMetrics Pipeline was spun out into a standalone company called Stitch.

The Stack

Our language of choice is Clojure, which was first deployed in a single microservice as a proof of concept for breaking down the original RJMetrics PHP monolith. Although few members of our team had Clojure experience at the time, the productivity of REPL-based programming resonated enough that it became our default. Today, after multiple years of work splitting out microservices, that monolith is no more - we have 17 Clojure microservices, and PHP is not a part of our stack. We have also brought on board a number of seasoned Clojure engineers, and found it to be a useful recruiting tool for attracting folks who want to work with it.

When we’re not writing Clojure, we use Python for the open source code in the Singer project, coffee and javascript on our front end, and a healthy amount of bash for streamlining our development environment and infrastructure tooling.

The Front

Stitch’s frontend is used to configure data sources and destinations and monitor the status of each. Although we have been using AngularJS since its early days, we recently introduced React components into our front end, which many of our developers find easier to work with. We started using Coffeescript when it was one of the few options for a more expressive alternative to vanilla JavaScript, but today we opt to instead write new code in ES6, which we feel is a more mature alternative.

Since we support over 60 different data sources, and we expect that number to continue growing, we are also investing in making the interface components for configuring each source reusable so that adding a new source is as simple as declaring some information about its parameters.

Stitch Data Sources

The Middle

The majority of our Clojure microservices are simple web services that wrap a transactional database with CRUD operations and a little bit of business logic. We use both MySQL and Postgres for transactional data persistence, having transitioned from the former to the latter for newer services to take advantage of the new features coming out of the Postgres community. Fortunately, we expect our scale requirements for these transactional services to remain quite small, and AWS RDS makes it easy enough for us to operate either.

Most of our Clojure best practices can be summed up by the phrase "keep it simple." We avoid more complex web frameworks in favor of using the Ring library to build web service routes, and we prefer sending SQL directly to the JDBC library rather than using a complicated ORM or SQL DSL. We also steer clear of Clojure libraries that are wrappers on top of Java libraries, since they can cause friction if they aren’t comprehensive, and Clojure’s interop features make it easy enough to interface with Java code directly. For testing, we find that the core.test library that comes with Clojure is all we need.

A trip down the data pipeline

A data point’s journey begins at its source, where it is either pushed to Stitch via a webhook, or pulled on a schedule by one of the data extraction scripts that Stitch runs from the Singer project. In either case, its next stop is "the gate" - or the Import API as our docs call it - a Clojure web service that accepts JSON-formatted data either point-at-a-time or in large batches. It does a quick validation check on the JSON and an authentication check on the request’s API token before writing the data to a central Kafka queue.

At this point, Stitch has accepted the data, and all of our attention goes to meeting our most important service-level target: don’t lose data. To confidently meet this goal, we replicate our Kafka cluster across three different data centers and require every data point to be written to two of them before it is accepted.

We have the capacity to retain data on Kafka for multiple days to ensure nothing is lost if downstream processing is delayed, but under normal conditions it is read off of the queue seconds later by "the streamery" - a multithreaded Clojure application that writes the data to files on S3 in batches separated by the database table it is destined for. The streamery will cut batches after reaching either a maximum memory limit, or amount of time elapsed since the last batch. Its design aims to maximize throughput while also guarding against data loss or data leaking between data sets.

Batches that have been written to S3 enter "the spool" - a queue of work waiting to be processed by one of our “loaders” - Clojure applications that read data from S3 and do whatever processing is necessary before finally loading the data into the customer’s data warehouse. We currently have loaders for Redshift, Postgres, BigQuery and Snowflake. We made each of these a separate codebase and runtime because the preprocessing steps required for each destination can vary significantly. Operating them separately also allows each to scale and fail independently, which is especially important when one of the cloud-based services has downtime or undergoes maintenance.

This blog post goes into more depth on some of the product and technology changes we’ve made over time, and the diagram below provides an overview of our architecture.

Stitch Architecture

Product Development

Our development environment is based on a VirtualBox VM that is configured with the same Chef code we use in production, ensuring consistency between the two. The majority of the team uses emacs as an IDE, along with the CIDER package to connect to the Clojure REPL of our applications over the network. Most of our engineers screens are tiled with tmux windows for editing and monitoring different services in their development environment. We use GitHub to host our code repositories, and CircleCI automatically runs our test suites before code is merged.

Infrastructure

Stitch is run entirely on AWS. All of our transactional databases are run with RDS, and we rely on S3 for data persistence in various stages of our pipeline. Our product integrates with Redshift as a data destination, and we also use Redshift as an internal data warehouse (powered by Stitch, of course).

We have two internal projects for managing our infrastructure: "boxcutter" contains Chef code for configuring each of our software applications, and “cloudcutter” contains Terraform code for configuring the virtual hardware each runs on.

The majority of our services run on stateless EC2 instances that are managed by AWS OpsWorks. We configure OpsWorks stacks and layers in terraform with the chef recipes, network, load balancer, and hardware configuration that a service requires, and then we use Jenkins along with a custom script to provision instances in those layers with a specific code release. Jenkins also handles our dozens of daily code deploys by provisioning new instances and swapping them into load balancers or otherwise activating them once health checks have passed.

We recently introduced Kubernetes into our infrastructure to run the scheduled jobs that execute Singer code to extract data from various sources. Although we tend to be wary of shiny new toys, Kubernetes has proven to be a good fit for this problem, and its stability, strong community and helpful tooling have made it easy for us to incorporate into our operations. While we don’t have any plans to move our entire infrastructure to Kubernetes, we expect to expand our use of it as a scheduled job runner.

Operations

We use Datadog to monitor our application and infrastructure, which is connected to Slack and PagerDuty to wake us up in the event of a problem. We use an ELK stack for aggregating logs, with the AWS Elasticsearch service managing the search cluster.

Integrations

Our integrations to external data sources were previously written in an internal Clojure framework, codenamed Sourcerer, which enabled developers to declaratively describe an API and automatically create a working integration made up of reusable components for common tasks like authorization. As we scaled the number of integrations, this system required more and more workarounds to enable our "reusable" components to keep working. As an example, we've found that two APIs that theoretically conform to the Oauth 2.0 spec rarely implement it in such a way that the same authorization code reliably works for both.

We also found that many of our users had a handful of data sources that were highly specific to their business, and they asked us if they could build their own integrations to run on our infrastructure alongside our official integrations. These requests, combined with Sourcerer’s maintainability problems, led us to start Singer, our open source project for integrations.

Singer

Singer publicly launched in March. At its core, it is a specification for communication between two kinds of programs:

Taps, which pull data from data sources
Targets, which send data to destinations

Any combination of tap and target will work together as long as they conform to the Singer specification. The wire format is JSON, and we use JSON schema to add metadata and types. While Singer works great with Stitch, it's important to know that Singer taps and targets are fully functional, self contained applications that you can run regardless of whether or not you’re a Stitch user.

We built the original taps, targets, and libraries used in Singer in Python rather than Clojure. While we continue to be happy with Clojure for our internal services, we felt that its relatively narrow adoption could impede Singer's growth. We chose Python both because it is well suited to the task, and it seems to have reached critical mass among data engineers. All that being said, the Singer spec is language agnostic, and integrations and libraries have been developed in JavaScript, Go, and Clojure.

All of our new integration development is in Singer, and we've converted about a quarter of our existing integrations from Sourcerer to Singer. The community has built 22 new integrations, and more are coming out all of the time. Much of the development so far has been from our customers and partners who run their integrations on our infrastructure and take advantage of our auto scaling, credential management, scheduling, monitoring, and notification services.

We run Singer integrations within Docker containers on our Kubernetes cluster. Each container contains a tap, a target, and an orchestrator that passes information from our internal services to the other two applications.

The future

We're investing a lot into converting and open sourcing the balance of our legacy Sourcerer integrations to Singer. That's a priority for us both to continue to drive adoption of the open source project and to take advantage of the better and more consistent tooling that we have developed around Singer.

We're also developing APIs to enable customers and partners to programmatically control Stitch accounts. We find that our customers virtually always use Stitch in conjunction with other products, and our goal is to improve that joint user experience. There are a lot of applications that work better with access to external data, and our goal is to be the conduit through which that access takes place.

If you're interested in open source, data integration, Docker, Kubernetes, Kafka, Clojure, Angular.js, or Python, we're hiring.

Business Tools

Slack

Group Chat & Notifications