How LaunchDarkly Serves Over 4 Billion Feature Flags Daily

Editor's note: By John Kodumal, CTO, LaunchDarkly

LaunchDarkly Platform

Background

Feature flagging (wrapping a feature in a flag that’s controlled outside of deployment) is a technique for effective continuous delivery. For example, you can wrap a new signup form in a feature flag and then control which users see that form, all without having to redeploy code or modify a database. Engineering-driven companies (think Google, Facebook, Twitter) invest heavily in custom-built feature flag management systems to roll features out to whom they want, when they want. Smaller companies build and maintain their own feature flagging infrastructure or using simple open source projects that often don't even have a UI. I was previously an engineering manager at Atlassian, where I’d seen a team work on an internal feature flagging system, so I was aware of the complexity of the problem and the investment required to build a product that addressed the needs of larger development teams and enterprises. That’s where we saw an opportunity to start LaunchDarkly.

LaunchDarkly Platform

We're currently serving over 4 billion feature flag requests per day for companies like Microsoft, Atlassian, Ten-X, and CircleCI. Many of our customers report that we’ve changed the way they do development-- we de-risk new feature launches, eliminate the need for painful long-lived branches, and empower product managers, QA, and others to use feature flags to improve their users’ experience.

General Architecture

You can think of LaunchDarkly as being split up into three pieces: a monolithic web application, a streaming API that serves feature flags, and an analytics processing pipeline that's structured as a set of microservices. We've written almost all of this in Go.

Go has really worked well for us. We love that our services compile from scratch in seconds, and produce small statically linked binaries that can be deployed easily and run in a small footprint. I'd done a lot with Scala at Atlassian, but I'd grown frustrated with the slow compilation times and overhead of the JVM. Our monolith has about a 6MB memory footprint— try that on the JVM!

I'm generally not a fan of large web frameworks like Django or Rails. Too much "magic" for me. I prefer to build on top of smaller libraries that serve specific needs. To that end, both our monolith and our microservices rely heavily on a home-built framework layer that uses libraries like Gorilla Mux.

Our framework makes it trivial to add a new resource to our REST API and get a ton of essential functionality out of the box-- with a few lines of code, you get authentication, APM with New Relic, metrics pumped to Graphite, CORS support, and more.

The web application monolith has a pretty standard architecture. Some of the technologies we use include:

MongoDB -- as our core application data store. It's popular to make fun of Mongo these days, but we've found it to be a great database technology as long as you don't store too many things in it. Anything you can count on your fingers and toes should be fine.
ElasticSearch -- handles user search and segmentation.
Redis -- caching, of course.
HAProxy -- as a load balancer.

LaunchDarkly Architecture

Serving feature flags, fast

One of the cool and novel parts of LaunchDarkly is our streaming architecture, which allows us to serve feature flag changes instantly. Think of it like a real-time, in-memory database containing feature flag settings. The closest comparison would be something like Firebase, except Firebase is really more focused on the client-side web and mobile, whereas we do that and the server-side.

We use several technologies to drive our streaming API. The most important is Pushpin / Fanout. These technologies abstract us away from managing these long-lived streaming connections and focus on building simple REST APIs.

We also use Fastly as a CDN. Fastly is perfect for us-- we can use VCL to write custom caching rules, and can purge content in milliseconds. If you're caching dynamic content (as opposed to say cat GIFs), or you find yourself needing to purge content programmatically, or you want the flexibility of Varnish in addition to the global network of POPs a CDN can provide, Fastly is the best choice out there. Their support team is also fantastic.

When assembled together, these technologies allow our customers to change their feature flag settings on our dashboard and have their new rollout settings streamed to thousands of servers in a hundred milliseconds or less.

Analytics at scale

The other huge component of LaunchDarkly is our analytics processing pipeline. Our customers request over 4 billion feature flags per day, and we use analytics data from these requests to power a lot of the features in our product. A/B testing is an obvious example, but we also do things like determine when a feature flag has stopped being requested, so that you can manage technical debt and clean up old flags.

Our current pipeline involves an HTTP microservice that writes analytics data to DynamoDB. If we need to do any further processing (say, for A/B testing), then we enqueue another job into SQS. Another microservice reads jobs off of the SQS queue and processes them. Right now, we're actively evolving this pipeline. We've found that when we're under heavy load, we need to buffer calls to DynamoDB while we expand capacity instead of trying to process them immediately. Kafka is perfect for this-- so we're splitting that HTTP microservice into a smaller HTTP service that simply queues events to Kafka, and another service that processes Kafka queues.

We actually use LaunchDarkly to control this evolution. We have a feature flag that controls whether a request goes through our old analytics pipeline, or the new Kafka-based pipeline we're rolling out. Once the new pipeline is enabled for all customers, we can clean up the code and switch over completely to the Kafka pipeline. This is a use case that surprises a lot of customers-- they think of feature flags in terms of controlling user-visible features (release toggles), but they are extremely valuable for other use cases like ops toggles, experiments, and permission management.

LaunchDarkly Platform

As we scaled this service out to handle tens of thousands of request per second, we learned an important lesson about microservice construction. When we first built many of these services, we thought in terms of building a separate service per concern. For example, we’d build a service that would read in analytics events and serve the autocomplete functionality on the site. The web application would make a sub-request to this service when it had an autocomplete request from the site.

We quickly learned that the need for fault tolerance and isolation trumps the conceptual neatness of having a service per concern. With fault tolerance in mind, we sliced our services along a different axis-- separating high-throughput analytics writes from the lower-volume read requests coming from the site. This shift dramatically improved the performance of our site, as well as our ability to evolve and scale the huge write load we see on the analytics side.

Infrastructure

As you might have inferred, we use AWS as our hosting provider. We’re fairly conservative when it comes to adopting new technologies-- deployment for us consists of a set of Ansible scripts that spin up EC2 boxes for our various services. We don’t yet use ECS or Docker containers-- which by extension means we don’t use anything for container orchestration. A long while back, we spiked a migration to Mesosphere but we ran into enough issues that we didn’t proceed forward. We do think that these technologies are the future, but that future is not now, at least for us.

So maturity is one issue that prevents us from adopting some of the latest whiz-bang ops technology. There are other technologies that we find interesting, like Amazon’s API Gateway but the pricing models just don’t work for us-- at tens of thousands of requests per second, they’re non-starters.

Other services

For customer communications and support, we use Intercom, Slack, and GrooveHQ. We also recently started using elevio, and we've found it's a great way to turn Intercom questions into trackable support tickets.

We use ReadMe.io for our product and developer API documentation, GitHub holds all our code hostage, and CircleCI helps us integrate continuously.

What’s next?

We’re constantly evolving our service to improve efficiency and scale. Besides the Kafka switchover, we’re looking at using Cassandra for some of the work that DynamoDB is doing right now. We also are keenly interested in Disque as a queuing solution, especially because we’ve had so much positive experience with Redis.

More aspirationally, we might try spiking some of our new services in Rust. I’m a functional programmer at heart, and while I am appreciative of the speed and tooling around Go, it would be nice to regain some of the expressiveness and elegance of a functional language while retaining what we like about Go (the fast compilation times, ease of deployment). If we do try it out, we’ll do so in a cautious manner, and isolate the trial to a new microservice somewhere.