Redux: Scaling LaunchDarkly From 4 to 200 Billion Feature Flags Daily

Written By John Kodumal, CTO and Co-Founder, LaunchDarkly

Background

LaunchDarkly is a feature management platform—we make it easy for software teams to adopt feature flags, helping them eliminate risk in their software development cycles. When we first wrote about our stack, we served about 4 billion feature flags a day. Last month, we averaged over 200 billion flags daily. To me, that's a mind-boggling number, and a testament to the degree to which we're able to change the way teams do software development. Some additional metrics:

Our global P99 flag update latency (the time it takes for a feature flag change on our dashboard to be reflected in your application) is under 500ms
Our primary Elasticsearch cluster indexes 175M+ docs / day
At daily peak, 1.5 million+ mobile devices and browsers and 500k+ servers are connected to our streaming APIs
Our event ingestion pipeline processes 40 billion events per day

We've scaled all our services through a process of gradual evolution, with an occasional bit of punctuated equilibrium. We've never re-written a service from scratch, nor have we ever had to completely re-architect any of our services (we did migrate one service from a SaaS provider to a homegrown; more on that later). In fact, from a high level, our stack is very similar to what we described in our earlier post:

A Go monolith that serves our REST API and UI (JS / React)
A Go microservice that powers our streaming API
An event ingestion / transformation pipeline implemented as a set of Go microservices

We use AWS as our cloud provider, and Fastly as our CDN.

Let's talk about some of the changes we've made to scale these systems.

Buy first, build if necessary

Over the past year, we've shifted our philosophy on managed services and have moved several critical parts of our infrastructure away from self-managed options. The most prominent was our shift away from HAProxy to AWS's managed application load balancers (ALBs). As we scaled, managing our HAProxy fleet became a larger and larger burden. We spent a significant amount of time tuning our configuration files and benchmarking different EC2 instance types to maximize throughput. Emerging needs like DDoS protection and auto scaling turned into large projects that we needed to schedule urgently. Instead of continuing this investment, we chose to shift to managed ALB instances. This was a large project, but it quickly paid for itself as we've nearly eliminated the time spent managing load balancers. We also gained DDoS protection and auto scaling "for free".

As we've evolved or added additional infrastructure to our stack, we've biased towards managed services:

Most new backing stores are Amazon RDS instances now. We do use self-managed PostgreSQL with TimescaleDB for time-series data—this is made HA with the use of Patroni and Consul.
We also use managed Elasticache instances instead of spinning up EC2 instances to run Redis workloads.
In our previous StackShare article, I wrote about a project to incorporate Kafka into our event ingestion pipeline. In keeping with our shift towards managed services, we shifted to Amazon's Kinesis instead of Kafka.

Managed services do have some drawbacks:

They're almost never cheaper (in raw dollars) than self-managed alternatives. Pricing is often more opaque, more variable, and hard to predict
Much less visibility into the operation, errors, and availability of the service
Vendor lock-in

Still, it's a false economy to measure the raw cost of a managed service to an unmanaged service—factor in your team's time and the math is usually pretty clear.

There is one notable case where we've moved from a managed SaaS solution to a homegrown. LaunchDarkly relies on a novel streaming architecture to push feature flag changes out in near real-time. Our SDKs create persistent outbound HTTPS connections to the LaunchDarkly streaming APIs. When you change a feature flag on your dashboard, that change is pushed out using the server-sent events (SSE) protocol. When we initially built our streaming service, we relied heavily on a third-party service, Fanout, to manage persistent connections. Fanout worked well for us, but over time we found that we could introduce domain-specific performance and cost optimizations if we built a custom service for our use case. We created a Go microservice that manages persistent connections and is heavily optimized for the unique workloads associated with feature flag delivery. We use NATS as a message broker to connect our REST API to a fleet of EC2 instances running this microservice. Each of these instances can manage over 50,000 concurrent SSE connections.

At scale, everything is a tight loop

Some of our analytics services receive tens of thousands of requests per second. One of the biggest things we've learned over the past year is that at this scale, there's almost no such thing as premature optimization. Because of the sheer volume of requests, every handler you write is effectively running in a tight loop. We found that to keep meeting our service level objectives and cost goals at scale, we had to do two things repeatedly:

Profile aggressively to identify and address CPU and memory bottlenecks
Apply a set of micro-patterns to handle specific workload

Profiling must be done periodically, as new bottlenecks will constantly emerge as traffic scales and old bottlenecks are eliminated. As an example, at one point, we found that the "front-door" microservice for our analytics pipeline was CPU-bound parsing JSON. We switched from Go's built-in encoding/json package to easyjson, which uses compile-time specialization to eliminate slow runtime reflection in JSON parsing.

We also identified a set of "micro-patterns" that we have extracted as self-contained libraries so they can be applied in appropriate contexts. Some examples:

Read coalescing—In a read-heavy workload, expensive calls to fetch data can be queued to await the first read—a kind of memoization. This pattern is encapsulated in Google's singleflight package
Write coalescing—The dual of read coalescing. In a write-heavy workload, where last write wins, writes can be queued and discarded in favor of the latest write attempt.
Multi-layer caching—In scenarios where an in-process, in-memory cache is necessary for performance, horizontal scaling can reduce cache hit rates. We make our fleet more resilient to this effect by employing multiple layers of caching—for example, backing an in-memory cache with a shared Redis cache before finally falling back to a slower persistent disk-backed store.

These simple patterns improved performance at scale and also helped us deal with bad traffic patterns like reconnection storms.

Get good at managing change

Scaling up isn't just about improving your services and architecture. It requires equal investment in people, processes and tools. One thing we really focused on the process and tools front is understanding change. Better visibility into changes being made to the service had a massively positive impact on service reliability. Here are a few things we did to improve visibility:

Internal changelog service: This service catalogues intentional changes being made to the system. This includes deploys, instance type changes, configuration changes, feature flag changes, and more. Anything that could potentially impact the service (either in a positive or negative way) is catalogued here. We couldn't find anything off the shelf here, so we built something ourselves.
COGS (cost of goods sold) log: Very similar to our changelog, but focused on price changes to our services. If we scale out a service, or change instance types, or make reserved instance reservations, we add an entry to this log. For us, this is just a Confluence page.
Observability / APM: We use a number of services to gain observability into what is happening to our service at runtime. We use a mix of Graphite / Grafana and Honeycomb.io to give us the observability we need. We're big fans of Honeycomb here.
Operational and release feature flags: We feature flag most changes using LaunchDarkly. Most new changes are protected by release flags (short-lived flags that are used to protect the initial rollout and rollback of a feature). We also create operational flags—which are long-lived flags that act as control switches to the application. Observability lets us understand change, and feature flags allow us to react to change to maintain availability or improve user experience.
Spinnaker / Armory: LaunchDarkly is almost a five year old company, and our methodology for deploying was state of the art... for 2014. We recently undertook a project to modernize the way we deploy our software, moving from Ansible-based deploy scripts that executed on our local machines, to using Spinnaker (along with Terraform and Packer) as the basis of our deployment system. We've been using Armory's enterprise Spinnaker offering to make this project a reality.

Like the sound of this stack? Learn more about LaunchDarkly.

Real-time Data Processing

Business Tools

Confluence

Project Management